MM-Vet
13 papers with code • 0 benchmarks • 0 datasets
Benchmarks
These leaderboards are used to track progress in MM-Vet
Most implemented papers
CogAgent: A Visual Language Model for GUI Agents
People are spending an enormous amount of time on digital devices through graphical user interfaces (GUIs), e. g., computer or smartphone screens.
ShapeLLM: Universal 3D Object Understanding for Embodied Interaction
This paper presents ShapeLLM, the first 3D Multimodal Large Language Model (LLM) designed for embodied interaction, exploring a universal 3D object understanding with 3D point clouds and languages.
CogVLM2: Visual Language Models for Image and Video Understanding
Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs in pursuit of enhanced vision-language fusion, efficient higher-resolution architecture, and broader modalities and applications.
To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning
Existing visual instruction tuning methods typically prompt large language models with textual descriptions to generate instruction-following data.
Multi-modal Preference Alignment Remedies Degradation of Visual Instruction Tuning on Language Models
Multi-modal large language models (MLLMs) are expected to support multi-turn queries of interchanging image and text modalities in production.
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
Problems include: (1) How to systematically structure and evaluate the complicated multimodal tasks; (2) How to design evaluation metrics that work well across question and answer types; and (3) How to give model insights beyond a simple performance ranking.
Volcano: Mitigating Multimodal Hallucination through Self-Feedback Guided Revision
Building on this approach, we introduce Volcano, a multimodal self-feedback guided revision model.
Text as Images: Can Multimodal Large Language Models Follow Printed Instructions in Pixels?
Recent multimodal large language models (MLLMs) have shown promising instruction following capabilities on vision-language tasks.
Self-Supervised Visual Preference Alignment
We generate chosen and rejected responses with regard to the original and augmented image pairs, and conduct preference alignment with direct preference optimization.
MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities
Using MM-Vet v2 to benchmark large multimodal models, we found that Claude 3. 5 Sonnet is the best model with a score of 71. 8, slightly outperforming GPT-4o which scored 71. 0.