MM-Vet

13 papers with code • 0 benchmarks • 0 datasets

This task has no description! Would you like to contribute one?

Most implemented papers

CogAgent: A Visual Language Model for GUI Agents

thudm/cogvlm CVPR 2024

People are spending an enormous amount of time on digital devices through graphical user interfaces (GUIs), e. g., computer or smartphone screens.

ShapeLLM: Universal 3D Object Understanding for Embodied Interaction

qizekun/ShapeLLM 27 Feb 2024

This paper presents ShapeLLM, the first 3D Multimodal Large Language Model (LLM) designed for embodied interaction, exploring a universal 3D object understanding with 3D point clouds and languages.

CogVLM2: Visual Language Models for Image and Video Understanding

thudm/glm-4 29 Aug 2024

Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs in pursuit of enhanced vision-language fusion, efficient higher-resolution architecture, and broader modalities and applications.

To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning

x2fd/lvis-instruct4v 13 Nov 2023

Existing visual instruction tuning methods typically prompt large language models with textual descriptions to generate instruction-following data.

Multi-modal Preference Alignment Remedies Degradation of Visual Instruction Tuning on Language Models

findalexli/mllm-dpo 16 Feb 2024

Multi-modal large language models (MLLMs) are expected to support multi-turn queries of interchanging image and text modalities in production.

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

yuweihao/mm-vet 4 Aug 2023

Problems include: (1) How to systematically structure and evaluate the complicated multimodal tasks; (2) How to design evaluation metrics that work well across question and answer types; and (3) How to give model insights beyond a simple performance ranking.

Volcano: Mitigating Multimodal Hallucination through Self-Feedback Guided Revision

kaistai/volcano 13 Nov 2023

Building on this approach, we introduce Volcano, a multimodal self-feedback guided revision model.

Text as Images: Can Multimodal Large Language Models Follow Printed Instructions in Pixels?

vim-bench/vim_tool 29 Nov 2023

Recent multimodal large language models (MLLMs) have shown promising instruction following capabilities on vision-language tasks.

Self-Supervised Visual Preference Alignment

Kevinz-code/SeVa 16 Apr 2024

We generate chosen and rejected responses with regard to the original and augmented image pairs, and conduct preference alignment with direct preference optimization.

MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities

yuweihao/mm-vet 1 Aug 2024

Using MM-Vet v2 to benchmark large multimodal models, we found that Claude 3. 5 Sonnet is the best model with a score of 71. 8, slightly outperforming GPT-4o which scored 71. 0.