Image Comprehension

22 papers with code • 0 benchmarks • 1 datasets

This task has no description! Would you like to contribute one?

Libraries

Use these libraries to find Image Comprehension models and implementations

Datasets


Most implemented papers

InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition

internlm/internlm-xcomposer 26 Sep 2023

We propose InternLM-XComposer, a vision-language large model that enables advanced image-text comprehension and composition.

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

dvlab-research/minigemini 27 Mar 2024

We try to narrow the gap by mining the potential of VLMs for better performance and any-to-any workflow from three aspects, i. e., high-resolution visual tokens, high-quality data, and VLM-guided generation.

Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement

umd-huang-lab/sima 24 May 2024

In this paper, we propose SIMA, a framework that enhances visual and language modality alignment through self-improvement, eliminating the needs for external models or data.

ArtGPT-4: Towards Artistic-understanding Large Vision-Language Models with Enhanced Adapter

dlyuangod/artgpt-4 12 May 2023

However, a grand challenge of exploiting LLMs for multimodal learning is the size of pre-trained LLMs which are always with billions of parameters.

JourneyDB: A Benchmark for Generative Image Understanding

shihaozhaozsh/lavi-bridge NeurIPS 2023

On our dataset, we have devised four benchmarks to assess the performance of generated image comprehension in relation to both content and style interpretation.

Hierarchical Open-vocabulary Universal Image Segmentation

berkeley-hipie/hipie NeurIPS 2023

Open-vocabulary image segmentation aims to partition an image into semantic regions according to arbitrary text descriptions.

RegionBLIP: A Unified Multi-modal Pre-training Framework for Holistic and Regional Comprehension

mightyzau/regionblip 3 Aug 2023

To this end, we propose to extract features corresponding to regional objects as soft prompts for LLM, which provides a straightforward and scalable approach and eliminates the need for LLM fine-tuning.

CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs

vista-h/gpt-4v_social_media 5 Jan 2024

When exploring the development of Artificial General Intelligence (AGI), a critical task for these models involves interpreting and processing information from multiple image inputs.

EarthGPT: A Universal Multi-modal Large Language Model for Multi-sensor Image Comprehension in Remote Sensing Domain

wivizhang/earthgpt 30 Jan 2024

Multi-modal large language models (MLLMs) have demonstrated remarkable success in vision and visual-language tasks within the natural image domain.

MM-MATH: Advancing Multimodal Math Evaluation with Process Evaluation and Fine-grained Classification

kge-sun/mm-math 7 Apr 2024

This highlights the challenging nature of our benchmark for existing models and the significant gap between the multimodal reasoning capabilities of current models and humans.