Image Comprehension
22 papers with code • 0 benchmarks • 1 datasets
Benchmarks
These leaderboards are used to track progress in Image Comprehension
Libraries
Use these libraries to find Image Comprehension models and implementationsMost implemented papers
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition
We propose InternLM-XComposer, a vision-language large model that enables advanced image-text comprehension and composition.
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
We try to narrow the gap by mining the potential of VLMs for better performance and any-to-any workflow from three aspects, i. e., high-resolution visual tokens, high-quality data, and VLM-guided generation.
Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement
In this paper, we propose SIMA, a framework that enhances visual and language modality alignment through self-improvement, eliminating the needs for external models or data.
ArtGPT-4: Towards Artistic-understanding Large Vision-Language Models with Enhanced Adapter
However, a grand challenge of exploiting LLMs for multimodal learning is the size of pre-trained LLMs which are always with billions of parameters.
JourneyDB: A Benchmark for Generative Image Understanding
On our dataset, we have devised four benchmarks to assess the performance of generated image comprehension in relation to both content and style interpretation.
Hierarchical Open-vocabulary Universal Image Segmentation
Open-vocabulary image segmentation aims to partition an image into semantic regions according to arbitrary text descriptions.
RegionBLIP: A Unified Multi-modal Pre-training Framework for Holistic and Regional Comprehension
To this end, we propose to extract features corresponding to regional objects as soft prompts for LLM, which provides a straightforward and scalable approach and eliminates the need for LLM fine-tuning.
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
When exploring the development of Artificial General Intelligence (AGI), a critical task for these models involves interpreting and processing information from multiple image inputs.
EarthGPT: A Universal Multi-modal Large Language Model for Multi-sensor Image Comprehension in Remote Sensing Domain
Multi-modal large language models (MLLMs) have demonstrated remarkable success in vision and visual-language tasks within the natural image domain.
MM-MATH: Advancing Multimodal Math Evaluation with Process Evaluation and Fine-grained Classification
This highlights the challenging nature of our benchmark for existing models and the significant gap between the multimodal reasoning capabilities of current models and humans.