Computer Vision

Image Comprehension

7 papers with code • 0 benchmarks • 1 datasets

This task has no description! Would you like to contribute one?

Benchmarks

Add a Result

These leaderboards are used to track progress in Image Comprehension

No evaluation results yet. Help compare methods by submitting evaluation metrics.

Datasets

Visual7W

Most implemented papers

Most implemented Social Latest No code

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

dvlab-research/minigemini • • 27 Mar 2024

We try to narrow the gap by mining the potential of VLMs for better performance and any-to-any workflow from three aspects, i. e., high-resolution visual tokens, high-quality data, and VLM-guided generation.

Paper
Code

ArtGPT-4: Towards Artistic-understanding Large Vision-Language Models with Enhanced Adapter

dlyuangod/artgpt-4 • • 12 May 2023

However, a grand challenge of exploiting LLMs for multimodal learning is the size of pre-trained LLMs which are always with billions of parameters.

Paper
Code

JourneyDB: A Benchmark for Generative Image Understanding

shihaozhaozsh/lavi-bridge • • NeurIPS 2023

On our dataset, we have devised four benchmarks to assess the performance of generated image comprehension in relation to both content and style interpretation.

Paper
Code

Hierarchical Open-vocabulary Universal Image Segmentation

berkeley-hipie/hipie • • NeurIPS 2023

Open-vocabulary image segmentation aims to partition an image into semantic regions according to arbitrary text descriptions.

Paper
Code

RegionBLIP: A Unified Multi-modal Pre-training Framework for Holistic and Regional Comprehension

mightyzau/regionblip • • 3 Aug 2023

To this end, we propose to extract features corresponding to regional objects as soft prompts for LLM, which provides a straightforward and scalable approach and eliminates the need for LLM fine-tuning.

Paper
Code

InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition

internlm/internlm-xcomposer • • 26 Sep 2023

We propose InternLM-XComposer, a vision-language large model that enables advanced image-text comprehension and composition.

Paper
Code

EarthGPT: A Universal Multi-modal Large Language Model for Multi-sensor Image Comprehension in Remote Sensing Domain

wivizhang/earthgpt • 30 Jan 2024

Multi-modal large language models (MLLMs) have demonstrated remarkable success in vision and visual-language tasks within the natural image domain.

Paper
Code

Image Comprehension

Benchmarks Add a Result

Datasets

Most implemented papers

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

ArtGPT-4: Towards Artistic-understanding Large Vision-Language Models with Enhanced Adapter

JourneyDB: A Benchmark for Generative Image Understanding

Hierarchical Open-vocabulary Universal Image Segmentation

RegionBLIP: A Unified Multi-modal Pre-training Framework for Holistic and Regional Comprehension

InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition

EarthGPT: A Universal Multi-modal Large Language Model for Multi-sensor Image Comprehension in Remote Sensing Domain

Content

Benchmarks

Add a Result