Image Comprehension

7 papers with code • 0 benchmarks • 1 datasets

This task has no description! Would you like to contribute one?

Datasets


Most implemented papers

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

dvlab-research/minigemini 27 Mar 2024

We try to narrow the gap by mining the potential of VLMs for better performance and any-to-any workflow from three aspects, i. e., high-resolution visual tokens, high-quality data, and VLM-guided generation.

ArtGPT-4: Towards Artistic-understanding Large Vision-Language Models with Enhanced Adapter

dlyuangod/artgpt-4 12 May 2023

However, a grand challenge of exploiting LLMs for multimodal learning is the size of pre-trained LLMs which are always with billions of parameters.

JourneyDB: A Benchmark for Generative Image Understanding

shihaozhaozsh/lavi-bridge NeurIPS 2023

On our dataset, we have devised four benchmarks to assess the performance of generated image comprehension in relation to both content and style interpretation.

Hierarchical Open-vocabulary Universal Image Segmentation

berkeley-hipie/hipie NeurIPS 2023

Open-vocabulary image segmentation aims to partition an image into semantic regions according to arbitrary text descriptions.

RegionBLIP: A Unified Multi-modal Pre-training Framework for Holistic and Regional Comprehension

mightyzau/regionblip 3 Aug 2023

To this end, we propose to extract features corresponding to regional objects as soft prompts for LLM, which provides a straightforward and scalable approach and eliminates the need for LLM fine-tuning.

InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition

internlm/internlm-xcomposer 26 Sep 2023

We propose InternLM-XComposer, a vision-language large model that enables advanced image-text comprehension and composition.

EarthGPT: A Universal Multi-modal Large Language Model for Multi-sensor Image Comprehension in Remote Sensing Domain

wivizhang/earthgpt 30 Jan 2024

Multi-modal large language models (MLLMs) have demonstrated remarkable success in vision and visual-language tasks within the natural image domain.