visual instruction following

7 papers with code • 1 benchmarks • 1 datasets

This task has no description! Would you like to contribute one?

Benchmarks

Add a Result

These leaderboards are used to track progress in visual instruction following

Trend	Dataset	Best Model	Paper	Code	Compare
	LLaVA-Bench	ShareGPT4V-13B			See all

Libraries

Use these libraries to find visual instruction following models and implementations

huggingface/transformers

3 papers

125,059

salesforce/lavis

2 papers

8,731

Datasets

LLaVA-Bench

Most implemented papers

Most implemented Social Latest No code

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

salesforce/lavis • • 30 Jan 2023

The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models.

Paper
Code

Visual Instruction Tuning

haotian-liu/LLaVA • • NeurIPS 2023

Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field.

Paper
Code

Improved Baselines with Visual Instruction Tuning

huggingface/transformers • • 5 Oct 2023

Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning.

Paper
Code

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

salesforce/lavis • • NeurIPS 2023

Large-scale pre-training and instruction tuning have been successful at creating general-purpose language models with broad competence.

Paper
Code

Instruction Clarification Requests in Multimodal Collaborative Dialogue Games: Tasks, and an Analysis of the CoDraw Dataset

briemadu/codraw-icr-v1 • • 28 Feb 2023

In visual instruction-following dialogue games, players can engage in repair mechanisms in face of an ambiguous or underspecified instruction that cannot be fully mapped to actions in the world.

Paper
Code

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

InternLM/InternLM-XComposer • • 21 Nov 2023

In the realm of large multi-modal models (LMMs), efficient modality alignment is crucial yet often constrained by the scarcity of high-quality image-text data.

Paper
Code

Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models

dongyh20/chain-of-spot • • 19 Mar 2024

In the realm of vision-language understanding, the proficiency of models in interpreting and reasoning over visual content has become a cornerstone for numerous applications.

Paper
Code

visual instruction following

Benchmarks Add a Result

Libraries

Datasets

Most implemented papers

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Visual Instruction Tuning

Improved Baselines with Visual Instruction Tuning

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Instruction Clarification Requests in Multimodal Collaborative Dialogue Games: Tasks, and an Analysis of the CoDraw Dataset

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models

Content

Benchmarks

Add a Result