visual instruction following

7 papers with code • 1 benchmarks • 1 datasets

This task has no description! Would you like to contribute one?

Libraries

Use these libraries to find visual instruction following models and implementations
2 papers
8,731

Datasets


Most implemented papers

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

salesforce/lavis 30 Jan 2023

The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models.

Visual Instruction Tuning

haotian-liu/LLaVA NeurIPS 2023

Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field.

Improved Baselines with Visual Instruction Tuning

huggingface/transformers 5 Oct 2023

Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning.

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

salesforce/lavis NeurIPS 2023

Large-scale pre-training and instruction tuning have been successful at creating general-purpose language models with broad competence.

Instruction Clarification Requests in Multimodal Collaborative Dialogue Games: Tasks, and an Analysis of the CoDraw Dataset

briemadu/codraw-icr-v1 28 Feb 2023

In visual instruction-following dialogue games, players can engage in repair mechanisms in face of an ambiguous or underspecified instruction that cannot be fully mapped to actions in the world.

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

InternLM/InternLM-XComposer 21 Nov 2023

In the realm of large multi-modal models (LMMs), efficient modality alignment is crucial yet often constrained by the scarcity of high-quality image-text data.

Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models

dongyh20/chain-of-spot 19 Mar 2024

In the realm of vision-language understanding, the proficiency of models in interpreting and reasoning over visual content has become a cornerstone for numerous applications.