VL-T5

Introduced by Cho et al. in Unifying Vision-and-Language Tasks via Text Generation

VL-T5 is a unified framework that learns different tasks in a single architecture with the same language modeling objective, i.e., multimodal conditional text generation. The model learns to generate labels in text based on the visual and textual inputs. In contrast to other existing methods, the framework unifies tasks as generating text labels conditioned on multimodal inputs. This allows the model to tackle vision-and-language tasks with unified text generation objective. The models use text prefixes to adapt to different tasks.

Source: Unifying Vision-and-Language Tasks via Text Generation

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Image Captioning	3	16.67%
Text Generation	2	11.11%
Visual Question Answering (VQA)	2	11.11%
Human-Object Interaction Detection	1	5.56%
Image Retrieval	1	5.56%
Object Categorization	1	5.56%
Object Localization	1	5.56%
Conditional Text Generation	1	5.56%
Language Modelling	1	5.56%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
🤖 No Components Found	You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories

Add Remove

Vision and Language Pre-Trained Models