OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

In this work, we pursue a unified paradigm for multimodal pretraining to break the scaffolds of complex task/modality-specific customization. We propose OFA, a Task-Agnostic and Modality-Agnostic framework that supports Task Comprehensiveness. OFA unifies a diverse set of cross-modal and unimodal tasks, including image generation, visual grounding, image captioning, image classification, language modeling, etc., in a simple sequence-to-sequence learning framework. OFA follows the instruction-based learning in both pretraining and finetuning stages, requiring no extra task-specific layers for downstream tasks. In comparison with the recent state-of-the-art vision & language models that rely on extremely large cross-modal datasets, OFA is pretrained on only 20M publicly available image-text pairs. Despite its simplicity and relatively small-scale training data, OFA achieves new SOTAs in a series of cross-modal tasks while attaining highly competitive performances on uni-modal tasks. Our further analysis indicates that OFA can also effectively transfer to unseen tasks and unseen domains. Our code and models are publicly available at https://github.com/OFA-Sys/OFA.

PDF Abstract

Results from the Paper

Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Text Summarization GigaWord OFA ROUGE-1 39.81 # 3
ROUGE-2 20.66 # 2
ROUGE-L 37.11 # 1
Referring Expression Comprehension GRIT OFA Refexp (ablation) 61.7 # 2
Visual Question Answering GRIT OFA VQA (ablation) 72.4 # 2
Object Categorization GRIT OFA_Large Categorization (ablation) 22.6 # 4
Self-Supervised Image Classification ImageNet (finetuned) OFA (Large) Number of Params 473M # 8
Top 1 Accuracy 85.6% # 12
Referring Expression Comprehension RefCoco+ OFA Val 87.86 # 1
Test A 91.70 # 1
Test B 80.71 # 1
Referring Expression Comprehension RefCOCO OFA Val 92.04 # 1
Test A 94.03 # 1
Test B 88.44 # 1
Referring Expression Comprehension RefCOCOg-test OFA Accuracy 88.78 # 1
Referring Expression Comprehension RefCOCOg-val OFA Accuracy 88.07 # 1
Visual Entailment SNLI-VE test OFA Accuracy 91.2 # 1
Visual Entailment SNLI-VE val OFA Accuracy 91.0 # 1
Visual Question Answering VQA v2 test-dev OFA Accuracy 82.0 # 6
Visual Question Answering VQA v2 test-std OFA overall 81.98 # 2
yes/no 94.66 # 2
number 71.44 # 1
other 73.35 # 1