ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

18 May 2023  ยท  Peng Wang, Shijie Wang, Junyang Lin, Shuai Bai, Xiaohuan Zhou, Jingren Zhou, Xinggang Wang, Chang Zhou ยท

In this work, we explore a scalable way for building a general representation model toward unlimited modalities. We release ONE-PEACE, a highly extensible model with 4B parameters that can seamlessly align and integrate representations across vision, audio, and language modalities. The architecture of ONE-PEACE comprises modality adapters, shared self-attention layers, and modality FFNs. This design allows for the easy extension of new modalities by adding adapters and FFNs, while also enabling multi-modal fusion through self-attention layers. To pretrain ONE-PEACE, we develop two modality-agnostic pretraining tasks, cross-modal aligning contrast and intra-modal denoising contrast, which align the semantic space of different modalities and capture fine-grained details within modalities concurrently. With the scaling-friendly architecture and pretraining tasks, ONE-PEACE has the potential to expand to unlimited modalities. Without using any vision or language pretrained model for initialization, ONE-PEACE achieves leading results on a wide range of uni-modal and multi-modal tasks, including image classification (ImageNet), semantic segmentation (ADE20K), audio-text retrieval (AudioCaps, Clotho), audio classification (ESC-50, FSD50K, VGGSound), audio question answering (AVQA), image-text retrieval (MSCOCO, Flickr30K), and visual grounding (RefCOCO/+/g). Code is available at https://github.com/OFA-Sys/ONE-PEACE.

PDF Abstract

Results from the Paper


 Ranked #1 on Semantic Segmentation on ADE20K (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Semantic Segmentation ADE20K ONE-PEACE Validation mIoU 63.0 # 1
Params (M) 1500 # 2
Text to Audio Retrieval AudioCaps ONE-PEACE R@1 42.5 # 3
R@10 88.4 # 1
R@5 77.5 # 1
Audio to Text Retrieval AudioCaps ONE-PEACE R@1 51.0 # 1
R@10 92.0 # 1
Audio-Visual Question Answering (AVQA) AVQA ONE-PEACE Accuracy 92.2 # 1
Text to Audio Retrieval Clotho ONE-PEACE R@1 22.4 # 3
R@10 62.7 # 2
R@5 49.0 # 2
Audio to Text Retrieval Clotho ONE-PEACE R@1 27.1 # 1
R@10 65.4 # 1
Image-to-Text Retrieval Flickr30k ONE-PEACE (finetuned, w/o ranking) Recall@1 97.6 # 2
Recall@5 100 # 1
Recall@10 100 # 1
Audio Classification FSD50K ONE-PEACE mAP 69.7 # 1
Image Classification ImageNet ONE-PEACE Top 1 Accuracy 89.8% # 21
Number of params 1520M # 960
Action Classification Kinetics-400 ONE-PEACE Acc@1 88.1 # 21
Acc@5 97.8 # 12
Image-to-Text Retrieval MS COCO ONE-PEACE (w/o ranking) Recall@10 98.3 # 3
Recall@1 84.1 # 2
Recall@5 96.3 # 2
Referring Expression Comprehension RefCoco+ ONE-PEACE Val 88.77 # 1
Test A 92.21 # 1
Test B 83.23 # 1
Referring Expression Comprehension RefCOCO ONE-PEACE Val 92.58 # 2
Test A 94.18 # 2
Test B 89.26 # 2
Referring Expression Comprehension RefCOCOg-test ONE-PEACE Accuracy 89.27 # 2
Referring Expression Comprehension RefCOCOg-val ONE-PEACE Accuracy 89.22 # 1
Audio Classification VGGSound ONE-PEACE (Audio-Only) Top 1 Accuracy 59.6 # 9
Audio Classification VGGSound ONE-PEACE (Audio-Visual) Top 1 Accuracy 68.2 # 2
Visual Question Answering (VQA) VQA v2 test-dev ONE-PEACE Accuracy 82.6 # 4
Visual Question Answering (VQA) VQA v2 test-std ONE-PEACE overall 82.52 # 3
yes/no 94.85 # 1
number 72.24 # 1
other 74.15 # 2

Methods


No methods listed for this paper. Add relevant methods here