ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

In this work, we explore a scalable way for building a general representation model toward unlimited modalities. We release ONE-PEACE, a highly extensible model with 4B parameters that can seamlessly align and integrate representations across vision, audio, and language modalities. The architecture of ONE-PEACE comprises modality adapters, shared self-attention layers, and modality FFNs. This design allows for the easy extension of new modalities by adding adapters and FFNs, while also enabling multi-modal fusion through self-attention layers. To pretrain ONE-PEACE, we develop two modality-agnostic pretraining tasks, cross-modal aligning contrast and intra-modal denoising contrast, which align the semantic space of different modalities and capture fine-grained details within modalities concurrently. With the scaling-friendly architecture and pretraining tasks, ONE-PEACE has the potential to expand to unlimited modalities. Without using any vision or language pretrained model for initialization, ONE-PEACE achieves leading results on a wide range of uni-modal and multi-modal tasks, including image classification (ImageNet), semantic segmentation (ADE20K), audio-text retrieval (AudioCaps, Clotho), audio classification (ESC-50, FSD50K, VGGSound), audio question answering (AVQA), image-text retrieval (MSCOCO, Flickr30K), and visual grounding (RefCOCO/+/g). Code is available at https://github.com/OFA-Sys/ONE-PEACE.

PDF Abstract

Results from the Paper


 Ranked #1 on Semantic Segmentation on ADE20K (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Semantic Segmentation ADE20K ONE-PEACE Validation mIoU 63.0 # 1
Params (M) 1500 # 2
Text to Audio Retrieval AudioCaps ONE-PEACE R@1 42.5 # 2
R@10 88.4 # 1
R@5 77.5 # 1
Audio to Text Retrieval AudioCaps ONE-PEACE R@1 51.0 # 1
R@10 92.0 # 1
Text to Audio Retrieval Clotho ONE-PEACE R@1 22.4 # 2
R@10 62.7 # 2
R@5 49.0 # 2
Audio to Text Retrieval Clotho ONE-PEACE R@1 27.1 # 1
R@10 65.4 # 1
Image-to-Text Retrieval COCO ONE-PEACE (w/o ranking) Recall@10 98.3 # 3
Recall@1 84.1 # 2
Recall@5 96.3 # 2
Image-to-Text Retrieval Flickr30k ONE-PEACE(finetuned, w/o ranking) Recall@1 97.6 # 1
Recall@5 100 # 1
Recall@10 100 # 1
Audio Classification FSD50K ONE-PEACE mAP 69.7 # 1
Image Classification ImageNet ONE-PEACE Top 1 Accuracy 89.8% # 19
Number of params 1520M # 903
Action Classification Kinetics-400 ONE-PEACE Acc@1 88.1 # 15
Acc@5 97.8 # 10
Referring Expression Comprehension RefCoco+ ONE-PEACE Val 88.77 # 1
Test A 92.21 # 1
Test B 83.23 # 1
Referring Expression Comprehension RefCOCO ONE-PEACE Val 92.58 # 2
Test A 94.18 # 2
Test B 89.26 # 2
Referring Expression Comprehension RefCOCOg-test ONE-PEACE Accuracy 89.27 # 2
Referring Expression Comprehension RefCOCOg-val ONE-PEACE Accuracy 89.22 # 1
Audio Classification VGGSound ONE-PEACE (Audio-Visual) Top 1 Accuracy 68.2 # 1
Audio Classification VGGSound ONE-PEACE (Audio-Only) Top 1 Accuracy 59.6 # 8
Visual Question Answering (VQA) VQA v2 test-dev ONE-PEACE Accuracy 82.6 # 4
Visual Question Answering (VQA) VQA v2 test-std ONE-PEACE overall 82.52 # 3
yes/no 94.85 # 1
number 72.24 # 1
other 74.15 # 2

Methods


No methods listed for this paper. Add relevant methods here