TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Semantic Segmentation	ADE20K	ONE-PEACE	Validation mIoU	63.0	# 1
Semantic Segmentation	ADE20K	ONE-PEACE	Params (M)	1500	# 2
Text to Audio Retrieval	AudioCaps	ONE-PEACE	R@1	42.5	# 3
Text to Audio Retrieval	AudioCaps	ONE-PEACE	R@10	88.4	# 1
Text to Audio Retrieval	AudioCaps	ONE-PEACE	R@5	77.5	# 1
Audio to Text Retrieval	AudioCaps	ONE-PEACE	R@1	51.0	# 1
Audio to Text Retrieval	AudioCaps	ONE-PEACE	R@10	92.0	# 1
Audio-Visual Question Answering (AVQA)	AVQA	ONE-PEACE	Accuracy	92.2	# 1
Text to Audio Retrieval	Clotho	ONE-PEACE	R@1	22.4	# 3
Text to Audio Retrieval	Clotho	ONE-PEACE	R@10	62.7	# 2
Text to Audio Retrieval	Clotho	ONE-PEACE	R@5	49.0	# 2
Audio to Text Retrieval	Clotho	ONE-PEACE	R@1	27.1	# 1
Audio to Text Retrieval	Clotho	ONE-PEACE	R@10	65.4	# 1
Image-to-Text Retrieval	Flickr30k	ONE-PEACE (finetuned, w/o ranking)	Recall@1	97.6	# 2
Image-to-Text Retrieval	Flickr30k	ONE-PEACE (finetuned, w/o ranking)	Recall@5	100	# 1
Image-to-Text Retrieval	Flickr30k	ONE-PEACE (finetuned, w/o ranking)	Recall@10	100	# 1
Audio Classification	FSD50K	ONE-PEACE	mAP	69.7	# 1
Image Classification	ImageNet	ONE-PEACE	Top 1 Accuracy	89.8%	# 21
Image Classification	ImageNet	ONE-PEACE	Number of params	1520M	# 960
Action Classification	Kinetics-400	ONE-PEACE	Acc@1	88.1	# 21
Action Classification	Kinetics-400	ONE-PEACE	Acc@5	97.8	# 12
Image-to-Text Retrieval	MS COCO	ONE-PEACE (w/o ranking)	Recall@10	98.3	# 3
Image-to-Text Retrieval	MS COCO	ONE-PEACE (w/o ranking)	Recall@1	84.1	# 2
Image-to-Text Retrieval	MS COCO	ONE-PEACE (w/o ranking)	Recall@5	96.3	# 2
Referring Expression Comprehension	RefCoco+	ONE-PEACE	Val	88.77	# 1
Referring Expression Comprehension	RefCoco+	ONE-PEACE	Test A	92.21	# 1
Referring Expression Comprehension	RefCoco+	ONE-PEACE	Test B	83.23	# 1
Referring Expression Comprehension	RefCOCO	ONE-PEACE	Val	92.58	# 2
Referring Expression Comprehension	RefCOCO	ONE-PEACE	Test A	94.18	# 2
Referring Expression Comprehension	RefCOCO	ONE-PEACE	Test B	89.26	# 2
Referring Expression Comprehension	RefCOCOg-test	ONE-PEACE	Accuracy	89.27	# 2
Referring Expression Comprehension	RefCOCOg-val	ONE-PEACE	Accuracy	89.22	# 1
Audio Classification	VGGSound	ONE-PEACE (Audio-Only)	Top 1 Accuracy	59.6	# 9
Audio Classification	VGGSound	ONE-PEACE (Audio-Visual)	Top 1 Accuracy	68.2	# 2
Visual Question Answering (VQA)	VQA v2 test-dev	ONE-PEACE	Accuracy	82.6	# 4
Visual Question Answering (VQA)	VQA v2 test-std	ONE-PEACE	overall	82.52	# 3
Visual Question Answering (VQA)	VQA v2 test-std	ONE-PEACE	yes/no	94.85	# 1
Visual Question Answering (VQA)	VQA v2 test-std	ONE-PEACE	number	72.24	# 1
Visual Question Answering (VQA)	VQA v2 test-std	ONE-PEACE	other	74.15	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/one-peace-exploring-one-general/semantic-segmentation-on-ade20k)](https://paperswithcode.com/sota/semantic-segmentation-on-ade20k?p=one-peace-exploring-one-general)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/one-peace-exploring-one-general/audio-to-text-retrieval-on-audiocaps)](https://paperswithcode.com/sota/audio-to-text-retrieval-on-audiocaps?p=one-peace-exploring-one-general)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/one-peace-exploring-one-general/audio-visual-question-answering-avqa-on-avqa)](https://paperswithcode.com/sota/audio-visual-question-answering-avqa-on-avqa?p=one-peace-exploring-one-general)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/one-peace-exploring-one-general/audio-to-text-retrieval-on-clotho)](https://paperswithcode.com/sota/audio-to-text-retrieval-on-clotho?p=one-peace-exploring-one-general)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/one-peace-exploring-one-general/audio-classification-on-fsd50k)](https://paperswithcode.com/sota/audio-classification-on-fsd50k?p=one-peace-exploring-one-general)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/one-peace-exploring-one-general/referring-expression-comprehension-on-refcoco-1)](https://paperswithcode.com/sota/referring-expression-comprehension-on-refcoco-1?p=one-peace-exploring-one-general)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/one-peace-exploring-one-general/referring-expression-comprehension-on)](https://paperswithcode.com/sota/referring-expression-comprehension-on?p=one-peace-exploring-one-general)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/one-peace-exploring-one-general/image-to-text-retrieval-on-flickr30k)](https://paperswithcode.com/sota/image-to-text-retrieval-on-flickr30k?p=one-peace-exploring-one-general)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/one-peace-exploring-one-general/image-to-text-retrieval-on-coco)](https://paperswithcode.com/sota/image-to-text-retrieval-on-coco?p=one-peace-exploring-one-general)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/one-peace-exploring-one-general/referring-expression-comprehension-on-refcoco)](https://paperswithcode.com/sota/referring-expression-comprehension-on-refcoco?p=one-peace-exploring-one-general)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/one-peace-exploring-one-general/referring-expression-comprehension-on-1)](https://paperswithcode.com/sota/referring-expression-comprehension-on-1?p=one-peace-exploring-one-general)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/one-peace-exploring-one-general/audio-classification-on-vggsound)](https://paperswithcode.com/sota/audio-classification-on-vggsound?p=one-peace-exploring-one-general)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/one-peace-exploring-one-general/text-to-audio-retrieval-on-audiocaps)](https://paperswithcode.com/sota/text-to-audio-retrieval-on-audiocaps?p=one-peace-exploring-one-general)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/one-peace-exploring-one-general/text-to-audio-retrieval-on-clotho)](https://paperswithcode.com/sota/text-to-audio-retrieval-on-clotho?p=one-peace-exploring-one-general)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/one-peace-exploring-one-general/visual-question-answering-on-vqa-v2-test-std)](https://paperswithcode.com/sota/visual-question-answering-on-vqa-v2-test-std?p=one-peace-exploring-one-general)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/one-peace-exploring-one-general/visual-question-answering-on-vqa-v2-test-dev)](https://paperswithcode.com/sota/visual-question-answering-on-vqa-v2-test-dev?p=one-peace-exploring-one-general)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/one-peace-exploring-one-general/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=one-peace-exploring-one-general)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/one-peace-exploring-one-general/action-classification-on-kinetics-400)](https://paperswithcode.com/sota/action-classification-on-kinetics-400?p=one-peace-exploring-one-general)`

ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

18 May 2023 · Peng Wang, Shijie Wang, Junyang Lin, Shuai Bai, Xiaohuan Zhou, Jingren Zhou, Xinggang Wang, Chang Zhou ·

In this work, we explore a scalable way for building a general representation model toward unlimited modalities. We release ONE-PEACE, a highly extensible model with 4B parameters that can seamlessly align and integrate representations across vision, audio, and language modalities. The architecture of ONE-PEACE comprises modality adapters, shared self-attention layers, and modality FFNs. This design allows for the easy extension of new modalities by adding adapters and FFNs, while also enabling multi-modal fusion through self-attention layers. To pretrain ONE-PEACE, we develop two modality-agnostic pretraining tasks, cross-modal aligning contrast and intra-modal denoising contrast, which align the semantic space of different modalities and capture fine-grained details within modalities concurrently. With the scaling-friendly architecture and pretraining tasks, ONE-PEACE has the potential to expand to unlimited modalities. Without using any vision or language pretrained model for initialization, ONE-PEACE achieves leading results on a wide range of uni-modal and multi-modal tasks, including image classification (ImageNet), semantic segmentation (ADE20K), audio-text retrieval (AudioCaps, Clotho), audio classification (ESC-50, FSD50K, VGGSound), audio question answering (AVQA), image-text retrieval (MSCOCO, Flickr30K), and visual grounding (RefCOCO/+/g). Code is available at https://github.com/OFA-Sys/ONE-PEACE.

PDF Abstract

Code

Add Remove Mark official

OFA-Sys/ONE-PEACE official

↳ Quickstart in

Spaces

840

modelscope/modelscope

6,055

Tasks

Add Remove

Action Classification

AudioCaps

Audio Classification

Audio Question Answering

Audio to Text Retrieval

Audio-Visual Question Answering (AVQA)

Denoising

Image Classification

Image-to-Text Retrieval

Question Answering

Referring Expression Comprehension

Retrieval

Self-Supervised Image Classification

Semantic Segmentation

Text Retrieval

Text to Audio Retrieval

Visual Grounding

Visual Question Answering (VQA)

Zero-Shot Environment Sound Classification

Datasets

ImageNet

MS COCO

Visual Question Answering

Kinetics

Visual Genome

ADE20K ImageNet-1K

Flickr30k

Kinetics 400

Visual Question Answering v2.0

RefCOCO

ESC-50

AudioCaps

VGG-Sound

Clotho

FSD50K Google Refexp AVQA

Results from the Paper

Edit

Ranked #1 on Semantic Segmentation on ADE20K (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Semantic Segmentation	ADE20K	ONE-PEACE	Validation mIoU	63.0	# 1	Compare
Semantic Segmentation	ADE20K	ONE-PEACE	Params (M)	1500	# 2	Compare
Text to Audio Retrieval	AudioCaps	ONE-PEACE	R@1	42.5	# 3	Compare
			R@10	88.4	# 1	Compare
			R@5	77.5	# 1	Compare
Audio to Text Retrieval	AudioCaps	ONE-PEACE	R@1	51.0	# 1	Compare
Audio to Text Retrieval	AudioCaps	ONE-PEACE	R@10	92.0	# 1	Compare
Audio-Visual Question Answering (AVQA)	AVQA	ONE-PEACE	Accuracy	92.2	# 1	Compare
Text to Audio Retrieval	Clotho	ONE-PEACE	R@1	22.4	# 3	Compare
			R@10	62.7	# 2	Compare
			R@5	49.0	# 2	Compare
Audio to Text Retrieval	Clotho	ONE-PEACE	R@1	27.1	# 1	Compare
Audio to Text Retrieval	Clotho	ONE-PEACE	R@10	65.4	# 1	Compare
Image-to-Text Retrieval	Flickr30k	ONE-PEACE (finetuned, w/o ranking)	Recall@1	97.6	# 2	Compare
			Recall@5	100	# 1	Compare
			Recall@10	100	# 1	Compare
Audio Classification	FSD50K	ONE-PEACE	mAP	69.7	# 1	Compare
Image Classification	ImageNet	ONE-PEACE	Top 1 Accuracy	89.8%	# 21	Compare
Image Classification	ImageNet	ONE-PEACE	Number of params	1520M	# 960	Compare
Action Classification	Kinetics-400	ONE-PEACE	Acc@1	88.1	# 21	Compare
Action Classification	Kinetics-400	ONE-PEACE	Acc@5	97.8	# 12	Compare
Image-to-Text Retrieval	MS COCO	ONE-PEACE (w/o ranking)	Recall@10	98.3	# 3	Compare
			Recall@1	84.1	# 2	Compare
			Recall@5	96.3	# 2	Compare
Referring Expression Comprehension	RefCoco+	ONE-PEACE	Val	88.77	# 1	Compare
			Test A	92.21	# 1	Compare
			Test B	83.23	# 1	Compare
Referring Expression Comprehension	RefCOCO	ONE-PEACE	Val	92.58	# 2	Compare
			Test A	94.18	# 2	Compare
			Test B	89.26	# 2	Compare
Referring Expression Comprehension	RefCOCOg-test	ONE-PEACE	Accuracy	89.27	# 2	Compare
Referring Expression Comprehension	RefCOCOg-val	ONE-PEACE	Accuracy	89.22	# 1	Compare
Audio Classification	VGGSound	ONE-PEACE (Audio-Only)	Top 1 Accuracy	59.6	# 9	Compare
Audio Classification	VGGSound	ONE-PEACE (Audio-Visual)	Top 1 Accuracy	68.2	# 2	Compare
Visual Question Answering (VQA)	VQA v2 test-dev	ONE-PEACE	Accuracy	82.6	# 4	Compare
Visual Question Answering (VQA)	VQA v2 test-std	ONE-PEACE	overall	82.52	# 3	Compare
			yes/no	94.85	# 1	Compare
			number	72.24	# 1	Compare
			other	74.15	# 2	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove