TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Retrieval	Localized Narratives	OPT	Text-to-image R@1	0.4196	# 1
Image Retrieval	Localized Narratives	OPT	Text-to-image R@5	0.72	# 1
Image Retrieval	Localized Narratives	OPT	Text-to-image R@10	0.8126	# 1
Audio to Text Retrieval	Localized Narratives	OPT	Audio-to-text R@1	0.803	# 1
Audio to Text Retrieval	Localized Narratives	OPT	Audio-to-text R@5	0.945	# 1
Audio to Text Retrieval	Localized Narratives	OPT	Audio-to-text R@10	0.971	# 1
Text to Audio Retrieval	Localized Narratives	OPT	Text-to-audio R@1	0.78	# 1
Text to Audio Retrieval	Localized Narratives	OPT	Text-to-audio R@5	0.927	# 1
Text to Audio Retrieval	Localized Narratives	OPT	Text-to-audio R@10	0.958	# 1
Image-to-Text Retrieval	Localized Narratives	OPT	Image-to-text R@1	0.394	# 1
Image-to-Text Retrieval	Localized Narratives	OPT	Image-to-text R@5	0.7194	# 1
Image-to-Text Retrieval	Localized Narratives	OPT	Image-to-text R@10	0.8256	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/opt-omni-perception-pre-trainer-for-cross/image-retrieval-on-localized-narratives)](https://paperswithcode.com/sota/image-retrieval-on-localized-narratives?p=opt-omni-perception-pre-trainer-for-cross)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/opt-omni-perception-pre-trainer-for-cross/audio-to-text-retrieval-on-localized)](https://paperswithcode.com/sota/audio-to-text-retrieval-on-localized?p=opt-omni-perception-pre-trainer-for-cross)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/opt-omni-perception-pre-trainer-for-cross/text-to-audio-retrieval-on-localized)](https://paperswithcode.com/sota/text-to-audio-retrieval-on-localized?p=opt-omni-perception-pre-trainer-for-cross)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/opt-omni-perception-pre-trainer-for-cross/image-to-text-retrieval-on-localized)](https://paperswithcode.com/sota/image-to-text-retrieval-on-localized?p=opt-omni-perception-pre-trainer-for-cross)`

OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation

1 Jul 2021 · Jing Liu, Xinxin Zhu, Fei Liu, Longteng Guo, Zijia Zhao, Mingzhen Sun, Weining Wang, Hanqing Lu, Shiyu Zhou, Jiajun Zhang, Jinqiao Wang ·

In this paper, we propose an Omni-perception Pre-Trainer (OPT) for cross-modal understanding and generation, by jointly modeling visual, text and audio resources. OPT is constructed in an encoder-decoder framework, including three single-modal encoders to generate token-based embeddings for each modality, a cross-modal encoder to encode the correlations among the three modalities, and two cross-modal decoders to generate text and image respectively. For the OPT's pre-training, we design a multi-task pretext learning scheme to model multi-modal resources from three different data granularities, \ie, token-, modality-, and sample-level modeling, through which OPT learns to align and translate among different modalities. The pre-training task is carried out on a large amount of image-text-audio triplets from Open Images. Experimental results show that OPT can learn strong image-text-audio multi-modal representations and achieve promising results on a variety of cross-modal understanding and generation tasks.

PDF Abstract

Code

Add Remove Mark official

mindspore-ai/models

334

2023-MindSpore-1/ms-code-161

Tasks

Add Remove

Audio to Text Retrieval

Cross-Modal Retrieval

Image Retrieval

Image-to-Text Retrieval

Text to Audio Retrieval

Datasets

Visual Question Answering

Visual Genome

Conceptual Captions

Localized Narratives

Results from the Paper

Edit

Ranked #1 on Image Retrieval on Localized Narratives

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Retrieval	Localized Narratives	OPT	Text-to-image R@1	0.4196	# 1	Compare
			Text-to-image R@5	0.72	# 1	Compare
			Text-to-image R@10	0.8126	# 1	Compare
Audio to Text Retrieval	Localized Narratives	OPT	Audio-to-text R@1	0.803	# 1	Compare
			Audio-to-text R@5	0.945	# 1	Compare
			Audio-to-text R@10	0.971	# 1	Compare
Text to Audio Retrieval	Localized Narratives	OPT	Text-to-audio R@1	0.78	# 1	Compare
			Text-to-audio R@5	0.927	# 1	Compare
			Text-to-audio R@10	0.958	# 1	Compare
Image-to-Text Retrieval	Localized Narratives	OPT	Image-to-text R@1	0.394	# 1	Compare
			Image-to-text R@5	0.7194	# 1	Compare
			Image-to-text R@10	0.8256	# 1	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove