OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation

In this paper, we propose an Omni-perception Pre-Trainer (OPT) for cross-modal understanding and generation, by jointly modeling visual, text and audio resources. OPT is constructed in an encoder-decoder framework, including three single-modal encoders to generate token-based embeddings for each modality, a cross-modal encoder to encode the correlations among the three modalities, and two cross-modal decoders to generate text and image respectively. For the OPT's pre-training, we design a multi-task pretext learning scheme to model multi-modal resources from three different data granularities, \ie, token-, modality-, and sample-level modeling, through which OPT learns to align and translate among different modalities. The pre-training task is carried out on a large amount of image-text-audio triplets from Open Images. Experimental results show that OPT can learn strong image-text-audio multi-modal representations and achieve promising results on a variety of cross-modal understanding and generation tasks.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Image Retrieval Localized Narratives OPT Text-to-image R@1 0.4196 # 1
Text-to-image R@5 0.72 # 1
Text-to-image R@10 0.8126 # 1
Audio to Text Retrieval Localized Narratives OPT Audio-to-text R@1 0.803 # 1
Audio-to-text R@5 0.945 # 1
Audio-to-text R@10 0.971 # 1
Text to Audio Retrieval Localized Narratives OPT Text-to-audio R@1 0.78 # 1
Text-to-audio R@5 0.927 # 1
Text-to-audio R@10 0.958 # 1
Image-to-Text Retrieval Localized Narratives OPT Image-to-text R@1 0.394 # 1
Image-to-text R@5 0.7194 # 1
Image-to-text R@10 0.8256 # 1

Methods


No methods listed for this paper. Add relevant methods here