Audio to Text Retrieval
5 papers with code • 4 benchmarks • 4 datasets
In this paper, we propose an Omni-perception Pre-Trainer (OPT) for cross-modal understanding and generation, by jointly modeling visual, text and audio resources.
In this work, we explore a scalable way for building a general representation model toward unlimited modalities.
Additionally, we introduce the SoundDescs benchmark, which consists of paired audio and natural language descriptions for a diverse collection of sounds that are complementary to those found in AudioCaps and Clotho.