Text to Audio Retrieval

9 papers with code • 4 benchmarks • 4 datasets

This task has no description! Would you like to contribute one?

Most implemented papers

OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation

mindspore-ai/models 1 Jul 2021

In this paper, we propose an Omni-perception Pre-Trainer (OPT) for cross-modal understanding and generation, by jointly modeling visual, text and audio resources.

ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

OFA-Sys/ONE-PEACE 18 May 2023

In this work, we explore a scalable way for building a general representation model toward unlimited modalities.

InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding

opengvlab/internvideo2 22 Mar 2024

We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-the-art performance in action recognition, video-text tasks, and video-centric dialogue.

Audio Retrieval with Natural Language Queries

oncescuandreea/audio-retrieval 5 May 2021

We consider the task of retrieving audio using free-form natural language queries.

Audio Retrieval with Natural Language Queries: A Benchmark Study

akoepke/audio-retrieval-benchmark 17 Dec 2021

Additionally, we introduce the SoundDescs benchmark, which consists of paired audio and natural language descriptions for a diverse collection of sounds that are complementary to those found in AudioCaps and Clotho.

Cross Modal Retrieval with Querybank Normalisation

ioanacroi/qb-norm CVPR 2022

In this work we first show that, despite their effectiveness, state-of-the-art joint embeddings suffer significantly from the longstanding "hubness problem" in which a small number of gallery embeddings form the nearest neighbours of many queries.

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

TXH-mercury/VALOR 17 Apr 2023

Different from widely-studied vision-language pretraining models, VALOR jointly models relationships of vision, audio and language in an end-to-end manner.

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

txh-mercury/vast NeurIPS 2023

Based on the proposed VAST-27M dataset, we train an omni-modality video-text foundational model named VAST, which can perceive and process vision, audio, and subtitle modalities from video, and better support various tasks including vision-text, audio-text, and multi-modal video-text tasks (retrieval, captioning and QA).

Advancing Natural-Language Based Audio Retrieval with PaSST and Large Audio-Caption Data Sets

optimusprimus/dcase2023_task6b 8 Aug 2023

This work presents a text-to-audio-retrieval system based on pre-trained text and spectrogram transformers.