Audio to Text Retrieval
6 papers with code • 4 benchmarks • 4 datasets
Most implemented papers
OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation
In this paper, we propose an Omni-perception Pre-Trainer (OPT) for cross-modal understanding and generation, by jointly modeling visual, text and audio resources.
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
In this work, we explore a scalable way for building a general representation model toward unlimited modalities.
Audio Retrieval with Natural Language Queries
We consider the task of retrieving audio using free-form natural language queries.
Audio Retrieval with Natural Language Queries: A Benchmark Study
Additionally, we introduce the SoundDescs benchmark, which consists of paired audio and natural language descriptions for a diverse collection of sounds that are complementary to those found in AudioCaps and Clotho.
Contrastive Audio-Language Learning for Music
In this work, we explore cross-modal learning in an attempt to bridge audio and language in the music domain.
AudioSetCaps: An Enriched Audio-Caption Dataset using Automated Generation Pipeline with Large Audio and Language Models
In this paper, we propose an automated pipeline that integrates audio-language models for fine-grained content extraction, LLMs for synthetic caption generation, and a contrastive language-audio pretraining (CLAP) model-based refinement process to improve the quality of captions.