audio-visual learning
21 papers with code • 0 benchmarks • 4 datasets
Benchmarks
These leaderboards are used to track progress in audio-visual learning
Most implemented papers
AudioToken: Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation
In this paper, we propose a novel method utilizing latent diffusion models trained for text-to-image-generation to generate images conditioned on audio recordings.
Adversarial-Metric Learning for Audio-Visual Cross-Modal Matching
AML aims to generate a modality-independent representation for each person in each modality via adversarial learning, while simultaneously learns a robust similarity measure for cross-modality matching via metric learning.
Can audio-visual integration strengthen robustness under multimodal attacks?
In this paper, we propose to make a systematic study on machines multisensory perception under attacks.
Distilling Audio-Visual Knowledge by Compositional Contrastive Learning
Having access to multi-modal cues (e. g. vision and audio) empowers some cognitive tasks to be done faster compared to learning from a single modality.
Cascaded Multilingual Audio-Visual Learning from Videos
In this paper, we explore self-supervised audio-visual models that learn from instructional videos.
Learning to Answer Questions in Dynamic Audio-Visual Scenarios
In this paper, we focus on the Audio-Visual Question Answering (AVQA) task, which aims to answer questions regarding different visual objects, sounds, and their associations in videos.
Modality-Aware Contrastive Instance Learning with Self-Distillation for Weakly-Supervised Audio-Visual Violence Detection
In this paper, we analyze the modality asynchrony and undifferentiated instances phenomena of the multiple instance learning (MIL) procedure, and further investigate its negative impact on weakly-supervised audio-visual learning.
UAVM: Towards Unifying Audio and Visual Models
Conventional audio-visual models have independent audio and video branches.
Revisiting Pre-training in Audio-Visual Learning
Specifically, we explore the effects of pre-trained models on two audio-visual learning scenarios: cross-modal initialization and multi-modal joint learning.
Unraveling Instance Associations: A Closer Look for Audio-Visual Segmentation
We show empirical results that demonstrate the effectiveness of our benchmark.