no code implementations • 14 May 2025 • Andrew Rouditchenko, Saurabhchand Bhati, Edson Araujo, Samuel Thomas, Hilde Kuehne, Rogerio Feris, James Glass
We propose Omni-R1 which fine-tunes a recent multi-modal LLM, Qwen2. 5-Omni, on an audio question answering dataset with the reinforcement learning method GRPO.
1 code implementation • CVPR 2025 • Edson Araujo, Andrew Rouditchenko, Yuan Gong, Saurabhchand Bhati, Samuel Thomas, Brian Kingsbury, Leonid Karlinsky, Rogerio Feris, James R. Glass, Hilde Kuehne
Second, we resolve conflicting optimization goals by separating contrastive and reconstruction objectives through dedicated global tokens.
1 code implementation • 3 Feb 2025 • Andrew Rouditchenko, Samuel Thomas, Hilde Kuehne, Rogerio Feris, James Glass
In this work, we propose mWhisper-Flamingo for multilingual AVSR which combines the strengths of a pre-trained audio model (Whisper) and video model (AV-HuBERT).
1 code implementation • 14 Jun 2024 • Andrew Rouditchenko, Yuan Gong, Samuel Thomas, Leonid Karlinsky, Hilde Kuehne, Rogerio Feris, James Glass
Since videos are harder to obtain than audio, the video training data of AVSR models is usually limited to a few thousand hours.
Ranked #1 on
Audio-Visual Speech Recognition
on LRS2
(using extra training data)
Audio-Visual Speech Recognition
Automatic Speech Recognition (ASR)
+5
no code implementations • CVPR 2024 • Brian Chen, Nina Shvetsova, Andrew Rouditchenko, Daniel Kondermann, Samuel Thomas, Shih-Fu Chang, Rogerio Feris, James Glass, Hilde Kuehne
Spatio-temporal grounding describes the task of localizing events in space and time e. g. in video data based on verbal descriptions only.
no code implementations • 29 Sep 2023 • Andrew Rouditchenko, Ronan Collobert, Tatiana Likhomanenko
Audio-visual speech contains synchronized audio and visual information that provides cross-modal supervision to learn representations for both automatic speech recognition (ASR) and visual speech recognition (VSR).
Audio-Visual Speech Recognition
Automatic Speech Recognition
+4
no code implementations • 21 May 2023 • Andrew Rouditchenko, Sameer Khurana, Samuel Thomas, Rogerio Feris, Leonid Karlinsky, Hilde Kuehne, David Harwath, Brian Kingsbury, James Glass
Recent models such as XLS-R and Whisper have made multilingual speech technologies more accessible by pre-training on audio from around 100 spoken languages each.
1 code implementation • 29 Mar 2023 • Brian Chen, Nina Shvetsova, Andrew Rouditchenko, Daniel Kondermann, Samuel Thomas, Shih-Fu Chang, Rogerio Feris, James Glass, Hilde Kuehne
Spatio-temporal grounding describes the task of localizing events in space and time, e. g., in video data, based on verbal descriptions only.
1 code implementation • 7 Oct 2022 • Andrew Rouditchenko, Yung-Sung Chuang, Nina Shvetsova, Samuel Thomas, Rogerio Feris, Brian Kingsbury, Leonid Karlinsky, David Harwath, Hilde Kuehne, James Glass
Inspired by the fact that English text-video retrieval outperforms other languages, we train a student model using input text in different languages to match the cross-modal predictions from teacher models using input text in English.
1 code implementation • 2 Oct 2022 • Yuan Gong, Andrew Rouditchenko, Alexander H. Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, James Glass
In this paper, we first extend the recent Masked Auto-Encoder (MAE) model from a single modality to audio-visual multi-modalities.
Ranked #1 on
Audio Tagging
on AudioSet
(using extra training data)
1 code implementation • 29 Jul 2022 • Yuan Gong, Alexander H. Liu, Andrew Rouditchenko, James Glass
Conventional audio-visual models have independent audio and video branches.
Ranked #2 on
Multi-modal Classification
on AudioSet
(using extra training data)
2 code implementations • 13 Mar 2022 • Yuan Gong, Sameer Khurana, Andrew Rouditchenko, James Glass
Audio classification is an active research area with a wide range of applications.
1 code implementation • CVPR 2022 • Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio S. Feris, David Harwath, James Glass, Hilde Kuehne
In this work, we present a multi-modal, modality agnostic fusion transformer that learns to exchange information between multiple modalities, such as video, audio, and text, and integrate them into a fused representation in a joined multi-modal embedding space.
1 code implementation • 8 Dec 2021 • Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Hilde Kuehne
Multi-modal learning from video data has seen increased attention recently as it allows to train semantically meaningful embeddings without human annotation enabling tasks like zero-shot retrieval and classification.
no code implementations • 1 Dec 2021 • Kevin Duarte, Brian Chen, Nina Shvetsova, Andrew Rouditchenko, Samuel Thomas, Alexander Liu, David Harwath, James Glass, Hilde Kuehne, Mubarak Shah
We present a new multimodal capsule network that allows us to leverage the strength of capsules in the context of a multimodal learning framework on large amounts of video data.
1 code implementation • 8 Nov 2021 • Andrew Rouditchenko, Angie Boggust, David Harwath, Samuel Thomas, Hilde Kuehne, Brian Chen, Rameswar Panda, Rogerio Feris, Brian Kingsbury, Michael Picheny, James Glass
In this paper, we explore self-supervised audio-visual models that learn from instructional videos.
1 code implementation • 14 Oct 2021 • Ian Palmer, Andrew Rouditchenko, Andrei Barbu, Boris Katz, James Glass
These results show that models trained on other datasets and then evaluated on Spoken ObjectNet tend to perform poorly due to biases in other datasets that the models have learned.
no code implementations • ACL 2022 • Alexander H. Liu, SouYoung Jin, Cheng-I Jeff Lai, Andrew Rouditchenko, Aude Oliva, James Glass
Recent advances in representation learning have demonstrated an ability to represent information from different modalities such as video, text, and audio in a single high-level embedding vector.
1 code implementation • ICCV 2021 • Brian Chen, Andrew Rouditchenko, Kevin Duarte, Hilde Kuehne, Samuel Thomas, Angie Boggust, Rameswar Panda, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Michael Picheny, Shih-Fu Chang
Multimodal self-supervised learning is getting more and more attention as it allows not only to train large networks without human supervision but also to search and retrieve data across various modalities.
1 code implementation • 16 Jun 2020 • Andrew Rouditchenko, Angie Boggust, David Harwath, Brian Chen, Dhiraj Joshi, Samuel Thomas, Kartik Audhkhasi, Hilde Kuehne, Rameswar Panda, Rogerio Feris, Brian Kingsbury, Michael Picheny, Antonio Torralba, James Glass
Further, we propose a tri-modal model that jointly processes raw audio, video, and text captions from videos to learn a multi-modal semantic embedding space useful for text-video retrieval.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+5
no code implementations • ICLR Workshop LLD 2019 • Tyler Lee, Ting Gong, Suchismita Padhy, Andrew Rouditchenko, Anthony Ndirango
We demonstrate that, in scenarios with limited labeled training data, one can significantly improve the performance of three different supervised classification tasks individually by up to 6% through simultaneous training with these additional self-supervised tasks.
no code implementations • 18 Apr 2019 • Andrew Rouditchenko, Hang Zhao, Chuang Gan, Josh Mcdermott, Antonio Torralba
Segmenting objects in images and separating sound sources in audio are challenging tasks, in part because traditional approaches require large amounts of labeled data.
2 code implementations • ECCV 2018 • Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh Mcdermott, Antonio Torralba
We introduce PixelPlayer, a system that, by leveraging large amounts of unlabeled videos, learns to locate image regions which produce sounds and separate the input sounds into a set of components that represents the sound from each pixel.