no code implementations • 21 May 2023 • Andrew Rouditchenko, Sameer Khurana, Samuel Thomas, Rogerio Feris, Leonid Karlinsky, Hilde Kuehne, David Harwath, Brian Kingsbury, James Glass
Recent models such as XLS-R and Whisper have made multilingual speech technologies more accessible by pre-training on audio from around 100 spoken languages each.
no code implementations • 29 Mar 2023 • Brian Chen, Nina Shvetsova, Andrew Rouditchenko, Daniel Kondermann, Samuel Thomas, Shih-Fu Chang, Rogerio Feris, James Glass, Hilde Kuehne
Spatio-temporal grounding describes the task of localizing events in space and time, e. g., in video data, based on verbal descriptions only.
1 code implementation • 7 Oct 2022 • Andrew Rouditchenko, Yung-Sung Chuang, Nina Shvetsova, Samuel Thomas, Rogerio Feris, Brian Kingsbury, Leonid Karlinsky, David Harwath, Hilde Kuehne, James Glass
Inspired by the fact that English text-video retrieval outperforms other languages, we train a student model using input text in different languages to match the cross-modal predictions from teacher models using input text in English.
1 code implementation • 2 Oct 2022 • Yuan Gong, Andrew Rouditchenko, Alexander H. Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, James Glass
In this paper, we first extend the recent Masked Auto-Encoder (MAE) model from a single modality to audio-visual multi-modalities.
Ranked #1 on
Audio Tagging
on AudioSet
(using extra training data)
1 code implementation • 29 Jul 2022 • Yuan Gong, Alexander H. Liu, Andrew Rouditchenko, James Glass
Conventional audio-visual models have independent audio and video branches.
Ranked #2 on
Multi-modal Classification
on AudioSet
(using extra training data)
2 code implementations • 13 Mar 2022 • Yuan Gong, Sameer Khurana, Andrew Rouditchenko, James Glass
Audio classification is an active research area with a wide range of applications.
1 code implementation • CVPR 2022 • Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio S. Feris, David Harwath, James Glass, Hilde Kuehne
In this work, we present a multi-modal, modality agnostic fusion transformer that learns to exchange information between multiple modalities, such as video, audio, and text, and integrate them into a fused representation in a joined multi-modal embedding space.
1 code implementation • 8 Dec 2021 • Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Hilde Kuehne
Multi-modal learning from video data has seen increased attention recently as it allows to train semantically meaningful embeddings without human annotation enabling tasks like zero-shot retrieval and classification.
no code implementations • 1 Dec 2021 • Kevin Duarte, Brian Chen, Nina Shvetsova, Andrew Rouditchenko, Samuel Thomas, Alexander Liu, David Harwath, James Glass, Hilde Kuehne, Mubarak Shah
We present a new multimodal capsule network that allows us to leverage the strength of capsules in the context of a multimodal learning framework on large amounts of video data.
1 code implementation • 8 Nov 2021 • Andrew Rouditchenko, Angie Boggust, David Harwath, Samuel Thomas, Hilde Kuehne, Brian Chen, Rameswar Panda, Rogerio Feris, Brian Kingsbury, Michael Picheny, James Glass
In this paper, we explore self-supervised audio-visual models that learn from instructional videos.
1 code implementation • 14 Oct 2021 • Ian Palmer, Andrew Rouditchenko, Andrei Barbu, Boris Katz, James Glass
These results show that models trained on other datasets and then evaluated on Spoken ObjectNet tend to perform poorly due to biases in other datasets that the models have learned.
no code implementations • ACL 2022 • Alexander H. Liu, SouYoung Jin, Cheng-I Jeff Lai, Andrew Rouditchenko, Aude Oliva, James Glass
Recent advances in representation learning have demonstrated an ability to represent information from different modalities such as video, text, and audio in a single high-level embedding vector.
1 code implementation • ICCV 2021 • Brian Chen, Andrew Rouditchenko, Kevin Duarte, Hilde Kuehne, Samuel Thomas, Angie Boggust, Rameswar Panda, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Michael Picheny, Shih-Fu Chang
Multimodal self-supervised learning is getting more and more attention as it allows not only to train large networks without human supervision but also to search and retrieve data across various modalities.
1 code implementation • 16 Jun 2020 • Andrew Rouditchenko, Angie Boggust, David Harwath, Brian Chen, Dhiraj Joshi, Samuel Thomas, Kartik Audhkhasi, Hilde Kuehne, Rameswar Panda, Rogerio Feris, Brian Kingsbury, Michael Picheny, Antonio Torralba, James Glass
Further, we propose a tri-modal model that jointly processes raw audio, video, and text captions from videos to learn a multi-modal semantic embedding space useful for text-video retrieval.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+5
no code implementations • ICLR Workshop LLD 2019 • Tyler Lee, Ting Gong, Suchismita Padhy, Andrew Rouditchenko, Anthony Ndirango
We demonstrate that, in scenarios with limited labeled training data, one can significantly improve the performance of three different supervised classification tasks individually by up to 6% through simultaneous training with these additional self-supervised tasks.
no code implementations • 18 Apr 2019 • Andrew Rouditchenko, Hang Zhao, Chuang Gan, Josh Mcdermott, Antonio Torralba
Segmenting objects in images and separating sound sources in audio are challenging tasks, in part because traditional approaches require large amounts of labeled data.
2 code implementations • ECCV 2018 • Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh Mcdermott, Antonio Torralba
We introduce PixelPlayer, a system that, by leveraging large amounts of unlabeled videos, learns to locate image regions which produce sounds and separate the input sounds into a set of components that represents the sound from each pixel.