Search Results for author: Andrew Rouditchenko

Found 19 papers, 11 papers with code

AV-CPL: Continuous Pseudo-Labeling for Audio-Visual Speech Recognition

no code implementations29 Sep 2023 Andrew Rouditchenko, Ronan Collobert, Tatiana Likhomanenko

Audio-visual speech contains synchronized audio and visual information that provides cross-modal supervision to learn representations for both automatic speech recognition (ASR) and visual speech recognition (VSR).

Audio-Visual Speech Recognition Automatic Speech Recognition +4

Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech Pre-Training for Adaptation to Unseen Languages

no code implementations21 May 2023 Andrew Rouditchenko, Sameer Khurana, Samuel Thomas, Rogerio Feris, Leonid Karlinsky, Hilde Kuehne, David Harwath, Brian Kingsbury, James Glass

Recent models such as XLS-R and Whisper have made multilingual speech technologies more accessible by pre-training on audio from around 100 spoken languages each.

C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval

1 code implementation7 Oct 2022 Andrew Rouditchenko, Yung-Sung Chuang, Nina Shvetsova, Samuel Thomas, Rogerio Feris, Brian Kingsbury, Leonid Karlinsky, David Harwath, Hilde Kuehne, James Glass

Inspired by the fact that English text-video retrieval outperforms other languages, we train a student model using input text in different languages to match the cross-modal predictions from teacher models using input text in English.

Knowledge Distillation Retrieval +2

Contrastive Audio-Visual Masked Autoencoder

1 code implementation2 Oct 2022 Yuan Gong, Andrew Rouditchenko, Alexander H. Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, James Glass

In this paper, we first extend the recent Masked Auto-Encoder (MAE) model from a single modality to audio-visual multi-modalities.

 Ranked #1 on Audio Tagging on AudioSet (using extra training data)

Audio Classification Audio Tagging +6

Everything at Once - Multi-Modal Fusion Transformer for Video Retrieval

1 code implementation CVPR 2022 Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio S. Feris, David Harwath, James Glass, Hilde Kuehne

In this work, we present a multi-modal, modality agnostic fusion transformer that learns to exchange information between multiple modalities, such as video, audio, and text, and integrate them into a fused representation in a joined multi-modal embedding space.

Action Localization Retrieval +2

Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

1 code implementation8 Dec 2021 Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Hilde Kuehne

Multi-modal learning from video data has seen increased attention recently as it allows to train semantically meaningful embeddings without human annotation enabling tasks like zero-shot retrieval and classification.

Action Localization Retrieval +2

Routing with Self-Attention for Multimodal Capsule Networks

no code implementations1 Dec 2021 Kevin Duarte, Brian Chen, Nina Shvetsova, Andrew Rouditchenko, Samuel Thomas, Alexander Liu, David Harwath, James Glass, Hilde Kuehne, Mubarak Shah

We present a new multimodal capsule network that allows us to leverage the strength of capsules in the context of a multimodal learning framework on large amounts of video data.

Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset

1 code implementation14 Oct 2021 Ian Palmer, Andrew Rouditchenko, Andrei Barbu, Boris Katz, James Glass

These results show that models trained on other datasets and then evaluated on Spoken ObjectNet tend to perform poorly due to biases in other datasets that the models have learned.

Image Retrieval Language Modelling +1

Cross-Modal Discrete Representation Learning

no code implementations ACL 2022 Alexander H. Liu, SouYoung Jin, Cheng-I Jeff Lai, Andrew Rouditchenko, Aude Oliva, James Glass

Recent advances in representation learning have demonstrated an ability to represent information from different modalities such as video, text, and audio in a single high-level embedding vector.

Cross-Modal Retrieval Quantization +4

Label-efficient audio classification through multitask learning and self-supervision

no code implementations ICLR Workshop LLD 2019 Tyler Lee, Ting Gong, Suchismita Padhy, Andrew Rouditchenko, Anthony Ndirango

We demonstrate that, in scenarios with limited labeled training data, one can significantly improve the performance of three different supervised classification tasks individually by up to 6% through simultaneous training with these additional self-supervised tasks.

Audio Classification Data Augmentation +2

Self-Supervised Audio-Visual Co-Segmentation

no code implementations18 Apr 2019 Andrew Rouditchenko, Hang Zhao, Chuang Gan, Josh Mcdermott, Antonio Torralba

Segmenting objects in images and separating sound sources in audio are challenging tasks, in part because traditional approaches require large amounts of labeled data.

Image Segmentation Segmentation +1

The Sound of Pixels

2 code implementations ECCV 2018 Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh Mcdermott, Antonio Torralba

We introduce PixelPlayer, a system that, by leveraging large amounts of unlabeled videos, learns to locate image regions which produce sounds and separate the input sounds into a set of components that represents the sound from each pixel.

Cannot find the paper you are looking for? You can Submit a new open access paper.