Search Results for author: David Harwath

Found 22 papers, 10 papers with code

MAE-AST: Masked Autoencoding Audio Spectrogram Transformer

1 code implementation30 Mar 2022 Alan Baade, Puyuan Peng, David Harwath

In this paper, we propose a simple yet powerful improvement over the recent Self-Supervised Audio Spectrogram Transformer (SSAST) model for speech and audio classification.

Audio Classification

Everything at Once - Multi-Modal Fusion Transformer for Video Retrieval

1 code implementation CVPR 2022 Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio S. Feris, David Harwath, James Glass, Hilde Kuehne

In this work, we present a multi-modal, modality agnostic fusion transformer that learns to exchange information between multiple modalities, such as video, audio, and text, and integrate them into a fused representation in a joined multi-modal embedding space.

Action Localization Video Retrieval

Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

1 code implementation8 Dec 2021 Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Hilde Kuehne

Multi-modal learning from video data has seen increased attention recently as it allows to train semantically meaningful embeddings without human annotation enabling tasks like zero-shot retrieval and classification.

Action Localization Video Retrieval

Routing with Self-Attention for Multimodal Capsule Networks

no code implementations1 Dec 2021 Kevin Duarte, Brian Chen, Nina Shvetsova, Andrew Rouditchenko, Samuel Thomas, Alexander Liu, David Harwath, James Glass, Hilde Kuehne, Mubarak Shah

We present a new multimodal capsule network that allows us to leverage the strength of capsules in the context of a multimodal learning framework on large amounts of video data.

Fast-Slow Transformer for Visually Grounding Speech

1 code implementation16 Sep 2021 Puyuan Peng, David Harwath

We present Fast-Slow Transformer for Visually Grounding Speech, or FaST-VGS.

Image Retrieval

Learning Audio-Visual Dereverberation

no code implementations14 Jun 2021 Changan Chen, Wei Sun, David Harwath, Kristen Grauman

The visual environment surrounding a human speaker reveals important cues about the room geometry, materials, and speaker location, all of which influence the precise reverberation effects in the audio stream.

Automatic Speech Recognition Speaker Identification +1

Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions

no code implementations CVPR 2021 Mathew Monfort, SouYoung Jin, Alexander Liu, David Harwath, Rogerio Feris, James Glass, Aude Oliva

With this in mind, the descriptions people generate for videos of different dynamic events can greatly improve our understanding of the key information of interest in each video.

Contrastive Learning Video Understanding

Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos

1 code implementation ICCV 2021 Brian Chen, Andrew Rouditchenko, Kevin Duarte, Hilde Kuehne, Samuel Thomas, Angie Boggust, Rameswar Panda, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Michael Picheny, Shih-Fu Chang

Multimodal self-supervised learning is getting more and more attention as it allows not only to train large networks without human supervision but also to search and retrieve data across various modalities.

Contrastive Learning Self-Supervised Learning +3

Text-Free Image-to-Speech Synthesis Using Learned Segmental Units

no code implementations ACL 2021 Wei-Ning Hsu, David Harwath, Christopher Song, James Glass

In this paper we present the first model for directly synthesizing fluent, natural-sounding spoken audio captions for images that does not require natural language text as an intermediate representation or source of supervision.

Image Captioning Speech Synthesis +1

Transfer Learning from Audio-Visual Grounding to Speech Recognition

no code implementations9 Jul 2019 Wei-Ning Hsu, David Harwath, James Glass

Transfer learning aims to reduce the amount of data required to excel at a new task by re-using the knowledge acquired from learning other related tasks.

Speech Recognition Transfer Learning +1

Towards Visually Grounded Sub-Word Speech Unit Discovery

no code implementations21 Feb 2019 David Harwath, James Glass

In this paper, we investigate the manner in which interpretable sub-word speech units emerge within a convolutional neural network model trained to associate raw speech waveforms with semantically related natural image scenes.

Vision as an Interlingua: Learning Multilingual Semantic Embeddings of Untranscribed Speech

no code implementations9 Apr 2018 David Harwath, Galen Chuang, James Glass

In this paper, we explore the learning of neural network embeddings for natural images and speech waveforms describing the content of those images.

Speech Recognition

Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input

no code implementations ECCV 2018 David Harwath, Adrià Recasens, Dídac Surís, Galen Chuang, Antonio Torralba, James Glass

In this paper, we explore neural network models that learn to associate segments of spoken audio captions with the semantically relevant portions of natural images that they refer to.

Learning Modality-Invariant Representations for Speech and Images

no code implementations11 Dec 2017 Kenneth Leidal, David Harwath, James Glass

In this paper, we explore the unsupervised learning of a semantic embedding space for co-occurring sensory inputs.

Information Retrieval Semantic Similarity +2

Learning Word-Like Units from Joint Audio-Visual Analysis

no code implementations ACL 2017 David Harwath, James R. Glass

Given a collection of images and spoken audio captions, we present a method for discovering word-like acoustic units in the continuous speech signal and grounding them to semantically relevant image regions.

Automatic Speech Recognition Language Acquisition

Deep Multimodal Semantic Embeddings for Speech and Images

no code implementations11 Nov 2015 David Harwath, James Glass

In this paper, we present a model which takes as input a corpus of images with relevant spoken captions and finds a correspondence between the two modalities.

Image Retrieval

Cannot find the paper you are looking for? You can Submit a new open access paper.