Search Results for author: Florian Metze

Found 83 papers, 26 papers with code

Paper
Add Code

Intra-Speaker Topic Modeling for Improved Multi-Party Meeting Summarization with Integrated Random Walk

no code implementations • NAACL 2012 • Yun-Nung Chen, Florian Metze

Meeting Summarization Speech Recognition +1

Paper
Add Code

Prosody-Based Unsupervised Speech Summarization with Two-Layer Mutually Reinforced Random Walk

no code implementations • IJCNLP 2013 • Sujay Kumar Jauhar, Yun-Nung Chen, Florian Metze

Paper
Add Code

Augmenting Translation Models with Simulated Acoustic Confusions for Improved Spoken Language Translation

no code implementations • EACL 2014 • Yulia Tsvetkov, Florian Metze, Chris Dyer

Language Modelling Machine Translation +2

Paper
Add Code

Semantics for Large-Scale Multimedia: New Challenges for NLP

no code implementations • ACL 2014 • Florian Metze, Koichi Shinoda

Active Learning Information Retrieval +2

Paper
Add Code

EESEN: End-to-End Speech Recognition using Deep RNN Models and WFST-based Decoding

4 code implementations • 29 Jul 2015 • Yajie Miao, Mohammad Gowayyed, Florian Metze

The performance of automatic speech recognition (ASR) has improved tremendously due to the application of deep neural networks (DNNs).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

823

Paper
Code

Robust end-to-end deep audiovisual speech recognition

no code implementations • 21 Nov 2016 • Ramon Sanabria, Florian Metze, Fernando de la Torre

Speech is one of the most effective ways of communication among humans.

speech-recognition Speech Recognition

Paper
Add Code

A Comparison of deep learning methods for environmental sound

1 code implementation • 20 Mar 2017 • Juncheng Li, Wei Dai, Florian Metze, Shuhui Qu, Samarjit Das

On these features, we apply five models: Gaussian Mixture Model (GMM), Deep Neural Network (DNN), Recurrent Neural Network (RNN), Convolutional Deep Neural Net- work (CNN) and i-vector.

Avg

Paper
Code

Comparison of Decoding Strategies for CTC Acoustic Models

no code implementations • 15 Aug 2017 • Thomas Zenkel, Ramon Sanabria, Florian Metze, Jan Niehues, Matthias Sperber, Sebastian Stüker, Alex Waibel

The CTC loss function maps an input sequence of observable feature vectors to an output sequence of symbols.

Language Modelling speech-recognition +1

Paper
Add Code

Annotating High-Level Structures of Short Stories and Personal Anecdotes

no code implementations • LREC 2018 • Boyang Li, Beth Cardier, Tong Wang, Florian Metze

Stories are a vital form of communication in human culture; they are employed daily to persuade, to elicit sympathy, or to convey a message.

Cultural Vocal Bursts Intensity Prediction Vocal Bursts Intensity Prediction

Paper
Add Code

Visual Features for Context-Aware Speech Recognition

no code implementations • 1 Dec 2017 • Abhinav Gupta, Yajie Miao, Leonardo Neves, Florian Metze

We are working on a corpus of "how-to" videos from the web, and the idea is that an object that can be seen ("car"), or a scene that is being detected ("kitchen") can be used to condition both models on the "context" of the recording, thereby reducing perplexity and improving transcription.

Language Modelling speech-recognition +1

Paper
Add Code

Subword and Crossword Units for CTC Acoustic Models

no code implementations • 19 Dec 2017 • Thomas Zenkel, Ramon Sanabria, Florian Metze, Alex Waibel

This paper proposes a novel approach to create an unit set for CTC based speech recognition systems.

Language Modelling speech-recognition +1

Paper
Add Code

Linguistic unit discovery from multi-modal inputs in unwritten languages: Summary of the "Speaking Rosetta" JSALT 2017 Workshop

no code implementations • 14 Feb 2018 • Odette Scharenborg, Laurent Besacier, Alan Black, Mark Hasegawa-Johnson, Florian Metze, Graham Neubig, Sebastian Stueker, Pierre Godard, Markus Mueller, Lucas Ondel, Shruti Palaskar, Philip Arthur, Francesco Ciannella, Mingxing Du, Elin Larsen, Danny Merkx, Rachid Riad, Liming Wang, Emmanuel Dupoux

We summarize the accomplishments of a multi-disciplinary workshop exploring the computational and scientific issues surrounding the discovery of linguistic units (subwords and words) in a language without orthography.

Paper
Add Code

Sequence-based Multi-lingual Low Resource Speech Recognition

no code implementations • 21 Feb 2018 • Siddharth Dalmia, Ramon Sanabria, Florian Metze, Alan W. black

Techniques for multi-lingual and cross-lingual speech recognition can help in low resource scenarios, to bootstrap systems and enable analysis of new languages and domains.

speech-recognition Speech Recognition

Paper
Add Code

End-to-End Multimodal Speech Recognition

no code implementations • 25 Apr 2018 • Shruti Palaskar, Ramon Sanabria, Florian Metze

Transcription or sub-titling of open-domain videos is still a challenging domain for Automatic Speech Recognition (ASR) due to the data's challenging acoustics, variable signal processing and the essentially unrestricted domain of the data.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval

1 code implementation • ICMR 2018 • Niluthpol Chowdhury Mithun, Juncheng Li, Florian Metze, Amit K. Roy-Chowdhury

Constructing a joint representation invariant across different modalities (e. g., video, language) is of significant importance in many multimedia applications.

Ranked #37 on Video Retrieval on MSR-VTT

Retrieval Text Retrieval +1

Paper
Code

Hierarchical Multi Task Learning With CTC

no code implementations • 18 Jul 2018 • Ramon Sanabria, Florian Metze

Our model obtains 14. 0% Word Error Rate on the Eval2000 Switchboard subset without any decoder or language model, outperforming the current state-of-the-art on acoustic-to-word models.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

Acoustic-to-Word Recognition with Sequence-to-Sequence Models

no code implementations • 23 Jul 2018 • Shruti Palaskar, Florian Metze

We present effective methods to train Sequence-to-Sequence models for direct word-level recognition (and character-level recognition) and show an absolute improvement of 4. 4-5. 0\% in Word Error Rate on the Switchboard corpus compared to prior work.

Language Modelling speech-recognition +1

Paper
Add Code

Domain Robust Feature Extraction for Rapid Low Resource ASR Development

no code implementations • 28 Jul 2018 • Siddharth Dalmia, Xinjian Li, Florian Metze, Alan W. black

We demonstrate the effectiveness of using a pre-trained English recognizer, which is robust to such mismatched conditions, as a domain normalizing feature extractor on a low resource language.

Paper
Add Code

Dialog-context aware end-to-end speech recognition

no code implementations • 7 Aug 2018 • Suyoun Kim, Florian Metze

Existing speech recognition systems are typically built at the sentence level, although it is known that dialog context, e. g. higher-level knowledge that spans across sentences or speakers, can help the processing of long conversations.

Sentence speech-recognition +1

Paper
Add Code

Activity Recognition on a Large Scale in Short Videos - Moments in Time Dataset

no code implementations • 1 Sep 2018 • Ankit Shah, Harini Kesavamoorthy, Poorva Rane, Pramati Kalwad, Alexander Hauptmann, Florian Metze

Moments capture a huge part of our lives.

Action Recognition Temporal Action Localization +1

Paper
Add Code

Zero-shot Learning for Speech Recognition with Universal Phonetic Model

no code implementations • 27 Sep 2018 • Xinjian Li, Siddharth Dalmia, David R. Mortensen, Florian Metze, Alan W Black

Our model is able to recognize unseen phonemes in the target language, if only a small text corpus is available.

speech-recognition Speech Recognition +1

Paper
Add Code

Connectionist Temporal Localization for Sound Event Detection with Sequential Labeling

2 code implementations • 22 Oct 2018 • Yun Wang, Florian Metze

Research on sound event detection (SED) with weak labeling has mostly focused on presence/absence labeling, which provides no temporal information at all about the event occurrences.

Sound Audio and Speech Processing

165

Paper
Code

A Comparison of Five Multiple Instance Learning Pooling Functions for Sound Event Detection with Weak Labeling

3 code implementations • 22 Oct 2018 • Yun Wang, Juncheng Li, Florian Metze

This paper compares five types of pooling functions both theoretically and experimentally, with special focus on their performance of localization.

Sound Audio and Speech Processing

165

Paper
Code

How2: A Large-scale Dataset for Multimodal Language Understanding

2 code implementations • 1 Nov 2018 • Ramon Sanabria, Ozan Caglayan, Shruti Palaskar, Desmond Elliott, Loïc Barrault, Lucia Specia, Florian Metze

In this paper, we introduce How2, a multimodal collection of instructional videos with English subtitles and crowdsourced Portuguese translations.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

151

Paper
Code

Multimodal Grounding for Sequence-to-Sequence Speech Recognition

1 code implementation • 9 Nov 2018 • Ozan Caglayan, Ramon Sanabria, Shruti Palaskar, Loïc Barrault, Florian Metze

Specifically, in our previous work, we propose a multistep visual adaptive training approach which improves the accuracy of an audio-based Automatic Speech Recognition (ASR) system.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

151

Paper
Code

Learning from Multiview Correlations in Open-Domain Videos

no code implementations • 21 Nov 2018 • Nils Holzenberger, Shruti Palaskar, Pranava Madhyastha, Florian Metze, Raman Arora

This shows it is possible to learn reliable representations across disparate, unaligned and noisy modalities, and encourages using the proposed approach on larger datasets.

Representation Learning Retrieval

Paper
Add Code

Learned In Speech Recognition: Contextual Acoustic Word Embeddings

no code implementations • 18 Feb 2019 • Shruti Palaskar, Vikas Raunak, Florian Metze

End-to-end acoustic-to-word speech recognition models have recently gained popularity because they are easy to train, scale well to large amounts of training data, and do not require a lexicon.

Sentence speech-recognition +3

Paper
Add Code

Phoneme Level Language Models for Sequence Based Low Resource ASR

no code implementations • 20 Feb 2019 • Siddharth Dalmia, Xinjian Li, Alan W. black, Florian Metze

Building multilingual and crosslingual models help bring different languages together in a language universal space.

Language Modelling

Paper
Add Code

The ARIEL-CMU Systems for LoReHLT18

no code implementations • 24 Feb 2019 • Aditi Chaudhary, Siddharth Dalmia, Junjie Hu, Xinjian Li, Austin Matthews, Aldrian Obaja Muis, Naoki Otani, Shruti Rijhwani, Zaid Sheikh, Nidhi Vyas, Xinyi Wang, Jiateng Xie, Ruochen Xu, Chunting Zhou, Peter J. Jansen, Yiming Yang, Lori Levin, Florian Metze, Teruko Mitamura, David R. Mortensen, Graham Neubig, Eduard Hovy, Alan W. black, Jaime Carbonell, Graham V. Horwood, Shabnam Tafreshi, Mona Diab, Efsun S. Kayi, Noura Farra, Kathleen McKeown

This paper describes the ARIEL-CMU submissions to the Low Resource Human Language Technologies (LoReHLT) 2018 evaluations for the tasks Machine Translation (MT), Entity Discovery and Linking (EDL), and detection of Situation Frames in Text and Speech (SF Text and Speech).

Machine Translation Translation

Paper
Add Code

Acoustic-to-Word Models with Conversational Context Information

no code implementations • NAACL 2019 • Suyoun Kim, Florian Metze

Conversational context information, higher-level knowledge that spans across sentences, can help to recognize a long conversation.

Sentence speech-recognition +1

Paper
Add Code

Multimodal Abstractive Summarization for How2 Videos

no code implementations • ACL 2019 • Shruti Palaskar, Jindrich Libovický, Spandana Gella, Florian Metze

In this paper, we study abstractive summarization for open-domain videos.

Ranked #1 on Text Summarization on How2

Abstractive Text Summarization News Summarization

Paper
Add Code

Gated Embeddings in End-to-End Speech Recognition for Conversational-Context Fusion

no code implementations • ACL 2019 • Suyoun Kim, Siddharth Dalmia, Florian Metze

We present a novel conversational-context aware end-to-end speech recognizer based on a gated neural network that incorporates conversational-context/word/speech embeddings.

Sentence Sentence Embeddings +2

Paper
Add Code

Analyzing Utility of Visual Context in Multimodal Speech Recognition Under Noisy Conditions

no code implementations • 30 Jun 2019 • Tejas Srinivasan, Ramon Sanabria, Florian Metze

Multimodal learning allows us to leverage information from multiple sources (visual, acoustic and text), similar to our experience of the real world.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Cross-Attention End-to-End ASR for Two-Party Conversations

no code implementations • 24 Jul 2019 • Suyoun Kim, Siddharth Dalmia, Florian Metze

We present an end-to-end speech recognition model that learns interaction between two speakers based on the turn-changing information.

speech-recognition Speech Recognition +1

Paper
Add Code

Effective Dimensionality Reduction for Word Embeddings

1 code implementation • WS 2019 • Vikas Raunak, Vivek Gupta, Florian Metze

Pre-trained word embeddings are used in several downstream applications as well as for constructing representations for sentences, paragraphs and documents.

Dimensionality Reduction Word Embeddings

124

Paper
Code

Multilingual Speech Recognition with Corpus Relatedness Sampling

no code implementations • 2 Aug 2019 • Xinjian Li, Siddharth Dalmia, Alan W. black, Florian Metze

For example, the target corpus might benefit more from a corpus in the same domain or a corpus from a close language.

speech-recognition Speech Recognition

Paper
Add Code

SANTLR: Speech Annotation Toolkit for Low Resource Languages

no code implementations • 2 Aug 2019 • Xinjian Li, Zhong Zhou, Siddharth Dalmia, Alan W. black, Florian Metze

In this work, we present SANTLR: Speech Annotation Toolkit for Low Resource Languages.

speech-recognition Speech Recognition

Paper
Add Code

RTC-VAE: HARNESSING THE PECULIARITY OF TOTAL CORRELATION IN LEARNING DISENTANGLED REPRESENTATIONS

no code implementations • 25 Sep 2019 • Ze Cheng, Juncheng B Li, Chenxu Wang, Jixuan Gu, Hao Xu, Xinjian Li, Florian Metze

In the problem of unsupervised learning of disentangled representations, one of the promising methods is to penalize the total correlation of sampled latent vari-ables.

Disentanglement

Paper
Add Code

On Dimensional Linguistic Properties of the Word Embedding Space

2 code implementations • WS 2020 • Vikas Raunak, Vaibhav Kumar, Vivek Gupta, Florian Metze

Word embeddings have become a staple of several natural language processing tasks, yet much remains to be understood about their properties.

Machine Translation Sentence +3

124

Paper
Code

On Leveraging the Visual Modality for Neural Machine Translation

no code implementations • WS 2019 • Vikas Raunak, Sang Keun Choe, Quanyang Lu, Yi Xu, Florian Metze

Leveraging the visual modality effectively for Neural Machine Translation (NMT) remains an open problem in computational linguistics.

Multimodal Machine Translation NMT +2

Paper
Add Code

Multitask Learning For Different Subword Segmentations In Neural Machine Translation

no code implementations • EMNLP (IWSLT) 2019 • Tejas Srinivasan, Ramon Sanabria, Florian Metze

In Neural Machine Translation (NMT) the usage of subwords and characters as source and target units offers a simple and flexible solution for translation of rare and unseen words.

Machine Translation NMT +2

Paper
Add Code

Adversarial Music: Real World Audio Adversary Against Wake-word Detection System

no code implementations • NeurIPS 2019 • Juncheng B. Li, Shuhui Qu, Xinjian Li, Joseph Szurley, J. Zico Kolter, Florian Metze

In this work, we target our attack on the wake-word detection system, jamming the model with some inconspicuous background music to deactivate the VAs while our audio adversary is present.

Real-World Adversarial Attack

Paper
Add Code

On Compositionality in Neural Machine Translation

no code implementations • 4 Nov 2019 • Vikas Raunak, Vaibhav Kumar, Florian Metze

We investigate two specific manifestations of compositionality in Neural Machine Translation (NMT) : (1) Productivity - the ability of the model to extend its predictions beyond the observed length in training data and (2) Systematicity - the ability of the model to systematically recombine known parts and rules.

Machine Translation NMT +1

Paper
Add Code

Enforcing Encoder-Decoder Modularity in Sequence-to-Sequence Models

no code implementations • 9 Nov 2019 • Siddharth Dalmia, Abdel-rahman Mohamed, Mike Lewis, Florian Metze, Luke Zettlemoyer

Inspired by modular software design principles of independence, interchangeability, and clarity of interface, we introduce a method for enforcing encoder-decoder modularity in seq2seq models without sacrificing the overall model quality or its full differentiability.

Paper
Add Code

Gun Source and Muzzle Head Detection

no code implementations • 29 Jan 2020 • Zhong Zhou, Isak Czeresnia Etinger, Florian Metze, Alexander Hauptmann, Alexander Waibel

We have interesting results both in bounding the shooter as well as detecting the gun smoke.

Head Detection object-detection +1

Paper
Add Code

Looking Enhances Listening: Recovering Missing Speech Using Images

no code implementations • 13 Feb 2020 • Tejas Srinivasan, Ramon Sanabria, Florian Metze

Speech is understood better by using visual context; for this reason, there have been many attempts to use images to adapt automatic speech recognition (ASR) systems.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Universal Phone Recognition with a Multilingual Allophone System

1 code implementation • 26 Feb 2020 • Xinjian Li, Siddharth Dalmia, Juncheng Li, Matthew Lee, Patrick Littell, Jiali Yao, Antonios Anastasopoulos, David R. Mortensen, Graham Neubig, Alan W. black, Florian Metze

Multilingual models can improve language processing, particularly for low resource situations, by sharing parameters across languages.

speech-recognition Speech Recognition

502

Paper
Code

Towards Zero-shot Learning for Automatic Phonemic Transcription

no code implementations • 26 Feb 2020 • Xinjian Li, Siddharth Dalmia, David R. Mortensen, Juncheng Li, Alan W. black, Florian Metze

The difficulty of this task is that phoneme inventories often differ between the training languages and the target language, making it infeasible to recognize unseen phonemes.

Zero-Shot Learning

Paper
Add Code

ASR Error Correction and Domain Adaptation Using Machine Translation

no code implementations • 13 Mar 2020 • Anirudh Mani, Shruti Palaskar, Nimshi Venkat Meripo, Sandeep Konam, Florian Metze

We propose a simple technique to perform domain adaptation for ASR error correction via machine translation.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +6

Paper
Add Code

AlloVera: A Multilingual Allophone Database

no code implementations • LREC 2020 • David R. Mortensen, Xinjian Li, Patrick Littell, Alexis Michaud, Shruti Rijhwani, Antonios Anastasopoulos, Alan W. black, Florian Metze, Graham Neubig

While phonemic representations are language specific, phonetic representations (stated in terms of (allo)phones) are much closer to a universal (language-independent) transcription.

speech-recognition Speech Recognition

Paper
Add Code

Contextual RNN-T For Open Domain ASR

no code implementations • 4 Jun 2020 • Mahaveer Jain, Gil Keren, Jay Mahadeokar, Geoffrey Zweig, Florian Metze, Yatharth Saraf

By using an attention model and a biasing model to leverage the contextual metadata that accompanies a video, we observe a relative improvement of about 16% in Word Error Rate on Named Entities (WER-NE) for videos with related metadata.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

How2Sign: A Large-scale Multimodal Dataset for Continuous American Sign Language

1 code implementation • CVPR 2021 • Amanda Duarte, Shruti Palaskar, Lucas Ventura, Deepti Ghadiyaram, Kenneth DeHaan, Florian Metze, Jordi Torres, Xavier Giro-i-Nieto

Towards this end, we introduce How2Sign, a multimodal and multiview continuous American Sign Language (ASL) dataset, consisting of a parallel corpus of more than 80 hours of sign language videos and a set of corresponding modalities including speech, English transcripts, and depth.

Sign Language Production Sign Language Translation +1

Paper
Code

Revisiting Factorizing Aggregated Posterior in Learning Disentangled Representations

no code implementations • 12 Sep 2020 • Ze Cheng, Juncheng Li, Chenxu Wang, Jixuan Gu, Hao Xu, Xinjian Li, Florian Metze

In this paper, we provide a theoretical explanation that low total correlation of sampled representation cannot guarantee low total correlation of the mean representation.

Paper
Add Code

Fine-Grained Grounding for Multimodal Speech Recognition

1 code implementation • Findings of the Association for Computational Linguistics 2020 • Tejas Srinivasan, Ramon Sanabria, Florian Metze, Desmond Elliott

In experiments on the Flickr8K Audio Captions Corpus, we find that our model improves over approaches that use global visual features, that the proposals enable the model to recover entities and other related words, such as adjectives, and that improvements are due to the model's ability to localize the correct proposals.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Code

Support-set bottlenecks for video-text representation learning

no code implementations • ICLR 2021 • Mandela Patrick, Po-Yao Huang, Yuki Asano, Florian Metze, Alexander Hauptmann, João Henriques, Andrea Vedaldi

The dominant paradigm for learning video-text representations -- noise contrastive learning -- increases the similarity of the representations of pairs of samples that are known to be related, such as text and video from the same sample, and pushes away the representations of all other pairs.

Contrastive Learning Representation Learning +3

Paper
Add Code

On Long-Tailed Phenomena in Neural Machine Translation

1 code implementation • Findings of the Association for Computational Linguistics 2020 • Vikas Raunak, Siddharth Dalmia, Vivek Gupta, Florian Metze

State-of-the-art Neural Machine Translation (NMT) models struggle with generating low-frequency tokens, tackling which remains a major challenge.

Conditional Text Generation Machine Translation +5

Paper
Code

Multimodal Speech Recognition with Unstructured Audio Masking

no code implementations • EMNLP (nlpbt) 2020 • Tejas Srinivasan, Ramon Sanabria, Florian Metze, Desmond Elliott

Our experiments on the Flickr 8K Audio Captions Corpus show that multimodal ASR can generalize to recover different types of masked words in this unstructured masking setting.

8k Automatic Speech Recognition +2

Paper
Add Code

Audio-Visual Event Recognition through the lens of Adversary

1 code implementation • 15 Nov 2020 • Juncheng B Li, Kaixin Ma, Shuhui Qu, Po-Yao Huang, Florian Metze

This work aims to study several key questions related to multimodal learning through the lens of adversarial noises: 1) The trade-off between early/middle/late fusion affecting its robustness and accuracy 2) How do different frequency/time domain features contribute to the robustness?

Paper
Code

End-to-end Quantized Training via Log-Barrier Extensions

no code implementations • 1 Jan 2021 • Juncheng B Li, Shuhui Qu, Xinjian Li, Emma Strubell, Florian Metze

Quantization of neural network parameters and activations has emerged as a successful approach to reducing the model size and inference time on hardware that sup-ports native low-precision arithmetic.

Quantization

Paper
Add Code

NoiseQA: Challenge Set Evaluation for User-Centric Question Answering

2 code implementations • EACL 2021 • Abhilasha Ravichander, Siddharth Dalmia, Maria Ryskina, Florian Metze, Eduard Hovy, Alan W Black

When Question-Answering (QA) systems are deployed in the real world, users query them through a variety of interfaces, such as speaking to voice assistants, typing questions into a search engine, or even translating questions to languages supported by the QA system.

Question Answering

Paper
Code

Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models

1 code implementation • NAACL 2021 • Po-Yao Huang, Mandela Patrick, Junjie Hu, Graham Neubig, Florian Metze, Alexander Hauptmann

Specifically, we focus on multilingual text-to-video search and propose a Transformer-based model that learns contextualized multilingual multimodal embeddings.

Image Retrieval Text-to-video search +1

Paper
Code

Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning

1 code implementation • ICCV 2021 • Mandela Patrick, Yuki M. Asano, Bernie Huang, Ishan Misra, Florian Metze, Joao Henriques, Andrea Vedaldi

First, for space, we show that spatial augmentations such as cropping do work well for videos too, but that previous implementations, due to the high processing and memory cost, could not do this at a scale sufficient for it to work well.

Representation Learning Self-Supervised Learning

Paper
Code

Self-supervised object detection from audio-visual correspondence

no code implementations • CVPR 2022 • Triantafyllos Afouras, Yuki M. Asano, Francois Fagan, Andrea Vedaldi, Florian Metze

We tackle the problem of learning object detectors without supervision.

Object object-detection +1

Paper
Add Code

Searchable Hidden Intermediates for End-to-End Models of Decomposable Sequence Tasks

no code implementations • NAACL 2021 • Siddharth Dalmia, Brian Yan, Vikas Raunak, Florian Metze, Shinji Watanabe

In this work, we present an end-to-end framework that exploits compositionality to learn searchable hidden representations at intermediate stages of a sequence model using decomposed sub-tasks.

speech-recognition Speech Recognition +1

Paper
Add Code

VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding

1 code implementation • Findings (ACL) 2021 • Hu Xu, Gargi Ghosh, Po-Yao Huang, Prahal Arora, Masoumeh Aminzadeh, Christoph Feichtenhofer, Florian Metze, Luke Zettlemoyer

We present a simplified, task-agnostic multi-modal pre-training approach that can accept either video or text input, or both for a variety of end tasks.

Ranked #2 on Temporal Action Localization on CrossTask (using extra training data)

Action Segmentation Language Modelling +5

29,183

Paper
Code

Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers

2 code implementations • NeurIPS 2021 • Mandela Patrick, Dylan Campbell, Yuki M. Asano, Ishan Misra, Florian Metze, Christoph Feichtenhofer, Andrea Vedaldi, João F. Henriques

In video transformers, the time dimension is often treated in the same way as the two spatial dimensions.

Ranked #15 on Action Recognition on EPIC-KITCHENS-100 (using extra training data)

Action Classification Action Recognition +1

7,508

Paper
Code

Rethinking End-to-End Evaluation of Decomposable Tasks: A Case Study on Spoken Language Understanding

no code implementations • 29 Jun 2021 • Siddhant Arora, Alissa Ostapenko, Vijay Viswanathan, Siddharth Dalmia, Florian Metze, Shinji Watanabe, Alan W Black

Our splits identify performance gaps up to 10% between end-to-end systems that were within 1% of each other on the original test sets.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

Differentiable Allophone Graphs for Language-Universal Speech Recognition

1 code implementation • 24 Jul 2021 • Brian Yan, Siddharth Dalmia, David R. Mortensen, Florian Metze, Shinji Watanabe

These phone-based systems with learned allophone graphs can be used by linguists to document new languages, build phone-based lexicons that capture rich pronunciation variations, and re-evaluate the allophone mappings of seen language.

speech-recognition Speech Recognition

Paper
Code

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

2 code implementations • EMNLP 2021 • Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, Christoph Feichtenhofer

We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks.

Ranked #1 on Temporal Action Localization on CrossTask (using extra training data)

Action Segmentation Long Video Retrieval (Background Removed) +4

29,185

Paper
Code

Speech Summarization using Restricted Self-Attention

no code implementations • 12 Oct 2021 • Roshan Sharma, Shruti Palaskar, Alan W Black, Florian Metze

End-to-end modeling of speech summarization models is challenging due to memory and compute constraints arising from long input audio sequences.

Document Summarization speech-recognition +2

Paper
Add Code

On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization

no code implementations • 24 May 2022 • Shruti Palaskar, Akshita Bhagia, Yonatan Bisk, Florian Metze, Alan W Black, Ana Marasović

Combining the visual modality with pretrained language models has been surprisingly effective for simple descriptive tasks such as image captioning.

Descriptive Image Captioning +5

Paper
Add Code

LegoNN: Building Modular Encoder-Decoder Models

no code implementations • 7 Jun 2022 • Siddharth Dalmia, Dmytro Okhonko, Mike Lewis, Sergey Edunov, Shinji Watanabe, Florian Metze, Luke Zettlemoyer, Abdelrahman Mohamed

We describe LegoNN, a procedure for building encoder-decoder architectures in a way so that its parts can be applied to other tasks without the need for any fine-tuning.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

Masked Autoencoders that Listen

4 code implementations • 13 Jul 2022 • Po-Yao Huang, Hu Xu, Juncheng Li, Alexei Baevski, Michael Auli, Wojciech Galuba, Florian Metze, Christoph Feichtenhofer

Following the Transformer encoder-decoder design in MAE, our Audio-MAE first encodes audio spectrogram patches with a high masking ratio, feeding only the non-masked tokens through encoder layers.

Ranked #2 on Speaker Identification on VoxCeleb1 (using extra training data)

Audio Classification Representation Learning +1

1,286

Paper
Code

ASR2K: Speech Recognition for Around 2000 Languages without Audio

1 code implementation • 6 Sep 2022 • Xinjian Li, Florian Metze, David R Mortensen, Alan W Black, Shinji Watanabe

We achieve 50% CER and 74% WER on the Wilderness dataset with Crubadan statistics only and improve them to 45% CER and 69% WER when using 10000 raw text utterances.

Language Modelling Speech Recognition

Paper
Code

CTC Alignments Improve Autoregressive Translation

no code implementations • 11 Oct 2022 • Brian Yan, Siddharth Dalmia, Yosuke Higuchi, Graham Neubig, Florian Metze, Alan W Black, Shinji Watanabe

Connectionist Temporal Classification (CTC) is a widely used approach for automatic speech recognition (ASR) that performs conditionally independent monotonic alignment.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

SQuAT: Sharpness- and Quantization-Aware Training for BERT

no code implementations • 13 Oct 2022 • Zheng Wang, Juncheng B Li, Shuhui Qu, Florian Metze, Emma Strubell

Quantization is an effective technique to reduce memory footprint, inference latency, and power consumption of deep learning models.

Quantization

Paper
Add Code

Token-level Sequence Labeling for Spoken Language Understanding using Compositional End-to-End Models

1 code implementation • 27 Oct 2022 • Siddhant Arora, Siddharth Dalmia, Brian Yan, Florian Metze, Alan W Black, Shinji Watanabe

End-to-end spoken language understanding (SLU) systems are gaining popularity over cascaded approaches due to their simplicity and ability to avoid error propagation.

named-entity-recognition Named Entity Recognition +2

7,851

Paper
Code

Normalized Contrastive Learning for Text-Video Retrieval

1 code implementation • 30 Nov 2022 • Yookoon Park, Mahmoud Azab, Bo Xiong, Seungwhan Moon, Florian Metze, Gourab Kundu, Kirmani Ahmed

Cross-modal contrastive learning has led the recent advances in multimodal retrieval with its simplicity and effectiveness.

Contrastive Learning Cross-Modal Retrieval +2

Paper
Code

Error-aware Quantization through Noise Tempering

no code implementations • 11 Dec 2022 • Zheng Wang, Juncheng B Li, Shuhui Qu, Florian Metze, Emma Strubell

In this work, we incorporate exponentially decaying quantization-error-aware noise together with a learnable scale of task loss gradient to approximate the effect of a quantization operator.

Model Compression Quantization

Paper
Add Code

Phone Inventories and Recognition for Every Language

no code implementations • LREC 2022 • Xinjian Li, Florian Metze, David R. Mortensen, Alan W Black, Shinji Watanabe

Identifying phone inventories is a crucial component in language documentation and the preservation of endangered languages.

Paper
Add Code

Zero-shot Learning for Grapheme to Phoneme Conversion with Language Ensemble

1 code implementation • Findings (ACL) 2022 • Xinjian Li, Florian Metze, David Mortensen, Shinji Watanabe, Alan Black

Grapheme-to-Phoneme (G2P) has many applications in NLP and speech fields.

8k Zero-Shot Learning

127

Paper
Code

CMU’s Machine Translation System for IWSLT 2019

no code implementations • EMNLP (IWSLT) 2019 • Tejas Srinivasan, Ramon Sanabria, Florian Metze

In Neural Machine Translation (NMT) the usage of sub-words and characters as source and target units offers a simple and flexible solution for translation of rare and unseen words.

Machine Translation NMT +1

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.