Search Results for author: Florian Metze

Found 83 papers, 26 papers with code

EESEN: End-to-End Speech Recognition using Deep RNN Models and WFST-based Decoding

4 code implementations29 Jul 2015 Yajie Miao, Mohammad Gowayyed, Florian Metze

The performance of automatic speech recognition (ASR) has improved tremendously due to the application of deep neural networks (DNNs).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

A Comparison of deep learning methods for environmental sound

1 code implementation20 Mar 2017 Juncheng Li, Wei Dai, Florian Metze, Shuhui Qu, Samarjit Das

On these features, we apply five models: Gaussian Mixture Model (GMM), Deep Neural Network (DNN), Recurrent Neural Network (RNN), Convolutional Deep Neural Net- work (CNN) and i-vector.

Avg

Visual Features for Context-Aware Speech Recognition

no code implementations1 Dec 2017 Abhinav Gupta, Yajie Miao, Leonardo Neves, Florian Metze

We are working on a corpus of "how-to" videos from the web, and the idea is that an object that can be seen ("car"), or a scene that is being detected ("kitchen") can be used to condition both models on the "context" of the recording, thereby reducing perplexity and improving transcription.

Language Modelling speech-recognition +1

Subword and Crossword Units for CTC Acoustic Models

no code implementations19 Dec 2017 Thomas Zenkel, Ramon Sanabria, Florian Metze, Alex Waibel

This paper proposes a novel approach to create an unit set for CTC based speech recognition systems.

Language Modelling speech-recognition +1

Sequence-based Multi-lingual Low Resource Speech Recognition

no code implementations21 Feb 2018 Siddharth Dalmia, Ramon Sanabria, Florian Metze, Alan W. black

Techniques for multi-lingual and cross-lingual speech recognition can help in low resource scenarios, to bootstrap systems and enable analysis of new languages and domains.

speech-recognition Speech Recognition

End-to-End Multimodal Speech Recognition

no code implementations25 Apr 2018 Shruti Palaskar, Ramon Sanabria, Florian Metze

Transcription or sub-titling of open-domain videos is still a challenging domain for Automatic Speech Recognition (ASR) due to the data's challenging acoustics, variable signal processing and the essentially unrestricted domain of the data.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval

1 code implementation ICMR 2018 Niluthpol Chowdhury Mithun, Juncheng Li, Florian Metze, Amit K. Roy-Chowdhury

Constructing a joint representation invariant across different modalities (e. g., video, language) is of significant importance in many multimedia applications.

Retrieval Text Retrieval +1

Hierarchical Multi Task Learning With CTC

no code implementations18 Jul 2018 Ramon Sanabria, Florian Metze

Our model obtains 14. 0% Word Error Rate on the Eval2000 Switchboard subset without any decoder or language model, outperforming the current state-of-the-art on acoustic-to-word models.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Acoustic-to-Word Recognition with Sequence-to-Sequence Models

no code implementations23 Jul 2018 Shruti Palaskar, Florian Metze

We present effective methods to train Sequence-to-Sequence models for direct word-level recognition (and character-level recognition) and show an absolute improvement of 4. 4-5. 0\% in Word Error Rate on the Switchboard corpus compared to prior work.

Language Modelling speech-recognition +1

Domain Robust Feature Extraction for Rapid Low Resource ASR Development

no code implementations28 Jul 2018 Siddharth Dalmia, Xinjian Li, Florian Metze, Alan W. black

We demonstrate the effectiveness of using a pre-trained English recognizer, which is robust to such mismatched conditions, as a domain normalizing feature extractor on a low resource language.

Dialog-context aware end-to-end speech recognition

no code implementations7 Aug 2018 Suyoun Kim, Florian Metze

Existing speech recognition systems are typically built at the sentence level, although it is known that dialog context, e. g. higher-level knowledge that spans across sentences or speakers, can help the processing of long conversations.

Sentence speech-recognition +1

Connectionist Temporal Localization for Sound Event Detection with Sequential Labeling

2 code implementations22 Oct 2018 Yun Wang, Florian Metze

Research on sound event detection (SED) with weak labeling has mostly focused on presence/absence labeling, which provides no temporal information at all about the event occurrences.

Sound Audio and Speech Processing

A Comparison of Five Multiple Instance Learning Pooling Functions for Sound Event Detection with Weak Labeling

3 code implementations22 Oct 2018 Yun Wang, Juncheng Li, Florian Metze

This paper compares five types of pooling functions both theoretically and experimentally, with special focus on their performance of localization.

Sound Audio and Speech Processing

How2: A Large-scale Dataset for Multimodal Language Understanding

2 code implementations1 Nov 2018 Ramon Sanabria, Ozan Caglayan, Shruti Palaskar, Desmond Elliott, Loïc Barrault, Lucia Specia, Florian Metze

In this paper, we introduce How2, a multimodal collection of instructional videos with English subtitles and crowdsourced Portuguese translations.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Multimodal Grounding for Sequence-to-Sequence Speech Recognition

1 code implementation9 Nov 2018 Ozan Caglayan, Ramon Sanabria, Shruti Palaskar, Loïc Barrault, Florian Metze

Specifically, in our previous work, we propose a multistep visual adaptive training approach which improves the accuracy of an audio-based Automatic Speech Recognition (ASR) system.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Learning from Multiview Correlations in Open-Domain Videos

no code implementations21 Nov 2018 Nils Holzenberger, Shruti Palaskar, Pranava Madhyastha, Florian Metze, Raman Arora

This shows it is possible to learn reliable representations across disparate, unaligned and noisy modalities, and encourages using the proposed approach on larger datasets.

Representation Learning Retrieval

Learned In Speech Recognition: Contextual Acoustic Word Embeddings

no code implementations18 Feb 2019 Shruti Palaskar, Vikas Raunak, Florian Metze

End-to-end acoustic-to-word speech recognition models have recently gained popularity because they are easy to train, scale well to large amounts of training data, and do not require a lexicon.

Sentence speech-recognition +3

Phoneme Level Language Models for Sequence Based Low Resource ASR

no code implementations20 Feb 2019 Siddharth Dalmia, Xinjian Li, Alan W. black, Florian Metze

Building multilingual and crosslingual models help bring different languages together in a language universal space.

Language Modelling

The ARIEL-CMU Systems for LoReHLT18

no code implementations24 Feb 2019 Aditi Chaudhary, Siddharth Dalmia, Junjie Hu, Xinjian Li, Austin Matthews, Aldrian Obaja Muis, Naoki Otani, Shruti Rijhwani, Zaid Sheikh, Nidhi Vyas, Xinyi Wang, Jiateng Xie, Ruochen Xu, Chunting Zhou, Peter J. Jansen, Yiming Yang, Lori Levin, Florian Metze, Teruko Mitamura, David R. Mortensen, Graham Neubig, Eduard Hovy, Alan W. black, Jaime Carbonell, Graham V. Horwood, Shabnam Tafreshi, Mona Diab, Efsun S. Kayi, Noura Farra, Kathleen McKeown

This paper describes the ARIEL-CMU submissions to the Low Resource Human Language Technologies (LoReHLT) 2018 evaluations for the tasks Machine Translation (MT), Entity Discovery and Linking (EDL), and detection of Situation Frames in Text and Speech (SF Text and Speech).

Machine Translation Translation

Acoustic-to-Word Models with Conversational Context Information

no code implementations NAACL 2019 Suyoun Kim, Florian Metze

Conversational context information, higher-level knowledge that spans across sentences, can help to recognize a long conversation.

Sentence speech-recognition +1

Gated Embeddings in End-to-End Speech Recognition for Conversational-Context Fusion

no code implementations ACL 2019 Suyoun Kim, Siddharth Dalmia, Florian Metze

We present a novel conversational-context aware end-to-end speech recognizer based on a gated neural network that incorporates conversational-context/word/speech embeddings.

Sentence Sentence Embeddings +2

Analyzing Utility of Visual Context in Multimodal Speech Recognition Under Noisy Conditions

no code implementations30 Jun 2019 Tejas Srinivasan, Ramon Sanabria, Florian Metze

Multimodal learning allows us to leverage information from multiple sources (visual, acoustic and text), similar to our experience of the real world.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Cross-Attention End-to-End ASR for Two-Party Conversations

no code implementations24 Jul 2019 Suyoun Kim, Siddharth Dalmia, Florian Metze

We present an end-to-end speech recognition model that learns interaction between two speakers based on the turn-changing information.

speech-recognition Speech Recognition +1

Effective Dimensionality Reduction for Word Embeddings

1 code implementation WS 2019 Vikas Raunak, Vivek Gupta, Florian Metze

Pre-trained word embeddings are used in several downstream applications as well as for constructing representations for sentences, paragraphs and documents.

Dimensionality Reduction Word Embeddings

Multilingual Speech Recognition with Corpus Relatedness Sampling

no code implementations2 Aug 2019 Xinjian Li, Siddharth Dalmia, Alan W. black, Florian Metze

For example, the target corpus might benefit more from a corpus in the same domain or a corpus from a close language.

speech-recognition Speech Recognition

RTC-VAE: HARNESSING THE PECULIARITY OF TOTAL CORRELATION IN LEARNING DISENTANGLED REPRESENTATIONS

no code implementations25 Sep 2019 Ze Cheng, Juncheng B Li, Chenxu Wang, Jixuan Gu, Hao Xu, Xinjian Li, Florian Metze

In the problem of unsupervised learning of disentangled representations, one of the promising methods is to penalize the total correlation of sampled latent vari-ables.

Disentanglement

On Dimensional Linguistic Properties of the Word Embedding Space

2 code implementations WS 2020 Vikas Raunak, Vaibhav Kumar, Vivek Gupta, Florian Metze

Word embeddings have become a staple of several natural language processing tasks, yet much remains to be understood about their properties.

Machine Translation Sentence +3

On Leveraging the Visual Modality for Neural Machine Translation

no code implementations WS 2019 Vikas Raunak, Sang Keun Choe, Quanyang Lu, Yi Xu, Florian Metze

Leveraging the visual modality effectively for Neural Machine Translation (NMT) remains an open problem in computational linguistics.

Multimodal Machine Translation NMT +2

Multitask Learning For Different Subword Segmentations In Neural Machine Translation

no code implementations EMNLP (IWSLT) 2019 Tejas Srinivasan, Ramon Sanabria, Florian Metze

In Neural Machine Translation (NMT) the usage of subwords and characters as source and target units offers a simple and flexible solution for translation of rare and unseen words.

Machine Translation NMT +2

Adversarial Music: Real World Audio Adversary Against Wake-word Detection System

no code implementations NeurIPS 2019 Juncheng B. Li, Shuhui Qu, Xinjian Li, Joseph Szurley, J. Zico Kolter, Florian Metze

In this work, we target our attack on the wake-word detection system, jamming the model with some inconspicuous background music to deactivate the VAs while our audio adversary is present.

Real-World Adversarial Attack

On Compositionality in Neural Machine Translation

no code implementations4 Nov 2019 Vikas Raunak, Vaibhav Kumar, Florian Metze

We investigate two specific manifestations of compositionality in Neural Machine Translation (NMT) : (1) Productivity - the ability of the model to extend its predictions beyond the observed length in training data and (2) Systematicity - the ability of the model to systematically recombine known parts and rules.

Machine Translation NMT +1

Enforcing Encoder-Decoder Modularity in Sequence-to-Sequence Models

no code implementations9 Nov 2019 Siddharth Dalmia, Abdel-rahman Mohamed, Mike Lewis, Florian Metze, Luke Zettlemoyer

Inspired by modular software design principles of independence, interchangeability, and clarity of interface, we introduce a method for enforcing encoder-decoder modularity in seq2seq models without sacrificing the overall model quality or its full differentiability.

Looking Enhances Listening: Recovering Missing Speech Using Images

no code implementations13 Feb 2020 Tejas Srinivasan, Ramon Sanabria, Florian Metze

Speech is understood better by using visual context; for this reason, there have been many attempts to use images to adapt automatic speech recognition (ASR) systems.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Towards Zero-shot Learning for Automatic Phonemic Transcription

no code implementations26 Feb 2020 Xinjian Li, Siddharth Dalmia, David R. Mortensen, Juncheng Li, Alan W. black, Florian Metze

The difficulty of this task is that phoneme inventories often differ between the training languages and the target language, making it infeasible to recognize unseen phonemes.

Zero-Shot Learning

AlloVera: A Multilingual Allophone Database

no code implementations LREC 2020 David R. Mortensen, Xinjian Li, Patrick Littell, Alexis Michaud, Shruti Rijhwani, Antonios Anastasopoulos, Alan W. black, Florian Metze, Graham Neubig

While phonemic representations are language specific, phonetic representations (stated in terms of (allo)phones) are much closer to a universal (language-independent) transcription.

speech-recognition Speech Recognition

Contextual RNN-T For Open Domain ASR

no code implementations4 Jun 2020 Mahaveer Jain, Gil Keren, Jay Mahadeokar, Geoffrey Zweig, Florian Metze, Yatharth Saraf

By using an attention model and a biasing model to leverage the contextual metadata that accompanies a video, we observe a relative improvement of about 16% in Word Error Rate on Named Entities (WER-NE) for videos with related metadata.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

How2Sign: A Large-scale Multimodal Dataset for Continuous American Sign Language

1 code implementation CVPR 2021 Amanda Duarte, Shruti Palaskar, Lucas Ventura, Deepti Ghadiyaram, Kenneth DeHaan, Florian Metze, Jordi Torres, Xavier Giro-i-Nieto

Towards this end, we introduce How2Sign, a multimodal and multiview continuous American Sign Language (ASL) dataset, consisting of a parallel corpus of more than 80 hours of sign language videos and a set of corresponding modalities including speech, English transcripts, and depth.

Sign Language Production Sign Language Translation +1

Revisiting Factorizing Aggregated Posterior in Learning Disentangled Representations

no code implementations12 Sep 2020 Ze Cheng, Juncheng Li, Chenxu Wang, Jixuan Gu, Hao Xu, Xinjian Li, Florian Metze

In this paper, we provide a theoretical explanation that low total correlation of sampled representation cannot guarantee low total correlation of the mean representation.

Fine-Grained Grounding for Multimodal Speech Recognition

1 code implementation Findings of the Association for Computational Linguistics 2020 Tejas Srinivasan, Ramon Sanabria, Florian Metze, Desmond Elliott

In experiments on the Flickr8K Audio Captions Corpus, we find that our model improves over approaches that use global visual features, that the proposals enable the model to recover entities and other related words, such as adjectives, and that improvements are due to the model's ability to localize the correct proposals.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Support-set bottlenecks for video-text representation learning

no code implementations ICLR 2021 Mandela Patrick, Po-Yao Huang, Yuki Asano, Florian Metze, Alexander Hauptmann, João Henriques, Andrea Vedaldi

The dominant paradigm for learning video-text representations -- noise contrastive learning -- increases the similarity of the representations of pairs of samples that are known to be related, such as text and video from the same sample, and pushes away the representations of all other pairs.

Contrastive Learning Representation Learning +3

Multimodal Speech Recognition with Unstructured Audio Masking

no code implementations EMNLP (nlpbt) 2020 Tejas Srinivasan, Ramon Sanabria, Florian Metze, Desmond Elliott

Our experiments on the Flickr 8K Audio Captions Corpus show that multimodal ASR can generalize to recover different types of masked words in this unstructured masking setting.

8k Automatic Speech Recognition +2

Audio-Visual Event Recognition through the lens of Adversary

1 code implementation15 Nov 2020 Juncheng B Li, Kaixin Ma, Shuhui Qu, Po-Yao Huang, Florian Metze

This work aims to study several key questions related to multimodal learning through the lens of adversarial noises: 1) The trade-off between early/middle/late fusion affecting its robustness and accuracy 2) How do different frequency/time domain features contribute to the robustness?

End-to-end Quantized Training via Log-Barrier Extensions

no code implementations1 Jan 2021 Juncheng B Li, Shuhui Qu, Xinjian Li, Emma Strubell, Florian Metze

Quantization of neural network parameters and activations has emerged as a successful approach to reducing the model size and inference time on hardware that sup-ports native low-precision arithmetic.

Quantization

NoiseQA: Challenge Set Evaluation for User-Centric Question Answering

2 code implementations EACL 2021 Abhilasha Ravichander, Siddharth Dalmia, Maria Ryskina, Florian Metze, Eduard Hovy, Alan W Black

When Question-Answering (QA) systems are deployed in the real world, users query them through a variety of interfaces, such as speaking to voice assistants, typing questions into a search engine, or even translating questions to languages supported by the QA system.

Question Answering

Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning

1 code implementation ICCV 2021 Mandela Patrick, Yuki M. Asano, Bernie Huang, Ishan Misra, Florian Metze, Joao Henriques, Andrea Vedaldi

First, for space, we show that spatial augmentations such as cropping do work well for videos too, but that previous implementations, due to the high processing and memory cost, could not do this at a scale sufficient for it to work well.

Representation Learning Self-Supervised Learning

Searchable Hidden Intermediates for End-to-End Models of Decomposable Sequence Tasks

no code implementations NAACL 2021 Siddharth Dalmia, Brian Yan, Vikas Raunak, Florian Metze, Shinji Watanabe

In this work, we present an end-to-end framework that exploits compositionality to learn searchable hidden representations at intermediate stages of a sequence model using decomposed sub-tasks.

speech-recognition Speech Recognition +1

Differentiable Allophone Graphs for Language-Universal Speech Recognition

1 code implementation24 Jul 2021 Brian Yan, Siddharth Dalmia, David R. Mortensen, Florian Metze, Shinji Watanabe

These phone-based systems with learned allophone graphs can be used by linguists to document new languages, build phone-based lexicons that capture rich pronunciation variations, and re-evaluate the allophone mappings of seen language.

speech-recognition Speech Recognition

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

2 code implementations EMNLP 2021 Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, Christoph Feichtenhofer

We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks.

 Ranked #1 on Temporal Action Localization on CrossTask (using extra training data)

Action Segmentation Long Video Retrieval (Background Removed) +4

Speech Summarization using Restricted Self-Attention

no code implementations12 Oct 2021 Roshan Sharma, Shruti Palaskar, Alan W Black, Florian Metze

End-to-end modeling of speech summarization models is challenging due to memory and compute constraints arising from long input audio sequences.

Document Summarization speech-recognition +2

On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization

no code implementations24 May 2022 Shruti Palaskar, Akshita Bhagia, Yonatan Bisk, Florian Metze, Alan W Black, Ana Marasović

Combining the visual modality with pretrained language models has been surprisingly effective for simple descriptive tasks such as image captioning.

Descriptive Image Captioning +5

LegoNN: Building Modular Encoder-Decoder Models

no code implementations7 Jun 2022 Siddharth Dalmia, Dmytro Okhonko, Mike Lewis, Sergey Edunov, Shinji Watanabe, Florian Metze, Luke Zettlemoyer, Abdelrahman Mohamed

We describe LegoNN, a procedure for building encoder-decoder architectures in a way so that its parts can be applied to other tasks without the need for any fine-tuning.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Masked Autoencoders that Listen

4 code implementations13 Jul 2022 Po-Yao Huang, Hu Xu, Juncheng Li, Alexei Baevski, Michael Auli, Wojciech Galuba, Florian Metze, Christoph Feichtenhofer

Following the Transformer encoder-decoder design in MAE, our Audio-MAE first encodes audio spectrogram patches with a high masking ratio, feeding only the non-masked tokens through encoder layers.

Ranked #2 on Speaker Identification on VoxCeleb1 (using extra training data)

Audio Classification Representation Learning +1

ASR2K: Speech Recognition for Around 2000 Languages without Audio

1 code implementation6 Sep 2022 Xinjian Li, Florian Metze, David R Mortensen, Alan W Black, Shinji Watanabe

We achieve 50% CER and 74% WER on the Wilderness dataset with Crubadan statistics only and improve them to 45% CER and 69% WER when using 10000 raw text utterances.

Language Modelling Speech Recognition

CTC Alignments Improve Autoregressive Translation

no code implementations11 Oct 2022 Brian Yan, Siddharth Dalmia, Yosuke Higuchi, Graham Neubig, Florian Metze, Alan W Black, Shinji Watanabe

Connectionist Temporal Classification (CTC) is a widely used approach for automatic speech recognition (ASR) that performs conditionally independent monotonic alignment.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

SQuAT: Sharpness- and Quantization-Aware Training for BERT

no code implementations13 Oct 2022 Zheng Wang, Juncheng B Li, Shuhui Qu, Florian Metze, Emma Strubell

Quantization is an effective technique to reduce memory footprint, inference latency, and power consumption of deep learning models.

Quantization

Token-level Sequence Labeling for Spoken Language Understanding using Compositional End-to-End Models

1 code implementation27 Oct 2022 Siddhant Arora, Siddharth Dalmia, Brian Yan, Florian Metze, Alan W Black, Shinji Watanabe

End-to-end spoken language understanding (SLU) systems are gaining popularity over cascaded approaches due to their simplicity and ability to avoid error propagation.

named-entity-recognition Named Entity Recognition +2

Normalized Contrastive Learning for Text-Video Retrieval

1 code implementation30 Nov 2022 Yookoon Park, Mahmoud Azab, Bo Xiong, Seungwhan Moon, Florian Metze, Gourab Kundu, Kirmani Ahmed

Cross-modal contrastive learning has led the recent advances in multimodal retrieval with its simplicity and effectiveness.

Contrastive Learning Cross-Modal Retrieval +2

Error-aware Quantization through Noise Tempering

no code implementations11 Dec 2022 Zheng Wang, Juncheng B Li, Shuhui Qu, Florian Metze, Emma Strubell

In this work, we incorporate exponentially decaying quantization-error-aware noise together with a learnable scale of task loss gradient to approximate the effect of a quantization operator.

Model Compression Quantization

Phone Inventories and Recognition for Every Language

no code implementations LREC 2022 Xinjian Li, Florian Metze, David R. Mortensen, Alan W Black, Shinji Watanabe

Identifying phone inventories is a crucial component in language documentation and the preservation of endangered languages.

CMU’s Machine Translation System for IWSLT 2019

no code implementations EMNLP (IWSLT) 2019 Tejas Srinivasan, Ramon Sanabria, Florian Metze

In Neural Machine Translation (NMT) the usage of sub-words and characters as source and target units offers a simple and flexible solution for translation of rare and unseen words.

Machine Translation NMT +1

Cannot find the paper you are looking for? You can Submit a new open access paper.