Search Results for author: Florian Metze

Found 71 papers, 17 papers with code

Speech Summarization using Restricted Self-Attention

no code implementations12 Oct 2021 Roshan Sharma, Shruti Palaskar, Alan W Black, Florian Metze

End-to-end modeling of speech summarization models is challenging due to memory and compute constraints arising from long input audio sequences.

Document Summarization Speech Recognition +1

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

1 code implementation EMNLP 2021 Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, Christoph Feichtenhofer

We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks.

Action Segmentation Video Retrieval

Differentiable Allophone Graphs for Language-Universal Speech Recognition

no code implementations24 Jul 2021 Brian Yan, Siddharth Dalmia, David R. Mortensen, Florian Metze, Shinji Watanabe

These phone-based systems with learned allophone graphs can be used by linguists to document new languages, build phone-based lexicons that capture rich pronunciation variations, and re-evaluate the allophone mappings of seen language.

Speech Recognition

Searchable Hidden Intermediates for End-to-End Models of Decomposable Sequence Tasks

no code implementations NAACL 2021 Siddharth Dalmia, Brian Yan, Vikas Raunak, Florian Metze, Shinji Watanabe

In this work, we present an end-to-end framework that exploits compositionality to learn searchable hidden representations at intermediate stages of a sequence model using decomposed sub-tasks.

Speech Recognition Translation

Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning

1 code implementation ICCV 2021 Mandela Patrick, Yuki M. Asano, Bernie Huang, Ishan Misra, Florian Metze, Joao Henriques, Andrea Vedaldi

First, for space, we show that spatial augmentations such as cropping do work well for videos too, but that previous implementations, due to the high processing and memory cost, could not do this at a scale sufficient for it to work well.

Representation Learning Self-Supervised Learning

NoiseQA: Challenge Set Evaluation for User-Centric Question Answering

1 code implementation EACL 2021 Abhilasha Ravichander, Siddharth Dalmia, Maria Ryskina, Florian Metze, Eduard Hovy, Alan W Black

When Question-Answering (QA) systems are deployed in the real world, users query them through a variety of interfaces, such as speaking to voice assistants, typing questions into a search engine, or even translating questions to languages supported by the QA system.

Question Answering

End-to-end Quantized Training via Log-Barrier Extensions

no code implementations1 Jan 2021 Juncheng B Li, Shuhui Qu, Xinjian Li, Emma Strubell, Florian Metze

Quantization of neural network parameters and activations has emerged as a successful approach to reducing the model size and inference time on hardware that sup-ports native low-precision arithmetic.

Quantization

Audio-Visual Event Recognition through the lens of Adversary

no code implementations15 Nov 2020 Juncheng B Li, Kaixin Ma, Shuhui Qu, Po-Yao Huang, Florian Metze

This work aims to study several key questions related to multimodal learning through the lens of adversarial noises: 1) The trade-off between early/middle/late fusion affecting its robustness and accuracy 2) How do different frequency/time domain features contribute to the robustness?

Multimodal Speech Recognition with Unstructured Audio Masking

no code implementations EMNLP (nlpbt) 2020 Tejas Srinivasan, Ramon Sanabria, Florian Metze, Desmond Elliott

Our experiments on the Flickr 8K Audio Captions Corpus show that multimodal ASR can generalize to recover different types of masked words in this unstructured masking setting.

Speech Recognition

Support-set bottlenecks for video-text representation learning

no code implementations ICLR 2021 Mandela Patrick, Po-Yao Huang, Yuki Asano, Florian Metze, Alexander Hauptmann, João Henriques, Andrea Vedaldi

The dominant paradigm for learning video-text representations -- noise contrastive learning -- increases the similarity of the representations of pairs of samples that are known to be related, such as text and video from the same sample, and pushes away the representations of all other pairs.

Contrastive Learning Representation Learning +2

Fine-Grained Grounding for Multimodal Speech Recognition

1 code implementation Findings of the Association for Computational Linguistics 2020 Tejas Srinivasan, Ramon Sanabria, Florian Metze, Desmond Elliott

In experiments on the Flickr8K Audio Captions Corpus, we find that our model improves over approaches that use global visual features, that the proposals enable the model to recover entities and other related words, such as adjectives, and that improvements are due to the model's ability to localize the correct proposals.

Speech Recognition

Revisiting Factorizing Aggregated Posterior in Learning Disentangled Representations

no code implementations12 Sep 2020 Ze Cheng, Juncheng Li, Chenxu Wang, Jixuan Gu, Hao Xu, Xinjian Li, Florian Metze

In this paper, we provide a theoretical explanation that low total correlation of sampled representation cannot guarantee low total correlation of the mean representation.

How2Sign: A Large-scale Multimodal Dataset for Continuous American Sign Language

no code implementations CVPR 2021 Amanda Duarte, Shruti Palaskar, Lucas Ventura, Deepti Ghadiyaram, Kenneth DeHaan, Florian Metze, Jordi Torres, Xavier Giro-i-Nieto

Towards this end, we introduce How2Sign, a multimodal and multiview continuous American Sign Language (ASL) dataset, consisting of a parallel corpus of more than 80 hours of sign language videos and a set of corresponding modalities including speech, English transcripts, and depth.

Sign Language Production Sign Language Translation +1

Contextual RNN-T For Open Domain ASR

no code implementations4 Jun 2020 Mahaveer Jain, Gil Keren, Jay Mahadeokar, Geoffrey Zweig, Florian Metze, Yatharth Saraf

By using an attention model and a biasing model to leverage the contextual metadata that accompanies a video, we observe a relative improvement of about 16% in Word Error Rate on Named Entities (WER-NE) for videos with related metadata.

Speech Recognition

AlloVera: A Multilingual Allophone Database

no code implementations LREC 2020 David R. Mortensen, Xinjian Li, Patrick Littell, Alexis Michaud, Shruti Rijhwani, Antonios Anastasopoulos, Alan W. black, Florian Metze, Graham Neubig

While phonemic representations are language specific, phonetic representations (stated in terms of (allo)phones) are much closer to a universal (language-independent) transcription.

Speech Recognition

Towards Zero-shot Learning for Automatic Phonemic Transcription

no code implementations26 Feb 2020 Xinjian Li, Siddharth Dalmia, David R. Mortensen, Juncheng Li, Alan W. black, Florian Metze

The difficulty of this task is that phoneme inventories often differ between the training languages and the target language, making it infeasible to recognize unseen phonemes.

Zero-Shot Learning

Looking Enhances Listening: Recovering Missing Speech Using Images

no code implementations13 Feb 2020 Tejas Srinivasan, Ramon Sanabria, Florian Metze

Speech is understood better by using visual context; for this reason, there have been many attempts to use images to adapt automatic speech recognition (ASR) systems.

Speech Recognition

Enforcing Encoder-Decoder Modularity in Sequence-to-Sequence Models

no code implementations9 Nov 2019 Siddharth Dalmia, Abdel-rahman Mohamed, Mike Lewis, Florian Metze, Luke Zettlemoyer

Inspired by modular software design principles of independence, interchangeability, and clarity of interface, we introduce a method for enforcing encoder-decoder modularity in seq2seq models without sacrificing the overall model quality or its full differentiability.

On Compositionality in Neural Machine Translation

no code implementations4 Nov 2019 Vikas Raunak, Vaibhav Kumar, Florian Metze

We investigate two specific manifestations of compositionality in Neural Machine Translation (NMT) : (1) Productivity - the ability of the model to extend its predictions beyond the observed length in training data and (2) Systematicity - the ability of the model to systematically recombine known parts and rules.

Machine Translation Translation

Adversarial Music: Real World Audio Adversary Against Wake-word Detection System

no code implementations NeurIPS 2019 Juncheng B. Li, Shuhui Qu, Xinjian Li, Joseph Szurley, J. Zico Kolter, Florian Metze

In this work, we target our attack on the wake-word detection system, jamming the model with some inconspicuous background music to deactivate the VAs while our audio adversary is present.

Real-World Adversarial Attack

Multitask Learning For Different Subword Segmentations In Neural Machine Translation

no code implementations27 Oct 2019 Tejas Srinivasan, Ramon Sanabria, Florian Metze

In Neural Machine Translation (NMT) the usage of subwords and characters as source and target units offers a simple and flexible solution for translation of rare and unseen words.

Machine Translation Translation

On Leveraging the Visual Modality for Neural Machine Translation

no code implementations WS 2019 Vikas Raunak, Sang Keun Choe, Quanyang Lu, Yi Xu, Florian Metze

Leveraging the visual modality effectively for Neural Machine Translation (NMT) remains an open problem in computational linguistics.

Multimodal Machine Translation Translation

On Dimensional Linguistic Properties of the Word Embedding Space

2 code implementations WS 2020 Vikas Raunak, Vaibhav Kumar, Vivek Gupta, Florian Metze

Word embeddings have become a staple of several natural language processing tasks, yet much remains to be understood about their properties.

Machine Translation Sentence Classification +2

RTC-VAE: HARNESSING THE PECULIARITY OF TOTAL CORRELATION IN LEARNING DISENTANGLED REPRESENTATIONS

no code implementations25 Sep 2019 Ze Cheng, Juncheng B Li, Chenxu Wang, Jixuan Gu, Hao Xu, Xinjian Li, Florian Metze

In the problem of unsupervised learning of disentangled representations, one of the promising methods is to penalize the total correlation of sampled latent vari-ables.

Multilingual Speech Recognition with Corpus Relatedness Sampling

no code implementations2 Aug 2019 Xinjian Li, Siddharth Dalmia, Alan W. black, Florian Metze

For example, the target corpus might benefit more from a corpus in the same domain or a corpus from a close language.

Speech Recognition

Effective Dimensionality Reduction for Word Embeddings

1 code implementation WS 2019 Vikas Raunak, Vivek Gupta, Florian Metze

Pre-trained word embeddings are used in several downstream applications as well as for constructing representations for sentences, paragraphs and documents.

Dimensionality Reduction Word Embeddings

Cross-Attention End-to-End ASR for Two-Party Conversations

no code implementations24 Jul 2019 Suyoun Kim, Siddharth Dalmia, Florian Metze

We present an end-to-end speech recognition model that learns interaction between two speakers based on the turn-changing information.

Speech Recognition

Analyzing Utility of Visual Context in Multimodal Speech Recognition Under Noisy Conditions

no code implementations30 Jun 2019 Tejas Srinivasan, Ramon Sanabria, Florian Metze

Multimodal learning allows us to leverage information from multiple sources (visual, acoustic and text), similar to our experience of the real world.

Speech Recognition

Gated Embeddings in End-to-End Speech Recognition for Conversational-Context Fusion

no code implementations ACL 2019 Suyoun Kim, Siddharth Dalmia, Florian Metze

We present a novel conversational-context aware end-to-end speech recognizer based on a gated neural network that incorporates conversational-context/word/speech embeddings.

Sentence Embeddings Speech Recognition

Acoustic-to-Word Models with Conversational Context Information

no code implementations NAACL 2019 Suyoun Kim, Florian Metze

Conversational context information, higher-level knowledge that spans across sentences, can help to recognize a long conversation.

Speech Recognition

The ARIEL-CMU Systems for LoReHLT18

no code implementations24 Feb 2019 Aditi Chaudhary, Siddharth Dalmia, Junjie Hu, Xinjian Li, Austin Matthews, Aldrian Obaja Muis, Naoki Otani, Shruti Rijhwani, Zaid Sheikh, Nidhi Vyas, Xinyi Wang, Jiateng Xie, Ruochen Xu, Chunting Zhou, Peter J. Jansen, Yiming Yang, Lori Levin, Florian Metze, Teruko Mitamura, David R. Mortensen, Graham Neubig, Eduard Hovy, Alan W. black, Jaime Carbonell, Graham V. Horwood, Shabnam Tafreshi, Mona Diab, Efsun S. Kayi, Noura Farra, Kathleen McKeown

This paper describes the ARIEL-CMU submissions to the Low Resource Human Language Technologies (LoReHLT) 2018 evaluations for the tasks Machine Translation (MT), Entity Discovery and Linking (EDL), and detection of Situation Frames in Text and Speech (SF Text and Speech).

Machine Translation Translation

Phoneme Level Language Models for Sequence Based Low Resource ASR

no code implementations20 Feb 2019 Siddharth Dalmia, Xinjian Li, Alan W. black, Florian Metze

Building multilingual and crosslingual models help bring different languages together in a language universal space.

Language Modelling

Learned In Speech Recognition: Contextual Acoustic Word Embeddings

no code implementations18 Feb 2019 Shruti Palaskar, Vikas Raunak, Florian Metze

End-to-end acoustic-to-word speech recognition models have recently gained popularity because they are easy to train, scale well to large amounts of training data, and do not require a lexicon.

Speech Recognition Spoken Language Understanding +1

Learning from Multiview Correlations in Open-Domain Videos

no code implementations21 Nov 2018 Nils Holzenberger, Shruti Palaskar, Pranava Madhyastha, Florian Metze, Raman Arora

This shows it is possible to learn reliable representations across disparate, unaligned and noisy modalities, and encourages using the proposed approach on larger datasets.

Representation Learning

Multimodal Grounding for Sequence-to-Sequence Speech Recognition

1 code implementation9 Nov 2018 Ozan Caglayan, Ramon Sanabria, Shruti Palaskar, Loïc Barrault, Florian Metze

Specifically, in our previous work, we propose a multistep visual adaptive training approach which improves the accuracy of an audio-based Automatic Speech Recognition (ASR) system.

Sequence-To-Sequence Speech Recognition

How2: A Large-scale Dataset for Multimodal Language Understanding

1 code implementation1 Nov 2018 Ramon Sanabria, Ozan Caglayan, Shruti Palaskar, Desmond Elliott, Loïc Barrault, Lucia Specia, Florian Metze

In this paper, we introduce How2, a multimodal collection of instructional videos with English subtitles and crowdsourced Portuguese translations.

Machine Translation Speech Recognition +1

A Comparison of Five Multiple Instance Learning Pooling Functions for Sound Event Detection with Weak Labeling

3 code implementations22 Oct 2018 Yun Wang, Juncheng Li, Florian Metze

This paper compares five types of pooling functions both theoretically and experimentally, with special focus on their performance of localization.

Sound Audio and Speech Processing

Connectionist Temporal Localization for Sound Event Detection with Sequential Labeling

2 code implementations22 Oct 2018 Yun Wang, Florian Metze

Research on sound event detection (SED) with weak labeling has mostly focused on presence/absence labeling, which provides no temporal information at all about the event occurrences.

Sound Audio and Speech Processing

Dialog-context aware end-to-end speech recognition

no code implementations7 Aug 2018 Suyoun Kim, Florian Metze

Existing speech recognition systems are typically built at the sentence level, although it is known that dialog context, e. g. higher-level knowledge that spans across sentences or speakers, can help the processing of long conversations.

Speech Recognition

Domain Robust Feature Extraction for Rapid Low Resource ASR Development

no code implementations28 Jul 2018 Siddharth Dalmia, Xinjian Li, Florian Metze, Alan W. black

We demonstrate the effectiveness of using a pre-trained English recognizer, which is robust to such mismatched conditions, as a domain normalizing feature extractor on a low resource language.

Acoustic-to-Word Recognition with Sequence-to-Sequence Models

no code implementations23 Jul 2018 Shruti Palaskar, Florian Metze

We present effective methods to train Sequence-to-Sequence models for direct word-level recognition (and character-level recognition) and show an absolute improvement of 4. 4-5. 0\% in Word Error Rate on the Switchboard corpus compared to prior work.

Speech Recognition

Hierarchical Multi Task Learning With CTC

no code implementations18 Jul 2018 Ramon Sanabria, Florian Metze

Our model obtains 14. 0% Word Error Rate on the Eval2000 Switchboard subset without any decoder or language model, outperforming the current state-of-the-art on acoustic-to-word models.

Multi-Task Learning Speech Recognition

Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval

1 code implementation ICMR 2018 Niluthpol Chowdhury Mithun, Juncheng Li, Florian Metze, Amit K. Roy-Chowdhury

Constructing a joint representation invariant across different modalities (e. g., video, language) is of significant importance in many multimedia applications.

Video-Text Retrieval

End-to-End Multimodal Speech Recognition

no code implementations25 Apr 2018 Shruti Palaskar, Ramon Sanabria, Florian Metze

Transcription or sub-titling of open-domain videos is still a challenging domain for Automatic Speech Recognition (ASR) due to the data's challenging acoustics, variable signal processing and the essentially unrestricted domain of the data.

Speech Recognition

Sequence-based Multi-lingual Low Resource Speech Recognition

no code implementations21 Feb 2018 Siddharth Dalmia, Ramon Sanabria, Florian Metze, Alan W. black

Techniques for multi-lingual and cross-lingual speech recognition can help in low resource scenarios, to bootstrap systems and enable analysis of new languages and domains.

Speech Recognition

Subword and Crossword Units for CTC Acoustic Models

no code implementations19 Dec 2017 Thomas Zenkel, Ramon Sanabria, Florian Metze, Alex Waibel

This paper proposes a novel approach to create an unit set for CTC based speech recognition systems.

Speech Recognition

Visual Features for Context-Aware Speech Recognition

no code implementations1 Dec 2017 Abhinav Gupta, Yajie Miao, Leonardo Neves, Florian Metze

We are working on a corpus of "how-to" videos from the web, and the idea is that an object that can be seen ("car"), or a scene that is being detected ("kitchen") can be used to condition both models on the "context" of the recording, thereby reducing perplexity and improving transcription.

Speech Recognition

Annotating High-Level Structures of Short Stories and Personal Anecdotes

no code implementations LREC 2018 Boyang Li, Beth Cardier, Tong Wang, Florian Metze

Stories are a vital form of communication in human culture; they are employed daily to persuade, to elicit sympathy, or to convey a message.

Comparison of Decoding Strategies for CTC Acoustic Models

no code implementations15 Aug 2017 Thomas Zenkel, Ramon Sanabria, Florian Metze, Jan Niehues, Matthias Sperber, Sebastian Stüker, Alex Waibel

The CTC loss function maps an input sequence of observable feature vectors to an output sequence of symbols.

Speech Recognition

A Comparison of deep learning methods for environmental sound

no code implementations20 Mar 2017 Juncheng Li, Wei Dai, Florian Metze, Shuhui Qu, Samarjit Das

On these features, we apply five models: Gaussian Mixture Model (GMM), Deep Neural Network (DNN), Recurrent Neural Network (RNN), Convolutional Deep Neural Net- work (CNN) and i-vector.

EESEN: End-to-End Speech Recognition using Deep RNN Models and WFST-based Decoding

3 code implementations29 Jul 2015 Yajie Miao, Mohammad Gowayyed, Florian Metze

The performance of automatic speech recognition (ASR) has improved tremendously due to the application of deep neural networks (DNNs).

Speech Recognition

Cannot find the paper you are looking for? You can Submit a new open access paper.