1 code implementation • Findings (ACL) 2022 • Xinjian Li, Florian Metze, David Mortensen, Shinji Watanabe, Alan Black
Grapheme-to-Phoneme (G2P) has many applications in NLP and speech fields.
no code implementations • LREC 2022 • Xinjian Li, Florian Metze, David R. Mortensen, Alan W Black, Shinji Watanabe
Identifying phone inventories is a crucial component in language documentation and the preservation of endangered languages.
no code implementations • EMNLP (IWSLT) 2019 • Tejas Srinivasan, Ramon Sanabria, Florian Metze
In Neural Machine Translation (NMT) the usage of sub-words and characters as source and target units offers a simple and flexible solution for translation of rare and unseen words.
no code implementations • 11 Dec 2022 • Zheng Wang, Juncheng B Li, Shuhui Qu, Florian Metze, Emma Strubell
In this work, we incorporate exponentially decaying quantization-error-aware noise together with a learnable scale of task loss gradient to approximate the effect of a quantization operator.
no code implementations • 30 Nov 2022 • Yookoon Park, Mahmoud Azab, Bo Xiong, Seungwhan Moon, Florian Metze, Gourab Kundu, Kirmani Ahmed
Cross-modal contrastive learning has led the recent advances in multimodal retrieval with its simplicity and effectiveness.
1 code implementation • 27 Oct 2022 • Siddhant Arora, Siddharth Dalmia, Brian Yan, Florian Metze, Alan W Black, Shinji Watanabe
End-to-end spoken language understanding (SLU) systems are gaining popularity over cascaded approaches due to their simplicity and ability to avoid error propagation.
no code implementations • 13 Oct 2022 • Zheng Wang, Juncheng B Li, Shuhui Qu, Florian Metze, Emma Strubell
Quantization is an effective technique to reduce memory footprint, inference latency, and power consumption of deep learning models.
no code implementations • 11 Oct 2022 • Brian Yan, Siddharth Dalmia, Yosuke Higuchi, Graham Neubig, Florian Metze, Alan W Black, Shinji Watanabe
Connectionist Temporal Classification (CTC) is a widely used approach for automatic speech recognition (ASR) that performs conditionally independent monotonic alignment.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+3
1 code implementation • 6 Sep 2022 • Xinjian Li, Florian Metze, David R Mortensen, Alan W Black, Shinji Watanabe
We achieve 50% CER and 74% WER on the Wilderness dataset with Crubadan statistics only and improve them to 45% CER and 69% WER when using 10000 raw text utterances.
2 code implementations • 13 Jul 2022 • Po-Yao Huang, Hu Xu, Juncheng Li, Alexei Baevski, Michael Auli, Wojciech Galuba, Florian Metze, Christoph Feichtenhofer
Following the Transformer encoder-decoder design in MAE, our Audio-MAE first encodes audio spectrogram patches with a high masking ratio, feeding only the non-masked tokens through encoder layers.
Ranked #2 on
Speaker Identification
on VoxCeleb1
no code implementations • 7 Jun 2022 • Siddharth Dalmia, Dmytro Okhonko, Mike Lewis, Sergey Edunov, Shinji Watanabe, Florian Metze, Luke Zettlemoyer, Abdelrahman Mohamed
We present several experiments to demonstrate the effectiveness of LegoNN models: a trained language generation LegoNN decoder module from German-English (De-En) MT task can be reused with no fine-tuning for the Europarl English ASR and the Romanian-English (Ro-En) MT tasks to match or beat respective baseline models.
no code implementations • 24 May 2022 • Shruti Palaskar, Akshita Bhagia, Yonatan Bisk, Florian Metze, Alan W Black, Ana Marasović
Combining the visual modality with pretrained language models has been surprisingly effective for simple descriptive tasks such as image captioning.
no code implementations • 12 Oct 2021 • Roshan Sharma, Shruti Palaskar, Alan W Black, Florian Metze
End-to-end modeling of speech summarization models is challenging due to memory and compute constraints arising from long input audio sequences.
2 code implementations • EMNLP 2021 • Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, Christoph Feichtenhofer
We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks.
Ranked #1 on
Zero-Shot Video Retrieval
on YouCook2
1 code implementation • 24 Jul 2021 • Brian Yan, Siddharth Dalmia, David R. Mortensen, Florian Metze, Shinji Watanabe
These phone-based systems with learned allophone graphs can be used by linguists to document new languages, build phone-based lexicons that capture rich pronunciation variations, and re-evaluate the allophone mappings of seen language.
no code implementations • 29 Jun 2021 • Siddhant Arora, Alissa Ostapenko, Vijay Viswanathan, Siddharth Dalmia, Florian Metze, Shinji Watanabe, Alan W Black
Our splits identify performance gaps up to 10% between end-to-end systems that were within 1% of each other on the original test sets.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+3
2 code implementations • NeurIPS 2021 • Mandela Patrick, Dylan Campbell, Yuki M. Asano, Ishan Misra, Florian Metze, Christoph Feichtenhofer, Andrea Vedaldi, João F. Henriques
In video transformers, the time dimension is often treated in the same way as the two spatial dimensions.
Ranked #10 on
Action Recognition
on EPIC-KITCHENS-100
(using extra training data)
1 code implementation • Findings (ACL) 2021 • Hu Xu, Gargi Ghosh, Po-Yao Huang, Prahal Arora, Masoumeh Aminzadeh, Christoph Feichtenhofer, Florian Metze, Luke Zettlemoyer
We present a simplified, task-agnostic multi-modal pre-training approach that can accept either video or text input, or both for a variety of end tasks.
Ranked #2 on
Temporal Action Localization
on CrossTask
(using extra training data)
no code implementations • NAACL 2021 • Siddharth Dalmia, Brian Yan, Vikas Raunak, Florian Metze, Shinji Watanabe
In this work, we present an end-to-end framework that exploits compositionality to learn searchable hidden representations at intermediate stages of a sequence model using decomposed sub-tasks.
no code implementations • CVPR 2022 • Triantafyllos Afouras, Yuki M. Asano, Francois Fagan, Andrea Vedaldi, Florian Metze
We tackle the problem of learning object detectors without supervision.
1 code implementation • ICCV 2021 • Mandela Patrick, Yuki M. Asano, Bernie Huang, Ishan Misra, Florian Metze, Joao Henriques, Andrea Vedaldi
First, for space, we show that spatial augmentations such as cropping do work well for videos too, but that previous implementations, due to the high processing and memory cost, could not do this at a scale sufficient for it to work well.
1 code implementation • NAACL 2021 • Po-Yao Huang, Mandela Patrick, Junjie Hu, Graham Neubig, Florian Metze, Alexander Hauptmann
Specifically, we focus on multilingual text-to-video search and propose a Transformer-based model that learns contextualized multilingual multimodal embeddings.
2 code implementations • EACL 2021 • Abhilasha Ravichander, Siddharth Dalmia, Maria Ryskina, Florian Metze, Eduard Hovy, Alan W Black
When Question-Answering (QA) systems are deployed in the real world, users query them through a variety of interfaces, such as speaking to voice assistants, typing questions into a search engine, or even translating questions to languages supported by the QA system.
no code implementations • 1 Jan 2021 • Juncheng B Li, Shuhui Qu, Xinjian Li, Emma Strubell, Florian Metze
Quantization of neural network parameters and activations has emerged as a successful approach to reducing the model size and inference time on hardware that sup-ports native low-precision arithmetic.
1 code implementation • 15 Nov 2020 • Juncheng B Li, Kaixin Ma, Shuhui Qu, Po-Yao Huang, Florian Metze
This work aims to study several key questions related to multimodal learning through the lens of adversarial noises: 1) The trade-off between early/middle/late fusion affecting its robustness and accuracy 2) How do different frequency/time domain features contribute to the robustness?
no code implementations • EMNLP (nlpbt) 2020 • Tejas Srinivasan, Ramon Sanabria, Florian Metze, Desmond Elliott
Our experiments on the Flickr 8K Audio Captions Corpus show that multimodal ASR can generalize to recover different types of masked words in this unstructured masking setting.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
1 code implementation • Findings of the Association for Computational Linguistics 2020 • Vikas Raunak, Siddharth Dalmia, Vivek Gupta, Florian Metze
State-of-the-art Neural Machine Translation (NMT) models struggle with generating low-frequency tokens, tackling which remains a major challenge.
no code implementations • ICLR 2021 • Mandela Patrick, Po-Yao Huang, Yuki Asano, Florian Metze, Alexander Hauptmann, João Henriques, Andrea Vedaldi
The dominant paradigm for learning video-text representations -- noise contrastive learning -- increases the similarity of the representations of pairs of samples that are known to be related, such as text and video from the same sample, and pushes away the representations of all other pairs.
1 code implementation • Findings of the Association for Computational Linguistics 2020 • Tejas Srinivasan, Ramon Sanabria, Florian Metze, Desmond Elliott
In experiments on the Flickr8K Audio Captions Corpus, we find that our model improves over approaches that use global visual features, that the proposals enable the model to recover entities and other related words, such as adjectives, and that improvements are due to the model's ability to localize the correct proposals.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
no code implementations • 12 Sep 2020 • Ze Cheng, Juncheng Li, Chenxu Wang, Jixuan Gu, Hao Xu, Xinjian Li, Florian Metze
In this paper, we provide a theoretical explanation that low total correlation of sampled representation cannot guarantee low total correlation of the mean representation.
1 code implementation • CVPR 2021 • Amanda Duarte, Shruti Palaskar, Lucas Ventura, Deepti Ghadiyaram, Kenneth DeHaan, Florian Metze, Jordi Torres, Xavier Giro-i-Nieto
Towards this end, we introduce How2Sign, a multimodal and multiview continuous American Sign Language (ASL) dataset, consisting of a parallel corpus of more than 80 hours of sign language videos and a set of corresponding modalities including speech, English transcripts, and depth.
no code implementations • 4 Jun 2020 • Mahaveer Jain, Gil Keren, Jay Mahadeokar, Geoffrey Zweig, Florian Metze, Yatharth Saraf
By using an attention model and a biasing model to leverage the contextual metadata that accompanies a video, we observe a relative improvement of about 16% in Word Error Rate on Named Entities (WER-NE) for videos with related metadata.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
no code implementations • LREC 2020 • David R. Mortensen, Xinjian Li, Patrick Littell, Alexis Michaud, Shruti Rijhwani, Antonios Anastasopoulos, Alan W. black, Florian Metze, Graham Neubig
While phonemic representations are language specific, phonetic representations (stated in terms of (allo)phones) are much closer to a universal (language-independent) transcription.
no code implementations • 13 Mar 2020 • Anirudh Mani, Shruti Palaskar, Nimshi Venkat Meripo, Sandeep Konam, Florian Metze
We propose a simple technique to perform domain adaptation for ASR error correction via machine translation.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+6
1 code implementation • 26 Feb 2020 • Xinjian Li, Siddharth Dalmia, Juncheng Li, Matthew Lee, Patrick Littell, Jiali Yao, Antonios Anastasopoulos, David R. Mortensen, Graham Neubig, Alan W. black, Florian Metze
Multilingual models can improve language processing, particularly for low resource situations, by sharing parameters across languages.
no code implementations • 26 Feb 2020 • Xinjian Li, Siddharth Dalmia, David R. Mortensen, Juncheng Li, Alan W. black, Florian Metze
The difficulty of this task is that phoneme inventories often differ between the training languages and the target language, making it infeasible to recognize unseen phonemes.
no code implementations • 13 Feb 2020 • Tejas Srinivasan, Ramon Sanabria, Florian Metze
Speech is understood better by using visual context; for this reason, there have been many attempts to use images to adapt automatic speech recognition (ASR) systems.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
no code implementations • 29 Jan 2020 • Zhong Zhou, Isak Czeresnia Etinger, Florian Metze, Alexander Hauptmann, Alexander Waibel
We have interesting results both in bounding the shooter as well as detecting the gun smoke.
no code implementations • 9 Nov 2019 • Siddharth Dalmia, Abdel-rahman Mohamed, Mike Lewis, Florian Metze, Luke Zettlemoyer
Inspired by modular software design principles of independence, interchangeability, and clarity of interface, we introduce a method for enforcing encoder-decoder modularity in seq2seq models without sacrificing the overall model quality or its full differentiability.
no code implementations • 4 Nov 2019 • Vikas Raunak, Vaibhav Kumar, Florian Metze
We investigate two specific manifestations of compositionality in Neural Machine Translation (NMT) : (1) Productivity - the ability of the model to extend its predictions beyond the observed length in training data and (2) Systematicity - the ability of the model to systematically recombine known parts and rules.
no code implementations • NeurIPS 2019 • Juncheng B. Li, Shuhui Qu, Xinjian Li, Joseph Szurley, J. Zico Kolter, Florian Metze
In this work, we target our attack on the wake-word detection system, jamming the model with some inconspicuous background music to deactivate the VAs while our audio adversary is present.
no code implementations • EMNLP (IWSLT) 2019 • Tejas Srinivasan, Ramon Sanabria, Florian Metze
In Neural Machine Translation (NMT) the usage of subwords and characters as source and target units offers a simple and flexible solution for translation of rare and unseen words.
no code implementations • WS 2019 • Vikas Raunak, Sang Keun Choe, Quanyang Lu, Yi Xu, Florian Metze
Leveraging the visual modality effectively for Neural Machine Translation (NMT) remains an open problem in computational linguistics.
2 code implementations • WS 2020 • Vikas Raunak, Vaibhav Kumar, Vivek Gupta, Florian Metze
Word embeddings have become a staple of several natural language processing tasks, yet much remains to be understood about their properties.
no code implementations • 25 Sep 2019 • Ze Cheng, Juncheng B Li, Chenxu Wang, Jixuan Gu, Hao Xu, Xinjian Li, Florian Metze
In the problem of unsupervised learning of disentangled representations, one of the promising methods is to penalize the total correlation of sampled latent vari-ables.
no code implementations • 2 Aug 2019 • Xinjian Li, Zhong Zhou, Siddharth Dalmia, Alan W. black, Florian Metze
In this work, we present SANTLR: Speech Annotation Toolkit for Low Resource Languages.
no code implementations • 2 Aug 2019 • Xinjian Li, Siddharth Dalmia, Alan W. black, Florian Metze
For example, the target corpus might benefit more from a corpus in the same domain or a corpus from a close language.
1 code implementation • WS 2019 • Vikas Raunak, Vivek Gupta, Florian Metze
Pre-trained word embeddings are used in several downstream applications as well as for constructing representations for sentences, paragraphs and documents.
no code implementations • 24 Jul 2019 • Suyoun Kim, Siddharth Dalmia, Florian Metze
We present an end-to-end speech recognition model that learns interaction between two speakers based on the turn-changing information.
no code implementations • 30 Jun 2019 • Tejas Srinivasan, Ramon Sanabria, Florian Metze
Multimodal learning allows us to leverage information from multiple sources (visual, acoustic and text), similar to our experience of the real world.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
no code implementations • ACL 2019 • Suyoun Kim, Siddharth Dalmia, Florian Metze
We present a novel conversational-context aware end-to-end speech recognizer based on a gated neural network that incorporates conversational-context/word/speech embeddings.
no code implementations • ACL 2019 • Shruti Palaskar, Jindrich Libovický, Spandana Gella, Florian Metze
In this paper, we study abstractive summarization for open-domain videos.
Ranked #1 on
Text Summarization
on How2
no code implementations • NAACL 2019 • Suyoun Kim, Florian Metze
Conversational context information, higher-level knowledge that spans across sentences, can help to recognize a long conversation.
no code implementations • 24 Feb 2019 • Aditi Chaudhary, Siddharth Dalmia, Junjie Hu, Xinjian Li, Austin Matthews, Aldrian Obaja Muis, Naoki Otani, Shruti Rijhwani, Zaid Sheikh, Nidhi Vyas, Xinyi Wang, Jiateng Xie, Ruochen Xu, Chunting Zhou, Peter J. Jansen, Yiming Yang, Lori Levin, Florian Metze, Teruko Mitamura, David R. Mortensen, Graham Neubig, Eduard Hovy, Alan W. black, Jaime Carbonell, Graham V. Horwood, Shabnam Tafreshi, Mona Diab, Efsun S. Kayi, Noura Farra, Kathleen McKeown
This paper describes the ARIEL-CMU submissions to the Low Resource Human Language Technologies (LoReHLT) 2018 evaluations for the tasks Machine Translation (MT), Entity Discovery and Linking (EDL), and detection of Situation Frames in Text and Speech (SF Text and Speech).
no code implementations • 20 Feb 2019 • Siddharth Dalmia, Xinjian Li, Alan W. black, Florian Metze
Building multilingual and crosslingual models help bring different languages together in a language universal space.
no code implementations • 18 Feb 2019 • Shruti Palaskar, Vikas Raunak, Florian Metze
End-to-end acoustic-to-word speech recognition models have recently gained popularity because they are easy to train, scale well to large amounts of training data, and do not require a lexicon.
no code implementations • 21 Nov 2018 • Nils Holzenberger, Shruti Palaskar, Pranava Madhyastha, Florian Metze, Raman Arora
This shows it is possible to learn reliable representations across disparate, unaligned and noisy modalities, and encourages using the proposed approach on larger datasets.
1 code implementation • 9 Nov 2018 • Ozan Caglayan, Ramon Sanabria, Shruti Palaskar, Loïc Barrault, Florian Metze
Specifically, in our previous work, we propose a multistep visual adaptive training approach which improves the accuracy of an audio-based Automatic Speech Recognition (ASR) system.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
2 code implementations • 1 Nov 2018 • Ramon Sanabria, Ozan Caglayan, Shruti Palaskar, Desmond Elliott, Loïc Barrault, Lucia Specia, Florian Metze
In this paper, we introduce How2, a multimodal collection of instructional videos with English subtitles and crowdsourced Portuguese translations.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+3
2 code implementations • 22 Oct 2018 • Yun Wang, Florian Metze
Research on sound event detection (SED) with weak labeling has mostly focused on presence/absence labeling, which provides no temporal information at all about the event occurrences.
Sound Audio and Speech Processing
3 code implementations • 22 Oct 2018 • Yun Wang, Juncheng Li, Florian Metze
This paper compares five types of pooling functions both theoretically and experimentally, with special focus on their performance of localization.
Sound Audio and Speech Processing
no code implementations • 27 Sep 2018 • Xinjian Li, Siddharth Dalmia, David R. Mortensen, Florian Metze, Alan W Black
Our model is able to recognize unseen phonemes in the target language, if only a small text corpus is available.
no code implementations • 1 Sep 2018 • Ankit Shah, Harini Kesavamoorthy, Poorva Rane, Pramati Kalwad, Alexander Hauptmann, Florian Metze
Moments capture a huge part of our lives.
no code implementations • 7 Aug 2018 • Suyoun Kim, Florian Metze
Existing speech recognition systems are typically built at the sentence level, although it is known that dialog context, e. g. higher-level knowledge that spans across sentences or speakers, can help the processing of long conversations.
no code implementations • 28 Jul 2018 • Siddharth Dalmia, Xinjian Li, Florian Metze, Alan W. black
We demonstrate the effectiveness of using a pre-trained English recognizer, which is robust to such mismatched conditions, as a domain normalizing feature extractor on a low resource language.
no code implementations • 23 Jul 2018 • Shruti Palaskar, Florian Metze
We present effective methods to train Sequence-to-Sequence models for direct word-level recognition (and character-level recognition) and show an absolute improvement of 4. 4-5. 0\% in Word Error Rate on the Switchboard corpus compared to prior work.
no code implementations • 18 Jul 2018 • Ramon Sanabria, Florian Metze
Our model obtains 14. 0% Word Error Rate on the Eval2000 Switchboard subset without any decoder or language model, outperforming the current state-of-the-art on acoustic-to-word models.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+3
1 code implementation • ICMR 2018 • Niluthpol Chowdhury Mithun, Juncheng Li, Florian Metze, Amit K. Roy-Chowdhury
Constructing a joint representation invariant across different modalities (e. g., video, language) is of significant importance in many multimedia applications.
Ranked #28 on
Video Retrieval
on MSR-VTT
no code implementations • 25 Apr 2018 • Shruti Palaskar, Ramon Sanabria, Florian Metze
Transcription or sub-titling of open-domain videos is still a challenging domain for Automatic Speech Recognition (ASR) due to the data's challenging acoustics, variable signal processing and the essentially unrestricted domain of the data.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
no code implementations • 21 Feb 2018 • Siddharth Dalmia, Ramon Sanabria, Florian Metze, Alan W. black
Techniques for multi-lingual and cross-lingual speech recognition can help in low resource scenarios, to bootstrap systems and enable analysis of new languages and domains.
no code implementations • 14 Feb 2018 • Odette Scharenborg, Laurent Besacier, Alan Black, Mark Hasegawa-Johnson, Florian Metze, Graham Neubig, Sebastian Stueker, Pierre Godard, Markus Mueller, Lucas Ondel, Shruti Palaskar, Philip Arthur, Francesco Ciannella, Mingxing Du, Elin Larsen, Danny Merkx, Rachid Riad, Liming Wang, Emmanuel Dupoux
We summarize the accomplishments of a multi-disciplinary workshop exploring the computational and scientific issues surrounding the discovery of linguistic units (subwords and words) in a language without orthography.
no code implementations • 19 Dec 2017 • Thomas Zenkel, Ramon Sanabria, Florian Metze, Alex Waibel
This paper proposes a novel approach to create an unit set for CTC based speech recognition systems.
no code implementations • 1 Dec 2017 • Abhinav Gupta, Yajie Miao, Leonardo Neves, Florian Metze
We are working on a corpus of "how-to" videos from the web, and the idea is that an object that can be seen ("car"), or a scene that is being detected ("kitchen") can be used to condition both models on the "context" of the recording, thereby reducing perplexity and improving transcription.
no code implementations • LREC 2018 • Boyang Li, Beth Cardier, Tong Wang, Florian Metze
Stories are a vital form of communication in human culture; they are employed daily to persuade, to elicit sympathy, or to convey a message.
Cultural Vocal Bursts Intensity Prediction
Vocal Bursts Intensity Prediction
no code implementations • 15 Aug 2017 • Thomas Zenkel, Ramon Sanabria, Florian Metze, Jan Niehues, Matthias Sperber, Sebastian Stüker, Alex Waibel
The CTC loss function maps an input sequence of observable feature vectors to an output sequence of symbols.
1 code implementation • 20 Mar 2017 • Juncheng Li, Wei Dai, Florian Metze, Shuhui Qu, Samarjit Das
On these features, we apply five models: Gaussian Mixture Model (GMM), Deep Neural Network (DNN), Recurrent Neural Network (RNN), Convolutional Deep Neural Net- work (CNN) and i-vector.
no code implementations • 21 Nov 2016 • Ramon Sanabria, Florian Metze, Fernando de la Torre
Speech is one of the most effective ways of communication among humans.
4 code implementations • 29 Jul 2015 • Yajie Miao, Mohammad Gowayyed, Florian Metze
The performance of automatic speech recognition (ASR) has improved tremendously due to the application of deep neural networks (DNNs).
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1