Search Results for author: Daan van Esch

Almost none of the 2, 000+ languages spoken in Africa have widely available automatic speech recognition systems, and the required data is also only available for a few languages.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Accented Speech Recognition: Benchmarking, Pre-training, and Diverse Data

no code implementations • 16 May 2022 • Alëna Aksënova, Zhehuai Chen, Chung-Cheng Chiu, Daan van Esch, Pavel Golik, Wei Han, Levi King, Bhuvana Ramabhadran, Andrew Rosenberg, Suzan Schwartz, Gary Wang

However, there are not enough data sets for accented speech, and for the ones that are already available, more training approaches need to be explored to improve the quality of accented speech recognition.

Accented Speech Recognition Benchmarking +1

Paper
Add Code

Building Machine Translation Systems for the Next Thousand Languages

no code implementations • 9 May 2022 • Ankur Bapna, Isaac Caswell, Julia Kreutzer, Orhan Firat, Daan van Esch, Aditya Siddhant, Mengmeng Niu, Pallavi Baljekar, Xavier Garcia, Wolfgang Macherey, Theresa Breiner, Vera Axelrod, Jason Riesa, Yuan Cao, Mia Xu Chen, Klaus Macherey, Maxim Krikun, Pidong Wang, Alexander Gutkin, Apurva Shah, Yanping Huang, Zhifeng Chen, Yonghui Wu, Macduff Hughes

In this paper we share findings from our effort to build practical machine translation (MT) systems capable of translating across over one thousand languages.

Language Identification Machine Translation +1

Paper
Add Code

XTREME-S: Evaluating Cross-lingual Speech Representations

no code implementations • 21 Mar 2022 • Alexis Conneau, Ankur Bapna, Yu Zhang, Min Ma, Patrick von Platen, Anton Lozhkov, Colin Cherry, Ye Jia, Clara Rivera, Mihir Kale, Daan van Esch, Vera Axelrod, Simran Khanuja, Jonathan H. Clark, Orhan Firat, Michael Auli, Sebastian Ruder, Jason Riesa, Melvin Johnson

Covering 102 languages from 10+ language families, 3 different domains and 4 task families, XTREME-S aims to simplify multilingual speech representation evaluation, as well as catalyze research in "universal" speech representation learning.

Representation Learning Retrieval +4

Paper
Add Code

Handling Compounding in Mobile Keyboard Input

no code implementations • 17 Jan 2022 • Andreas Kabel, Keith Hall, Tom Ouyang, David Rybach, Daan van Esch, Françoise Beaufays

This paper proposes a framework to improve the typing experience of mobile users in morphologically rich languages.

Paper
Add Code

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

no code implementations • 22 Mar 2021 • Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, Monang Setyawan, Supheakmungkol Sarin, Sokhar Samb, Benoît Sagot, Clara Rivera, Annette Rios, Isabel Papadimitriou, Salomey Osei, Pedro Ortiz Suarez, Iroro Orife, Kelechi Ogueji, Andre Niyongabo Rubungo, Toan Q. Nguyen, Mathias Müller, André Müller, Shamsuddeen Hassan Muhammad, Nanda Muhammad, Ayanda Mnyakeni, Jamshidbek Mirzakhalov, Tapiwanashe Matangira, Colin Leong, Nze Lawson, Sneha Kudugunta, Yacine Jernite, Mathias Jenny, Orhan Firat, Bonaventure F. P. Dossou, Sakhile Dlamini, Nisansa de Silva, Sakine Çabuk Ballı, Stella Biderman, Alessia Battisti, Ahmed Baruwa, Ankur Bapna, Pallavi Baljekar, Israel Abebe Azime, Ayodele Awokoya, Duygu Ataman, Orevaoghene Ahia, Oghenefego Ahia, Sweta Agrawal, Mofetoluwa Adeyemi

With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages.

Paper
Add Code

Mining Large-Scale Low-Resource Pronunciation Data From Wikipedia

no code implementations • 27 Jan 2021 • Tania Chakraborty, Manasa Prasad, Theresa Breiner, Sandy Ritchie, Daan van Esch

Pronunciation modeling is a key task for building speech technology in new languages, and while solid grapheme-to-phoneme (G2P) mapping systems exist, language coverage can stand to be improved.

Paper
Add Code

Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus

1 code implementation • COLING 2020 • Isaac Caswell, Theresa Breiner, Daan van Esch, Ankur Bapna

Large text corpora are increasingly important for a wide variety of Natural Language Processing (NLP) tasks, and automatic language identification (LangID) is a core technology needed to collect such datasets in a multilingual context.

Language Identification

Paper
Code

Data-Driven Parametric Text Normalization: Rapidly Scaling Finite-State Transduction Verbalizers to New Languages

no code implementations • LREC 2020 • S Ritchie, y, Eoin Mahon, Kim Heiligenstein, Nikos Bampounis, Daan van Esch, Christian Schallhart, Jonas Mortensen, Benoit Brard

This paper presents a methodology for rapidly generating FST-based verbalizers for ASR and TTS systems by efficiently sourcing language-specific data.

Paper
Add Code

Writing Across the World's Languages: Deep Internationalization for Gboard, the Google Keyboard

no code implementations • 3 Dec 2019 • Daan van Esch, Elnaz Sarbar, Tamar Lucassen, Jeremy O'Brien, Theresa Breiner, Manasa Prasad, Evan Crew, Chieu Nguyen, Françoise Beaufays

Today, Gboard supports 900+ language varieties across 70+ writing systems, and this report describes how and why we have been adding support for hundreds of language varieties from around the globe.

Paper
Add Code

Future Directions in Technological Support for Language Documentation

no code implementations • WS 2019 • Daan van Esch, Ben Foley, Nay San

Paper
Add Code

Automatic Keyboard Layout Design for Low-Resource Latin-Script Languages

no code implementations • 18 Jan 2019 • Theresa Breiner, Chieu Nguyen, Daan van Esch, Jeremy O'Brien

For many speakers, one of the barriers in accessing and creating text content on the web is the absence of input tools for their language.

Layout Design

Paper
Add Code

Text Normalization Infrastructure that Scales to Hundreds of Language Varieties

no code implementations • LREC 2018 • Mason Chua, Daan van Esch, Noah Coccaro, Eunjoon Cho, Bh, Sujeet ari, Libin Jia

Language Identification Language Modelling +1

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.