1 code implementation • LREC 2022 • Daan van Esch, Tamar Lucassen, Sebastian Ruder, Isaac Caswell, Clara Rivera
We describe an open-source dataset providing metadata for about 2, 800 language varieties used in the world today.
no code implementations • 19 May 2023 • Jacob Eisenstein, Vinodkumar Prabhakaran, Clara Rivera, Dorottya Demszky, Devyani Sharma
We introduce a new dataset of conversational speech representing English from India, Nigeria, and the United States.
1 code implementation • 31 Oct 2022 • Sebastian Gehrmann, Sebastian Ruder, Vitaly Nikolaev, Jan A. Botha, Michael Chavinda, Ankur Parikh, Clara Rivera
To address this lack of data, we create Table-to-Text in African languages (TaTa), the first large multilingual table-to-text dataset with a focus on African languages.
no code implementations • 25 May 2022 • Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara Rivera, Ankur Bapna
We introduce FLEURS, the Few-shot Learning Evaluation of Universal Representations of Speech benchmark.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +6
no code implementations • 21 Mar 2022 • Alexis Conneau, Ankur Bapna, Yu Zhang, Min Ma, Patrick von Platen, Anton Lozhkov, Colin Cherry, Ye Jia, Clara Rivera, Mihir Kale, Daan van Esch, Vera Axelrod, Simran Khanuja, Jonathan H. Clark, Orhan Firat, Michael Auli, Sebastian Ruder, Jason Riesa, Melvin Johnson
Covering 102 languages from 10+ language families, 3 different domains and 4 task families, XTREME-S aims to simplify multilingual speech representation evaluation, as well as catalyze research in "universal" speech representation learning.
no code implementations • 22 Mar 2021 • Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, Monang Setyawan, Supheakmungkol Sarin, Sokhar Samb, Benoît Sagot, Clara Rivera, Annette Rios, Isabel Papadimitriou, Salomey Osei, Pedro Ortiz Suarez, Iroro Orife, Kelechi Ogueji, Andre Niyongabo Rubungo, Toan Q. Nguyen, Mathias Müller, André Müller, Shamsuddeen Hassan Muhammad, Nanda Muhammad, Ayanda Mnyakeni, Jamshidbek Mirzakhalov, Tapiwanashe Matangira, Colin Leong, Nze Lawson, Sneha Kudugunta, Yacine Jernite, Mathias Jenny, Orhan Firat, Bonaventure F. P. Dossou, Sakhile Dlamini, Nisansa de Silva, Sakine Çabuk Ballı, Stella Biderman, Alessia Battisti, Ahmed Baruwa, Ankur Bapna, Pallavi Baljekar, Israel Abebe Azime, Ayodele Awokoya, Duygu Ataman, Orevaoghene Ahia, Oghenefego Ahia, Sweta Agrawal, Mofetoluwa Adeyemi
With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages.
1 code implementation • Asian Chapter of the Association for Computational Linguistics 2020 • Gabriel Huang, Bo Pang, Zhenhai Zhu, Clara Rivera, Radu Soricut
First, we construct and release a new dense video captioning dataset, Video Timeline Tags (ViTT), featuring a variety of instructional videos together with time-stamped annotations.
Ranked #1 on Dense Video Captioning on YouCook2 (ROUGE-L metric, using extra training data)
1 code implementation • 14 Oct 2020 • Alena Butryna, Shan-Hui Cathy Chu, Isin Demirsahin, Alexander Gutkin, Linne Ha, Fei He, Martin Jansche, Cibu Johny, Anna Katanova, Oddur Kjartansson, Chenfang Li, Tatiana Merkulova, Yin May Oo, Knot Pipatsrisawat, Clara Rivera, Supheakmungkol Sarin, Pasindu De Silva, Keshan Sodimana, Richard Sproat, Theeraphol Wattanavekin, Jaka Aris Eko Wibawa
This paper presents an overview of a program designed to address the growing need for developing freely available speech resources for under-represented languages.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • LREC 2020 • Oddur Kjartansson, Alex Gutkin, er, Alena Butryna, Isin Demirsahin, Clara Rivera
This paper introduces new open speech datasets for three of the languages of Spain: Basque, Catalan and Galician.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • LREC 2020 • Isin Demirsahin, Oddur Kjartansson, Alex Gutkin, er, Clara Rivera
This paper presents a dataset of transcribed high-quality audio of English sentences recorded by volunteers speaking with different accents of the British Isles.
no code implementations • LREC 2020 • Fei He, Shan-Hui Cathy Chu, Oddur Kjartansson, Clara Rivera, Anna Katanova, Alex Gutkin, er, Isin Demirsahin, Cibu Johny, Martin Jansche, Supheakmungkol Sarin, Knot Pipatsrisawat
We present free high quality multi-speaker speech corpora for Gujarati, Kannada, Malayalam, Marathi, Tamil and Telugu, which are six of the twenty two official languages of India spoken by 374 million native speakers.