Common Voice is an audio dataset that consists of a unique MP3 and corresponding text file. There are 9,283 recorded hours in the dataset. The dataset also includes demographic metadata like age, sex, and accent. The dataset consists of 7,335 validated hours in 60 languages.
362 PAPERS • 100 BENCHMARKS
It consists of an extensive collection of a high quality cross-lingual fact-to-text dataset in 11 languages: Assamese (as), Bengali (bn), Gujarati (gu), Hindi (hi), Kannada (kn), Malayalam (ml), Marathi (mr), Oriya (or), Punjabi (pa), Tamil (ta), Telugu (te), and monolingual dataset in English (en). This is the Wikipedia text <--> Wikidata KG aligned corpus used to train the data-to-text generation model. The Train & validation splits are created using distant supervision methods and Test data is generated through human annotations.
2 PAPERS • 1 BENCHMARK
OpenSpeaks Voice: Odia is a large speech dataset in the Odia language of India that is stewarded by Subhashish Panigrahi and is hosted at the O Foundation. It currently hosts over 70,000 audio files under a Universal Public Domain (CC0 1.0) Release. Of these, 66,000, hosted on Wikimedia Commons, include pronunciation of words and phrases, and the remaining 4,400 include pronunciation of sentences and are hosted on Mozilla Common Voice. The files on Wikimedia Commons were also released n 2023 as four physical media in the form of DVD-ROMs titled OpenSpeaks Voice: Odia Volume I, OpenSpeaks Voice: Odia Volume II, OpenSpeaks Voice: Balesoria-Odia Volume I, and OpenSpeaks Voice: Balesoria-Odia Volume II. The dataset uses Free/Libre and Open Source Software, primarily using web-based platforms such as Lingua Libre and Common Voice. Other tools used for this project include Kathabhidhana, developed by Panigrahi by forking the Voice Recorder for Tamil Wiktionary by Shrinivasan T, and Spell4wik
1 PAPER • NO BENCHMARKS YET
PolyNews is a multilingual dataset containing news titles in 77 languages and 19 scripts.
PolyNews is a multilingual parallel dataset containing news titles 833 language pairs, spanning in 64 languages and 17 scripts.
We provide a new data set XWikiRef for the task of Cross-lingual Multi-document Summarization. This task aims at generating Wikipedia style text in Low Resource languages by taking reference text as input. Overall, the data set contains 8 different languages: bengali (bn), english (en), hindi (hi), marathi (mr), malayalam (ml), odia (or), punjabi (pa) and tamil (ta). It also contains 5 domains: books, films, politicians, sportsman and writers.
1 PAPER • 1 BENCHMARK
We present sentence aligned parallel corpora across 10 Indian Languages - Hindi, Telugu, Tamil, Malayalam, Gujarati, Urdu, Bengali, Oriya, Marathi, Punjabi, and English - many of which are categorized as low resource. The corpora are compiled from online sources which have content shared across languages. The corpora presented significantly extends present resources that are either not large enough or are restricted to a specific domain (such as health). We also provide a separate test corpus compiled from an independent online source that can be independently used for validating the performance in 10 Indian languages. Alongside, we report on the methods of constructing such corpora using tools enabled by recent advances in machine translation and cross-lingual retrieval using deep neural network based methods.
0 PAPER • NO BENCHMARKS YET