VoxCeleb2 is a large scale speaker recognition dataset obtained automatically from open-source media. VoxCeleb2 consists of over a million utterances from over 6k speakers. Since the dataset is collected ‘in the wild’, the speech segments are corrupted with real world noise including laughter, cross-talk, channel effects, music and other sounds. The dataset is also multilingual, with speech from speakers of 145 different nationalities, covering a wide range of accents, ages, ethnicities and languages. The dataset is audio-visual, so is also useful for a number of other applications, for example – visual speech synthesis, speech separation, cross-modal transfer from face to voice or vice versa and training face recognition from video to complement existing face recognition datasets.
351 PAPERS • 5 BENCHMARKS
The Cross-lingual Natural Language Inference (XNLI) corpus is the extension of the Multi-Genre NLI (MultiNLI) corpus to 15 languages. The dataset was created by manually translating the validation and test sets of MultiNLI into each of those 15 languages. The English training set was machine translated for all languages. The dataset is composed of 122k train, 2490 validation and 5010 test examples.
264 PAPERS • 11 BENCHMARKS
XQuAD (Cross-lingual Question Answering Dataset) is a benchmark dataset for evaluating cross-lingual question answering performance. The dataset consists of a subset of 240 paragraphs and 1190 question-answer pairs from the development set of SQuAD v1.1 (Rajpurkar et al., 2016) together with their professional translations into ten languages: Spanish, German, Greek, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, and Hindi. Consequently, the dataset is entirely parallel across 11 languages.
123 PAPERS • 2 BENCHMARKS
The CoNLL-2012 shared task involved predicting coreference in English, Chinese, and Arabic, using the final version, v5.0, of the OntoNotes corpus. It was a follow-on to the English-only task organized in 2011.
83 PAPERS • 4 BENCHMARKS
mC4 is a multilingual variant of the C4 dataset called mC4. mC4 comprises natural text in 101 languages drawn from the public Common Crawl web scrape.
60 PAPERS • NO BENCHMARKS YET
The FLORES evaluation benchmark consists of 3001 sentences extracted from English Wikipedia and covering a variety of different topics and domains. These sentences have been translated in 101 languages by professional translators through a carefully controlled process. The resulting dataset enables better assessment of model quality on the long tail of low-resource languages, including the evaluation of many-to-many multilingual translation systems, as all translations are multilingually aligned. By publicly releasing such a high-quality and high-coverage dataset, we hope to foster progress in the machine translation community and beyond.
46 PAPERS • NO BENCHMARKS YET
ParaCrawl v.7.1 is a parallel dataset with 41 language pairs primarily aligned with English (39 out of 41) and mined using the parallel-data-crawling tool Bitextor which includes downloading documents, preprocessing and normalization, aligning documents and segments, and filtering noisy data via Bicleaner. ParaCrawl focuses on European languages, but also includes 9 lower-resource, non-European language pairs in v7.1.
Multilingual Document Classification Corpus (MLDoc) is a cross-lingual document classification dataset covering English, German, French, Spanish, Italian, Russian, Japanese and Chinese. It is a subset of the Reuters Corpus Volume 2 selected according to the following design choices:
45 PAPERS • 11 BENCHMARKS
VoxPopuli is a large-scale multilingual corpus providing 100K hours of unlabelled speech data in 23 languages. It is the largest open data to date for unsupervised representation learning as well as semi-supervised learning. VoxPopuli also contains 1.8K hours of transcribed speeches in 16 languages and their aligned oral interpretations into 5 other languages totaling 5.1K hours.
40 PAPERS • 4 BENCHMARKS
OSCAR or Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. The dataset used for training multilingual models such as BART incorporates 138 GB of text.
39 PAPERS • NO BENCHMARKS YET
Wikipedia-based Image Text (WIT) Dataset is a large multimodal multilingual dataset. WIT is composed of a curated set of 37.6 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages. Its size enables WIT to be used as a pretraining dataset for multimodal machine learning models.
34 PAPERS • 1 BENCHMARK
CCAligned consists of parallel or comparable web-document pairs in 137 languages aligned with English. These web-document pairs were constructed by performing language identification on raw web-documents, and ensuring corresponding language codes were corresponding in the URLs of web documents. This pattern matching approach yielded more than 100 million aligned documents paired with English. Recognizing that each English document was often aligned to multiple documents in different target language, it is possible to join on English documents to obtain aligned documents that directly pair two non-English documents (e.g., Arabic-French).
27 PAPERS • NO BENCHMARKS YET
WikiAnn is a dataset for cross-lingual name tagging and linking based on Wikipedia articles in 295 languages.
26 PAPERS • 7 BENCHMARKS
AVSpeech is a large-scale audio-visual dataset comprising speech clips with no interfering background signals. The segments are of varying length, between 3 and 10 seconds long, and in each clip the only visible face in the video and audible sound in the soundtrack belong to a single speaking person. In total, the dataset contains roughly 4700 hours of video segments with approximately 150,000 distinct speakers, spanning a wide variety of people, languages and face poses.
23 PAPERS • NO BENCHMARKS YET
This is the dataset for the 2020 Duolingo shared task on Simultaneous Translation And Paraphrase for Language Education (STAPLE). Sentence prompts, along with automatic translations, and high-coverage sets of translation paraphrases weighted by user response are provided in 5 language pairs. Starter code for this task can be found here: github.com/duolingo/duolingo-sharedtask-2020/. More details on the data set and task are available at: sharedtask.duolingo.com
9 PAPERS • NO BENCHMARKS YET
This repository contains gzipped files containing more than 2 million tokens (words) from answers submitted by more than 6,000 students over the course of their first 30 days of using Duolingo. It also contains baseline starter code written in Python. There are three data sets, corresponding to three different language courses. More details on the data set and task are available at: http://sharedtask.duolingo.com. (2018-01-10)
3 PAPERS • NO BENCHMARKS YET
This is a gzipped CSV file containing the 13 million Duolingo student learning traces used in experiments by Settles & Meeder (2016). For more details and replication source code, visit: https://github.com/duolingo/halflife-regression (2016-06-07)
A multilingual etymological database extracted from the Wiktionary (described in Methodological Aspects of Developing and Managing an Etymological Lexical Resource: Introducing EtymDB-2.0)
2 PAPERS • NO BENCHMARKS YET
CommonCrawl News is a dataset containing news articles from news sites all over the world. The dataset is available in form of Web ARChive (WARC) files that are released on a daily basis.
1 PAPER • NO BENCHMARKS YET
Automatic segmentation, tokenization and morphological and syntactic annotations of raw texts in 45 languages, generated by UDPipe (http://ufal.mff.cuni.cz/udpipe), together with word embeddings of dimension 100 computed from lowercased texts by word2vec (https://code.google.com/archive/p/word2vec/).
We introduce GLAMI-1M: the largest multilingual image-text classification dataset and benchmark. The dataset contains images of fashion products with item descriptions, each in 1 of 13 languages. Categorization into 191 classes has high-quality annotations: all 100k images in the test set and 75% of the 1M training set were human-labeled. The paper presents baselines for image-text classification showing that the dataset presents a challenging fine-grained classification problem: The best scoring EmbraceNet model using both visual and textual features achieves 69.7% accuracy. Experiments with a modified Imagen model show the dataset is also suitable for image generation conditioned on text.
1 PAPER • 1 BENCHMARK
QALD-9-Plus Dataset Description QALD-9-Plus is the dataset for Knowledge Graph Question Answering (KGQA) based on well-known QALD-9.
WebLI (Web Language Image) is a web-scale multilingual image-text dataset, designed to support Google’s vision-language research, such as the large-scale pre-training for image understanding, image captioning, visual question answering, object detection etc.