WMT 2014 is a collection of datasets used in shared tasks of the Ninth Workshop on Statistical Machine Translation. The workshop featured four tasks:
282 PAPERS • 11 BENCHMARKS
OpenSubtitles is collection of multilingual parallel corpora. The dataset is compiled from a large database of movie and TV subtitles and includes a total of 1689 bitexts spanning 2.6 billion sentences across 60 languages.
209 PAPERS • 2 BENCHMARKS
WMT 2016 is a collection of datasets used in shared tasks of the First Conference on Machine Translation. The conference builds on ten previous Workshops on statistical Machine Translation.
174 PAPERS • 18 BENCHMARKS
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance. MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic, German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between 4 different languages on average.
159 PAPERS • 1 BENCHMARK
A corpus of parallel text in 21 European languages from the proceedings of the European Parliament.
128 PAPERS • NO BENCHMARKS YET
FLoRes-200 doubles the existing language coverage of FLoRes-101. Given the nature of the new languages, which have less standardization and require more specialized professional translations, the verification process became more complex. This required modifications to the translation workflow. FLoRes-200 has several languages which were not translated from English. Specifically, several languages were translated from Spanish, French, Russian, and Modern Standard Arabic.
100 PAPERS • 1 BENCHMARK
ASPEC, Asian Scientific Paper Excerpt Corpus, is constructed by the Japan Science and Technology Agency (JST) in collaboration with the National Institute of Information and Communications Technology (NICT). It consists of a Japanese-English paper abstract corpus of 3M parallel sentences (ASPEC-JE) and a Japanese-Chinese paper excerpt corpus of 680K parallel sentences (ASPEC-JC). This corpus is one of the achievements of the Japanese-Chinese machine translation project which was run in Japan from 2006 to 2010.
87 PAPERS • NO BENCHMARKS YET
FLoRes-101 is an evaluation benchmark for low-resource and multilingual machine translation. It consists of 3001 sentences extracted from English Wikipedia, covering a variety of different topics and domains. These sentences have been translated into 101 languages by professional translators through a carefully controlled process.
81 PAPERS • 9 BENCHMARKS
OPUS-100 is an English-centric multilingual corpus covering 100 languages. It was randomly sampled from the OPUS collection.
74 PAPERS • NO BENCHMARKS YET
Europarl-ST is a multilingual Spoken Language Translation corpus containing paired audio-text samples for SLT from and into 9 European languages, for a total of 72 different translation directions. This corpus has been compiled using the debates held in the European Parliament in the period between 2008 and 2012.
57 PAPERS • NO BENCHMARKS YET
The Shifts Dataset is a dataset for evaluation of uncertainty estimates and robustness to distributional shift. The dataset, which has been collected from industrial sources and services, is composed of three tasks, with each corresponding to a particular data modality: tabular weather prediction, machine translation, and self-driving car (SDC) vehicle motion prediction. All of these data modalities and tasks are affected by real, `in-the-wild' distributional shifts and pose interesting challenges with respect to uncertainty estimation.
52 PAPERS • 1 BENCHMARK
The Machine Translation of Noisy Text (MTNT) dataset is a Machine Translation dataset that consists of noisy comments on Reddit and professionally sourced translation. The translation are between French, Japanese and French, with between 7k and 37k sentence per language pair.
51 PAPERS • NO BENCHMARKS YET
FLoRes is a benchmark dataset for machine translation between English and four low-resource languages, Nepali, Sinhala, Khmer, and Pashto, based on sentences translated from Wikipedia. The FLoRes project has two versions: FLoRes-101 and FLoRes-200.
49 PAPERS • NO BENCHMARKS YET
Multilingual Knowledge Questions and Answers (MKQA) is an open-domain question answering evaluation set comprising 10k question-answer pairs aligned across 26 typologically diverse languages (260k question-answer pairs in total). The goal of this dataset is to provide a challenging benchmark for question answering quality across a wide set of languages. Answers are based on a language-independent data representation, making results comparable across languages and independent of language-specific passages. With 26 languages, this dataset supplies the widest range of languages to-date for evaluating question answering.
42 PAPERS • NO BENCHMARKS YET
Samanantar is the largest publicly available parallel corpora collection for Indic languages: Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu. The corpus has 49.6M sentence pairs between English to Indian Languages.
39 PAPERS • NO BENCHMARKS YET
Tatoeba is a free collection of example sentences with translations geared towards foreign language learners. It is available in more than 400 languages. Its name comes from the Japanese phrase “tatoeba” (例えば), meaning “for example”. It is written and maintained by a community of volunteers through a model of open collaboration. Individual contributors are known as Tatoebans.
38 PAPERS • 27 BENCHMARKS
WMT 2018 is a collection of datasets used in shared tasks of the Third Conference on Machine Translation. The conference builds on a series of twelve previous annual workshops and conferences on Statistical Machine Translation.
37 PAPERS • 6 BENCHMARKS
WMT 2020 is a collection of datasets used in shared tasks of the Fifth Conference on Machine Translation. The conference builds on a series of annual workshops and conferences on Statistical Machine Translation.
34 PAPERS • 1 BENCHMARK
WMT 2015 is a collection of datasets used in shared tasks of the Tenth Workshop on Statistical Machine Translation. The workshop featured five tasks:
33 PAPERS • 4 BENCHMARKS
Consists of millions of entries in which the MT element of the training triplets has been obtained by translating the source side of publicly-available parallel corpora, and using the target side as an artificial human post-edit. Translations are obtained both with phrase-based and neural models.
28 PAPERS • NO BENCHMARKS YET
News translation is a recurring WMT task. The test set is a collection of parallel corpora consisting of about 1500 English sentences translated into 5 languages (Czech, German, Finnish, Romanian, Russian, Turkish) and additional 1500 sentences from each of the 5 languages translated to English. For Romanian a third of the test set were released as a development set instead. For Turkish additional 500 sentence development set was released. The sentences were selected from dozens of news websites and translated by professional translators. The training data consists of parallel corpora to train translation models, monolingual corpora to train language models and development sets for tuning. Some training corpora were identical from WMT 2015 (Europarl, United Nations, French-English 10⁹ corpus, Common Crawl, Russian-English parallel data provided by Yandex, Wikipedia Headlines provided by CMU) and some were update (CzEng v1.6pre, News Commentary v11, monolingual news data). Additionally,
24 PAPERS • 8 BENCHMARKS
COCO-CN is a bilingual image description dataset enriching MS-COCO with manually written Chinese sentences and tags. The new dataset can be used for multiple tasks including image tagging, captioning and retrieval, all in a cross-lingual setting.
21 PAPERS • 3 BENCHMARKS
The Multilingual Quality Estimation and Automatic Post-editing (MLQE-PE) Dataset is a dataset for Machine Translation (MT) Quality Estimation (QE) and Automatic Post-Editing (APE). The dataset contains seven language pairs, with human labels for 9,000 translations per language pair in the following formats: sentence-level direct assessments and post-editing effort, and word-level good/bad labels. It also contains the post-edited sentences, as well as titles of the articles where the sentences were extracted from, and the neural MT models used to translate the text.
20 PAPERS • NO BENCHMARKS YET
PARANMT-50M is a dataset for training paraphrastic sentence embeddings. It consists of more than 50 million English-English sentential paraphrase pairs.
12 PAPERS • NO BENCHMARKS YET
This is the dataset for the 2020 Duolingo shared task on Simultaneous Translation And Paraphrase for Language Education (STAPLE). Sentence prompts, along with automatic translations, and high-coverage sets of translation paraphrases weighted by user response are provided in 5 language pairs. Starter code for this task can be found here: github.com/duolingo/duolingo-sharedtask-2020/. More details on the data set and task are available at: sharedtask.duolingo.com
10 PAPERS • NO BENCHMARKS YET
A challenging new benchmark for language-agnostic answer retrieval from a multilingual candidate pool.
The Japanese-English business conversation corpus, namely Business Scene Dialogue corpus, was constructed in 3 steps:
9 PAPERS • 2 BENCHMARKS
News translation is a recurring WMT task. The test set is a collection of parallel corpora consisting of about 1500 English sentences translated into 5 languages (Chinese, Czech, Estonian, German, Finnish, Russian, Turkish) and additional 1500 sentences from each of the 7 languages translated to English. The sentences were selected from dozens of news websites and translated by professional translators.
8 PAPERS • NO BENCHMARKS YET
GigaST is a large-scale pseudo speech translation (ST) corpus. The corpus was created by translating the text in GigaSpeech, an English ASR corpus, into German and Chinese. The training set is translated by a strong machine translation system and the test set was translated by human. ST models trained with an addition of the corpus obtain new state-of-the-art results on the MuST-C English-German benchmark test set.
7 PAPERS • NO BENCHMARKS YET
Hindi Visual Genome is a multimodal dataset consisting of text and images suitable for English-Hindi multimodal machine translation task and multimodal research.
The NLC2CMD Competition hosted at NeurIPS 2020 aimed to bring the power of natural language processing to the command line. Participants were tasked with building models that can transform descriptions of command line tasks in English to their Bash syntax.
7 PAPERS • 1 BENCHMARK
PhoMT is a high-quality and large-scale Vietnamese-English parallel dataset of 3.02M sentence pairs for machine translation.
BMELD is a bilingual (English-Chinese) dialogue corpus for Neural chat translation.
6 PAPERS • NO BENCHMARKS YET
Demetr is a diagnostic dataset with 31K English examples (translated from 10 source languages) for evaluating the sensitivity of MT evaluation metrics to 35 different linguistic perturbations spanning semantic, syntactic, and morphological error categories.
The IWSLT 2015 Evaluation Campaign featured three tracks: automatic speech recognition (ASR), spoken language translation (SLT), and machine translation (MT). For ASR we offered two tasks, on English and German, while for SLT and MT a number of tasks were proposed, involving English, German, French, Chinese, Czech, Thai, and Vietnamese. All tracks involved the transcription or translation of TED talks, either made available by the official TED website or by other TEDx events. A notable change with respect to previous evaluations was the use of unsegmented speech in the SLT track in order to better fit a real application scenario.
6 PAPERS • 1 BENCHMARK
JParaCrawl is a parallel corpus for English-Japanese, for which the amount of publicly available parallel corpora is still limited. The parallel corpus was constructed by broadly crawling the web and automatically aligning parallel sentences. The corpus amassed over 8.7 million sentence pairs.
Shellcode_IA32 is a dataset containing 20 years of shellcodes from a variety of sources is the largest collection of shellcodes in assembly available to date.
ACES a dataset consisting of 68 phenomena ranging from simple perturbations at the word/character level to more complex errors based on discourse and real-world knowledge. It can be used to evaluate a wide range of Machine Translation metrics.
5 PAPERS • 1 BENCHMARK
A parallel corpus of Hindi and English, and HindMonoCorp, a monolingual corpus of Hindi in their release version 0.5. Both corpora were collected from web sources and preprocessed primarily for the training of statistical machine translation systems. HindEnCorp consists of 274k parallel sentences (3.9 million Hindi and 3.8 million English tokens). HindMonoCorp amounts to 787 million tokens in 44 million sentences.
5 PAPERS • NO BENCHMARKS YET
The MLQE dataset is a dataset for sentence-level Machine Translation Quality Estimation. It consists of 6 language pairs representing NMT training in high, medium, and low-resource scenarios. The corpus is extracted from Wikipedia, and 10K segments per language pair are annotated.
A new English-French test set for the evaluation of Machine Translation (MT) for informal, written bilingual dialogue. The test set contains 144 spontaneous dialogues (5,700+ sentences) between native English and French speakers, mediated by one of two neural MT systems in a range of role-play settings. The dialogues are accompanied by fine-grained sentence-level judgments of MT quality, produced by the dialogue participants themselves, as well as by manually normalised versions and reference translations produced a posteriori.
4 PAPERS • NO BENCHMARKS YET
FRMT is a dataset and evaluation benchmark for Few-shot Region-aware Machine Translation, a type of style-targeted translation. The dataset consists of human translations of a few thousand English Wikipedia sentences into regional variants of Portuguese and Mandarin. Source documents are selected to enable detailed analysis of phenomena of interest, including lexically distinct terms and distractor terms.
4 PAPERS • 4 BENCHMARKS
IndoNLG is a benchmark to measure natural language generation (NLG) progress in three low-resource—yet widely spoken—languages of Indonesia: Indonesian, Javanese, and Sundanese. Altogether, these languages are spoken by more than 100 million native speakers, and hence constitute an important use case of NLG systems today. Concretely, IndoNLG covers six tasks: summarization, question answering, chit-chat, and three different pairs of machine translation (MT) tasks.
SRL is the task of extracting semantic predicate-argument structures from sentences. X-SRL is a multilingual parallel Semantic Role Labelling (SRL) corpus for English (EN), German (DE), French (FR) and Spanish (ES) that is based on English gold annotations and shares the same labelling scheme across languages.
Itihasa is a large-scale corpus for Sanskrit to English translation containing 93,000 pairs of Sanskrit shlokas and their English translations. The shlokas are extracted from two Indian epics viz., The Ramayana and The Mahabharata.
3 PAPERS • 1 BENCHMARK
A large-scale multilingual corpus of images, each labeled with the word it represents. The dataset includes approximately 10,000 words in each of 100 languages.
3 PAPERS • NO BENCHMARKS YET
TextBox 2.0 is a comprehensive and unified library for text generation, focusing on the use of pre-trained language models (PLMs). The library covers 13 common text generation tasks and their corresponding 83 datasets and further incorporates 45 PLMs covering general, translation, Chinese, dialogue, controllable, distilled, prompting, and lightweight PLMs.
APE is useful to evaluate Machine Translation automatic post-editing (APE), which is the task of improving the output of a blackbox MT system by automatically fixing its mistakes. The act of post-editing text can be fully specified as a sequence of delete and insert actions in given positions.
2 PAPERS • NO BENCHMARKS YET