ASPEC, Asian Scientific Paper Excerpt Corpus, is constructed by the Japan Science and Technology Agency (JST) in collaboration with the National Institute of Information and Communications Technology (NICT). It consists of a Japanese-English paper abstract corpus of 3M parallel sentences (ASPEC-JE) and a Japanese-Chinese paper excerpt corpus of 680K parallel sentences (ASPEC-JC). This corpus is one of the achievements of the Japanese-Chinese machine translation project which was run in Japan from 2006 to 2010.
84 PAPERS • NO BENCHMARKS YET
The Machine Translation of Noisy Text (MTNT) dataset is a Machine Translation dataset that consists of noisy comments on Reddit and professionally sourced translation. The translation are between French, Japanese and French, with between 7k and 37k sentence per language pair.
51 PAPERS • NO BENCHMARKS YET
Multilingual Knowledge Questions and Answers (MKQA) is an open-domain question answering evaluation set comprising 10k question-answer pairs aligned across 26 typologically diverse languages (260k question-answer pairs in total). The goal of this dataset is to provide a challenging benchmark for question answering quality across a wide set of languages. Answers are based on a language-independent data representation, making results comparable across languages and independent of language-specific passages. With 26 languages, this dataset supplies the widest range of languages to-date for evaluating question answering.
37 PAPERS • NO BENCHMARKS YET
WMT 2020 is a collection of datasets used in shared tasks of the Fifth Conference on Machine Translation. The conference builds on a series of annual workshops and conferences on Statistical Machine Translation.
33 PAPERS • 1 BENCHMARK
This is the dataset for the 2020 Duolingo shared task on Simultaneous Translation And Paraphrase for Language Education (STAPLE). Sentence prompts, along with automatic translations, and high-coverage sets of translation paraphrases weighted by user response are provided in 5 language pairs. Starter code for this task can be found here: github.com/duolingo/duolingo-sharedtask-2020/. More details on the data set and task are available at: sharedtask.duolingo.com
10 PAPERS • NO BENCHMARKS YET
The Japanese-English business conversation corpus, namely Business Scene Dialogue corpus, was constructed in 3 steps:
8 PAPERS • 2 BENCHMARKS
Demetr is a diagnostic dataset with 31K English examples (translated from 10 source languages) for evaluating the sensitivity of MT evaluation metrics to 35 different linguistic perturbations spanning semantic, syntactic, and morphological error categories.
6 PAPERS • NO BENCHMARKS YET
JParaCrawl is a parallel corpus for English-Japanese, for which the amount of publicly available parallel corpora is still limited. The parallel corpus was constructed by broadly crawling the web and automatically aligning parallel sentences. The corpus amassed over 8.7 million sentence pairs.
5 PAPERS • NO BENCHMARKS YET
PheMT is a phenomenon-wise dataset designed for evaluating the robustness of Japanese-English machine translation systems. The dataset is based on the MTNT dataset, with additional annotations of four linguistic phenomena common in UGC; Proper Noun, Abbreviated Noun, Colloquial Expression, and Variant
1 PAPER • NO BENCHMARKS YET