9 dataset results for Machine Translation AND Japanese

ASPEC (Asian Scientific Paper Excerpt Corpus)

ASPEC, Asian Scientific Paper Excerpt Corpus, is constructed by the Japan Science and Technology Agency (JST) in collaboration with the National Institute of Information and Communications Technology (NICT). It consists of a Japanese-English paper abstract corpus of 3M parallel sentences (ASPEC-JE) and a Japanese-Chinese paper excerpt corpus of 680K parallel sentences (ASPEC-JC). This corpus is one of the achievements of the Japanese-Chinese machine translation project which was run in Japan from 2006 to 2010.

84 PAPERS • NO BENCHMARKS YET

MTNT

The Machine Translation of Noisy Text (MTNT) dataset is a Machine Translation dataset that consists of noisy comments on Reddit and professionally sourced translation. The translation are between French, Japanese and French, with between 7k and 37k sentence per language pair.

51 PAPERS • NO BENCHMARKS YET

MKQA (Multilingual Knowledge Questions and Answers)

Multilingual Knowledge Questions and Answers (MKQA) is an open-domain question answering evaluation set comprising 10k question-answer pairs aligned across 26 typologically diverse languages (260k question-answer pairs in total). The goal of this dataset is to provide a challenging benchmark for question answering quality across a wide set of languages. Answers are based on a language-independent data representation, making results comparable across languages and independent of language-specific passages. With 26 languages, this dataset supplies the widest range of languages to-date for evaluating question answering.

37 PAPERS • NO BENCHMARKS YET

WMT 2020

WMT 2020 is a collection of datasets used in shared tasks of the Fifth Conference on Machine Translation. The conference builds on a series of annual workshops and conferences on Statistical Machine Translation.

33 PAPERS • 1 BENCHMARK

Duolingo STAPLE Shared Task

This is the dataset for the 2020 Duolingo shared task on Simultaneous Translation And Paraphrase for Language Education (STAPLE). Sentence prompts, along with automatic translations, and high-coverage sets of translation paraphrases weighted by user response are provided in 5 language pairs. Starter code for this task can be found here: github.com/duolingo/duolingo-sharedtask-2020/. More details on the data set and task are available at: sharedtask.duolingo.com

10 PAPERS • NO BENCHMARKS YET

Business Scene Dialogue

The Japanese-English business conversation corpus, namely Business Scene Dialogue corpus, was constructed in 3 steps:

8 PAPERS • 2 BENCHMARKS

Demetr

Demetr is a diagnostic dataset with 31K English examples (translated from 10 source languages) for evaluating the sensitivity of MT evaluation metrics to 35 different linguistic perturbations spanning semantic, syntactic, and morphological error categories.

6 PAPERS • NO BENCHMARKS YET

JParaCrawl

JParaCrawl is a parallel corpus for English-Japanese, for which the amount of publicly available parallel corpora is still limited. The parallel corpus was constructed by broadly crawling the web and automatically aligning parallel sentences. The corpus amassed over 8.7 million sentence pairs.

5 PAPERS • NO BENCHMARKS YET

PheMT

PheMT is a phenomenon-wise dataset designed for evaluating the robustness of Japanese-English machine translation systems. The dataset is based on the MTNT dataset, with additional annotations of four linguistic phenomena common in UGC; Proper Noun, Abbreviated Noun, Colloquial Expression, and Variant

1 PAPER • NO BENCHMARKS YET

Datasets

9 dataset results for Machine Translation AND Japanese