WMT 2016 is a collection of datasets used in shared tasks of the First Conference on Machine Translation. The conference builds on ten previous Workshops on statistical Machine Translation.
168 PAPERS • 18 BENCHMARKS
A corpus of parallel text in 21 European languages from the proceedings of the European Parliament.
125 PAPERS • NO BENCHMARKS YET
Multilingual Knowledge Questions and Answers (MKQA) is an open-domain question answering evaluation set comprising 10k question-answer pairs aligned across 26 typologically diverse languages (260k question-answer pairs in total). The goal of this dataset is to provide a challenging benchmark for question answering quality across a wide set of languages. Answers are based on a language-independent data representation, making results comparable across languages and independent of language-specific passages. With 26 languages, this dataset supplies the widest range of languages to-date for evaluating question answering.
37 PAPERS • NO BENCHMARKS YET
WMT 2018 is a collection of datasets used in shared tasks of the Third Conference on Machine Translation. The conference builds on a series of twelve previous annual workshops and conferences on Statistical Machine Translation.
36 PAPERS • 6 BENCHMARKS
News translation is a recurring WMT task. The test set is a collection of parallel corpora consisting of about 1500 English sentences translated into 5 languages (Czech, German, Finnish, Romanian, Russian, Turkish) and additional 1500 sentences from each of the 5 languages translated to English. For Romanian a third of the test set were released as a development set instead. For Turkish additional 500 sentence development set was released. The sentences were selected from dozens of news websites and translated by professional translators. The training data consists of parallel corpora to train translation models, monolingual corpora to train language models and development sets for tuning. Some training corpora were identical from WMT 2015 (Europarl, United Nations, French-English 10⁹ corpus, Common Crawl, Russian-English parallel data provided by Yandex, Wikipedia Headlines provided by CMU) and some were update (CzEng v1.6pre, News Commentary v11, monolingual news data). Additionally,
23 PAPERS • 8 BENCHMARKS
News translation is a recurring WMT task. The test set is a collection of parallel corpora consisting of about 1500 English sentences translated into 5 languages (Chinese, Czech, Estonian, German, Finnish, Russian, Turkish) and additional 1500 sentences from each of the 7 languages translated to English. The sentences were selected from dozens of news websites and translated by professional translators.
8 PAPERS • NO BENCHMARKS YET
News translation is a recurring WMT task. The test set is a collection of parallel corpora consisting of about 1500 English sentences translated into 5 languages (Czech, German, Finnish, French, Russian) and additional 1500 sentences from each of the 5 languages translated to English. The sentences are taken from newspaper articles for each language pair, except for French, where the test set was drawn from user-generated comments on the news articles (from Guardian and Le Monde). The translation was done by professional translators.
1 PAPER • NO BENCHMARKS YET