4 dataset results for segmentation AND Machine Translation

scb-mt-en-th-2020

scb-mt-en-th-2020 is an English-Thai machine translation dataset with over 1 million segment pairs, curated from various sources, namely news, Wikipedia articles, SMS messages, task-based dialogs, web-crawled

1 PAPER • NO BENCHMARKS YET

MLQE

MLQE (MultiLingual Quality Estimation)

…The corpus is extracted from Wikipedia, and 10K segments per language pair are annotated.

5 PAPERS • NO BENCHMARKS YET

Tilde MODEL Corpus

Tilde MODEL Corpus (Tilde Multilingual Open Data for European Languages)

…It contains over 10M segments of multilingual open data. The data has been collected from sites allowing free use and reuse of its content, as well as from Public Sector web sites.

2 PAPERS • NO BENCHMARKS YET

GATITOS

GATITOS (Google's Additional Translations Into Tail-languages: Often Short)

…This dataset consists in 4,000 English segments (4,500 tokens) that have been translated into each of 26 low-resource languages, as well as three higher-resource pivot languages (es, fr, hi).

1 PAPER • NO BENCHMARKS YET

Datasets

4 dataset results for segmentation AND Machine Translation