The Tatoeba Translation Challenge is a benchmark for machine translation that provides training and test data for thousands of language pairs covering over 500 languages.
The Tatoeba translation challenge includes shuffled training data taken from OPUS, an open collection of parallel corpora, and test data from Tatoeba, a crowd-sourced collection of user-provided translations in a large number of languages.
The current release includes over 500GB of compressed data for 2,961 language pairs covering 555 languages. The data sets are released per language pair with the following structure (using
deu-eng as an example):
data/deu-eng/ data/deu-eng/train.src.gz data/deu-eng/train.trg.gz data/deu-eng/train.id.gz data/deu-eng/dev.id data/deu-eng/dev.src data/deu-eng/dev.trg data/deu-eng/test.src data/deu-eng/test.trg data/deu-eng/test.id