The Tatoeba Translation Challenge is a benchmark for machine translation that provides training and test data for thousands of language pairs covering over 500 languages.

The Tatoeba translation challenge includes shuffled training data taken from OPUS, an open collection of parallel corpora, and test data from Tatoeba, a crowd-sourced collection of user-provided translations in a large number of languages.

The current release includes over 500GB of compressed data for 2,961 language pairs covering 555 languages. The data sets are released per language pair with the following structure (using deu-eng as an example):

data/deu-eng/
data/deu-eng/train.src.gz
data/deu-eng/train.trg.gz
data/deu-eng/train.id.gz
data/deu-eng/dev.id
data/deu-eng/dev.src
data/deu-eng/dev.trg
data/deu-eng/test.src
data/deu-eng/test.trg
data/deu-eng/test.id

Papers


Paper Code Results Date Stars

Tasks


Similar Datasets


License


Modalities


Languages