WikiMatrix is a dataset of parallel sentences in the textual content of Wikipedia for all possible language pairs. The mined data consists of:

  • 85 different languages, 1620 language pairs
  • 134M parallel sentences, out of which 34M are aligned with English
Source: WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia


Paper Code Results Date Stars

Dataset Loaders


Similar Datasets