WikiMatrix

Introduced by Schwenk et al. in WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia

WikiMatrix is a dataset of parallel sentences in the textual content of Wikipedia for all possible language pairs. The mined data consists of:

85 different languages, 1620 language pairs
134M parallel sentences, out of which 34M are aligned with English

Source: WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia

Homepage

Benchmarks

Add a new result Link an existing benchmark

No benchmarks yet. Start a new benchmark or link an existing one.

Papers

Paper	Code	Results	Date	Stars

Dataset Loaders

Add Remove

facebookresearch/LASER

3,520

Tasks

Similar Datasets

CCMatrix

CCAligned

GeBioCorpus

CzEng 2.0 Parallel Corpus

CzEng 2.0 Parallel Corpus

Usage

License

CC BY-SA 4.0

Modalities

Texts

Languages