CLIRMatrix is a large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval. It includes:

  • BI-139: A bilingual dataset of queries in one language matched with relevant documents in another language for 139x138=19,182 language pairs,
  • MULTI-8, a multilingual dataset of queries and documents jointly aligned in 8 different languages.

In total, 49 million unique queries and 34 billion (query, document, label) triplets were mined, making CLIRMatrix the largest and most comprehensive CLIR dataset to date.

Source: CLIRMatrix: A massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval

Papers


Paper Code Results Date Stars

Dataset Loaders


Tasks


Similar Datasets


License


  • Unknown

Modalities


Languages