MTEB is a benchmark that spans 8 embedding tasks covering a total of 56 datasets and 112 languages. The 8 task types are Bitext mining, Classification, Clustering, Pair Classification, Reranking, Retrieval, Semantic Textual Similarity and Summarisation. The 56 datasets contain varying text lengths and they are grouped into three categories: Sentence to sentence, Paragraph to paragraph, and Sentence to paragraph.
51 PAPERS • 8 BENCHMARKS
The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.
26 PAPERS • 6 BENCHMARKS
The dataset contains the main components of the news articles published online by the newspaper named <a href="https://gazzettadimodena.gelocal.it/modena">Gazzetta di Modena</a>: url of the web page, title, sub-title, text, date of publication, crime category assigned to each news article by the author.
3 PAPERS • NO BENCHMARKS YET
Urdu News Headlines Dataset with VOA and BBC An Urdu news headlines dataset is a collection of news headlines in the Urdu language, typically scraped from news websites and social media platforms. These datasets can be valuable for researchers and developers working on a variety of tasks, such as:
1 PAPER • 1 BENCHMARK