4 dataset results for Text Clustering AND Texts

MTEB is a benchmark that spans 8 embedding tasks covering a total of 56 datasets and 112 languages. The 8 task types are Bitext mining, Classification, Clustering, Pair Classification, Reranking, Retrieval, Semantic Textual Similarity and Summarisation. The 56 datasets contain varying text lengths and they are grouped into three categories: Sentence to sentence, Paragraph to paragraph, and Sentence to paragraph.

51 PAPERS • 8 BENCHMARKS

20 Newsgroups

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.

26 PAPERS • 6 BENCHMARKS

DICE: a Dataset of Italian Crime Event news

DICE: a Dataset of Italian Crime Event news (from Gazzetta di Modena [2011-2021])

The dataset contains the main components of the news articles published online by the newspaper named <a href="https://gazzettadimodena.gelocal.it/modena">Gazzetta di Modena</a>: url of the web page, title, sub-title, text, date of publication, crime category assigned to each news article by the author.

3 PAPERS • NO BENCHMARKS YET

Urdu News Headlines Dataset

Urdu News Headlines Dataset with VOA and BBC An Urdu news headlines dataset is a collection of news headlines in the Urdu language, typically scraped from news websites and social media platforms. These datasets can be valuable for researchers and developers working on a variety of tasks, such as:

1 PAPER • 1 BENCHMARK

Datasets

4 dataset results for Text Clustering AND Texts