🔔 Share your dataset with the ML community!

Filter by Modality

Filter by Task (clear)

Filter by Language

136 dataset results for Text Classification

The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger NIST Special Database 3 (digits written by employees of the United States Census Bureau) and Special Database 1 (digits written by high school students) which contain monochrome images of handwritten digits. The digits have been size-normalized and centered in a fixed-size image. The original black and white (bilevel) images from NIST were size normalized to fit in a 20x20 pixel box while preserving their aspect ratio. The resulting images contain grey levels as a result of the anti-aliasing technique used by the normalization algorithm. the images were centered in a 28x28 image by computing the center of mass of the pixels, and translating the image so as to position this point at the center of the 28x28 field.

6,996 PAPERS • 52 BENCHMARKS

20 Newsgroups

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.

26 PAPERS • 6 BENCHMARKS

AG’s Corpus

AG’s Corpus (AG's corpus of news articlesNews)

Antonio Gulli’s corpus of news articles is a collection of more than 1 million news articles. The articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non - commercial activity.

2 PAPERS • NO BENCHMARKS YET

Arxiv HEP-TH citation graph

Arxiv HEP-TH (high energy physics theory) citation graph is from the e-print arXiv and covers all the citations within a dataset of 27,770 papers with 352,807 edges. If a paper i cites paper j, the graph contains a directed edge from i to j. If a paper cites, or is cited by, a paper outside the dataset, the graph does not contain any information about this. The data covers papers in the period from January 1993 to April 2003 (124 months).

34 PAPERS • 9 BENCHMARKS

Avicenna: Deductive Commonsense Reasoning

A syllogism is a common form of deductive reasoning that requires precisely two premises and one conclusion. The Avicenna corpus is a benchmark for syllogistic NLI and syllogistic NLG:

1 PAPER • NO BENCHMARKS YET

BASIL

300 news articles annotated with 1,727 bias spans and find evidence that informational bias appears in news articles more frequently than lexical bias.

23 PAPERS • NO BENCHMARKS YET

BugClassify

Dataset of 5,591 labeled issue tickets. Originally created by Herzig et al. in : "It’s Not a Bug, It’s a Feature: How Misclassification Impacts Bug Prediction" (paper)

1 PAPER • NO BENCHMARKS YET

CARD-660

An expert-annotated word similarity dataset which provides a highly reliable, yet challenging, benchmark for rare word representation techniques.

10 PAPERS • NO BENCHMARKS YET

CIC

CIC (Catalonia Independence Corpus)

The dataset is annotated with stance towards one topic, namely, the independence of Catalonia.

2 PAPERS • 3 BENCHMARKS

CLUE

CLUE (Chinese Language Understanding Evaluation Benchmark)

CLUE is a Chinese Language Understanding Evaluation benchmark. It consists of different NLU datasets. It is a community-driven project that brings together 9 tasks spanning several well-established single-sentence/sentence-pair classification tasks, as well as machine reading comprehension, all on original Chinese text.

95 PAPERS • 8 BENCHMARKS

COVID-19 Twitter Chatter Dataset

A large-scale curated dataset of over 152 million tweets, growing daily, related to COVID-19 chatter generated from January 1st to April 4th at the time of writing.

10 PAPERS • 6 BENCHMARKS

Chinese AI and Law (CAIL) 2018

Large-scale Chinese legal dataset for judgment prediction. \dataset contains more than 2.6 million criminal cases published by the Supreme People's Court of China, which are several times larger than other datasets in existing works on judgment prediction.

1 PAPER • NO BENCHMARKS YET

Evidence Inference

Evidence Inference is a corpus for this task comprising 10,000+ prompts coupled with full-text articles describing RCTs.

26 PAPERS • NO BENCHMARKS YET

Hate Speech

Dataset of hate speech annotated on Internet forum posts in English at sentence-level. The source forum in Stormfront, a large online community of white nacionalists. A total of 10,568 sentence have been been extracted from Stormfront and classified as conveying hate speech or not.

162 PAPERS • 1 BENCHMARK

HateXplain

Covers multiple aspects of the issue. Each post in the dataset is annotated from three different perspectives: the basic, commonly used 3-class classification (i.e., hate, offensive or normal), the target community (i.e., the community that has been the victim of hate speech/offensive speech in the post), and the rationales, i.e., the portions of the post on which their labelling decision (as hate, offensive or normal) is based.

89 PAPERS • 3 BENCHMARKS

Hyperpartisan News Detection

Hyperpartisan News Detection was a dataset created for PAN @ SemEval 2019 Task 4. Given a news article text, decide whether it follows a hyperpartisan argumentation, i.e., whether it exhibits blind, prejudiced, or unreasoning allegiance to one party, faction, cause, or person.

3 PAPERS • 1 BENCHMARK

Indian Number Plates Dataset | Vehicle Number Plates | English OCR Detection

This dataset is an extremely challenging set of over 20,000+ original Number plate images captured and crowdsourced from over 700+ urban and rural areas, where each image is manually reviewed and verified by computer vision professionals at Datacluster Labs

0 PAPER • NO BENCHMARKS YET

IndoNLU Benchmark

The IndoNLU benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems for Bahasa Indonesia. It is a joint venture from many Indonesia NLP enthusiasts from different institutions such as Gojek, Institut Teknologi Bandung, HKUST, Universitas Multimedia Nusantara, Prosa.ai, and Universitas Indonesia.

14 PAPERS • 1 BENCHMARK

LSHTC

LSHTC is a dataset for large-scale text classification. The data used in the LSHTC challenges originates from two popular sources: the DBpedia and the ODP (Open Directory Project) directory, also known as DMOZ. DBpedia instances were selected from the english, non-regional Extended Abstracts provided by the DBpedia site. The DMOZ instances consist of either Content vectors, Description vectors or both. A Content vectors is obtained by directly indexing the web page using standard indexing chain (preprocessing, stemming/lemmatization, stop-word removal).

18 PAPERS • NO BENCHMARKS YET

Modern Hebrew Sentiment Dataset

Modern Hebrew Sentiment Dataset is a sentiment analysis benchmark for Hebrew, based on 12K social media comments, and provide two instances of these data: in token-based and morpheme-based settings.

1 PAPER • NO BENCHMARKS YET

NSMC (Naver Sentiment Movie Corpus)

This is a movie review dataset in the Korean language. Reviews were scraped from Naver Movies.

0 PAPER • 1 BENCHMARK

NatCat

A general purpose text categorization dataset (NatCat) from three online resources: Wikipedia, Reddit, and Stack Exchange. These datasets consist of document-category pairs derived from manual curation that occurs naturally by their communities.

1 PAPER • NO BENCHMARKS YET

Ohsumed

Ohsumed includes medical abstracts from the MeSH categories of the year 1991. In [Joachims, 1997] were used the first 20,000 documents divided in 10,000 for training and 10,000 for testing. The specific task was to categorize the 23 cardiovascular diseases categories. After selecting the such category subset, the unique abstract number becomes 13,929 (6,286 for training and 7,643 for testing). As current computers can easily manage larger number of documents we make available all 34,389 cardiovascular diseases abstracts out of 50,216 medical abstracts contained in the year 1991.

11 PAPERS • 2 BENCHMARKS

OpenWebText

OpenWebText is an open-source recreation of the WebText corpus. The text is web content extracted from URLs shared on Reddit with at least three upvotes. (38GB).

133 PAPERS • NO BENCHMARKS YET

Paper Field

Paper Field is built from the Microsoft Academic Graph and maps paper titles to one of 7 fields of study. Each field of study - geography, politics, economics, business, sociology, medicine, and psychology - has approximately 12K training examples.

1 PAPER • 1 BENCHMARK

RSDD-Time

RSDD-Time is a dataset of 598 manually annotated self-reported depression diagnosis posts from Reddit that include temporal information about the diagnosis. Annotations include whether a mental health condition is present and how recently the diagnosis happened. Additionally, the dataset includes exact temporal spans that relate to the date of diagnosis.

2 PAPERS • NO BENCHMARKS YET

RuADReCT

RuADReCT (The Russian Adverse Drug Reaction Corpus of Tweets)

Created as part of the Social Media Mining for Health Applications (#SMM4H '20) shared tasks, this dataset consists of 9515 tweets describing health issues. Each tweet is labeled for whether it contains information about an adverse side effect that occurred when taking a drug. The dataset was a joint effort with the UPenn HLP Center and the Chemoinformatics and Molecular Modeling Research Laboratory at Kazan Federal University.

0 PAPER • NO BENCHMARKS YET

SMHD

SMHD (Self-reported Mental Health Diagnoses)

A novel large dataset of social media posts from users with one or multiple mental health conditions along with matched control users.

16 PAPERS • NO BENCHMARKS YET

SmokEng

SmokEng is a dataset of 3144 tweets, which are selected based on the presence of colloquial slang related to smoking and analyze it based on the semantics of the tweet.

1 PAPER • NO BENCHMARKS YET

TUNIZI

A sentiment analysis Tunisian Arabizi Dataset, collected from social networks, preprocessed for analytical studies and annotated manually by Tunisian native speakers.

7 PAPERS • NO BENCHMARKS YET

Text_VPH

Este conjunto de datos consiste en comentarios de publicaciones del MINSA (Perú) en Facebook sobre la vacuna contra el VPH entre los años 2019 y 2020. Se leyó cuidadosamente cada uno de los comentarios, luego se procedió a clasificarlos de manera manual. Para esta clasificación se interpretó los mensajes de las personas, por lo que se analizó los hilos (comentarios y respuestas) por separado y se procedió a etiquetarlos por temas "Topic" . Un profesional de salud realizó una segunda clasificación y las discrepancias se resolvieron con un tercer profesional. Luego, se seleccionaron subcategorías que hacían referencia directa a las vacunas contra el VPH. La clasificación se realizó utilizando las siguientes categorías "topic_c" :

0 PAPER • NO BENCHMARKS YET

TweetEval

TweetEval introduces an evaluation framework consisting of seven heterogeneous Twitter-specific classification tasks.

72 PAPERS • 2 BENCHMARKS

Twitter Sentiment Analysis (Entity-Level Twitter Sentiment Analysis Dataset)

This is an entity-level Twitter Sentiment Analysis dataset. For each message, the task is to judge the sentiment of the entire sentence towards a given entity. For example, A outperforms B is positive for entity A but negative for entity B. The dataset contains ~70K labeled training messages and 1K labeled validation messages. It is available online for free on Kaggle.

4 PAPERS • 1 BENCHMARK

WOS

WOS (Web of Science Dataset)

Web of Science (WOS) is a document classification dataset that contains 46,985 documents with 134 categories which include 7 parents categories.

48 PAPERS • 4 BENCHMARKS

Wiki-en

Wiki-en is an annotated English dataset for domain detection extracted from Wikipedia. It includes texts from 7 different domains: “Business and Commerce” (BUS), “Government and Politics” (GOV), “Physical and Mental Health” (HEA), “Law and Order” (LAW), “Lifestyle” (LIF), “Military” (MIL), and “General Purpose” (GEN).

1 PAPER • NO BENCHMARKS YET

Wiki-zh

Wiki-zh is an annotated Chinese dataset for domain detection extracted from Wikipedia. It includes texts from 7 different domains: “Business and Commerce” (BUS), “Government and Politics” (GOV), “Physical and Mental Health” (HEA), “Law and Order” (LAW), “Lifestyle” (LIF), “Military” (MIL), and “General Purpose” (GEN). It contains 26,280 documents split into training, validation and test.

1 PAPER • NO BENCHMARKS YET

Wikipedia Title

Wikipedia Title is a dataset for learning character-level compositionality from the character visual characteristics. It consists of a collection of Wikipedia titles in Chinese, Japanese or Korean labelled with the category to which the article belongs.

3 PAPERS • NO BENCHMARKS YET

Yelp

The Yelp Dataset is a valuable resource for academic research, teaching, and learning. It provides a rich collection of real-world data related to businesses, reviews, and user interactions. Here are the key details about the Yelp Dataset: Reviews: A whopping 6,990,280 reviews from users. Businesses: Information on 150,346 businesses. Pictures: A collection of 200,100 pictures. Metropolitan Areas: Data from 11 metropolitan areas. Tips: Over 908,915 tips provided by 1,987,897 users. Business Attributes: Details like hours, parking availability, and ambiance for more than 1.2 million businesses. Aggregated Check-ins: Historical check-in data for each of the 131,930 businesses.

68 PAPERS • 21 BENCHMARKS

e-SNLI

e-SNLI is used for various goals, such as obtaining full sentence justifications of a model's decisions, improving universal sentence representations and transferring to out-of-domain NLI datasets.

123 PAPERS • 1 BENCHMARK

iLur News Texts

iLur News Texts is a dataset of over 12000 news articles from iLur.am, categorized into 7 classes: sport, politics, weather, economy, accidents, art, society. The articles are split into train (2242k tokens) and test sets (425k tokens).

1 PAPER • NO BENCHMARKS YET

Datasets

136 dataset results for Text Classification