🔔 Share your dataset with the ML community!

Filter by Modality

Filter by Task (clear)

Filter by Language

148 dataset results for Text Classification

The MATHWELL Human Annotation Dataset contains 4,734 synthetic word problems and answers generated by MATHWELL, a context-free grade school math word problem generator released in MATHWELL: Generating Educational Math Word Problems at Scale, and comparison models (GPT-4, GPT-3.5, Llama-2, MAmmoTH, and LLEMMA) with expert human annotations for solvability, accuracy, appropriateness, and meets all criteria (MaC). Solvability means the problem is mathematically possible to solve, accuracy means the Program of Thought (PoT) solution arrives at the correct answer, appropriateness means that the mathematical topic is familiar to a grade school student and the question's context is appropriate for a young learner, and MaC denotes questions which are labeled as solvable, accurate, and appropriate. Null values for accuracy and appropriateness indicate a question labeled as unsolvable, which means it cannot have an accurate solution and is automatically inappropriate. Based on our annotations, 8

1 PAPER • NO BENCHMARKS YET

MetaHate

MetaHate: A Dataset for Unifying Efforts on Hate Speech Detection This is MetaHate: a meta-collection of 36 hate speech datasets from social media comments.

1 PAPER • NO BENCHMARKS YET

Events classification - Biotech news

A dataset specifically tailored to the biotech news sector, aiming to transcend the limitations of existing benchmarks. This dataset is rich in complex content, comprising various biotech news articles covering various events, thus providing a more nuanced view of information extraction challenges.

0 PAPER • NO BENCHMARKS YET

ShortPersianEmo

ShortPersianEmo is a new data set for emotion recognition in Persian short texts. The ShortPersianEmo dataset is a single-label dataset that contains 5472 short Persian texts collected from Twitter and Digikala. Our dataset is annotated according to Rachael Jack’s emotional model in five emotional classes happiness, sadness, anger, fear, and other. Unlike publicly accessible datasets that do not impose any restrictions on text length, ShortPersianEmo specifically focuses on short texts. The average text length in the ShortPersianEmo dataset is 56 words. Table 1 presents a comparison between the introduced ShortPersianEmo dataset and other datasets from the literature for emotion detection in Persian text. For more information on this dataset please read our paper. If you use this dataset in any research work, please cite our paper.

1 PAPER • 1 BENCHMARK

This is not a Dataset (This is not a Dataset: A Large Negation Benchmark to Challenge Large Language Models)

We introduce a large semi-automatically generated dataset of ~400,000 descriptive sentences about commonsense knowledge that can be true or false in which negation is present in about 2/3 of the corpus in different forms that we use to evaluate LLMs.

1 PAPER • 2 BENCHMARKS

LLeQA (Long-form Legal Question Answering)

LLeQA is a French native dataset for studying information retrieval and long-form question answering in the legal domain. It consists of a knowledge corpus of 27,941 statutory articles collected from the Belgian legislation, and 1,868 legal questions posed by Belgian citizens and labeled by experienced jurists with a comprehensive answer rooted in relevant articles from the corpus.

1 PAPER • NO BENCHMARKS YET

NLP Taxonomy Classification Data

The dataset consists of titles and abstracts from NLP-related papers. Each paper is annotated with multiple fields of study from an NLP taxonomy. The training dataset contains 178,521 weakly annotated samples. The test dataset consists of 828 manually annotated samples from the EMNLP22 conference. The manually labeled test dataset might not contain all possible classes since it consists of EMNLP22 papers only, and some rarer classes haven’t been published there. Therefore, we advise creating an additional test or validation set from the train data that includes all the possible classes.

1 PAPER • NO BENCHMARKS YET

Forex News Annotated Dataset for Sentiment Analysis

This dataset contains news headlines relevant to key forex pairs: AUDUSD, EURCHF, EURUSD, GBPUSD, and USDJPY. The data was extracted from reputable platforms Forex Live and FXstreet over a period of 86 days, from January to May 2023. The dataset comprises 2,291 unique news headlines. Each headline includes an associated forex pair, timestamp, source, author, URL, and the corresponding article text. Data was collected using web scraping techniques executed via a custom service on a virtual machine. This service periodically retrieves the latest news for a specified forex pair (ticker) from each platform, parsing all available information. The collected data is then processed to extract details such as the article's timestamp, author, and URL. The URL is further used to retrieve the full text of each article. This data acquisition process repeats approximately every 15 minutes.

1 PAPER • NO BENCHMARKS YET

Dissonance Twitter Dataset

Dissonance Twitter Dataset is a dataset collected from annotating tweets for dissonance.

1 PAPER • NO BENCHMARKS YET

AI-GA: AI-Generated Abstracts dataset

The AI-GA (Artificial Intelligence Generated Abstracts) dataset is a collection of abstracts and titles, with half of the abstracts being AI-generated and the other half being original. This dataset is designed to be used for research and experimentation in the field of natural language processing, particularly in the context of language generation and machine learning.

1 PAPER • NO BENCHMARKS YET

ShapeIt

The ShapeIt dataset introduced by Alper et al. (2023) consists of 109 nouns and noun phrases along with the basic shape normally associated with that item, chosen from the set {circle, rectangle, triangle}.

1 PAPER • NO BENCHMARKS YET

AI-generated Twitter Timelines

This project contains instructions and codes to reconstruct a dataset for the development and evaluation of forensic tools for detecting machine-generated text in social media.

1 PAPER • NO BENCHMARKS YET

Lot-insts

Lot-insts (Long-Tailed instituition names)

LoT-insts contains over 25k classes whose frequencies are naturally long-tail distributed. Its test set from four different subsets: many-, medium-, and few-shot sets, as well as a zero-shot open set. To our best knowledge, this is the first natural language dataset that focuses on this long-tailed and open classification problem.

1 PAPER • 2 BENCHMARKS

MN-DS

MN-DS (Multilabeled News Dataset)

Multilabeled News Dataset (MN-DS) is a dataset for news classification. It consists of 10,917 articles in 17 first-level and 109 second-level categories from 215 media sources.

3 PAPERS • NO BENCHMARKS YET

MiST

MiST (Modals In Scientific Text) is a dataset containing 3737 modal instances in five scientific domains annotated for their semantic, pragmatic, or rhetorical function.

1 PAPER • NO BENCHMARKS YET

Medical Abstracts

Medical Abstracts (Medical Abstracts Text Classification Dataset)

The Medical Abstracts dataset contains 14,438 medical abstracts describing 5 different classes of patient conditions, with all of the dataset being annotated. The dataset is split into training and test sets.

3 PAPERS • 1 BENCHMARK

FMC-MWO2KG

FMC-MWO2KG (The MWO2KG Failure Mode Classification Dataset)

The Failure Mode Classification dataset released in the paper "MWO2KG and Echidna: Constructing and exploring knowledge graphs from maintenance data" by Stewart et al. The goal is to label a given observation (made by a maintainer) with the corresponding Failure Mode Code.

1 PAPER • 1 BENCHMARK

SciHTC

SciHTC is a dataset for hierarchical multi-label text classification (HMLTC) of scientific papers which contains 186,160 papers and 1,233 categories from the ACM CCS tree.

1 PAPER • NO BENCHMARKS YET

Law Stack Exchange

Description Dataset from the Law Stack Exchange, as used in "Parameter-Efficient Legal Domain Adaptation" (Li et al., 2022). We introduce a dataset with data from the Law Stack Exchange. This dataset is composed of questions from the Law Stack Exchange, which is a community forum-based website containing questions with answers to legal questions. We link the questions with their associated tags (e.g., "copyright" or "criminal-law"), and perform a multi-label classification task

2 PAPERS • NO BENCHMARKS YET

Legal Advice Reddit

Dataset Summary New dataset introduced in Parameter-Efficient Legal Domain Adaptation (Li et al., 2022) from the Legal Advice Reddit community (known as "/r/legaldvice"), sourcing the Reddit posts from the Pushshift Reddit dataset. The dataset maps the text and title of each legal question posted into one of eleven classes, based on the original Reddit post's "flair" (i.e., tag). Questions are typically informal and use non-legal-specific language. Per the Legal Advice Reddit rules, posts must be about actual personal circumstances or situations. We limit the number of labels to the top eleven classes and remove the other samples from the dataset.

1 PAPER • NO BENCHMARKS YET

Cards Against Humanity

A dataset of games played in the card game "Cards Against Humanity" (CAH), by human players, derived from the online CAH labs. Each round includes the cards presented to users - a "black" prompt with a blank or question and 10 "white" punchlines as possible responses, and which punchline was picked by a player each round, along with text and metadata.

1 PAPER • NO BENCHMARKS YET

TCAB

TCAB (Text Classification Attack Benchmark)

Text Classification Attack Benchmark (TCAB) is a dataset for analyzing, understanding, detecting, and labeling adversarial attacks against text classifiers. TCAB includes 1.5 million attack instances, generated by twelve adversarial attack targeting three classifiers trained on six source datasets for sentiment analysis and abuse detection in English. The process of generating attacks is automated, so that TCAB can easily be extended to incorporate new text attacks and better classifiers as they are developed.

3 PAPERS • NO BENCHMARKS YET

MTEB (Massive Text Embedding Benchmark)

MTEB is a benchmark that spans 8 embedding tasks covering a total of 56 datasets and 112 languages. The 8 task types are Bitext mining, Classification, Clustering, Pair Classification, Reranking, Retrieval, Semantic Textual Similarity and Summarisation. The 56 datasets contain varying text lengths and they are grouped into three categories: Sentence to sentence, Paragraph to paragraph, and Sentence to paragraph.

51 PAPERS • 8 BENCHMARKS

SV-Ident (Survey Variable Identification)

SV-Ident comprises 4,248 sentences from social science publications in English and German. The data is the official data for the Shared Task: “Survey Variable Identification in Social Science Publications” (SV-Ident) 2022. Sentences are labeled with variables that are mentioned either explicitly or implicitly.

3 PAPERS • 2 BENCHMARKS

CSL (Chinese Scientific Literature)

We present CSL, a large-scale Chinese Scientific Literature dataset, which contains the titles, abstracts, keywords and academic fields of 396,209 papers. To our knowledge, CSL is the first scientific document dataset in Chinese.

1 PAPER • NO BENCHMARKS YET

STEDUCOV: A DATASET ON STANCE DETECTION IN TWEETS TOWARDS ONLINE EDUCATION DURING COVID-19 PANDEMIC

StEduCov, a dataset annotated for stances toward online education during the COVID-19 pandemic. StEduCov has 17,097 tweets gathered over 15 months, from March 2020 to May 2021, using Twitter API. The tweets are manually annotated into agree, disagree or neutral classes. We used a set of relevant hashtags and keywords. Specifically, we utilised a combination of hashtags, such as '#COVID 19' or '#Coronavirus' with keywords, such as 'education', 'online learning', 'distance learning' and 'remote learning'. To ensure high annotation quality, three different annotators annotated each tweet and at least one of the reviewers from three judges revised it. They were guided by some instructions, such as that in the case of disagree class, there should be a clear negative statement about online education or its impact. Also, if the tweet is negative but refers to other people (e.g. 'my children hate online learning').

1 PAPER • 1 BENCHMARK

The Mafia Dataset

The Mafia Dataset was created to model the behavior of deceptive actors in the context of the Mafia game, as described in the paper “Putting the Con in Context: Identifying Deceptive Actors in the Game of Mafia”. We hope that this dataset will be of use to others studying the effects of deception on language use.

1 PAPER • NO BENCHMARKS YET

Towards a Data-Driven Requirements Engineering Approach: Automatic Analysis of User Reviews

6000 French user reviews from three applications on Google Play (Garmin Connect, Huawei Health, Samsung Health) are labelled manually. We selected four labels: rating, bug report, feature request and user experience.

1 PAPER • NO BENCHMARKS YET

arXiv-10

Benchmark dataset for abstracts and titles of 100,000 ArXiv scientific papers. This dataset contains 10 classes and is balanced (exactly 10,000 per class). The classes include subcategories of computer science, physics, and math.

4 PAPERS • 1 BENCHMARK

JGLUE

JGLUE, Japanese General Language Understanding Evaluation, is built to measure the general NLU ability in Japanese.

7 PAPERS • NO BENCHMARKS YET

RESD (Russian Emotional Speech Dialogs with annotated text)

Russian dataset of emotional speech dialogues. This dataset was assembled from ~3.5 hours of live speech by actors who voiced pre-distributed emotions in the dialogue for ~3 minutes each. <br> Each sample of dataset contains name of part from the original dataset studio source, speech file (16000 or 44100Hz) of human voice, 1 of 7 labeled emotions and the speech-to-texted part of voice speech. <br>

0 PAPER • NO BENCHMARKS YET

Biographical

Biographical (Biographical: A Semi-Supervised Relation Extraction Dataset)

Biographical is a semi-supervised dataset for RE. The dataset, which is aimed towards digital humanities (DH) and historical research, is automatically compiled by aligning sentences from Wikipedia articles with matching structured data from sources including Pantheon and Wikidata.

2 PAPERS • NO BENCHMARKS YET

CareerCoach 2022

The CareerCoach 2022 gold standard is available for download in the NIF and JSON format, and draws upon documents from a corpus of over 99,000 education courses which have been retrieved from 488 different education providers.

1 PAPER • NO BENCHMARKS YET

M-Phasis

M-Phasis (A Feature-Based Corpus of Hate Online)

A corpus of 9k German and French user comments collected from migration-related news articles. It goes beyond the hate-neutral dichotomy and is instead annotated with 23 features, which in combination become descriptors of various types of speech, ranging from critical comments to implicit and explicit expressions of hate. The annotations are performed by 4 native speakers per language and achieve high (0.77) inter-annotator agreements.

1 PAPER • NO BENCHMARKS YET

MASSIVE

MASSIVE is a parallel dataset of > 1M utterances across 51 languages with annotations for the Natural Language Understanding tasks of intent prediction and slot annotation. Utterances span 60 intents and include 55 slot types. MASSIVE was created by localizing the SLURP dataset, composed of general Intelligent Voice Assistant single-shot interactions.

52 PAPERS • 6 BENCHMARKS

BioRED

BioRED is a first-of-its-kind biomedical relation extraction dataset with multiple entity types (e.g. gene/protein, disease, chemical) and relation pairs (e.g. gene–disease; chemical–chemical) at the document level, on a set of600 PubMed abstracts. Furthermore, BioRED label each relation as describing either a novel finding or previously known background knowledge, enabling automated algorithms to differentiate between novel and background information.

14 PAPERS • 3 BENCHMARKS

Genocide Transcript Corpus (GTC): Topic-Based Paragraph Classification in Genocide-Related Court Transcripts

The Topic-Based Paragraph Classification in Genocide-Related Court Transcripts (GTC) dataset is the first reference corpus annotated with samples from genocide tribunals in different international criminal courts. It is made up of witness statements about violence experienced. The material consists of 1475 text passages with about 40 to 120 pages per transcript, covering 3 tribunals: the Extraordinary Chambers in the Courts of Cambodia (ECCC) - 438 pages, the International Criminal Tribunal for Rwanda (ICTR) - 566 pages, and the International Criminal Tribunal of the Former Yugoslavia (ICTY) - 416 pages. As no datasets of any kind containing genocide court transcripts have been published nor other forms of pre-structured or annotated text data in this field of research exist, the aim was to address this gap by providing a systematically annotated dataset.

1 PAPER • NO BENCHMARKS YET

MuMiN

MuMiN is a misinformation graph dataset containing rich social media data (tweets, replies, users, images, articles, hashtags), spanning 21 million tweets belonging to 26 thousand Twitter threads, each of which have been semantically linked to 13 thousand fact-checked claims across dozens of topics, events and domains, in 41 different languages, spanning more than a decade.

4 PAPERS • 3 BENCHMARKS

MuMiN-large

This is the large version of the MuMiN dataset.

1 PAPER • 1 BENCHMARK

MuMiN-medium

This is the medium version of the MuMiN dataset.

1 PAPER • 1 BENCHMARK

MuMiN-small

This is the small version of the MuMiN dataset.

1 PAPER • 1 BENCHMARK

MuLD (Multitask Long Document Benchmark)

MuLD (Multitask Long Document Benchmark) is a set of 6 NLP tasks where the inputs consist of at least 10,000 words. The benchmark covers a wide variety of task types including translation, summarization, question answering, and classification. Additionally there is a range of output lengths from a single word classification label all the way up to an output longer than the input text.

3 PAPERS • 6 BENCHMARKS

Illness-dataset

Illness-dataset (Illness multi-domain textual dataset)

A dataset for evaluating text classification, domain adaptation, and active learning models. The dataset consists of 22,660 documents (tweets) collected in 2018 and 2019. It spans across four domains: Alzheimer's, Parkinson's, Cancer, and Diabetes.

2 PAPERS • NO BENCHMARKS YET

Moral Stories

Moral Stories is a crowd-sourced dataset of structured narratives that describe normative and norm-divergent actions taken by individuals to accomplish certain intentions in concrete situations, and their respective consequences.

19 PAPERS • NO BENCHMARKS YET

COVID-19 Disinfo (COVID-19 Disinformation Twitter Dataset)

With the emergence of the COVID-19 pandemic, the political and the medical aspects of disinformation merged as the problem got elevated to a whole new level to become the first global infodemic. Fighting this infodemic has been declared one of the most important focus areas of the World Health Organization, with dangers ranging from promoting fake cures, rumors, and conspiracy theories to spreading xenophobia and panic. Addressing the issue requires solving a number of challenging problems such as identifying messages containing claims, determining their check-worthiness and factuality, and their potential to do harm as well as the nature of that harm, to mention just a few. To address this gap, we release a large dataset of 16K manually annotated tweets for fine-grained disinformation analysis that focuses on COVID-19, combines the perspectives and the interests of journalists, fact-checkers, social media platforms, policy makers, and society, and covers Arabic, Bulgarian, Dutch, and

0 PAPER • NO BENCHMARKS YET

Invisible Mobile Keyboard Dataset

Invisible Mobile Keyboard Dataset contains user initial, age, type of mobile devices, size of the screen, time taken for typing each phrase, and annotation of typed phrases with coordinate values of the typed position (x and y points). The collected dataset is the first and only dataset for a novel IMK decoding task.

1 PAPER • NO BENCHMARKS YET

KanHope

KanHope (Kannada Hope speech dataset)

KanHope is a code mixed hope speech dataset for equality, diversity, and inclusion in Kannada, an under-resourced Dravidian language. The dataset consists of 6,176 user-generated comments in code mixed Kannada crawled from YouTube and manually labelled as bearing hope speech or not-hope speech.

2 PAPERS • 1 BENCHMARK

MNAD (Moroccan News Articles Dataset)

About the MNAD Dataset The MNAD corpus is a collection of over 1 million Moroccan news articles written in modern Arabic language. These news articles have been gathered from 11 prominent electronic news sources. The dataset is made available to the academic community for research purposes, such as data mining (clustering, classification, etc.), information retrieval (ranking, search, etc.), and other non-commercial activities.

0 PAPER • NO BENCHMARKS YET

Datasets

148 dataset results for Text Classification