Antonio Gulli’s corpus of news articles is a collection of more than 1 million news articles. The articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non - commercial activity.
2 PAPERS • NO BENCHMARKS YET
Biographical is a semi-supervised dataset for RE. The dataset, which is aimed towards digital humanities (DH) and historical research, is automatically compiled by aligning sentences from Wikipedia articles with matching structured data from sources including Pantheon and Wikidata.
The dataset is annotated with stance towards one topic, namely, the independence of Catalonia.
2 PAPERS • 3 BENCHMARKS
The Headline Grouping dataset is a binary classification dataset on pairs of news headline. For each pair of headline, the binary label indicates whether the two headlines are part of the same group (and describe the same underlying event), or whether they are in distinct groups. The dataset contains a total of 20k annotated headline pairs, further split in a train, validation and test portions.
A dataset for evaluating text classification, domain adaptation, and active learning models. The dataset consists of 22,660 documents (tweets) collected in 2018 and 2019. It spans across four domains: Alzheimer's, Parkinson's, Cancer, and Diabetes.
Description Dataset from the Law Stack Exchange, as used in "Parameter-Efficient Legal Domain Adaptation" (Li et al., 2022). We introduce a dataset with data from the Law Stack Exchange. This dataset is composed of questions from the Law Stack Exchange, which is a community forum-based website containing questions with answers to legal questions. We link the questions with their associated tags (e.g., "copyright" or "criminal-law"), and perform a multi-label classification task
Data set constructed from YouTube comments (72,098 comments posted by 43,859 users on 623 relevant videos to the crisis)
2 PAPERS • 1 BENCHMARK
Tweets and items from psychological scales for sexism detection with counterfactual examples.
The data set contains 2500 manually-stance-labeled tweets, 1250 for each candidate (Joe Biden and Donald Trump). These tweets were sampled from the unlabeled set that our research team collected English tweets related to the 2020 US Presidential election. Through the Twitter Streaming API, the authors collected data using election-related hashtags and keywords. Between January 2020 and September 2020, over 5 million tweets were collected, not including quotes and retweets.
2 PAPERS • 2 BENCHMARKS
This project contains instructions and codes to reconstruct a dataset for the development and evaluation of forensic tools for detecting machine-generated text in social media.
1 PAPER • NO BENCHMARKS YET
In NLP, text classification is one of the primary problems we try to solve and its uses in language analyses are indisputable. The lack of labeled training data made it harder to do these tasks in low resource languages like Amharic. The task of collecting, labeling, annotating, and making valuable this kind of data will encourage junior researchers, schools, and machine learning practitioners to implement existing classification models in their language. In this short paper, we aim to introduce the Amharic text classification dataset that consists of more than 50k news articles that were categorized into 6 classes. This dataset is made available with easy baseline performances to encourage studies and better performance experiments.
1 PAPER • 1 BENCHMARK
BanglaEmotion is a manually annotated Bangla Emotion corpus, which incorporates the diversity of fine-grained emotion expressions in social-media text. More fine-grained emotion labels are considered such as Sadness, Happiness, Disgust, Surprise, Fear and Anger - which are, according to Paul Ekman (1999), the six basic emotion categories. For this task, a large amount of raw text data are collected from the user’s comments on two different Facebook groups (Ekattor TV and Airport Magistrates) and from the public post of a popular blogger and activist Dr. Imran H Sarker. These comments are mostly reactions to ongoing socio-political issues and towards the economic success and failure of Bangladesh. A total of 32923 comments are scraped from the three sources aforementioned above. Out of these, a total of 6314 comments were annotated into the six categories. The distribution of the annotated corpus is as follows:
Dataset of 5,591 labeled issue tickets. Originally created by Herzig et al. in : "It’s Not a Bug, It’s a Feature: How Misclassification Impacts Bug Prediction" (paper)
We present CSL, a large-scale Chinese Scientific Literature dataset, which contains the titles, abstracts, keywords and academic fields of 396,209 papers. To our knowledge, CSL is the first scientific document dataset in Chinese.
A dataset of games played in the card game "Cards Against Humanity" (CAH), by human players, derived from the online CAH labs. Each round includes the cards presented to users - a "black" prompt with a blank or question and 10 "white" punchlines as possible responses, and which punchline was picked by a player each round, along with text and metadata.
The CareerCoach 2022 gold standard is available for download in the NIF and JSON format, and draws upon documents from a corpus of over 99,000 education courses which have been retrieved from 488 different education providers.
Caselaw4 is a dataset of 350k common law judicial decisions from the U.S. Caselaw Access Project, of which 250k have been automatically annotated with binary outcome labels of AFFIRM and REVERSE.
Large-scale Chinese legal dataset for judgment prediction. \dataset contains more than 2.6 million criminal cases published by the Supreme People's Court of China, which are several times larger than other datasets in existing works on judgment prediction.
Dissonance Twitter Dataset is a dataset collected from annotating tweets for dissonance.
The Failure Mode Classification dataset released in the paper "MWO2KG and Echidna: Constructing and exploring knowledge graphs from maintenance data" by Stewart et al. The goal is to label a given observation (made by a maintainer) with the corresponding Failure Mode Code.
FTR-18 is a multilingual rumour dataset on football transfer news. Transfer rumours are continuously published by sports media. They can both harm the image of player or a club or increase the player's market value. The proposed dataset includes transfer articles written in English, Spanish and Portuguese. It also comprises Twitter reactions related to the transfer rumours. FTR-18 is suited for rumour classification tasks and allows the research on the linguistic patterns used in sports journalism.
We collect the various forms of Federal Reserve communications.
This dataset contains news headlines relevant to key forex pairs: AUDUSD, EURCHF, EURUSD, GBPUSD, and USDJPY. The data was extracted from reputable platforms Forex Live and FXstreet over a period of 86 days, from January to May 2023. The dataset comprises 2,291 unique news headlines. Each headline includes an associated forex pair, timestamp, source, author, URL, and the corresponding article text. Data was collected using web scraping techniques executed via a custom service on a virtual machine. This service periodically retrieves the latest news for a specified forex pair (ticker) from each platform, parsing all available information. The collected data is then processed to extract details such as the article's timestamp, author, and URL. The URL is further used to retrieve the full text of each article. This data acquisition process repeats approximately every 15 minutes.
LLeQA is a French native dataset for studying information retrieval and long-form question answering in the legal domain. It consists of a knowledge corpus of 27,941 statutory articles collected from the Belgian legislation, and 1,868 legal questions posed by Belgian citizens and labeled by experienced jurists with a comprehensive answer rooted in relevant articles from the corpus.
Dataset Summary New dataset introduced in Parameter-Efficient Legal Domain Adaptation (Li et al., 2022) from the Legal Advice Reddit community (known as "/r/legaldvice"), sourcing the Reddit posts from the Pushshift Reddit dataset. The dataset maps the text and title of each legal question posted into one of eleven classes, based on the original Reddit post's "flair" (i.e., tag). Questions are typically informal and use non-legal-specific language. Per the Legal Advice Reddit rules, posts must be about actual personal circumstances or situations. We limit the number of labels to the top eleven classes and remove the other samples from the dataset.
A corpus of 9k German and French user comments collected from migration-related news articles. It goes beyond the hate-neutral dichotomy and is instead annotated with 23 features, which in combination become descriptors of various types of speech, ranging from critical comments to implicit and explicit expressions of hate. The annotations are performed by 4 native speakers per language and achieve high (0.77) inter-annotator agreements.
The MATHWELL Human Annotation Dataset contains 4,734 synthetic word problems and answers generated by MATHWELL, a context-free grade school math word problem generator released in MATHWELL: Generating Educational Math Word Problems at Scale, and comparison models (GPT-4, GPT-3.5, Llama-2, MAmmoTH, and LLEMMA) with expert human annotations for solvability, accuracy, appropriateness, and meets all criteria (MaC). Solvability means the problem is mathematically possible to solve, accuracy means the Program of Thought (PoT) solution arrives at the correct answer, appropriateness means that the mathematical topic is familiar to a grade school student and the question's context is appropriate for a young learner, and MaC denotes questions which are labeled as solvable, accurate, and appropriate. Null values for accuracy and appropriateness indicate a question labeled as unsolvable, which means it cannot have an accurate solution and is automatically inappropriate. Based on our annotations, 8
MetaHate: A Dataset for Unifying Efforts on Hate Speech Detection This is MetaHate: a meta-collection of 36 hate speech datasets from social media comments.
MiST (Modals In Scientific Text) is a dataset containing 3737 modal instances in five scientific domains annotated for their semantic, pragmatic, or rhetorical function.
Modern Hebrew Sentiment Dataset is a sentiment analysis benchmark for Hebrew, based on 12K social media comments, and provide two instances of these data: in token-based and morpheme-based settings.
This is the large version of the MuMiN dataset.
This is the medium version of the MuMiN dataset.
This is the small version of the MuMiN dataset.
The dataset consists of titles and abstracts from NLP-related papers. Each paper is annotated with multiple fields of study from an NLP taxonomy. The training dataset contains 178,521 weakly annotated samples. The test dataset consists of 828 manually annotated samples from the EMNLP22 conference. The manually labeled test dataset might not contain all possible classes since it consists of EMNLP22 papers only, and some rarer classes haven’t been published there. Therefore, we advise creating an additional test or validation set from the train data that includes all the possible classes.
Paper Field is built from the Microsoft Academic Graph and maps paper titles to one of 7 fields of study. Each field of study - geography, politics, economics, business, sociology, medicine, and psychology - has approximately 12K training examples.
The Sequence labellIng evaLuatIon benChmark fOr spoken laNguagE (SILICONE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems specifically designed for spoken language. All datasets are in the English language and covers a large variety of domains (e.g daily life, scripted scenarios, joint task completion, phone call conversations, and televsion dialogue). Some datasets additionally include emotion and/or sentiment labels.
SciHTC is a dataset for hierarchical multi-label text classification (HMLTC) of scientific papers which contains 186,160 papers and 1,233 categories from the ACM CCS tree.
This resource contains 10.5 million paragraphs with associated statement labels, realized as one paragraph per file, one sentence per line. Each file is placed in a subdirectory named after its annotated class. The statements were extracted from author-annotated environments, where we only selected the first paragraph,immediately following the heading. Headings include both structural sections (e.g. Introduction), as well as scholarly statement annotations, (e.g. Definition, Proof, Remark).
The ShapeIt dataset introduced by Alper et al. (2023) consists of 109 nouns and noun phrases along with the basic shape normally associated with that item, chosen from the set {circle, rectangle, triangle}.
ShortPersianEmo is a new data set for emotion recognition in Persian short texts. The ShortPersianEmo dataset is a single-label dataset that contains 5472 short Persian texts collected from Twitter and Digikala. Our dataset is annotated according to Rachael Jack’s emotional model in five emotional classes happiness, sadness, anger, fear, and other. Unlike publicly accessible datasets that do not impose any restrictions on text length, ShortPersianEmo specifically focuses on short texts. The average text length in the ShortPersianEmo dataset is 56 words. Table 1 presents a comparison between the introduced ShortPersianEmo dataset and other datasets from the literature for emotion detection in Persian text. For more information on this dataset please read our paper. If you use this dataset in any research work, please cite our paper.
The Mafia Dataset was created to model the behavior of deceptive actors in the context of the Mafia game, as described in the paper “Putting the Con in Context: Identifying Deceptive Actors in the Game of Mafia”. We hope that this dataset will be of use to others studying the effects of deception on language use.
We introduce a large semi-automatically generated dataset of ~400,000 descriptive sentences about commonsense knowledge that can be true or false in which negation is present in about 2/3 of the corpus in different forms that we use to evaluate LLMs.
1 PAPER • 2 BENCHMARKS
This dataset for abusive content detection in Twitter consists of two sets of annotations for the same set of tweets, one where the human annotators had access to the tweet's content and one where they didn't know the context.
Wiki-Reliability is the first dataset of English Wikipedia articles annotated with a wide set of content reliability issues. Templates are tags used by expert Wikipedia editors to indicate content issues, such as the presence of "non-neutral point of view" or "contradictory articles", and serve as a strong signal for detecting reliability issues in a revision. We select the 10 most popular reliability-related templates on Wikipedia, and propose an effective method to label almost 1M samples of Wikipedia article revisions as positive or negative with respect to each template. Each positive/negative example in the dataset comes with the full article text and 20 features from the revision's metadata. We provide an overview of the possible downstream tasks enabled by such data, and show that Wiki-Reliability can be used to train large-scale models for content reliability prediction.
Wiki-en is an annotated English dataset for domain detection extracted from Wikipedia. It includes texts from 7 different domains: “Business and Commerce” (BUS), “Government and Politics” (GOV), “Physical and Mental Health” (HEA), “Law and Order” (LAW), “Lifestyle” (LIF), “Military” (MIL), and “General Purpose” (GEN).
Wiki-zh is an annotated Chinese dataset for domain detection extracted from Wikipedia. It includes texts from 7 different domains: “Business and Commerce” (BUS), “Government and Politics” (GOV), “Physical and Mental Health” (HEA), “Law and Order” (LAW), “Lifestyle” (LIF), “Military” (MIL), and “General Purpose” (GEN). It contains 26,280 documents split into training, validation and test.
With the emergence of the COVID-19 pandemic, the political and the medical aspects of disinformation merged as the problem got elevated to a whole new level to become the first global infodemic. Fighting this infodemic has been declared one of the most important focus areas of the World Health Organization, with dangers ranging from promoting fake cures, rumors, and conspiracy theories to spreading xenophobia and panic. Addressing the issue requires solving a number of challenging problems such as identifying messages containing claims, determining their check-worthiness and factuality, and their potential to do harm as well as the nature of that harm, to mention just a few. To address this gap, we release a large dataset of 16K manually annotated tweets for fine-grained disinformation analysis that focuses on COVID-19, combines the perspectives and the interests of journalists, fact-checkers, social media platforms, policy makers, and society, and covers Arabic, Bulgarian, Dutch, and
0 PAPER • NO BENCHMARKS YET
Eduge news classification dataset provided by Bolorsoft LLC. Used to train the Eduge.mn production news classifier 75K news articles in 9 categories: урлаг соёл, эдийн засаг, эрүүл мэнд, хууль, улс төр, спорт, технологи, боловсрол and байгал орчин