The AI2’s Reasoning Challenge (ARC) dataset is a multiple-choice question-answering dataset, containing questions from science exams from grade 3 to grade 9. The dataset is split in two partitions: Easy and Challenge, where the latter partition contains the more difficult questions that require reasoning. Most of the questions have 4 answer choices, with <1% of all the questions having either 3 or 5 answer choices. ARC includes a supporting KB of 14.3M unstructured text passages.
118 PAPERS • 3 BENCHMARKS
LIAR is a publicly available dataset for fake news detection. A decade-long of 12.8K manually labeled short statements were collected in various contexts from POLITIFACT.COM, which provides detailed analysis report and links to source documents for each case. This dataset can be used for fact-checking research as well. Notably, this new dataset is an order of magnitude larger than previously largest public fake news datasets of similar type. The LIAR dataset4 includes 12.8K human labeled short statements from POLITIFACT.COM’s API, and each statement is evaluated by a POLITIFACT.COM editor for its truthfulness.
112 PAPERS • 1 BENCHMARK
Fact-checking (FC) articles which contains pairs (multimodal tweet and a FC-article) from snopes.com.
21 PAPERS • 1 BENCHMARK
FNC-1 was designed as a stance detection dataset and it contains 75,385 labeled headline and article pairs. The pairs are labelled as either agree, disagree, discuss, and unrelated. Each headline in the dataset is phrased as a statement
19 PAPERS • 2 BENCHMARKS
VAST consists of a large range of topics covering broad themes, such as politics (e.g., ‘a Palestinian state’), education (e.g., ‘charter schools’), and public health (e.g., ‘childhood vaccination’). In addition, the data includes a wide range of similar expressions (e.g., ‘guns on campus’ versus ‘firearms on campus’). This variation captures how humans might realistically describe the same topic and contrasts with the lack of variation in existing datasets.
14 PAPERS • 1 BENCHMARK
A large-scale stance detection dataset from comments written by candidates of elections in Switzerland. The dataset consists of German, French and Italian text, allowing for a cross-lingual evaluation of stance detection. It contains 67 000 comments on more than 150 political issues (targets).
12 PAPERS • NO BENCHMARKS YET
Perspectrum is a dataset of claims, perspectives and evidence, making use of online debate websites to create the initial data collection, and augmenting it using search engines in order to expand and diversify the dataset. Crowd-sourcing was used to filter out noise and ensure high-quality data. The dataset contains 1k claims, accompanied with pools of 10k and 8k perspective sentences and evidence paragraphs, respectively.
10 PAPERS • 1 BENCHMARK
MGTAB is the first standardized graph-based benchmark for stance and bot detection. MGTAB contains 10,199 expert-annotated users and 7 types of relationships, ensuring high-quality annotation and diversified relations. For more details, please refer to the MGTAB paper.
8 PAPERS • 2 BENCHMARKS
For LIAR-RAW, we extended the public dataset LIAR-PLUS (Alhindi et al., 2018) with relevant raw reports, containing fine-grained claims from Politifact. LIAR-RAW is based on LIAR, where gold labels refer to Politifact. To alleviate the dependency of fact-checked reports, we extended the public LIAR dataset with additional raw reports for each claim. Besides, we put these raw reports into a single file with the format of LIAR.
6 PAPERS • 1 BENCHMARK
For RAWFC, we constructed it from scratch by collecting the claims from Snopes and relevant raw reports by retrieving claim keywords. To alleviate the dependency of fact-checked reports, RAWFC was constructed by using raw reports (from scratch), where gold labels refer to Snopes. Each instance in the train/val/test set is presented as a signle file.
4 PAPERS • 1 BENCHMARK
COVID-CQ is a stance data set of user-generated content on Twitter in the context of COVID-19.
3 PAPERS • NO BENCHMARKS YET
CoVaxLies v1 includes 17 known Misinformation Targets (MisTs) found on Twitter about the covid-19 vaccines. Language experts annotated tweets as Relevant or Not Relevant, and then further annotated Relevant tweets with Stance towards each MisT. This collection is a first step in providing large-scale resources for misinformation detection and misinformation stance identification.
Contains 3,689,229 English news articles on politics, gathered from 11 United States (US) media outlets covering a broad ideological spectrum.
2 PAPERS • NO BENCHMARKS YET
The dataset is annotated with stance towards one topic, namely, the independence of Catalonia.
2 PAPERS • 3 BENCHMARKS
CoVaxFrames includes 113 Vaccine Hesitancy Framings found on Twitter about the COVID-19 vaccines. Language experts annotated tweets as Relevant or Not Relevant, and then further annotated Relevant tweets with Stance towards each framing.
CoVaxLies v2 includes 47 Misinformation Targets (MisTs) found on Twitter about the COVID-19 vaccines. Language experts annotated tweets as Relevant or Not Relevant, and then further annotated Relevant tweets with Stance towards each MisT. This collection is a first step in providing large-scale resources for misinformation detection and misinformation stance identification.
Includes Russian tweets and news comments from multiple sources, covering multiple stories, as well as text classification approaches to stance detection as benchmarks over this data in this language.
2 PAPERS • 1 BENCHMARK
The "Stance Detection in COVID-19 Tweets" dataset represents an evolution of stance detection research, tailored to address the unique and urgent challenges presented by the COVID-19 pandemic. This dataset is designed to capture public opinions, beliefs, and sentiments towards various aspects of the COVID-19 crisis, such as government policies, vaccination campaigns, public health recommendations, and the impact of the virus on daily life. It facilitates the analysis of how people's stances on these issues are expressed in social media discourse, specifically through tweets.
The data set contains 2500 manually-stance-labeled tweets, 1250 for each candidate (Joe Biden and Donald Trump). These tweets were sampled from the unlabeled set that our research team collected English tweets related to the 2020 US Presidential election. Through the Twitter Streaming API, the authors collected data using election-related hashtags and keywords. Between January 2020 and September 2020, over 5 million tweets were collected, not including quotes and retweets.
2 PAPERS • 2 BENCHMARKS
Will-They-Won't-They (WT-WT) is a large dataset of English tweets targeted at stance detection for the rumor verification task. The dataset is constructed based on tweets that discuss five recent merger and acquisition (M&A) operations of US companies, mainly from the healthcare sector.
COVMis-Stance is a stance detection dataset for COVID-19 misinformation. It consists of fake news and claims related to COVID. Fake news was collected from articles fact-checking sites, and fake claims were from the WHO official Twitter. It contains 2631 tweets annotated for stance towards 111 COVID19 misinformation items.
1 PAPER • NO BENCHMARKS YET
Conversational Stance Detection (CSD) is a dataset with annotations of stances and the structures of conversation threads. It consists of 500 conversation threads (including 500 posts and 5376 comments) from six major social media platforms in Hong Kong.
The ExaASC dataset is a dataset for Target-based Stance Detection in the Arabic Language that contains different types of targets like persons, entities and events. This corpus contains about 9500 tweets with replies and target specified in the source tweet. Each sample has at least two stance annotations provided by Exa Corporation annotators. The stance of each reply is annotated toward the target in the corresponding source tweet. Format of data is as follows: id, main (source tweet), reply, target, label of each annotator id and majority_label.
HpVaxFrames includes 64 Vaccine Hesitancy Framings found on Twitter about the HPV vaccines. Language experts annotated tweets as Relevant or Not Relevant, and then further annotated Relevant tweets with Stance towards each framing.
MMVax-Stance includes 113 Vaccine Hesitancy Framings found on Twitter about the COVID-19 vaccines. Language experts annotated multimodal image-text tweets as Relevant or Not Relevant, and then further annotated Relevant tweets with Stance towards each framing.
This dataset of medical misinformation was collected and is published by Kempelen Institute of Intelligent Technologies (KInIT). It consists of approx. 317k news articles and blog posts on medical topics published between January 1, 1998 and February 1, 2022 from a total of 207 reliable and unreliable sources. The dataset contains full-texts of the articles, their original source URL and other extracted metadata. If a source has a credibility score available (e.g., from Media Bias/Fact Check), it is also included in the form of annotation. Besides the articles, the dataset contains around 3.5k fact-checks and extracted verified medical claims with their unified veracity ratings published by fact-checking organisations such as Snopes or FullFact. Lastly and most importantly, the dataset contains 573 manually and more than 51k automatically labelled mappings between previously verified claims and the articles; mappings consist of two values: claim presence (i.e., whether a claim is conta
A novel stance detection dataset covering 419 different controversial issues and their related pros and cons collected by procon.org in nonpartisan format.
StEduCov, a dataset annotated for stances toward online education during the COVID-19 pandemic. StEduCov has 17,097 tweets gathered over 15 months, from March 2020 to May 2021, using Twitter API. The tweets are manually annotated into agree, disagree or neutral classes. We used a set of relevant hashtags and keywords. Specifically, we utilised a combination of hashtags, such as '#COVID 19' or '#Coronavirus' with keywords, such as 'education', 'online learning', 'distance learning' and 'remote learning'. To ensure high annotation quality, three different annotators annotated each tweet and at least one of the reviewers from three judges revised it. They were guided by some instructions, such as that in the case of disagree class, there should be a clear negative statement about online education or its impact. Also, if the tweet is negative but refers to other people (e.g. 'my children hate online learning').
1 PAPER • 1 BENCHMARK
SemEval-2016 Task 6, titled "Stance Detection in Tweets," provides a specialized dataset for the computational linguistics and natural language processing (NLP) communities to explore and analyze users' positions towards certain targets, based solely on the content of their tweets. Stance detection aims to determine whether the author of a piece of text is in favor of, against, or neutral towards a specified target, such as a political figure, policy, or product.
Combines CoVaxFrames and HpVaxFrames into a unified dataset of 113 Vaccine Hesitancy Framings found on Twitter about the COVID-19 vaccines and 64 Vaccine Hesitancy Framings found on Twitter about the HPV vaccines. Language experts annotated tweets as Relevant or Not Relevant, and then further annotated Relevant tweets with Stance towards each framing.
A Natural Language Resource for Learning to Recognize Misinformation about the COVID-19 and HPV Vaccines.
Political stance in Danish. Examples represent statements by politicians and are annotated for, against, or neutral to a given topic/article.
This is a stance detection dataset in the Zulu language. The data is translated to Zulu by Zulu native speakers, from English source texts.