LIAR is a publicly available dataset for fake news detection. A decade-long of 12.8K manually labeled short statements were collected in various contexts from POLITIFACT.COM, which provides detailed analysis report and links to source documents for each case. This dataset can be used for fact-checking research as well. Notably, this new dataset is an order of magnitude larger than previously largest public fake news datasets of similar type. The LIAR dataset4 includes 12.8K human labeled short statements from POLITIFACT.COM’s API, and each statement is evaluated by a POLITIFACT.COM editor for its truthfulness.
114 PAPERS • 1 BENCHMARK
RealNews is a large corpus of news articles from Common Crawl. Data is scraped from Common Crawl, limited to the 5000 news domains indexed by Google News. The authors used the Newspaper Python library to extract the body and metadata from each article. News from Common Crawl dumps from December 2016 through March 2019 were used as training data; articles published in April 2019 from the April 2019 dump were used for evaluation. After deduplication, RealNews is 120 gigabytes without compression.
76 PAPERS • NO BENCHMARKS YET
The Weibo NER dataset is a Chinese Named Entity Recognition dataset drawn from the social media website Sina Weibo.
51 PAPERS • 2 BENCHMARKS
FakeNewsNet is collected from two fact-checking websites: GossipCop and PolitiFact containing news contents with labels annotated by professional journalists and experts, along with social context information.
27 PAPERS • NO BENCHMARKS YET
Fact-checking (FC) articles which contains pairs (multimodal tweet and a FC-article) from snopes.com.
22 PAPERS • 1 BENCHMARK
Fact-checking (FC) articles which contains pairs (multimodal tweet and a FC-article) from politifact.com.
20 PAPERS • 1 BENCHMARK
FNC-1 was designed as a stance detection dataset and it contains 75,385 labeled headline and article pairs. The pairs are labelled as either agree, disagree, discuss, and unrelated. Each headline in the dataset is phrased as a statement
19 PAPERS • 2 BENCHMARKS
Weibo21 is a benchmark of fake news dataset for multi-domain fake news detection (MFND) with domain label annotated, which consists of 4,488 fake news and 4,640 real news from 9 different domains.
12 PAPERS • NO BENCHMARKS YET
Along with COVID-19 pandemic we are also fighting an `infodemic'. Fake news and rumors are rampant on social media. Believing in rumors can cause significant harm. This is further exacerbated at the time of a pandemic. To tackle this, we curate and release a manually annotated dataset of 10,700 social media posts and articles of real and fake news on COVID-19. We benchmark the annotated dataset with four machine learning baselines - Decision Tree, Logistic Regression , Gradient Boost , and Support Vector Machine (SVM). We obtain the best performance of 93.46\% F1-score with SVM.
11 PAPERS • 1 BENCHMARK
Fakeddit is a novel multimodal dataset for fake news detection consisting of over 1 million samples from multiple categories of fake news. After being processed through several stages of review, the samples are labeled according to 2-way, 3-way, and 6-way classification categories through distant supervision.
10 PAPERS • NO BENCHMARKS YET
MM-COVID is a dataset for fake news detection related to COVID-19. This dataset provides the multilingual fake news and the relevant social context. It contains 3,981 pieces of fake news content and 7,192 trustworthy information from English, Spanish, Portuguese, Hindi, French and Italian, 6 different languages.
For benchmarking, please refer to its variant UPFD-POL and UPFD-GOS.
10 PAPERS • 2 BENCHMARKS
NELA-GT-2018 is a dataset for the study of misinformation that consists of 713k articles collected between 02/2018-11/2018. These articles are collected directly from 194 news and media outlets including mainstream, hyper-partisan, and conspiracy sources. It includes ground truth ratings of the sources from 8 different assessment sites covering multiple dimensions of veracity, including reliability, bias, transparency, adherence to journalistic standards, and consumer trust.
9 PAPERS • NO BENCHMARKS YET
For LIAR-RAW, we extended the public dataset LIAR-PLUS (Alhindi et al., 2018) with relevant raw reports, containing fine-grained claims from Politifact. LIAR-RAW is based on LIAR, where gold labels refer to Politifact. To alleviate the dependency of fact-checked reports, we extended the public LIAR dataset with additional raw reports for each claim. Besides, we put these raw reports into a single file with the format of LIAR.
6 PAPERS • 1 BENCHMARK
An annotated dataset of ~50K news that can be used for building automated fake news detection systems for a low resource language like Bangla.
5 PAPERS • NO BENCHMARKS YET
MuMiN is a misinformation graph dataset containing rich social media data (tweets, replies, users, images, articles, hashtags), spanning 21 million tweets belonging to 26 thousand Twitter threads, each of which have been semantically linked to 13 thousand fact-checked claims across dozens of topics, events and domains, in 41 different languages, spanning more than a decade.
5 PAPERS • 3 BENCHMARKS
NELA-GT-2019 is an updated version of the NELA-GT-2018 dataset. NELA-GT-2019 contains 1.12M news articles from 260 sources collected between January 1st 2019 and December 31st 2019. Just as with NELA-GT-2018, these sources come from a wide range of mainstream news sources and alternative news sources. Included with the dataset are source-level ground truth labels from 7 different assessment sites covering multiple dimensions of veracity.
NELA-GT-2020 is an updated version of the NELA-GT-2019 dataset. NELA-GT-2020 contains nearly 1.8M news articles from 519 sources collected between January 1st, 2020 and December 31st, 2020. Just as with NELA-GT-2018 and NELA-GT-2019, these sources come from a wide range of mainstream news sources and alternative news sources. Included in the dataset are source-level ground truth labels from Media Bias/Fact Check (MBFC) covering multiple dimensions of veracity. Additionally, new in the 2020 dataset are the Tweets embedded in the collected news articles, adding an extra layer of information to the data.
4 PAPERS • NO BENCHMARKS YET
For RAWFC, we constructed it from scratch by collecting the claims from Snopes and relevant raw reports by retrieving claim keywords. To alleviate the dependency of fact-checked reports, RAWFC was constructed by using raw reports (from scratch), where gold labels refer to Snopes. Each instance in the train/val/test set is presented as a signle file.
4 PAPERS • 1 BENCHMARK
Some Like it Hoax is a fake news detection dataset consisting of 15,500 Facebook posts and 909,236 users.
The Gossipcop variant of the UPFD dataset for benchmarking.
3 PAPERS • 1 BENCHMARK
AraCOVID19-MFH is a manually annotated multi-label Arabic COVID-19 fake news and hate speech detection dataset. The dataset contains 10,828 Arabic tweets annotated with 10 different labels.
2 PAPERS • NO BENCHMARKS YET
A Dataset to Identify Manipulated Social Media News in Bangla
The PolitiFact variant of the UPFD dataset for benchmarking.
2 PAPERS • 1 BENCHMARK
Expertly-curated benchmark dataset for fake news detection in Filipino.
1 PAPER • NO BENCHMARKS YET
The LIAR dataset has been widely followed by fake news detection researchers since its release, and along with a great deal of research, the community has provided a variety of feedback on the dataset to improve it. We adopted these feedbacks and released the LIAR2 dataset, a new benchmark dataset of ~23k manually labeled by professional fact-checkers for fake news detection tasks. We have used a split ratio of 8:1:1 to distinguish between the training set, the test set, and the validation set, details of which are provided in the paper of "An Enhanced Fake News Detection System With Fuzzy Deep Learning". The LIAR2 dataset can be accessed at Huggingface and Github, and statistical information for LIAR and LIAR2 is provided in the table below:
1 PAPER • 1 BENCHMARK
Search Engine Optimization (SEO) attributes provide strong signals for predicting news site reliability. We introduce a novel attributed webgraph dataset with labeled news domains and their connections to outlinking and backlinking domains. Finally, we introduce and evaluate a novel graph-based algorithm for discovering previously unknown misinformation news sources.
The task addresses the problem of the appearance and propagation of posts that share misleading multimedia content (images or video). In the context of the task, different types of misleading use are considered:
The CIDII dataset is a binary classification, consisting of two classes of correct information and disinformation related to Islamic issues. The CIDII dataset belongs to our research (DISINFORMATION DETECTION ABOUT ISLAMIC ISSUES ON SOCIAL MEDIA USING DEEP LEARNING TECHNIQUES) published in MJCS journal in the link below: https://ejournal.um.edu.my/index.php/MJCS/article/view/41935
0 PAPER • NO BENCHMARKS YET