Dataset of hate speech annotated on Internet forum posts in English at sentence-level. The source forum in Stormfront, a large online community of white nacionalists. A total of 10,568 sentence have been been extracted from Stormfront and classified as conveying hate speech or not.
107 PAPERS • 1 BENCHMARK
HSOL is a dataset for hate speech detection. The authors begun with a hate speech lexicon containing words and phrases identified by internet users as hate speech, compiled by Hatebase.org. Using the Twitter API they searched for tweets containing terms from the lexicon, resulting in a sample of tweets from 33,458 Twitter users. They extracted the time-line for each user, resulting in a set of 85.4 million tweets. From this corpus they took a random sample of 25k tweets containing terms from the lexicon and had them manually coded by CrowdFlower (CF) workers. Workers were asked to label each tweet as one of three categories: hate speech, offensive but not hate speech, or neither offensive nor hate speech.
54 PAPERS • NO BENCHMARKS YET
Covers multiple aspects of the issue. Each post in the dataset is annotated from three different perspectives: the basic, commonly used 3-class classification (i.e., hate, offensive or normal), the target community (i.e., the community that has been the victim of hate speech/offensive speech in the post), and the rationales, i.e., the portions of the post on which their labelling decision (as hate, offensive or normal) is based.
39 PAPERS • 1 BENCHMARK
The Implicit Hate corpus is a dataset for hate speech detection with fine-grained labels for each message and its implication. This dataset contains 22,056 tweets from the most prominent extremist groups in the United States; 6,346 of these tweets contain implicit hate speech.
14 PAPERS • NO BENCHMARKS YET
ETHOS is a hate speech detection dataset. It is built from YouTube and Reddit comments validated through a crowdsourcing platform. It has two subsets, one for binary classification and the other for multi-label classification. The former contains 998 comments, while the latter contains fine-grained hate-speech annotations for 433 comments.
6 PAPERS • 2 BENCHMARKS
This dataset contains 33,400 annotated comments used for hate speech detection on social network sites. Label: CLEAN (non hate), OFFENSIVE and HATE
5 PAPERS • NO BENCHMARKS YET
Introduces three datasets of expressing hate, commonly used topics, and opinions for hate speech detection, document classification, and sentiment analysis, respectively.
4 PAPERS • NO BENCHMARKS YET
A corpus of Offensive Language and Hate Speech Detection for Danish
3 PAPERS • 1 BENCHMARK
Presents 9.4K manually labeled entertainment news comments for identifying Korean toxic speech, collected from a widely used online news platform in Korea.
3 PAPERS • NO BENCHMARKS YET
A large-scale and machine-generated dataset of 274,186 toxic and benign statements about 13 minority groups.
AraCOVID19-MFH is a manually annotated multi-label Arabic COVID-19 fake news and hate speech detection dataset. The dataset contains 10,828 Arabic tweets annotated with 10 different labels.
2 PAPERS • NO BENCHMARKS YET
HatemojiCheck is a test suite for detecting emoji-based hate of 3,930 test cases covering seven functionalities of emoji-based hate and six identities.
A new multilingual multi-aspect hate speech analysis dataset and use it to test the current state-of-the-art multilingual multitask learning approaches.
The Toxic Language Detection for Brazilian Portuguese (ToLD-Br) is a dataset with tweets in Brazilian Portuguese annotated according to different toxic aspects.
2 PAPERS • 1 BENCHMARK
#chinahate dataset contains a total of 2,172,333 tweets hashtagged #china posted during the time it was collected. It is designed for the task of hate speech detection.
1 PAPER • NO BENCHMARKS YET
The dataset contains 7,601 Gab posts classified on three different aspects: abuse presence or not, abuse severity and abuse target.
At the end of 2017 the Civil Comments platform shut down and chose make their ~2m public comments from their platform available in a lasting open archive so that researchers could understand and improve civility in online conversations for years to come. Jigsaw sponsored this effort and extended annotation of this data by human raters for various toxic conversational attributes.
1 PAPER • 1 BENCHMARK
CoRAL is a language and culturally aware Croatian Abusive dataset covering phenomena of implicitness and reliance on local and global context.
DeToxy is a publicly available toxicity annotated dataset for the English language. DeToxy is sourced from various openly available speech databases and consists of over 2 million utterances. The dataset would act as a benchmark for the relatively new and un-explored Spoken Language Processing task of detecting toxicity from spoken utterances and boost further research in this space.
HERDPhobia is an annotated hate speech detection dataset on Fulani herders in Nigeria -- in three languages: English, Nigerian-Pidgin, and Hausa.
HS-BAN is a binary class hate speech (HS) dataset in Bangla language consisting of more than 50,000 labeled comments, including 40.17% hate and rest are non hate speech.
Korean Multi-label Hate Speech Dataset
APEACH is the first crowd-generated Korean evaluation dataset for hate speech detection. Sentences of the dataset are created by anonymous participants using an online crowdsourcing platform DeepNatural AI.
A corpus of 9k German and French user comments collected from migration-related news articles. It goes beyond the hate-neutral dichotomy and is instead annotated with 23 features, which in combination become descriptors of various types of speech, ranging from critical comments to implicit and explicit expressions of hate. The annotations are performed by 4 native speakers per language and achieve high (0.77) inter-annotator agreements.
NJH is a dataset of over 40,000 tweets about immigration from the US and UK, annotated with six labels for different aspects of incivility and intolerance. It is a more fine-grained multi-label approach to predicting incivility and hateful or intolerant content.
Peer to Peer Hate is a comprehensive hate speech dataset capturing various types of hate. It has been built from 27,330 hate speech tweets.
This is an abusive/offensive language detection dataset for Albanian. The data is formatted following the OffensEval convention. Data is from Instagram and YouTube comments.
The Sina Weibo Sexism Review (SWSR) dataset is a dataset to research online sexism in Chinese. The SWSR dataset provides labels at different levels of granularity including (i) sexism or non-sexism, (ii) sexism category and (iii) target type, which can be exploited, among others, for building computational methods to identify and investigate finer-grained gender-related abusive language.
The ComMA Dataset v0.2 is a multilingual dataset annotated with a hierarchical, fine-grained tagset marking different types of aggression and the "context" in which they occur. The context, here, is defined by the conversational thread in which a specific comment occurs and also the "type" of discursive role that the comment is performing with respect to the previous comment. The initial dataset, being discussed here (and made available as part of the ComMA@ICON shared task), consists of a total 15,000 annotated comments in four languages - Meitei, Bangla, Hindi, and Indian English - collected from various social media platforms such as YouTube, Facebook, Twitter and Telegram. As is usual on social media websites, a large number of these comments are multilingual, mostly code-mixed with English.
This is a high-quality dataset of annotated posts sampled from social media posts and annotated for misogyny. Danish language.