Sentiment detection remains a pivotal task in natural language processing, yet its development in Arabic lags due to a scarcity of training materials compared to English. Addressing this gap, we present ArSen-20, a benchmark dataset tailored to propel Arabic sentiment detection forward. ArSen-20 comprises 20,000 professionally labeled tweets sourced from Twitter, focusing on the theme of COVID-19 and spanning the period from 2020 to 2023. Beyond tweet content, the dataset incorporates metadata associated with the user, enriching the contextual understanding. ArSen-20 offers a comprehensive resource to foster advancements in Arabic sentiment analysis and facilitate research in this critical domain.
1 PAPER • NO BENCHMARKS YET
L3Cube-MahaCorpus is a Marathi monolingual data set scraped from different internet sources. We expand the existing Marathi monolingual corpus with 24.8M sentences and 289M tokens. We also present, MahaBERT, MahaAlBERT, and MahaRoBerta all BERT-based masked language models, and MahaFT, the fast text word embeddings both trained on full Marathi corpus with 752M tokens.
This dataset contains news headlines relevant to key forex pairs: AUDUSD, EURCHF, EURUSD, GBPUSD, and USDJPY. The data was extracted from reputable platforms Forex Live and FXstreet over a period of 86 days, from January to May 2023. The dataset comprises 2,291 unique news headlines. Each headline includes an associated forex pair, timestamp, source, author, URL, and the corresponding article text. Data was collected using web scraping techniques executed via a custom service on a virtual machine. This service periodically retrieves the latest news for a specified forex pair (ticker) from each platform, parsing all available information. The collected data is then processed to extract details such as the article's timestamp, author, and URL. The URL is further used to retrieve the full text of each article. This data acquisition process repeats approximately every 15 minutes.
In AISIA-VN-Review-S and AISIA-VN-Review-F datasets, we first collect 450K customer reviewing comments from various e–commerce websites. Then, we manually label each review to be either positive or negative, resulting in 358,743 positive reviews and 100,699 negative reviews. We named this dataset the sentiment classification from reviews collected by AISIA, the full version (AISIA-VN-Review-F). However, in this work, we are interested in improving the model’s performance when the training data are limited; thus, we only consider a subset of up to 25K training reviews and evaluate the model on another 170K reviews. We refer to this subset from the full dataset as AISIA-VN-Review-S. It is important to emphasize that our team spends a lot of time and effort to manually classify each review into positive or negative sentiments.
Fallout New Vegas Dialog is a multilingual sentiment annotated dialog dataset from Fallout New Vegas. The game developers have preannotated every line of dialog in the game in one of the 8 different sentiments: anger, disgust, fear, happy, neutral, pained, sad and surprised and they have been translated into 5 different languages: English, Spanish, German, French and Italian.
LEPISZCZE is an open-source comprehensive benchmark for Polish NLP and a continuous-submission leaderboard, concentrating public Polish datasets (existing and new) in specific tasks.
2 PAPERS • NO BENCHMARKS YET
Financial Language Understanding Evaluation is an open-source comprehensive suite of benchmarks for the financial domain. It contains benchmarks across 5 NLP tasks in financial domain as well as common benchmarks used in the previous research. The tasks are financial sentiment analysis, news headline classification, named entity recognition, structure boundary detection and question answering.
A dataset of games played in the card game "Cards Against Humanity" (CAH), by human players, derived from the online CAH labs. Each round includes the cards presented to users - a "black" prompt with a blank or question and 10 "white" punchlines as possible responses, and which punchline was picked by a player each round, along with text and metadata.
Text Classification Attack Benchmark (TCAB) is a dataset for analyzing, understanding, detecting, and labeling adversarial attacks against text classifiers. TCAB includes 1.5 million attack instances, generated by twelve adversarial attack targeting three classifiers trained on six source datasets for sentiment analysis and abuse detection in English. The process of generating attacks is automated, so that TCAB can easily be extended to incorporate new text attacks and better classifiers as they are developed.
3 PAPERS • NO BENCHMARKS YET
Capriccio is a sentiment classification dataset on tweets that simulates data drift. It is created by slicing the Sentiment140 dataset (homepage, Huggingface datasets) with a sliding window of 500,000 tweets, resulting in 38 slices. Thus, each slice can be used to represent the training/validation dataset of a sentiment classification model that is re-trained every day. Each slice has 425,000 tweets for training (file named %d_train.json) and 75,000 tweets for validation (file named %d_val.json).
Moral Foundations Reddit Corpus (MFRC) is a collection of 16,123 Reddit comments that have been curated from 12 distinct subreddits, hand-annotated by at least three trained annotators for 8 categories of moral sentiment (i.e., Care, Proportionality, Equality, Purity, Authority, Loyalty, Thin Morality, Implicit/Explicit Morality) based on the updated Moral Foundations Theory (MFT) framework.
4 PAPERS • NO BENCHMARKS YET
The peer-reviewed paper of AWARE dataset is published in ASEW 2021, and can be accessed through: http://doi.org/10.1109/ASEW52652.2021.00049. Kindly cite this paper when using AWARE dataset.
1 PAPER • 3 BENCHMARKS
SentimentArcs’ reference corpus for novels consists of 25 narratives selected to create a diverse set of well recognized novels that can serve as a benchmark for future studies. The composition of the corpora was limited by the effect of copyright laws as well as historical imbalances. Most works were obtained from US and Australian Gutenberg Projects. The corpora is expected to grow in size and diversity over time.
TBCOV is a large-scale Twitter dataset comprising more than two billion multilingual tweets related to the COVID-19 pandemic collected worldwide over a continuous period of more than one year. Several state-of-the-art deep learning models are used to enrich the data with important attributes, including sentiment labels, named-entities (e.g., mentions of persons, organizations, locations), user types, and gender information. A geotagging method is proposed to assign country, state, county, and city information to tweets, enabling a myriad of data analysis tasks to understand real-world issues.
SEN is a novel publicly available human-labelled dataset for training and testing machine learning algorithms for the problem of entity level sentiment analysis of political news headlines.
0 PAPER • NO BENCHMARKS YET
A Bambara dialectal dataset dedicated for Sentiment Analysis, available freely for Natural Language Processing research purposes
Laptop-ACOS is a brand new Laptop dataset collected from the Amazon platform in the years 2017 and 2018 (covering ten types of laptops under six brands such as ASUS, Acer, Samsung, Lenovo, MBP, MSI, and so on). It contains 4,076 review sentences, much larger than the SemEval Laptop datasets. For Laptop-ACOS, we annotate the four elements and their corresponding quadruples all by ourselves. We employ the aspect categories defined in the SemEval 2016 Laptop dataset. The Laptop-ACOS dataset contains 4076 sentences with 5758 quadruples. As we have mentioned, a large percentage of the quadruples contain implicit aspects or implicit opinions . By comparing two datasets, it can be observed that Laptop-ACOS has a higher percentage of implicit opinions than Restaurant-ACOS . It is worth noting that the Laptop-ACOS is available for all subtasks in ABSA, including aspect-based sentiment classification, aspect-sentiment pair extraction, aspect-opinion pair extraction, aspect-opinion sentiment tri
4 PAPERS • 1 BENCHMARK
The Restaurant-ACOS dataset is constructed based on the SemEval 2016 Restaurant dataset (Pontiki et al., 2016) and its expansion datasets (Fan et al., 2019; Xu et al., 2020). The SemEval 2016 Restaurant dataset (Pontiki et al., 2016) was annotated with explicit and implicit aspects, categories, and sentiment. (Fan et al., 2019; Xu et al., 2020) further added the opinion annotations. We integrate their annotations to construct aspect-category-opinion-sentiment quadruples and further annotate the implicit opinions. The Restaurant-ACOS dataset contains 2286 sentences with 3658 quadruples. It is worth noting that the Restaurant-ACOS is available for all subtasks in ABSA, including aspect-based sentiment classification, aspect-sentiment pair extraction, aspect-opinion pair extraction, aspect-opinion sentiment triple extraction, aspect-category-sentiment triple extraction, etc.
ReactionGIF is an affective dataset of 30K tweets which can be used for tasks like induced sentiment prediction and multilabel classification of induced emotions.
RETWEET is a dataset of tweets and overall predominant sentiment of their replies.
2 PAPERS • 1 BENCHMARK
ArSarcasm-v2 is an extension of the original ArSarcasm dataset published along with the paper From Arabic Sentiment Analysis to Sarcasm Detection: The ArSarcasm Dataset. ArSarcasm-v2 conisists of ArSarcasm along with portions of DAICT corpus and some new tweets. Each tweet was annotated for sarcasm, sentiment and dialect. The final dataset consists of 15,548 tweets divided into 12,548 training tweets and 3,000 testing tweets. ArSarcasm-v2 was used and released as a part of the shared task on sarcasm detection and sentiment analysis in Arabic.
14 PAPERS • NO BENCHMARKS YET
L3CubeMahaSent is a large publicly available Marathi Sentiment Analysis dataset. It consists of marathi tweets which are manually labelled.
7 PAPERS • NO BENCHMARKS YET
BanglaEmotion is a manually annotated Bangla Emotion corpus, which incorporates the diversity of fine-grained emotion expressions in social-media text. More fine-grained emotion labels are considered such as Sadness, Happiness, Disgust, Surprise, Fear and Anger - which are, according to Paul Ekman (1999), the six basic emotion categories. For this task, a large amount of raw text data are collected from the user’s comments on two different Facebook groups (Ekattor TV and Airport Magistrates) and from the public post of a popular blogger and activist Dr. Imran H Sarker. These comments are mostly reactions to ongoing socio-political issues and towards the economic success and failure of Bangladesh. A total of 32923 comments are scraped from the three sources aforementioned above. Out of these, a total of 6314 comments were annotated into the six categories. The distribution of the annotated corpus is as follows:
The Sequence labellIng evaLuatIon benChmark fOr spoken laNguagE (SILICONE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems specifically designed for spoken language. All datasets are in the English language and covers a large variety of domains (e.g daily life, scripted scenarios, joint task completion, phone call conversations, and televsion dialogue). Some datasets additionally include emotion and/or sentiment labels.
1 PAPER • 1 BENCHMARK
Sentiment analysis of codemixed tweets.
27 PAPERS • NO BENCHMARKS YET
A multimodal dataset for sentiment analysis on internet memes.
6 PAPERS • NO BENCHMARKS YET
CH-SIMS is a Chinese single- and multimodal sentiment analysis dataset which contains 2,281 refined video segments in the wild with both multimodal and independent unimodal annotations. It allows researchers to study the interaction between modalities or use independent unimodal annotations for unimodal sentiment analysis.
13 PAPERS • 1 BENCHMARK
The social vision and language dataset is a large-scale multimodal dataset designed for research into social contextual learning.
ArSarcasm is a new Arabic sarcasm detection dataset. The dataset was created using previously available Arabic sentiment analysis datasets (SemEval 2017 and ASTD) and adds sarcasm and dialect labels to them. The dataset contains 10,547 tweets, 1,682 (16%) of which are sarcastic.
12 PAPERS • NO BENCHMARKS YET
The SemEval-2013 Task 2 dataset contains data for two subtasks: A, an expression-level subtask, and B, a message-level subtask. Crowdsourcing was used to label a large Twitter training dataset along with additional test sets of Twitter and SMS messages for both subtasks.
30 PAPERS • NO BENCHMARKS YET
iSarcasm is a dataset of tweets, each labelled as either sarcastic or non_sarcastic. Each sarcastic tweet is further labelled for one of the following types of ironic speech:
17 PAPERS • 1 BENCHMARK
The SPOT dataset contains 197 reviews originating from the Yelp'13 and IMDB collections (1), annotated with segment-level polarity labels (positive/neutral/negative). Annotations have been gathered on 2 levels of granulatiry:
The Stanford Sentiment Treebank is a corpus with fully labeled parse trees that allows for a complete analysis of the compositional effects of sentiment in language. The corpus is based on the dataset introduced by Pang and Lee (2005) and consists of 11,855 single sentences extracted from movie reviews. It was parsed with the Stanford parser and includes a total of 215,154 unique phrases from those parse trees, each annotated by 3 human judges.
2,016 PAPERS • 9 BENCHMARKS
The IMDb Movie Reviews dataset is a binary sentiment analysis dataset consisting of 50,000 reviews from the Internet Movie Database (IMDb) labeled as positive or negative. The dataset contains an even number of positive and negative reviews. Only highly polarizing reviews are considered. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. No more than 30 reviews are included per movie. The dataset contains additional unlabeled data.
1,579 PAPERS • 11 BENCHMARKS
The Multi-Domain Sentiment Dataset contains product reviews taken from Amazon.com from many product types (domains). Some domains (books and dvds) have hundreds of thousands of reviews. Others (musical instruments) have only a few hundred. Reviews contain star ratings (1 to 5 stars) that can be converted into binary labels if needed.
53 PAPERS • 1 BENCHMARK
The MPQA Opinion Corpus contains 535 news articles from a wide variety of news sources manually annotated for opinions and other private states (i.e., beliefs, emotions, sentiments, speculations, etc.).
301 PAPERS • 3 BENCHMARKS
MR Movie Reviews is a dataset for use in sentiment-analysis experiments. Available are collections of movie-review documents labeled with respect to their overall sentiment polarity (positive or negative) or subjective rating (e.g., "two and a half stars") and sentences labeled with respect to their subjectivity status (subjective or objective) or polarity.
27 PAPERS • 3 BENCHMARKS
Antonio Gulli’s corpus of news articles is a collection of more than 1 million news articles. The articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non - commercial activity.
The Arabic Sentiment Twitter Dataset for the Levantine dialect (ArSenTD-LEV) is a dataset of 4,000 tweets with the following annotations: the overall sentiment of the tweet, the target to which the sentiment was expressed, how the sentiment was expressed, and the topic of the tweet.
8 PAPERS • NO BENCHMARKS YET
The Covid19-CountryImage dataset is a Twitter dataset which contains COVID-19-related tweets.
FinnSentiment introduces a 27,000 sentence dataset (in Finnish) annotated independently with sentiment polarity by three native annotators.
GeoCoV19 is a large-scale Twitter dataset containing more than 524 million multilingual tweets. The dataset contains around 378K geotagged tweets and 5.4 million tweets with Place information. The annotations include toponyms from the user location field and tweet content and resolve them to geolocations such as country, state, or city level. In this case, 297 million tweets are annotated with geolocation using the user location field and 452 million tweets using tweet content.
HappyDB is a corpus of 100,000 crowdsourced happy moments.
9 PAPERS • NO BENCHMARKS YET
LABR is a large sentiment analysis dataset to-date for the Arabic language. It consists of over 63,000 book reviews, each rated on a scale of 1 to 5 stars.
16 PAPERS • 1 BENCHMARK
Large Scale Informal Chinese Corpus (LSICC) is a large-scale corpus of informal Chinese. This corpus contains around 37 million book reviews and 50 thousand netizen's comments to the news.
MalayalamMixSentiment is a Sentiment Analysis Dataset for Code-Mixed Malayalam-English.
Modern Hebrew Sentiment Dataset is a sentiment analysis benchmark for Hebrew, based on 12K social media comments, and provide two instances of these data: in token-based and morpheme-based settings.
The Norwegian Review Corpus (NoReC) was created for the purpose of training and evaluating models for document-level sentiment analysis. More than 43,000 full-text reviews have been collected from major Norwegian news sources and cover a range of different domains, including literature, movies, video games, restaurants, music and theater, in addition to product reviews across a range of categories. Each review is labeled with a manually assigned score of 1–6, as provided by the rating of the original author.