RETWEET is a dataset of tweets and overall predominant sentiment of their replies.
WHAT: Message-level Polarity Classification.
GOAL: To predict the predominant sentiment among (potential) first-order replies to a given tweet.
IDEA: Mitigate the problem of lacking labeled training data wi treating the unsupervised nature of the problem as a supervised learning case.
To download all of the replies to a tweet, the Search API should be used. However, the Search API is limited to 75000 requests per hour, which causes the mining and downloading process to be slow. Furthermore, using the Twitter API, there is no possibility of downloading absolute random data. Therefore, we try to make the procedure as random as possible by utilizing two different strategies for data downloading and using them in an intermixed manner.
Our first strategy is based on a sample of English tweets obtained by filtering the Twitter stream via a list of cultural keywords. This list consists of 147 words that are deemed to play a "pivotal role in discussions of culture and society", covering diverse words such as aesthetics, environment, feminism, power, tourism, or youth. We extracted all tweets in 2019 that have a minimum of 20 first-order replies in the dataset. The data come with an obvious caveat: Both the source tweet as well as all the replies must contain at least one word from the list of keywords. Therewith, it is highly unlikely that the list of replies for any given source is exhaustive, i.e. there might be many more first-order replies to the source tweet that are not in the dataset.
As our second approach, we use the GetOldTweets3 library to download all the replies corresponding to every tweet. We define few restrictions to add randomization to the process. Firstly, every tweet and also every reply should contain at least 20 strings. This is due to the fact that our automatic tweet classifier, explsined in the paper, is optimized based on the message-level classification paradigm. Therefore, it operates optimal when the input contains at least a sufficient number of words. The second constraint is that every tweet should contain at least 20 first-order replies. In order to increase randomness, in this strategy, instead of referencing to a list of keywords, we manually choose some keywords, which are most likely to include long discussions, such as Coronavirus and football or the ones, which are most likely to include strong opinions such as birthday, war, or racism in order to account for the easy-to-guess examples.
5,015 tweets with their corresponding replies, collected as a combination of the two different collection strategies, were given to three different students. Each of them had to read all the replies corresponding to every tweet, without observing the original tweet in order to avoid having a prior knowledge, and decide on ONE final sentiment for the replies. The assigned sentiment can only be one of the positive, negative, or neutral labels.
Considering the fact that this is a really challenging task for the machine, to prevent human mistakes, we correlated the results of the three annotators and only chose the tweets, in which all of the annotators had the same opinion on the labels, as the final gold standard test data. Therefore, we finally, ended up with a test set consisting of 1,519 human labeled tweets, with the labels being the sentiment of the replies of a tweet and not the tweet itself.
1. Training raw dataset: 34,953 unique tweets in total and individual automatic labels for all of their corresponding replies (1,519,504 total replies). Including,
./RETWEET_data/train_reply_labels_set1.txt
./RETWEET_data/train_reply_labels_set2.txt
2. Training autamtically-labeled dataset: 34,953 unique tweets and ONE final automatic label (chosen based on the algorithm 1 of our paper) for every tweet. Including,
./RETWEET_data/train_final_label.txt
3. Gold standard test dataset (RETWEET): 1,519 unique tweets with their manual labels for replies. ONE final label, which states the predominant overall polarity of all its replies, is assigned to every tweet. Including,
./RETWEET_data/test_gold.txt
Please note that by downloading the Twitter data you agree to abide by the Twitter terms of service, and in particular you agree not to redistribute the data and to delete tweets that are marked deleted in the future.
The "neutral" label in the annotations stands for objective or neutral.
The distribution consists of a set of Twitter unique tweet IDs with annotations (overall polarity of replies). As for data privacy, the texts of the tweets and replies are not distributed. But as all the utilized resources in this dataset are taken from public tweets, having the tweet unique IDs, you can download the tweet and its replies. You can use the Semeval Twitter data downloading script to obtain the corresponding tweets:
https://github.com/seirasto/twitter_download/
The dataset URL:
https://kaggle.com/soroosharasteh/retweet/
The accompanying dataset is released under a Creative Commons Attribution 4.0 International License.
The official source code of the paper: https://github.com/starasteh/retweet
S. Tayebi Arasteh, M. Monajem, V. Christlein, P. Heinrich, A. Nicolaou, H.N. Boldaji, M. Lotfinia, S. Evert. "How Will Your Tweet Be Received? Predicting the Sentiment Polarity of Tweet Replies". Proceedings of the 2021 IEEE 15th International Conference on Semantic Computing (ICSC), Laguna Hills, CA, USA, January 2021.
@inproceedings{RETWEET,
title = "How Will Your Tweet Be Received? Predicting the Sentiment Polarity of Tweet Replies",
author = "Tayebi Arasteh, Soroosh and Monajem, Mehrpad and Christlein, Vincent and
Heinrich, Philipp and Nicolaou, Anguelos and Naderi Boldaji, Hamidreza and Lotfinia, Mahshad and Evert, Stefan",
booktitle = "Proceedings of the 2021 IEEE 15th International Conference on Semantic Computing (ICSC)",
address = "Laguna Hills, CA, USA",
pages = "370-373",
doi = "10.1109/ICSC50631.2021.00068",
url = "https://ieeexplore.ieee.org/document/9364527/",
month = "01",
year = "2021"
}
E-mail: soroosh.arasteh@fau.de
label TAB id
where, "label" can be positive, neutral or negative, corresponding to the overall message-level polarity of the replies of the tweet and "id" corresponds to the Twitter unique ID for the tweets.
Paper | Code | Results | Date | Stars |
---|