RepLab 2013

RepLab 2013 dataset uses Twitter data in English and Spanish (more than 142,000 tweets). The balance between both languages depends on the availability of data for each of the entities included in the dataset. The corpus consists of a collection of tweets referring to a selected set of 61 entities from four domains: automotive, banking, universities and music/artists. The domain selection was done to offer a variety of scenarios for reputation studies.

Crawling was performed during the period from the 1st June 2012 till the 31st Dec 2012 using the entity’s canonical name as query. For each entity, at least 2,200 tweets are collected: at least 700 tweets at the beginning of the timeline are used as training set, and at least 1,500 last tweets are reserved for the test set. The corpus also comprises additional background tweets for each entity (up to 50,000 tweets, with a large variability across entities). This distribution was set in this way to obtain a temporal separation (ideally of several months) between the training and test data.

Note that the final amount of available tweets in these sets may be lower, since some posts may have been deleted by the users: in order to respect Twitter’s terms of service, we do not provide the contents of the tweets. The tweet identifiers can be used to retrieve the texts of the posts. We provide a download tool that is similarly to the mechanism used in the TREC Microblog Track in 2011 and 2012.

For more information, please refer to the RepLab 2013 Overview's paper.

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


License


  • Unknown

Modalities


Languages