News Interactions on Globo.com (News Portal User Interactions by Globo.com - A large dataset for news recommendations offline evaluation and analytics)

Introduced by Moreira et al. in Contextual Hybrid Session-based News Recommendation with Recurrent Neural Networks

Context

This large dataset with users interactions logs (page views) from a news portal was kindly provided by Globo.com, the most popular news portal in Brazil, for reproducibility of the experiments with CHAMELEON - a meta-architecture for contextual hybrid session-based news recommender systems. The source code was made available at GitHub.

The first version (v1) (download) of this dataset was released for reproducibility of the experiments presented in the following paper:

> Gabriel de Souza Pereira Moreira, Felipe Ferreira, and Adilson Marques da Cunha. 2018. News Session-Based Recommendations using Deep Neural Networks. In 3rd Workshop on Deep Learning for Recommender Systems (DLRS 2018), October 6, 2018, Vancouver, BC, Canada. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3270323.3270328

A second version (v2) (download) of this dataset was made available for reproducibility of the experiments presented in the following paper. Compared to the v1, the only differences are:

  • Included four additional user contextual attributes (click_os, click_country, click_region, click_referrer_type)
  • Removed repeated clicks (clicks in the same articles) within sessions. Those sessions with less than two clicks (minimum for the next-click prediction task) were removed

> Gabriel de Souza Pereira Moreira, Dietmar Jannach, and Adilson Marques da Cunha. 2019. Contextual Hybrid Session-based News Recommendation with Recurrent Neural Networks. arXiv preprint arXiv:1904.10367, 49 pages

You are not allowed to use this dataset for commercial purposes, only with academic objectives (like education or research). If used for research, please cite the above papers.

Content

The dataset contains a sample of user interactions (page views) in G1 news portal from Oct. 1 to 16, 2017, including about 3 million clicks, distributed in more than 1 million sessions from 314,000 users who read more than 46,000 different news articles during that period.

It is composed by three files/folders:

  • clicks.zip - Folder with CSV files (one per hour), containing user sessions interactions in the news portal.
  • articles_metadata.csv - CSV file with metadata information about all (364047) published articles
  • articles_embeddings.pickle Pickle (Python 3) of a NumPy matrix containing the Article Content Embeddings (250-dimensional vectors), trained upon articles' text and metadata by the CHAMELEON's ACR module (see paper for details) for 364047 published articles.
    P.s. The full text of news articles could not be provided due to license restrictions, but those embeddings can be used by Neural Networks to represent their content. See this paper for a t-SNE visualization of these embeddings, colored by category.

Acknowledgements

I would like to acknowledge Globo.com for providing this dataset for this research and for the academic community, in special to Felipe Ferreira for preparing the original dataset by Globo.com.

Dataset banner photo by rawpixel on Unsplash

Inspiration

This dataset might be very useful if you want to implement and evaluate hybrid and contextual news recommender systems, using both user interactions and articles content and metadata to provide recommendations. You might also use it for analytics, trying to understand how users interactions in a news portal are distributed by user, by article, or by category, for example.

If you are interested in a dataset of user interactions on articles with the full text provided, to experiment with some different text representations using NLP, you might want to take a look in this smaller dataset.

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


License


  • Unknown

Modalities


Languages