News Interactions on Globo.com Dataset

Name:*

Full name (optional):

Description (Markdown and $\LaTeX$ enabled):*

### Context

This large dataset with users interactions logs (page views) from a news portal was kindly provided by [Globo.com][1], the most popular news portal in Brazil, for reproducibility of the experiments with CHAMELEON - a meta-architecture for contextual hybrid session-based news recommender systems. The source code was made available at [GitHub][2].

The **first version (v1)** ([download][13]) of this dataset was released for reproducibility of the experiments presented in the following paper:

&gt; Gabriel de Souza Pereira Moreira, Felipe Ferreira, and Adilson Marques da Cunha. 2018.  [News Session-Based Recommendations using Deep Neural Networks][3]. In [3rd Workshop on Deep Learning for Recommender Systems (DLRS 2018)][4], October 6, 2018, Vancouver, BC, Canada. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3270323.3270328

A **second version (v2)** ([download][14]) of this dataset was made available for reproducibility of the experiments presented in the following paper. Compared to the v1, the only differences are:

* Included four additional user contextual attributes (click\_os, click\_country, click_region, click_referrer_type)
* Removed repeated clicks (clicks in the same articles) within sessions. Those sessions with less than two clicks (minimum for the next-click prediction task) were removed

&gt; Gabriel de Souza Pereira Moreira, Dietmar Jannach, and Adilson Marques da Cunha. 2019.  [Contextual Hybrid Session-based News Recommendation with Recurrent Neural Networks][15]. arXiv preprint arXiv:1904.10367, 49 pages

You are not allowed to use this dataset for commercial purposes, only with academic objectives (like education or research). 
**If used for research, please cite the above papers.**

### Content
The dataset contains a sample of user interactions (page views) in [G1 news portal][5] from Oct. 1 to 16, 2017, including about 3 million clicks, distributed in more than 1 million sessions from 314,000 users who read more than 46,000 different news articles during that period.

It is composed by three files/folders:

- **clicks.zip** - Folder with CSV files (one per hour), containing user sessions interactions in the news portal.
 - **articles_metadata.csv** - CSV file with metadata information about all (364047) published articles 
 - **articles_embeddings.pickle** Pickle (Python 3) of a NumPy matrix containing the Article Content Embeddings (250-dimensional vectors), trained upon articles' text and metadata by the CHAMELEON's ACR module (see [paper][6] for details) for 364047 published articles.  
 P.s. The full text of news articles could not be provided due to license restrictions, but those embeddings can be used by Neural Networks to represent their content. See this [paper][7] for a t-SNE visualization of these embeddings, colored by category.

### Acknowledgements

I would like to acknowledge [Globo.com][8] for providing this dataset for this research and for the academic community, in special to [Felipe Ferreira][9] for preparing the original dataset by Globo.com.

*Dataset banner photo by [rawpixel][10] on [Unsplash][11]*

### Inspiration

This dataset might be very useful if you want to implement and evaluate hybrid and contextual news recommender systems, using both user interactions and articles content and metadata to provide recommendations. You might also use it for analytics, trying to understand how users interactions in a news portal are distributed by user, by article, or by category, for example.

If you are interested in a dataset of user interactions on articles with the full text provided, to experiment with some different text representations using NLP, you might want to take a look in this smaller [dataset][12].

[1]: https://www.globo.com/
  [2]: https://github.com/gabrielspmoreira/chameleon_recsys
  [3]: https://arxiv.org/abs/1808.00076
  [4]: https://recsys.acm.org/recsys18/dlrs/
  [5]: http://g1.com.br/
  [6]: https://arxiv.org/abs/1808.00076
  [7]: https://arxiv.org/abs/1808.00076
  [8]: https://www.globo.com/
  [9]: https://www.linkedin.com/in/feliferr/
  [10]: https://unsplash.com/photos/O7lbegmDGEw?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText
  [11]: https://unsplash.com
  [12]: https://www.kaggle.com/gspmoreira/articles-sharing-reading-from-cit-deskdrop
  [13]: https://www.kaggle.com/gspmoreira/news-portal-user-interactions-by-globocom/downloads/news-portal-user-interactions-by-globocom.zip/1
  [14]: https://www.kaggle.com/gspmoreira/news-portal-user-interactions-by-globocom/downloads/news-portal-user-interactions-by-globocom.zip/2
  [15]: https://arxiv.org/abs/1904.10367

Homepage URL (optional):

Paper where the dataset was introduced:

Introduction date:

Dataset license:

URL to full license terms:

Image

---

News Interactions on Globo.com (News Portal User Interactions by Globo.com - A large dataset for news recommendations offline evaluation and analytics)

Context

Content

Acknowledgements

Inspiration

Benchmarks

Add a new result Link an existing benchmark

Papers

Dataset Loaders

Add Remove

Tasks

License

Modalities

Languages

News Interactions on Globo.com (News Portal User Interactions by Globo.com - A large dataset for news recommendations offline evaluation and analytics)

Context

Content

Acknowledgements

Inspiration

Benchmarks Edit Add a new result Link an existing benchmark

Papers

Dataset Loaders Edit Add Remove

Tasks Edit

License Edit

Modalities Edit

Languages Edit