A scholarly data set with publications’ full-text, annotated in-text citations, and links to metadata.

The unarXive data set contains

  • One million papers in plain text
  • 63 million citation contexts
  • 39 million reference strings
  • A citation network of 16 million connections

The data is generated from all LaTeX sources on arXiv from 1991–2020/07 and therefore of higher quality than data generated from PDF files. Furthermore, as all citing papers are available in full text, citation contexts of arbitrary size can be extracted.

Typical uses of the data set are approaches in

  • Citation recommendation
  • Citation context analysis
  • Reference string parsing

The code for generating the data set is publicly available.


Paper Code Results Date Stars

Dataset Loaders

No data loaders found. You can submit your data loader here.


Similar Datasets


  • Unknown