A scholarly data set with publications’ full-text, annotated in-text citations, and links to metadata.
The unarXive data set contains
The data is generated from all LaTeX sources on arXiv from 1991–2020/07 and therefore of higher quality than data generated from PDF files. Furthermore, as all citing papers are available in full text, citation contexts of arbitrary size can be extracted.
Typical uses of the data set are approaches in
The code for generating the data set is publicly available.
Paper | Code | Results | Date | Stars |
---|