The ArxivPapers dataset is an unlabelled collection of over 104K papers related to machine learning and published on between 2007–2020. The dataset includes around 94K papers (for which LaTeX source code is available) in a structured form in which paper is split into a title, abstract, sections, paragraphs and references. Additionally, the dataset contains over 277K tables extracted from the LaTeX papers.

Due to the papers license the dataset is published as a metadata and open-source pipeline that can be used to obtain and convert the papers.

Source: AxCell: Automatic Extraction of Results from Machine Learning Papers


