A large corpus of 81.1M English-language academic papers spanning many academic disciplines. Rich metadata, paper abstracts, resolved bibliographic references, as well as structured full text for 8.1M open access papers. Full text annotated with automatically-detected inline mentions of citations, figures, and tables, each linked to their corresponding paper objects. Aggregated papers from hundreds of academic publishers and digital archives into a unified source, and create the largest publicly-available collection of machine-readable academic text to date.
151 PAPERS • 2 BENCHMARKS
The Microsoft Academic Graph is a heterogeneous graph containing scientific publication records, citation relationships between those publications, as well as authors, institutions, journals, conferences, and fields of study.
123 PAPERS • 1 BENCHMARK
ACL Anthology Reference Corpus (ACL ARC) is a collection of 10,920 academic papers from the ACL Anthology. ACL ARC is cleaned to remove:
13 PAPERS • 4 BENCHMARKS
A scholarly data set with publications’ full-text, annotated in-text citations, and links to metadata.
9 PAPERS • NO BENCHMARKS YET
A data set containing citations, citation contexts, and papers.
7 PAPERS • 1 BENCHMARK
SemOpenAlex is an extensive RDF knowledge graph that contains over 26 billion triples about scientific publications and their associated entities, such as authors, institutions, journals, and concepts. * SemOpenAlex is licensed under CC0, providing free and open access to the data. * We offer the data through multiple channels, including RDF dump files, a SPARQL endpoint, and as a data source in the Linked Open Data cloud, complete with resolvable URIs and links to other data sources (ISNI, DOI, ORCID, ROR, Scopus, DOAJ, Wikidata, * Moreover, we provide embeddings for knowledge graph entities using high-performance computing.
5 PAPERS • NO BENCHMARKS YET
These images were generated using UnityEyes simulator, after including essential eyeball physiology elements and modeling binocular vision dynamics. The images are annotated with head pose and gaze direction information, besides 2D and 3D landmarks of eye's most important features. Additionally, the images are distributed into two classes denoting the status of the eye (Open for open eyes, Closed for closed eyes). This dataset was used to train a DNN model for detecting drowsiness status of a driver. The dataset contains 1,704 training images, 4,232 testing images and additional 4,103 images for improvements.
4 PAPERS • NO BENCHMARKS YET
FullTextPeerRead is a dataset created by Jeong et al. for context-aware citation recommendation. It contains context sentences to cited references and paper metadata, which makes it a well-organized dataset for a context-aware paper recommendation.
1 PAPER • 1 BENCHMARK
Internet Archive Scholar Reference Dataset.
1 PAPER • NO BENCHMARKS YET
A newly proposed dataset for local citation recommendation, consisting of 3.2 million local citation sentences along with the title and the abstract of both the citing and the cited papers. Around 1.66 million papers' titles and abstracts are available in the database.