🔔 Share your dataset with the ML community!

Filter by Modality

Filter by Task (clear)

Filter by Language

15 dataset results for Document Classification

The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words.

492 PAPERS • 20 BENCHMARKS

MPQA Opinion Corpus (Multi-Perspective Question Answering)

The MPQA Opinion Corpus contains 535 news articles from a wide variety of news sources manually annotated for opinions and other private states (i.e., beliefs, emotions, sentiments, speculations, etc.).

301 PAPERS • 3 BENCHMARKS

IMDB-MULTI

IMDB-MULTI is a relational dataset that consists of a network of 1000 actors or actresses who played roles in movies in IMDB. A node represents an actor or actress, and an edge connects two nodes when they appear in the same movie. In IMDB-MULTI, the edges are collected from three different genres: Comedy, Romance and Sci-Fi.

226 PAPERS • 2 BENCHMARKS

Yelp

The Yelp Dataset is a valuable resource for academic research, teaching, and learning. It provides a rich collection of real-world data related to businesses, reviews, and user interactions. Here are the key details about the Yelp Dataset: Reviews: A whopping 6,990,280 reviews from users. Businesses: Information on 150,346 businesses. Pictures: A collection of 200,100 pictures. Metropolitan Areas: Data from 11 metropolitan areas. Tips: Over 908,915 tips provided by 1,987,897 users. Business Attributes: Details like hours, parking availability, and ambiance for more than 1.2 million businesses. Aggregated Check-ins: Historical check-in data for each of the 131,930 businesses.

68 PAPERS • 21 BENCHMARKS

Reuters-21578

The Reuters-21578 dataset is a collection of documents with news articles. The original corpus has 10,369 documents and a vocabulary of 29,930 words.

63 PAPERS • 6 BENCHMARKS

WOS

WOS (Web of Science Dataset)

Web of Science (WOS) is a document classification dataset that contains 46,985 documents with 134 categories which include 7 parents categories.

48 PAPERS • 3 BENCHMARKS

SciDocs

SciDocs evaluation framework consists of a suite of evaluation tasks designed for document-level tasks.

40 PAPERS • 3 BENCHMARKS

HOC (Hallmarks of Cancer)

The Hallmarks of Cancer (*HOC) corpus consists of 1852 PubMed publication abstracts manually annotated by experts according to the Hallmarks of Cancer taxonomy. The taxonomy consists of 37 classes in a hierarchy. Zero or more class labels are assigned to each sentence in the corpus.

29 PAPERS • 1 BENCHMARK

MultiEURLEX

MultiEURLEX is a multilingual dataset for topic classification of legal documents. The dataset comprises 65k European Union (EU) laws, officially translated in 23 languages, annotated with multiple labels from the EUROVOC taxonomy. The dataset covers 23 official EU languages from 7 language families.

9 PAPERS • NO BENCHMARKS YET

Bengali Hate Speech

Introduces three datasets of expressing hate, commonly used topics, and opinions for hate speech detection, document classification, and sentiment analysis, respectively.

6 PAPERS • NO BENCHMARKS YET

Wikipedia Title

Wikipedia Title is a dataset for learning character-level compositionality from the character visual characteristics. It consists of a collection of Wikipedia titles in Chinese, Japanese or Korean labelled with the category to which the article belongs.

3 PAPERS • NO BENCHMARKS YET

RTC

RTC (Reddit Time Corpus)

RTC is a benchmark corpus of social media comments sampled over three years. The corpus consists of 36.36m unlabelled comments for adaptation and evaluation on an upstream masked language modelling task as well as 0.9m labelled comments for finetuning and evaluation on a downstream document classification task. The Reddit Time Corpus (RTC) covers three years between March 2017 and February 2020 and is split into 36 evenly-sized monthly subsets based on comment timestamps. RTC is sampled from the Pushshift Reddit dataset.

2 PAPERS • NO BENCHMARKS YET

MeSHup

MeSHup (A Corpus for Full Text Biomedical Document Indexing)

Contains 1,342,667 full text articles in English, together with the associated MeSH labels and metadata, authors, and publication venues that are collected from the MEDLINE database.

1 PAPER • NO BENCHMARKS YET

RVL-CDIP_MP

RVL-CDIP_MP (RVL-CDIP multi-page)

RVL-CDIP_MP is our first contribution to retrieve the original documents of the IIT-CDIP test collection which were used to create RVL-CDIP. Some PDFs or encoded images were corrupt, which explains that we have around 500 fewer instances. By leveraging metadata from OCR-IDL , we matched the original identifiers from IIT-CDIP and retrieved them from IDL using a conversion.

1 PAPER • NO BENCHMARKS YET

RVL-CDIP_N_MP

RVL-CDIP_N_MP (RVL-CDIP-N multi-page)

RVL-CDIP_MP-N can serve its original goal as a covariate shift test set, now for multi-page document classification. We were able to retrieve the original full documents from DocumentCloud and Web Search.

1 PAPER • NO BENCHMARKS YET

Datasets

15 dataset results for Document Classification