PAWS-X contains 23,659 human translated PAWS evaluation pairs and 296,406 machine translated training pairs in six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean. All translated pairs are sourced from examples in PAWS-Wiki.
160 PAPERS • 2 BENCHMARKS
An unsupervised dataset for co-reference resolution. Presented in the publication: Kocijan et. al, WikiCREM: A Large Unsupervised Corpus for Coreference Resolution, presented at EMNLP 2019.
6 PAPERS • NO BENCHMARKS YET
ELI5 is a dataset for long-form question answering. It contains 270K complex, diverse questions that require explanatory multi-sentence answers. Web search results are used as evidence documents to answer each question.
121 PAPERS • 1 BENCHMARK
The Kite database is a multi-modal dataset for the control of unmanned aerial vehicles (UAVs). There are three modalities present in the dataset:
1 PAPER • NO BENCHMARKS YET
ChID is a large-scale Chinese IDiom dataset for cloze test. ChID contains 581K passages and 729K blanks, and covers multiple domains. In ChID, the idioms in a passage were replaced with blank symbols. For each blank, a list of candidate idioms including the golden idiom are provided as choice.
32 PAPERS • 3 BENCHMARKS
RealNews is a large corpus of news articles from Common Crawl. Data is scraped from Common Crawl, limited to the 5000 news domains indexed by Google News. The authors used the Newspaper Python library to extract the body and metadata from each article. News from Common Crawl dumps from December 2016 through March 2019 were used as training data; articles published in April 2019 from the April 2019 dump were used for evaluation. After deduplication, RealNews is 120 gigabytes without compression.
77 PAPERS • NO BENCHMARKS YET
Social Interaction QA (SIQA) is a question-answering benchmark for testing social commonsense intelligence. Contrary to many prior benchmarks that focus on physical or taxonomic knowledge, Social IQa focuses on reasoning about people’s actions and their social implications. For example, given an action like "Jesse saw a concert" and a question like "Why did Jesse do this?", humans can easily infer that Jesse wanted "to see their favorite performer" or "to enjoy the music", and not "to see what's happening inside" or "to see if it works". The actions in Social IQa span a wide variety of social situations, and answer candidates contain both human-curated answers and adversarially-filtered machine-generated candidates. Social IQa contains over 37,000 QA pairs for evaluating models’ abilities to reason about the social implications of everyday events and situations.
46 PAPERS • 1 BENCHMARK
Contents (As on March 4, 2019) The text corpus contains running text from various free licensed sources. - The whole content of Malayalam Wikipedia extracted on January 1, 2019 - News/Article from various sources, source mentioned in respective files: - 251 Mb - 8,60,159 lines - 98,15,533 words - 10,11,11,885 characters
2 PAPERS • NO BENCHMARKS YET
PersonalDialog is a large-scale multi-turn dialogue dataset containing various traits from a large number of speakers. The dataset consists of 20.83M sessions and 56.25M utterances from 8.47M speakers. Each utterance is associated with a speaker who is marked with traits like Age, Gender, Location, Interest Tags, etc. Several anonymization schemes are designed to protect the privacy of each speaker.
9 PAPERS • NO BENCHMARKS YET
CMRC is a dataset is annotated by human experts with near 20,000 questions as well as a challenging set which is composed of the questions that need reasoning over multiple clues.
62 PAPERS • 11 BENCHMARKS
Clotho is an audio captioning dataset, consisting of 4981 audio samples, and each audio sample has five captions (a total of 24 905 captions). Audio samples are of 15 to 30 s duration and captions are eight to 20 words long.
139 PAPERS • 5 BENCHMARKS
The OLID is a hierarchical dataset to identify the type and the target of offensive texts in social media. The dataset is collected on Twitter and publicly available. There are 14,100 tweets in total, in which 13,240 are in the training set, and 860 are in the test set. For each tweet, there are three levels of labels: (A) Offensive/Not-Offensive, (B) Targeted-Insult/Untargeted, (C) Individual/Group/Other. The relationship between them is hierarchical. If a tweet is offensive, it can have a target or no target. If it is offensive to a specific target, the target can be an individual, a group, or some other objects. This dataset is used in the OffensEval-2019 competition in SemEval-2019.
142 PAPERS • 1 BENCHMARK
Tatoeba is a free collection of example sentences with translations geared towards foreign language learners. It is available in more than 400 languages. Its name comes from the Japanese phrase “tatoeba” (例えば), meaning “for example”. It is written and maintained by a community of volunteers through a model of open collaboration. Individual contributors are known as Tatoebans.
37 PAPERS • 26 BENCHMARKS
emrQA has 1 million question-logical form and 400,000+ questionanswer evidence pairs.
46 PAPERS • NO BENCHMARKS YET
CC-Stories (or STORIES) is a dataset for common sense reasoning and language modeling. It was constructed by aggregating documents from the CommonCrawl dataset that has the most overlapping n-grams with the questions in commonsense reasoning tasks. The top 1.0% of highest ranked documents is chosen as the new training corpus.
Worldtree is a corpus of explanation graphs, explanatory role ratings, and associated tablestore. It contains explanation graphs for 1,680 questions, and 4,950 tablestore rows across 62 semi-structured tables are provided. This data is intended to be paired with the AI2 Mercury Licensed questions.
33 PAPERS • NO BENCHMARKS YET
The Cloze Test by Teachers (CLOTH) benchmark is a collection of nearly 100,000 4-way multiple-choice cloze-style questions from middle- and high school-level English language exams, where the answer fills a blank in a given text. Each question is labeled with a type of deep reasoning it involves, where the four possible types are grammar, short-term reasoning, matching/paraphrasing, and long-term reasoning, i.e., reasoning over multiple sentences
18 PAPERS • NO BENCHMARKS YET
EmotionLines contains a total of 29245 labeled utterances from 2000 dialogues. Each utterance in dialogues is labeled with one of seven emotions, six Ekman’s basic emotions plus the neutral emotion. Each labeling was accomplished by 5 workers, and for each utterance in a label, the emotion category with the highest votes was set as the label of the utterance. Those utterances voted as more than two different emotions were put into the non-neutral category. Therefore the dataset has a total of 8 types of emotion labels, anger, disgust, fear, happiness, sadness, surprise, neutral, and non-neutral.
42 PAPERS • 1 BENCHMARK
The MetaQA dataset consists of a movie ontology derived from the WikiMovies Dataset and three sets of question-answer pairs written in natural language: 1-hop, 2-hop, and 3-hop queries.
68 PAPERS • 1 BENCHMARK
PearRead is a dataset of scientific peer reviews. The dataset consists of over 14K paper drafts and the corresponding accept/reject decisions in top-tier venues including ACL, NIPS and ICLR, as well as over 10K textual peer reviews written by experts for a subset of the papers.
32 PAPERS • NO BENCHMARKS YET
The Semantic Scholar corpus (S2) is composed of titles from scientific papers published in machine learning conferences and journals from 1985 to 2017, split by year (33 timesteps).
70 PAPERS • NO BENCHMARKS YET
News translation is a recurring WMT task. The test set is a collection of parallel corpora consisting of about 1500 English sentences translated into 5 languages (Chinese, Czech, Estonian, German, Finnish, Russian, Turkish) and additional 1500 sentences from each of the 7 languages translated to English. The sentences were selected from dozens of news websites and translated by professional translators.
8 PAPERS • NO BENCHMARKS YET
End-to-End NLG Challenge (E2E) aims to assess whether recent end-to-end NLG systems can generate more complex output by learning from datasets containing higher lexical richness, syntactic complexity and diverse discourse phenomena.
68 PAPERS • 4 BENCHMARKS
KP20k is a large-scale scholarly articles dataset with 528K articles for training, 20K articles for validation and 20K articles for testing.
79 PAPERS • 3 BENCHMARKS
TUT-SED Synthetic 2016 contains of mixture signals artificially generated from isolated sound events samples. This approach is used to get more accurate onset and offset annotations than in dataset using recordings from real acoustic environments where the annotations are always subjective. Mixture signals in the dataset are created by randomly selecting and mixing isolated sound events from 16 sound event classes together. The resulting mixtures contains sound events with varying polyphony. All together 994 sound event samples were purchased from Sound Ideas. From the 100 mixtures created, 60% were assigned for training, 20% for testing and 20% for validation. The total amount of audio material in the dataset is 566 minutes. Different instances of the sound events are used to synthesize the training, validation and test partitions. Mixtures were created by randomly selecting event instance and from it, randomly, a segment of length 3-15 seconds. Between events, random length silent re
7 PAPERS • NO BENCHMARKS YET
The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.
490 PAPERS • 2 BENCHMARKS
809 PAPERS • 3 BENCHMARKS
The LAMBADA (LAnguage Modeling Broadened to Account for Discourse Aspects) benchmark is an open-ended cloze task which consists of about 10,000 passages from BooksCorpus where a missing target word is predicted in the last sentence of each passage. The missing word is constrained to always be the last word of the last sentence and there are no candidate words to choose from. Examples were filtered by humans to ensure they were possible to guess given the context, i.e., the sentences in the passage leading up to the last sentence. Examples were further filtered to ensure that missing words could not be guessed without the context, ensuring that models attempting the dataset would need to reason over the entire paragraph to answer questions.
184 PAPERS • 1 BENCHMARK
Datasets Spades contains 93,319 questions derived from clueweb09 sentences. Specifically, the questions were created by randomly removing an entity, thus producing sentence-denotation pairs.
The TUT Sound Events 2017 dataset contains 24 audio recordings in a street environment and contains 6 different classes. These classes are: brakes squeaking, car, children, large vehicle, people speaking, and people walking.
BookCorpus is a large collection of free novel books written by unpublished authors, which contains 11,038 books (around 74M sentences and 1G words) of 16 different sub-genres (e.g., Romance, Historical, Adventure, etc.).
322 PAPERS • 1 BENCHMARK
The UKP Argument Annotated Essays corpus consists of argument annotated persuasive essays including annotations of argument components and argumentative relations.
BeerAdvocate is a dataset that consists of beer reviews from beeradvocate. The data span a period of more than 10 years, including all ~1.5 million reviews up to November 2011. Each review includes ratings in terms of five "aspects": appearance, aroma, palate, taste, and overall impression. Reviews include product and user information, followed by each of these five ratings, and a plaintext review.
14 PAPERS • 1 BENCHMARK
The TED-LIUM corpus consists of English-language TED talks. It includes transcriptions of these talks. The audio is sampled at 16kHz. The dataset spans a range of 118 to 452 hours of transcribed speech data.
54 PAPERS • 2 BENCHMARKS
MuseData is an electronic library of orchestral and piano classical music from CCARH. It consists of around 3MB of 783 files.
27 PAPERS • NO BENCHMARKS YET
CASIA-HWDB is a dataset for handwritten Chinese character recognition. It contains 300 files (240 in HWDB1.1 training set and 60 in HWDB1.1 test set). Each file contains about 3000 isolated gray-scale Chinese character images written by one writer, as well as their corresponding labels.
17 PAPERS • NO BENCHMARKS YET
The IMDb Movie Reviews dataset is a binary sentiment analysis dataset consisting of 50,000 reviews from the Internet Movie Database (IMDb) labeled as positive or negative. The dataset contains an even number of positive and negative reviews. Only highly polarizing reviews are considered. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. No more than 30 reviews are included per movie. The dataset contains additional unlabeled data.
1,582 PAPERS • 11 BENCHMARKS
The PubMed dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words.
1,067 PAPERS • 24 BENCHMARKS
The RWC (Real World Computing) Music Database is a copyright-cleared music database (DB) that is available to researchers as a common foundation for research. It contains around 100 complete songs with manually labeled section boundaries. For the 50 instruments, individual sounds at half-tone intervals were captured with several variations of playing styles, dynamics, instrument manufacturers and musicians.
41 PAPERS • NO BENCHMARKS YET
There are now many computer programs for automatically determining the sense of a word in context (Word Sense Disambiguation or WSD). The purpose of SENSEVAL is to evaluate the strengths and weaknesses of such programs with respect to different words, different varieties of language, and different languages.
19 PAPERS • NO BENCHMARKS YET
The ECGs in this collection were obtained using a non-commercial, PTB prototype recorder with the following specifications:
16 PAPERS • 4 BENCHMARKS
The English Penn Treebank (PTB) corpus, and in particular the section of the corpus corresponding to the articles of Wall Street Journal (WSJ), is one of the most known and used corpus for the evaluation of models for sequence labelling. The task consists of annotating each word with its Part-of-Speech tag. In the most common split of this corpus, sections from 0 to 18 are used for training (38 219 sentences, 912 344 tokens), sections from 19 to 21 are used for validation (5 527 sentences, 131 768 tokens), and sections from 22 to 24 are used for testing (5 462 sentences, 129 654 tokens). The corpus is also commonly used for character-level and word-level Language Modelling.
978 PAPERS • 10 BENCHMARKS
AISHELL-1 is a corpus for speech recognition research and building speech recognition systems for Mandarin.
163 PAPERS • 1 BENCHMARK
ANTIQUE is a collection of 2,626 open-domain non-factoid questions from a diverse set of categories. The dataset contains 34,011 manual relevance annotations. The questions were asked by real users in a community question answering service, i.e., Yahoo! Answers. Relevance judgments for all the answers to each question were collected through crowdsourcing.
13 PAPERS • NO BENCHMARKS YET
ART consists of over 20k commonsense narrative contexts and 200k explanations.
ATOMIC is an atlas of everyday commonsense reasoning, organized through 877k textual descriptions of inferential knowledge. Compared to existing resources that center around taxonomic knowledge, ATOMIC focuses on inferential knowledge organized as typed if-then relations with variables (e.g., "if X pays Y a compliment, then Y will likely return the compliment").
162 PAPERS • NO BENCHMARKS YET
The Alexa Point of View dataset is point of view conversion dataset, a parallel corpus of messages spoken to a virtual assistant and the converted messages for delivery. The dataset contains parallel corpus of input (input column) message and POV converted messages (output column). An example of a pair is tell @CN@ that i'll be late [\t] hi @CN@, @SCN@ would like you to know that they'll be late. The input and pov-converted output pair is tab separated. @CN@ tag is a placeholder for the contact name (receiver) and @SCN@ tag is a placeholder for source contact name (sender). The total dataset has 46563 pairs. This data is then test/train/dev split into 6985 pairs/32594 pairs/6985 pairs.
1 PAPER • 1 BENCHMARK
Arxiv HEP-TH (high energy physics theory) citation graph is from the e-print arXiv and covers all the citations within a dataset of 27,770 papers with 352,807 edges. If a paper i cites paper j, the graph contains a directed edge from i to j. If a paper cites, or is cited by, a paper outside the dataset, the graph does not contain any information about this. The data covers papers in the period from January 1993 to April 2003 (124 months).
34 PAPERS • 9 BENCHMARKS