Composes sentence pairs (i.e., twin sentences).
7 PAPERS • NO BENCHMARKS YET
TUT-SED Synthetic 2016 contains of mixture signals artificially generated from isolated sound events samples. This approach is used to get more accurate onset and offset annotations than in dataset using recordings from real acoustic environments where the annotations are always subjective. Mixture signals in the dataset are created by randomly selecting and mixing isolated sound events from 16 sound event classes together. The resulting mixtures contains sound events with varying polyphony. All together 994 sound event samples were purchased from Sound Ideas. From the 100 mixtures created, 60% were assigned for training, 20% for testing and 20% for validation. The total amount of audio material in the dataset is 566 minutes. Different instances of the sound events are used to synthesize the training, validation and test partitions. Mixtures were created by randomly selecting event instance and from it, randomly, a segment of length 3-15 seconds. Between events, random length silent re
The CoarseWSD-20 dataset is a coarse-grained sense disambiguation dataset built from Wikipedia (nouns only) targeting 2 to 5 senses of 20 ambiguous words. It was specifically designed to provide an ideal setting for evaluating Word Sense Disambiguation (WSD) models (e.g. no senses in test sets missing from training), both quantitively and qualitatively.
6 PAPERS • NO BENCHMARKS YET
A quantitative benchmark for developing and understanding video of fill-in-the-blank question-answering dataset with over 300,000 examples, based on descriptive video annotations for the visually impaired.
The WordNet Language Model Probing (WNLaMPro) dataset consists of relations between keywords and words. It contains 4 different kinds of relations: Antonym, Hypernym, Cohyponym and Corruption.
An unsupervised dataset for co-reference resolution. Presented in the publication: Kocijan et. al, WikiCREM: A Large Unsupervised Corpus for Coreference Resolution, presented at EMNLP 2019.
Winogender Schemas is a novel, Winograd schema-style set of minimal pair sentences that differ only by pronoun gender.
CLUECorpus2020 is a large-scale corpus that can be used directly for self-supervised learning such as pre-training of a language model, or language generation. It has 100G raw corpus with 35 billion Chinese characters, which is retrieved from Common Crawl.
5 PAPERS • NO BENCHMARKS YET
Coached Conversational Preference Elicitation is a dataset consisting of 502 English dialogs with 12,000 annotated utterances between a user and an assistant discussing movie preferences in natural language. It was collected using a Wizard-of-Oz methodology between two paid crowd-workers, where one worker plays the role of an 'assistant', while the other plays the role of a 'user'.
GINC (Generative In-Context learning Dataset) is a small-scale synthetic dataset for studying in-context learning. The pretraining data is generated by a mixture of HMMs and the in-context learning prompt examples are also generated from HMMs (either from the mixture or not). The prompt examples are out-of-distribution with respect to the pretraining data since every example is independent, concatenated, and separated by delimiters. The GitHub repository provides code to generate GINC-style datasets of varying vocabulary sizes, number of HMMs, and other parameters.
A large-scale Indonesian summarization dataset consisting of harvested articles from Liputan6.com, an online news portal, resulting in 215,827 document-summary pairs.
Tencent ML-Images is a large open-source multi-label image database, including 17,609,752 training and 88,739 validation image URLs, which are annotated with up to 11,166 categories.
The corpus represents the largest existing corpus of Catalan containing 687 million words, which is a significant increase given that until now the biggest corpus of Catalan, CuCWeb, counts 166 million words.
A dataset consisting of 502 English dialogs with 12,000 annotated utterances between a user and an assistant discussing movie preferences in natural language.
4 PAPERS • NO BENCHMARKS YET
Databricks Dolly 15k is a dataset containing 15,000 high-quality human-generated prompt / response pairs specifically designed for instruction tuning large language models. It is authored by more than 5,000 Databricks employees during March and April of 2023. The training records are natural, expressive and designed to represent a wide range of the behaviors, from brainstorming and content generation to information extraction and summarization.
Models character profiles and gives dialogue agents the ability to learn characters' language styles through their HLAs.
Romanian Named Entity Corpus is a named entity corpus for the Romanian language. The corpus contains over 26000 entities in ~5000 annotated sentences, belonging to 16 distinct classes. The sentences have been extracted from a copy-right free newspaper, covering several styles. This corpus represents the first initiative in the Romanian language space specifically targeted for named entity recognition.
Benchmark dataset for abstracts and titles of 100,000 ArXiv scientific papers. This dataset contains 10 classes and is balanced (exactly 10,000 per class). The classes include subcategories of computer science, physics, and math.
4 PAPERS • 1 BENCHMARK
This is a dataset for disentangling conversations on IRC, which is the task of identifying separate conversations in a single stream of messages. It contains disentanglement information for 77,563 messages or IRC.
4 PAPERS • 3 BENCHMARKS
CommitChronicle is a dataset for commit message generation (and/or completion).
3 PAPERS • NO BENCHMARKS YET
The IndicNLP corpus is a large-scale, general-domain corpus containing 2.7 billion words for 10 Indian languages from two language families.
Wiki-Convert is a 900,000+ sentences dataset of precise number annotations from English Wikipedia. It relies on Wiki contributors' annotations in the form of a {{Convert}} template.
WikiText-TL-39 is a benchmark language modeling dataset in Filipino that has 39 million tokens in the training set.
Cherokee-English Parallel Dataset is a low-resource dataset of 14,151 pairs of sentences with around 313K English tokens and 206K Cherokee tokens. The parallel corpus is accompanied by a monolingual Cherokee dataset of 5,120 sentences. Both datasets are mostly derived from Cherokee monolingual books.
2 PAPERS • NO BENCHMARKS YET
The Circa (meaning ‘approximately’) dataset aims to help machine learning systems to solve the problem of interpreting indirect answers to polar questions.
Comparative Question Completion is a dataset to evaluate what do large Language Models learn.
Czech restaurant information is a dataset for NLG in task-oriented spoken dialogue systems with Czech as the target language. It originated as a translation of the English San Francisco Restaurants dataset by Wen et al. (2015).
2 PAPERS • 1 BENCHMARK
KMIR (Knowledge Memorization, Identification, and Reasoning) is a benchmark that covers 3 types of knowledge, including general knowledge, domain-specific knowledge, and commonsense, and provides 184,348 well-designed questions. KMIR can be used for evaluating knowledge memorization, identification and reasoning abilities of language models.
NQuAD is a Nuclear Question Answering Dataset, which contains 700+ nuclear Question Answer pairs developed and verified by expert nuclear researchers.
A collection of 385,705 scientific abstracts about Cognitive Control and their GPT-3 embeddings.
RTC is a benchmark corpus of social media comments sampled over three years. The corpus consists of 36.36m unlabelled comments for adaptation and evaluation on an upstream masked language modelling task as well as 0.9m labelled comments for finetuning and evaluation on a downstream document classification task. The Reddit Time Corpus (RTC) covers three years between March 2017 and February 2020 and is split into 36 evenly-sized monthly subsets based on comment timestamps. RTC is sampled from the Pushshift Reddit dataset.
SLNET is collection of third party Simulink models. It is curated via mining open source repository (GitHub and Matlab Central) using SLNET-Miner (https://github.com/50417/SLNet_Miner).
Contents (As on March 4, 2019) The text corpus contains running text from various free licensed sources. - The whole content of Malayalam Wikipedia extracted on January 1, 2019 - News/Article from various sources, source mentioned in respective files: - 251 Mb - 8,60,159 lines - 98,15,533 words - 10,11,11,885 characters
The Sentimental LIAR dataset is a modified and further extended version of the LIAR extension introduced by Kirilin et al. In this dataset, the multi-class labeling of LIAR is converted to a binary annotation by changing half-true, false, barely-true and pants-fire labels to False, and the remaining labels to True. Furthermore, the speaker names are converted to numerical IDs in order to avoid bias with regards to the textual representation of names. The binary-label dataset is then extended by adding sentiments derived using the Google NLP API. Sentiment analysis determines the overall attitude of the text (i.e., whether it is positive or negative), and is quantified by a numerical score. If the sentiment score is positive, then the sample is tagged as Positive for the sentiment attribute, otherwise Negative is assigned. A further extension is introduced by adding emotion scores extracted using the IBM NLP API for each claim, which determine the detected level of 6 emotional states na
Datasets Spades contains 93,319 questions derived from clueweb09 sentences. Specifically, the questions were created by randomly removing an entity, thus producing sentence-denotation pairs.
ViText2SQL is a dataset for the Vietnamese Text-to-SQL semantic parsing task, consisting of about 10K question and SQL query pairs.
The Alexa Point of View dataset is point of view conversion dataset, a parallel corpus of messages spoken to a virtual assistant and the converted messages for delivery. The dataset contains parallel corpus of input (input column) message and POV converted messages (output column). An example of a pair is tell @CN@ that i'll be late [\t] hi @CN@, @SCN@ would like you to know that they'll be late. The input and pov-converted output pair is tab separated. @CN@ tag is a placeholder for the contact name (receiver) and @SCN@ tag is a placeholder for source contact name (sender). The total dataset has 46563 pairs. This data is then test/train/dev split into 6985 pairs/32594 pairs/6985 pairs.
1 PAPER • 1 BENCHMARK
The Books3 dataset emerged as part of a broader effort to train AI models for natural language understanding and generation. It comprises an extensive collection of digitized books, spanning from classics to contemporary works. These books were gathered from various sources, including libraries and online repositories². Here are some key details about the Books3 dataset:
The Curation Corpus is a collection of 40,000 professionally-written summaries of news articles, with links to the articles themselves.
DpgMedia2019 is a Dutch news dataset for partisanship detection. It contains more than 100K articles that are labelled on the publisher level and 776 articles that were crowdsourced using an internal survey platform and labelled on the article level.
1 PAPER • NO BENCHMARKS YET
Expertly-curated benchmark dataset for fake news detection in Filipino.
Free Law Project is a leading nonprofit organization that aims to make the legal ecosystem more equitable and competitive through technology, data, and advocacy. Here are some key aspects of their work:
A dataset of natural language data collected by putting together more than 150 existing mono-lingual and multilingual datasets together and crawling known multilingual websites. The focus of this dataset is on 500 extremely low-resource languages.
Collection of news websites in low-resource languages.
StoryBooks for 174 unique languages.
InstructOpenWiki is a substantial instruction tuning dataset for Open-world IE enriched with a comprehensive corpus, extensive annotations, and diverse instructions.
The Kite database is a multi-modal dataset for the control of unmanned aerial vehicles (UAVs). There are three modalities present in the dataset:
The Lenta Short Sentences dataset is a text dataset for language modelling for the Russian language. It consists of 236K sentences sampled from the Lenta News dataset.