🔔 Share your dataset with the ML community!

Filter by Modality

Filter by Task (clear)

Filter by Language

157 dataset results for Language Modelling

Definite Pronoun Resolution Dataset

Composes sentence pairs (i.e., twin sentences).

7 PAPERS • NO BENCHMARKS YET

TUT-SED Synthetic 2016

TUT-SED Synthetic 2016 contains of mixture signals artificially generated from isolated sound events samples. This approach is used to get more accurate onset and offset annotations than in dataset using recordings from real acoustic environments where the annotations are always subjective. Mixture signals in the dataset are created by randomly selecting and mixing isolated sound events from 16 sound event classes together. The resulting mixtures contains sound events with varying polyphony. All together 994 sound event samples were purchased from Sound Ideas. From the 100 mixtures created, 60% were assigned for training, 20% for testing and 20% for validation. The total amount of audio material in the dataset is 566 minutes. Different instances of the sound events are used to synthesize the training, validation and test partitions. Mixtures were created by randomly selecting event instance and from it, randomly, a segment of length 3-15 seconds. Between events, random length silent re

7 PAPERS • NO BENCHMARKS YET

CoarseWSD-20

The CoarseWSD-20 dataset is a coarse-grained sense disambiguation dataset built from Wikipedia (nouns only) targeting 2 to 5 senses of 20 ambiguous words. It was specifically designed to provide an ideal setting for evaluating Word Sense Disambiguation (WSD) models (e.g. no senses in test sets missing from training), both quantitively and qualitatively.

6 PAPERS • NO BENCHMARKS YET

MovieFIB

MovieFIB (Movie Fill-in-the-Blank)

A quantitative benchmark for developing and understanding video of fill-in-the-blank question-answering dataset with over 300,000 examples, based on descriptive video annotations for the visually impaired.

6 PAPERS • NO BENCHMARKS YET

WNLaMPro

WNLaMPro (WordNet Language Model Probing)

The WordNet Language Model Probing (WNLaMPro) dataset consists of relations between keywords and words. It contains 4 different kinds of relations: Antonym, Hypernym, Cohyponym and Corruption.

6 PAPERS • NO BENCHMARKS YET

WikiCREM

An unsupervised dataset for co-reference resolution. Presented in the publication: Kocijan et. al, WikiCREM: A Large Unsupervised Corpus for Coreference Resolution, presented at EMNLP 2019.

6 PAPERS • NO BENCHMARKS YET

Winogender Schemas

Winogender Schemas is a novel, Winograd schema-style set of minimal pair sentences that differ only by pronoun gender.

6 PAPERS • NO BENCHMARKS YET

CLUECorpus2020

CLUECorpus2020 is a large-scale corpus that can be used directly for self-supervised learning such as pre-training of a language model, or language generation. It has 100G raw corpus with 35 billion Chinese characters, which is retrieved from Common Crawl.

5 PAPERS • NO BENCHMARKS YET

Coached Conversational Preference Elicitation

Coached Conversational Preference Elicitation is a dataset consisting of 502 English dialogs with 12,000 annotated utterances between a user and an assistant discussing movie preferences in natural language. It was collected using a Wizard-of-Oz methodology between two paid crowd-workers, where one worker plays the role of an 'assistant', while the other plays the role of a 'user'.

5 PAPERS • NO BENCHMARKS YET

GINC

GINC (Generative IN-Context learning Dataset)

GINC (Generative In-Context learning Dataset) is a small-scale synthetic dataset for studying in-context learning. The pretraining data is generated by a mixture of HMMs and the in-context learning prompt examples are also generated from HMMs (either from the mixture or not). The prompt examples are out-of-distribution with respect to the pretraining data since every example is independent, concatenated, and separated by delimiters. The GitHub repository provides code to generate GINC-style datasets of varying vocabulary sizes, number of HMMs, and other parameters.

5 PAPERS • NO BENCHMARKS YET

Liputan6

A large-scale Indonesian summarization dataset consisting of harvested articles from Liputan6.com, an online news portal, resulting in 215,827 document-summary pairs.

5 PAPERS • NO BENCHMARKS YET

Tencent ML-Images

Tencent ML-Images is a large open-source multi-label image database, including 17,609,752 training and 88,739 validation image URLs, which are annotated with up to 11,166 categories.

5 PAPERS • NO BENCHMARKS YET

caWaC

The corpus represents the largest existing corpus of Catalan containing 687 million words, which is a significant increase given that until now the biggest corpus of Catalan, CuCWeb, counts 166 million words.

5 PAPERS • NO BENCHMARKS YET

CCPE-M

CCPE-M (Coached Conversational Preference Elicitation dataset for Movies)

A dataset consisting of 502 English dialogs with 12,000 annotated utterances between a user and an assistant discussing movie preferences in natural language.

4 PAPERS • NO BENCHMARKS YET

Databricks Dolly 15k (databricks-dolly-15k)

Databricks Dolly 15k is a dataset containing 15,000 high-quality human-generated prompt / response pairs specifically designed for instruction tuning large language models. It is authored by more than 5,000 Databricks employees during March and April of 2023. The training records are natural, expressive and designed to represent a wide range of the behaviors, from brainstorming and content generation to information extraction and summarization.

4 PAPERS • NO BENCHMARKS YET

HLA-Chat

Models character profiles and gives dialogue agents the ability to learn characters' language styles through their HLAs.

4 PAPERS • NO BENCHMARKS YET

RONEC

RONEC (Romanian Named Entity Corpus)

Romanian Named Entity Corpus is a named entity corpus for the Romanian language. The corpus contains over 26000 entities in ~5000 annotated sentences, belonging to 16 distinct classes. The sentences have been extracted from a copy-right free newspaper, covering several styles. This corpus represents the first initiative in the Romanian language space specifically targeted for named entity recognition.

4 PAPERS • NO BENCHMARKS YET

arXiv-10

Benchmark dataset for abstracts and titles of 100,000 ArXiv scientific papers. This dataset contains 10 classes and is balanced (exactly 10,000 per class). The classes include subcategories of computer science, physics, and math.

4 PAPERS • 1 BENCHMARK

irc-disentanglement

This is a dataset for disentangling conversations on IRC, which is the task of identifying separate conversations in a single stream of messages. It contains disentanglement information for 77,563 messages or IRC.

4 PAPERS • 3 BENCHMARKS

CommitChronicle

CommitChronicle is a dataset for commit message generation (and/or completion).

3 PAPERS • NO BENCHMARKS YET

IndicNLP Corpus

The IndicNLP corpus is a large-scale, general-domain corpus containing 2.7 billion words for 10 Indian languages from two language families.

3 PAPERS • NO BENCHMARKS YET

WikiConvert

Wiki-Convert is a 900,000+ sentences dataset of precise number annotations from English Wikipedia. It relies on Wiki contributors' annotations in the form of a {{Convert}} template.

3 PAPERS • NO BENCHMARKS YET

WikiText-TL-39

WikiText-TL-39 is a benchmark language modeling dataset in Filipino that has 39 million tokens in the training set.

3 PAPERS • NO BENCHMARKS YET

ChrEn (Cherokee-English Parallel Dataset)

Cherokee-English Parallel Dataset is a low-resource dataset of 14,151 pairs of sentences with around 313K English tokens and 206K Cherokee tokens. The parallel corpus is accompanied by a monolingual Cherokee dataset of 5,120 sentences. Both datasets are mostly derived from Cherokee monolingual books.

2 PAPERS • NO BENCHMARKS YET

Circa

The Circa (meaning ‘approximately’) dataset aims to help machine learning systems to solve the problem of interpreting indirect answers to polar questions.

2 PAPERS • NO BENCHMARKS YET

Comparative Question Completion

Comparative Question Completion is a dataset to evaluate what do large Language Models learn.

2 PAPERS • NO BENCHMARKS YET

Czech restaurant information

Czech restaurant information is a dataset for NLG in task-oriented spoken dialogue systems with Czech as the target language. It originated as a translation of the English San Francisco Restaurants dataset by Wen et al. (2015).

2 PAPERS • 1 BENCHMARK

KMIR (Knowledge Memorization, Identification, and Reasoning)

KMIR (Knowledge Memorization, Identification, and Reasoning) is a benchmark that covers 3 types of knowledge, including general knowledge, domain-specific knowledge, and commonsense, and provides 184,348 well-designed questions. KMIR can be used for evaluating knowledge memorization, identification and reasoning abilities of language models.

2 PAPERS • NO BENCHMARKS YET

NQuAD (Nuclear Question Answering Dataset)

NQuAD is a Nuclear Question Answering Dataset, which contains 700+ nuclear Question Answer pairs developed and verified by expert nuclear researchers.

2 PAPERS • NO BENCHMARKS YET

PubMed Cognitive Control Abstracts

PubMed Cognitive Control Abstracts (CogText)

A collection of 385,705 scientific abstracts about Cognitive Control and their GPT-3 embeddings.

2 PAPERS • NO BENCHMARKS YET

RTC

RTC (Reddit Time Corpus)

RTC is a benchmark corpus of social media comments sampled over three years. The corpus consists of 36.36m unlabelled comments for adaptation and evaluation on an upstream masked language modelling task as well as 0.9m labelled comments for finetuning and evaluation on a downstream document classification task. The Reddit Time Corpus (RTC) covers three years between March 2017 and February 2020 and is split into 36 evenly-sized monthly subsets based on comment timestamps. RTC is sampled from the Pushshift Reddit dataset.

2 PAPERS • NO BENCHMARKS YET

SLNET

SLNET (SLNET: A Redistributable Corpus of 3rd-party Simulink Models)

SLNET is collection of third party Simulink models. It is curated via mining open source repository (GitHub and Matlab Central) using SLNET-Miner (https://github.com/50417/SLNet_Miner).

2 PAPERS • NO BENCHMARKS YET

SMC Text Corpus

Contents (As on March 4, 2019) The text corpus contains running text from various free licensed sources. - The whole content of Malayalam Wikipedia extracted on January 1, 2019 - News/Article from various sources, source mentioned in respective files: - 251 Mb - 8,60,159 lines - 98,15,533 words - 10,11,11,885 characters

2 PAPERS • NO BENCHMARKS YET

Sentimental LIAR

The Sentimental LIAR dataset is a modified and further extended version of the LIAR extension introduced by Kirilin et al. In this dataset, the multi-class labeling of LIAR is converted to a binary annotation by changing half-true, false, barely-true and pants-fire labels to False, and the remaining labels to True. Furthermore, the speaker names are converted to numerical IDs in order to avoid bias with regards to the textual representation of names. The binary-label dataset is then extended by adding sentiments derived using the Google NLP API. Sentiment analysis determines the overall attitude of the text (i.e., whether it is positive or negative), and is quantified by a numerical score. If the sentiment score is positive, then the sample is tagged as Positive for the sentiment attribute, otherwise Negative is assigned. A further extension is introduced by adding emotion scores extracted using the IBM NLP API for each claim, which determine the detected level of 6 emotional states na

2 PAPERS • NO BENCHMARKS YET

Spades (Semantic PArsing of DEclarative Sentences)

Datasets Spades contains 93,319 questions derived from clueweb09 sentences. Specifically, the questions were created by randomly removing an entity, thus producing sentence-denotation pairs.

2 PAPERS • NO BENCHMARKS YET

ViText2SQL

ViText2SQL is a dataset for the Vietnamese Text-to-SQL semantic parsing task, consisting of about 10K question and SQL query pairs.

2 PAPERS • NO BENCHMARKS YET

Alexa Point of View

The Alexa Point of View dataset is point of view conversion dataset, a parallel corpus of messages spoken to a virtual assistant and the converted messages for delivery. The dataset contains parallel corpus of input (input column) message and POV converted messages (output column). An example of a pair is tell @CN@ that i'll be late [\t] hi @CN@, @SCN@ would like you to know that they'll be late. The input and pov-converted output pair is tab separated. @CN@ tag is a placeholder for the contact name (receiver) and @SCN@ tag is a placeholder for source contact name (sender). The total dataset has 46563 pairs. This data is then test/train/dev split into 6985 pairs/32594 pairs/6985 pairs.

1 PAPER • 1 BENCHMARK

Books3

The Books3 dataset emerged as part of a broader effort to train AI models for natural language understanding and generation. It comprises an extensive collection of digitized books, spanning from classics to contemporary works. These books were gathered from various sources, including libraries and online repositories². Here are some key details about the Books3 dataset:

1 PAPER • 1 BENCHMARK

Curation Corpus

The Curation Corpus is a collection of 40,000 professionally-written summaries of news articles, with links to the articles themselves.

1 PAPER • 1 BENCHMARK

DpgMedia2019

DpgMedia2019 is a Dutch news dataset for partisanship detection. It contains more than 100K articles that are labelled on the publisher level and 776 articles that were crowdsourced using an internal survey platform and labelled on the article level.

1 PAPER • NO BENCHMARKS YET

Fake News Filipino Dataset

Expertly-curated benchmark dataset for fake news detection in Filipino.

1 PAPER • NO BENCHMARKS YET

FreeLaw

Free Law Project is a leading nonprofit organization that aims to make the legal ecosystem more equitable and competitive through technology, data, and advocacy. Here are some key aspects of their work:

1 PAPER • 1 BENCHMARK

Glot500-c

Glot500-c (Glot500 Corpus)

A dataset of natural language data collected by putting together more than 150 existing mono-lingual and multilingual datasets together and crawling known multilingual websites. The focus of this dataset is on 500 extremely low-resource languages.

1 PAPER • NO BENCHMARKS YET

GlotSparse

Collection of news websites in low-resource languages.

1 PAPER • NO BENCHMARKS YET

GlotStoryBook

StoryBooks for 174 unique languages.

1 PAPER • NO BENCHMARKS YET

InstructOpenWiki

InstructOpenWiki is a substantial instruction tuning dataset for Open-world IE enriched with a comprehensive corpus, extensive annotations, and diverse instructions.

1 PAPER • NO BENCHMARKS YET

Kite

The Kite database is a multi-modal dataset for the control of unmanned aerial vehicles (UAVs). There are three modalities present in the dataset:

1 PAPER • NO BENCHMARKS YET

Lenta Short Sentences

The Lenta Short Sentences dataset is a text dataset for language modelling for the Russian language. It consists of 236K sentences sampled from the Lenta News dataset.

1 PAPER • NO BENCHMARKS YET