🔔 Share your dataset with the ML community!

Filter by Modality

Filter by Task (clear)

Filter by Language

157 dataset results for Language Modelling

OCNLI (Original Chinese Natural Language Inference)

OCNLI stands for Original Chinese Natural Language Inference. It is corpus for Chinese Natural Language Inference, collected following closely the procedures of MNLI, but with enhanced strategies aiming for more challenging inference pairs. No human/machine translation is used in creating the dataset, and thus the Chinese texts are original and not translated.

41 PAPERS • 3 BENCHMARKS

RWC (Real World Computing Music Database)

The RWC (Real World Computing) Music Database is a copyright-cleared music database (DB) that is available to researchers as a common foundation for research. It contains around 100 complete songs with manually labeled section boundaries. For the 50 instruments, individual sounds at half-tone intervals were captured with several variations of playing styles, dynamics, instrument manufacturers and musicians.

41 PAPERS • NO BENCHMARKS YET

DART

DART is a large dataset for open-domain structured data record to text generation. DART consists of 82,191 examples across different domains with each input being a semantic RDF triple set derived from data records in tables and the tree ontology of the schema, annotated with sentence descriptions that cover all facts in the triple set.

40 PAPERS • 3 BENCHMARKS

MLSUM

MLSUM (MultiLingual SUMmarization)

A large-scale MultiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages -- namely, French, German, Spanish, Russian, Turkish. Together with English newspapers from the popular CNN/Daily mail dataset, the collected data form a large scale multilingual dataset which can enable new research directions for the text summarization community.

40 PAPERS • 7 BENCHMARKS

SciDocs

SciDocs evaluation framework consists of a suite of evaluation tasks designed for document-level tasks.

40 PAPERS • 3 BENCHMARKS

SMM4H

SMM4H (Social Media Mining for Health Shared Task)

Social Media Mining for Health (SMM4H) Shared Task is a massive data source for biomedical and public health applications.

39 PAPERS • NO BENCHMARKS YET

Tatoeba

Tatoeba is a free collection of example sentences with translations geared towards foreign language learners. It is available in more than 400 languages. Its name comes from the Japanese phrase “tatoeba” (例えば), meaning “for example”. It is written and maintained by a community of volunteers through a model of open collaboration. Individual contributors are known as Tatoebans.

37 PAPERS • 360 BENCHMARKS

Arxiv HEP-TH citation graph

Arxiv HEP-TH (high energy physics theory) citation graph is from the e-print arXiv and covers all the citations within a dataset of 27,770 papers with 352,807 edges. If a paper i cites paper j, the graph contains a directed edge from i to j. If a paper cites, or is cited by, a paper outside the dataset, the graph does not contain any information about this. The data covers papers in the period from January 1993 to April 2003 (124 months).

34 PAPERS • 9 BENCHMARKS

PeerRead

PearRead is a dataset of scientific peer reviews. The dataset consists of over 14K paper drafts and the corresponding accept/reject decisions in top-tier venues including ACL, NIPS and ICLR, as well as over 10K textual peer reviews written by experts for a subset of the papers.

32 PAPERS • NO BENCHMARKS YET

Worldtree

Worldtree is a corpus of explanation graphs, explanatory role ratings, and associated tablestore. It contains explanation graphs for 1,680 questions, and 4,950 tablestore rows across 62 semi-structured tables are provided. This data is intended to be paired with the AI2 Mercury Licensed questions.

32 PAPERS • NO BENCHMARKS YET

ChID (Chinese IDiom dataset)

ChID is a large-scale Chinese IDiom dataset for cloze test. ChID contains 581K passages and 729K blanks, and covers multiple domains. In ChID, the idioms in a passage were replaced with blank symbols. For each blank, a list of candidate idioms including the golden idiom are provided as choice.

31 PAPERS • 3 BENCHMARKS

MuseData

MuseData is an electronic library of orchestral and piano classical music from CCARH. It consists of around 3MB of 783 files.

27 PAPERS • NO BENCHMARKS YET

SentiCap

The SentiCap dataset contains several thousand images with captions with positive and negative sentiments. These sentimental captions are constructed by the authors by re-writing factual descriptions. In total there are 2000+ sentimental captions.

26 PAPERS • NO BENCHMARKS YET

KPTimes

KPTimes is a large-scale dataset of news texts paired with editor-curated keyphrases.

25 PAPERS • 3 BENCHMARKS

KELM

KELM is a large-scale synthetic corpus of Wikidata KG as natural text.

24 PAPERS • NO BENCHMARKS YET

MassiveText

MassiveText is a collection of large English-language text datasets from multiple sources: web pages, books, news articles, and code. The data pipeline includes text quality filtering, removal of repetitious text, deduplication of similar documents, and removal of documents with significant test-set overlap. MassiveText contains 2.35 billion documents or about 10.5 TB of text.

24 PAPERS • NO BENCHMARKS YET

Natural Stories

The Natural Stories dataset consists of English texts edited to contain many low-frequency syntactic constructions while still sounding fluent to native speakers. The corpus is annotated with hand-corrected parse trees and includes self-paced reading time data.

24 PAPERS • NO BENCHMARKS YET

Wiki-40B

A new multilingual language model benchmark that is composed of 40+ languages spanning several scripts and linguistic families containing round 40 billion characters and aimed to accelerate the research of multilingual modeling.

24 PAPERS • 3 BENCHMARKS

AdvGLUE

AdvGLUE (Adversarial GLUE)

Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to quantitatively and thoroughly explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks. In particular, we systematically apply 14 textual adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations.

22 PAPERS • 1 BENCHMARK

Text8

Desc: About of Text8

21 PAPERS • 1 BENCHMARK

Senseval-2

There are now many computer programs for automatically determining the sense of a word in context (Word Sense Disambiguation or WSD). The purpose of SENSEVAL is to evaluate the strengths and weaknesses of such programs with respect to different words, different varieties of language, and different languages.

19 PAPERS • NO BENCHMARKS YET

CLOTH (CLOze test by TeacHers)

The Cloze Test by Teachers (CLOTH) benchmark is a collection of nearly 100,000 4-way multiple-choice cloze-style questions from middle- and high school-level English language exams, where the answer fills a blank in a given text. Each question is labeled with a type of deep reasoning it involves, where the four possible types are grammar, short-term reasoning, matching/paraphrasing, and long-term reasoning, i.e., reasoning over multiple sentences

18 PAPERS • NO BENCHMARKS YET

Taskmaster-1

Taskmaster-1 is a dialog dataset consisting of 13,215 task-based dialogs in English, including 5,507 spoken and 7,708 written dialogs created with two distinct procedures. Each conversation falls into one of six domains: ordering pizza, creating auto repair appointments, setting up ride service, ordering movie tickets, ordering coffee drinks and making restaurant reservations.

18 PAPERS • 1 BENCHMARK

CASIA-HWDB

CASIA-HWDB is a dataset for handwritten Chinese character recognition. It contains 300 files (240 in HWDB1.1 training set and 60 in HWDB1.1 test set). Each file contains about 3000 isolated gray-scale Chinese character images written by one writer, as well as their corresponding labels.

17 PAPERS • NO BENCHMARKS YET

PTB Diagnostic ECG Database

The ECGs in this collection were obtained using a non-commercial, PTB prototype recorder with the following specifications:

16 PAPERS • 4 BENCHMARKS

BeerAdvocate

BeerAdvocate is a dataset that consists of beer reviews from beeradvocate. The data span a period of more than 10 years, including all ~1.5 million reviews up to November 2011. Each review includes ratings in terms of five "aspects": appearance, aroma, palate, taste, and overall impression. Reviews include product and user information, followed by each of these five ratings, and a plaintext review.

14 PAPERS • 1 BENCHMARK

CMU DoG (CMU Document Grounded Conversations Dataset)

This is a document grounded dataset for text conversations. "Document Grounded Conversations" are conversations that are about the contents of a specified document. In this dataset the specified documents are Wikipedia articles about popular movies. The dataset contains 4112 conversations with an average of 21.43 turns per conversation.

14 PAPERS • NO BENCHMARKS YET

Humicroedit

Humicroedit is a humorous headline dataset. The data consists of regular English news headlines paired with versions of the same headlines that contain simple replacement edits designed to make them funny. The authors carefully curated crowdsourced editors to create funny headlines and judges to score a to a total of 15,095 edited headlines, with five judges per headline.

14 PAPERS • NO BENCHMARKS YET

IndoNLU Benchmark

The IndoNLU benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems for Bahasa Indonesia. It is a joint venture from many Indonesia NLP enthusiasts from different institutions such as Gojek, Institut Teknologi Bandung, HKUST, Universitas Multimedia Nusantara, Prosa.ai, and Universitas Indonesia.

14 PAPERS • 2 BENCHMARKS

ANTIQUE

ANTIQUE is a collection of 2,626 open-domain non-factoid questions from a diverse set of categories. The dataset contains 34,011 manual relevance annotations. The questions were asked by real users in a community question answering service, i.e., Yahoo! Answers. Relevance judgments for all the answers to each question were collected through crowdsourcing.

13 PAPERS • NO BENCHMARKS YET

OVAD benchmark (Open-Vocabulary Attribute Detection)

Vision-language modeling has enabled open-vocabulary tasks where predictions can be queried using any text prompt in a zero-shot manner. Existing open-vocabulary tasks focus on object classes, whereas research on object attributes is limited due to the lack of a reliable attribute-focused evaluation benchmark. This paper introduces the Open-Vocabulary Attribute Detection (OVAD) task and the corresponding OVAD benchmark. The objective of the novel task and benchmark is to probe object-level attribute information learned by vision-language models. To this end, we created a clean and densely annotated test set covering 117 attribute classes on the 80 object classes of MS COCO. It includes positive and negative annotations, which enables open-vocabulary evaluation. Overall, the benchmark consists of 1.4 million annotations. For reference, we provide a first baseline method for open-vocabulary attribute detection. Moreover, we demonstrate the benchmark's value by studying the attribute dete

13 PAPERS • 2 BENCHMARKS

CoDraw

The Collaborative Drawing game (CoDraw) dataset contains ~10K dialogs consisting of ~138K messages exchanged between human players in the CoDraw game. The game involves two players: a Teller and a Drawer. The Teller sees an abstract scene containing multiple clip art pieces in a semantically meaningful configuration, while the Drawer tries to reconstruct the scene on an empty canvas using available clip art pieces. The two players communicate with each other using natural language.

12 PAPERS • NO BENCHMARKS YET

FewCLUE

Chinese Few-shot Learning Evaluation Benchmark (FewCLUE) is a comprehensive small sample evaluation benchmark in Chinese. It includes nine tasks, ranging from single-sentence and sentence-pair classification tasks to machine reading comprehension tasks.

12 PAPERS • 5 BENCHMARKS

Hutter Prize

The Hutter Prize Wikipedia dataset, also known as enwiki8, is a byte-level dataset consisting of the first 100 million bytes of a Wikipedia XML dump. For simplicity we shall refer to it as a character-level dataset. Within these 100 million bytes are 205 unique tokens.

12 PAPERS • 1 BENCHMARK

Dakshina

The Dakshina dataset is a collection of text in both Latin and native scripts for 12 South Asian languages. For each language, the dataset includes a large collection of native script Wikipedia text, a romanization lexicon which consists of words in the native script with attested romanizations, and some full sentence parallel data in both a native script of the language and the basic Latin alphabet.

11 PAPERS • NO BENCHMARKS YET

Do-Not-Answer

Do-Not-Answer is a dataset to evaluate safeguards in large language models, and deploy safer open-source LLMs at a low cost. The dataset is curated and filtered to consist only of instructions that responsible language models should not follow. We annotate and assess the responses of six popular LLMs to these instructions.

10 PAPERS • NO BENCHMARKS YET

ART Dataset

ART Dataset (Abductive Reasoning in narrative Text)

ART consists of over 20k commonsense narrative contexts and 200k explanations.

9 PAPERS • NO BENCHMARKS YET

BabyLM

BabyLM is a dataset for small scale language modeling, human language acquisition, low-resource NLP, and cognitive modeling. In partnership with CoNLL and CMCL, it provides a platform for approaches to pretraining with a limited-size corpus sourced from data inspired by the input to children. The task has three tracks, two of which restrict the training data to pre-released datasets of 10M and 100M words and are dedicated to explorations of approaches such as architectural variations, self-supervised objectives, or curriculum learning. The final track only restricts the amount of text used, allowing innovation in the choice of the data, its domain, and even its modality (i.e., data from sources other than text is welcome).

9 PAPERS • NO BENCHMARKS YET

ComQA

ComQA is a large dataset of real user questions that exhibit different challenging aspects such as compositionality, temporal reasoning, and comparisons. ComQA questions come from the WikiAnswers community QA platform, which typically contains questions that are not satisfactorily answerable by existing search engine technology.

9 PAPERS • NO BENCHMARKS YET

IndoSum

The IndoSum dataset is a benchmark dataset for Indonesian text summarization. The dataset consists of news articles and manually constructed summaries.

9 PAPERS • NO BENCHMARKS YET

PersonalDialog

PersonalDialog is a large-scale multi-turn dialogue dataset containing various traits from a large number of speakers. The dataset consists of 20.83M sessions and 56.25M utterances from 8.47M speakers. Each utterance is associated with a speaker who is marked with traits like Age, Gender, Location, Interest Tags, etc. Several anonymization schemes are designed to protect the privacy of each speaker.

9 PAPERS • NO BENCHMARKS YET

CC-Stories

CC-Stories (or STORIES) is a dataset for common sense reasoning and language modeling. It was constructed by aggregating documents from the CommonCrawl dataset that has the most overlapping n-grams with the questions in commonsense reasoning tasks. The top 1.0% of highest ranked documents is chosen as the new training corpus.

8 PAPERS • NO BENCHMARKS YET

Open-Platypus

Open-Platypus is a family of fine-tuned and merged Large Language Models (LLMs) that achieves the strongest performance and currently stands at first place in HuggingFace's Open LLM Leaderboard.

8 PAPERS • NO BENCHMARKS YET

PEYMA

Peyma is a Persian NER dataset to train and test NER systems. It is constructed by collecting documents from ten news websites.

8 PAPERS • NO BENCHMARKS YET

TUT Sound Events 2017

The TUT Sound Events 2017 dataset contains 24 audio recordings in a street environment and contains 6 different classes. These classes are: brakes squeaking, car, children, large vehicle, people speaking, and people walking.

8 PAPERS • NO BENCHMARKS YET

UKP (UKP Argument Annotated Essays)

The UKP Argument Annotated Essays corpus consists of argument annotated persuasive essays including annotations of argument components and argumentative relations.

8 PAPERS • NO BENCHMARKS YET

WMT 2018 News (WMT 2018 News Translation Task)

News translation is a recurring WMT task. The test set is a collection of parallel corpora consisting of about 1500 English sentences translated into 5 languages (Chinese, Czech, Estonian, German, Finnish, Russian, Turkish) and additional 1500 sentences from each of the 7 languages translated to English. The sentences were selected from dozens of news websites and translated by professional translators.

8 PAPERS • NO BENCHMARKS YET

Definite Pronoun Resolution Dataset

Composes sentence pairs (i.e., twin sentences).

7 PAPERS • NO BENCHMARKS YET

Datasets

157 dataset results for Language Modelling