🔔 Share your dataset with the ML community!

Filter by Modality

Filter by Task (clear)

Filter by Language (clear)

66 dataset results for Language Modelling AND English

WMT 2018 News (WMT 2018 News Translation Task)

News translation is a recurring WMT task. The test set is a collection of parallel corpora consisting of about 1500 English sentences translated into 5 languages (Chinese, Czech, Estonian, German, Finnish, Russian, Turkish) and additional 1500 sentences from each of the 7 languages translated to English. The sentences were selected from dozens of news websites and translated by professional translators.

8 PAPERS • NO BENCHMARKS YET

Coached Conversational Preference Elicitation

Coached Conversational Preference Elicitation is a dataset consisting of 502 English dialogs with 12,000 annotated utterances between a user and an assistant discussing movie preferences in natural language. It was collected using a Wizard-of-Oz methodology between two paid crowd-workers, where one worker plays the role of an 'assistant', while the other plays the role of a 'user'.

5 PAPERS • NO BENCHMARKS YET

arXiv-10

Benchmark dataset for abstracts and titles of 100,000 ArXiv scientific papers. This dataset contains 10 classes and is balanced (exactly 10,000 per class). The classes include subcategories of computer science, physics, and math.

4 PAPERS • 1 BENCHMARK

CCPE-M

CCPE-M (Coached Conversational Preference Elicitation dataset for Movies)

A dataset consisting of 502 English dialogs with 12,000 annotated utterances between a user and an assistant discussing movie preferences in natural language.

3 PAPERS • NO BENCHMARKS YET

Databricks Dolly 15k

Databricks Dolly 15k (databricks-dolly-15k)

Databricks Dolly 15k is a dataset containing 15,000 high-quality human-generated prompt / response pairs specifically designed for instruction tuning large language models. It is authored by more than 5,000 Databricks employees during March and April of 2023. The training records are natural, expressive and designed to represent a wide range of the behaviors, from brainstorming and content generation to information extraction and summarization.

3 PAPERS • NO BENCHMARKS YET

WikiConvert

Wiki-Convert is a 900,000+ sentences dataset of precise number annotations from English Wikipedia. It relies on Wiki contributors' annotations in the form of a {{Convert}} template.

3 PAPERS • NO BENCHMARKS YET

ChrEn (Cherokee-English Parallel Dataset)

Cherokee-English Parallel Dataset is a low-resource dataset of 14,151 pairs of sentences with around 313K English tokens and 206K Cherokee tokens. The parallel corpus is accompanied by a monolingual Cherokee dataset of 5,120 sentences. Both datasets are mostly derived from Cherokee monolingual books.

2 PAPERS • NO BENCHMARKS YET

Circa

The Circa (meaning ‘approximately’) dataset aims to help machine learning systems to solve the problem of interpreting indirect answers to polar questions.

2 PAPERS • NO BENCHMARKS YET

Comparative Question Completion

Comparative Question Completion is a dataset to evaluate what do large Language Models learn.

2 PAPERS • NO BENCHMARKS YET

PubMed Cognitive Control Abstracts

PubMed Cognitive Control Abstracts (CogText)

A collection of 385,705 scientific abstracts about Cognitive Control and their GPT-3 embeddings.

2 PAPERS • NO BENCHMARKS YET

RTC

RTC (Reddit Time Corpus)

RTC is a benchmark corpus of social media comments sampled over three years. The corpus consists of 36.36m unlabelled comments for adaptation and evaluation on an upstream masked language modelling task as well as 0.9m labelled comments for finetuning and evaluation on a downstream document classification task. The Reddit Time Corpus (RTC) covers three years between March 2017 and February 2020 and is split into 36 evenly-sized monthly subsets based on comment timestamps. RTC is sampled from the Pushshift Reddit dataset.

2 PAPERS • NO BENCHMARKS YET

SLNET

SLNET (SLNET: A Redistributable Corpus of 3rd-party Simulink Models)

SLNET is collection of third party Simulink models. It is curated via mining open source repository (GitHub and Matlab Central) using SLNET-Miner (https://github.com/50417/SLNet_Miner).

2 PAPERS • NO BENCHMARKS YET

Alexa Point of View

The Alexa Point of View dataset is point of view conversion dataset, a parallel corpus of messages spoken to a virtual assistant and the converted messages for delivery. The dataset contains parallel corpus of input (input column) message and POV converted messages (output column). An example of a pair is tell @CN@ that i'll be late [\t] hi @CN@, @SCN@ would like you to know that they'll be late. The input and pov-converted output pair is tab separated. @CN@ tag is a placeholder for the contact name (receiver) and @SCN@ tag is a placeholder for source contact name (sender). The total dataset has 46563 pairs. This data is then test/train/dev split into 6985 pairs/32594 pairs/6985 pairs.

1 PAPER • 1 BENCHMARK

Kite

The Kite database is a multi-modal dataset for the control of unmanned aerial vehicles (UAVs). There are three modalities present in the dataset:

1 PAPER • NO BENCHMARKS YET

Lipogram-e

This is a dataset of 3 English books which do not contain the letter "e" in them. This dataset includes all of "Gadsby" by Ernest Vincent Wright, all of "A Void" by Georges Perec, and almost all of "Eunoia" by Christian Bok (except for the single chapter that uses the letter "e" in it)

1 PAPER • 1 BENCHMARK

SVLD (Social Vision and Language Dataset)

The social vision and language dataset is a large-scale multimodal dataset designed for research into social contextual learning.

1 PAPER • NO BENCHMARKS YET

Verified Smart Contracts

Verified Smart Contracts is a dataset of real Ethereum smart contracts, containing both Solidity and Vyper source code. It consists of every deployed Ethereum smart contract as of 1st of April 2022, whose been verified on Etherscan and has a least one transaction. A total of 186,397 unique smart contracts are provided, filtered down from 2,217,692 smart contracts. The dataset contains 53,843,305 lines of code.

1 PAPER • NO BENCHMARKS YET

language-modeling-recommendation

This is the Big-Bench version of our language-based movie recommendation dataset

1 PAPER • 1 BENCHMARK

Datasets

66 dataset results for Language Modelling AND English