157 dataset results for Language Modelling

This is a dataset of 3 English books which do not contain the letter "e" in them. This dataset includes all of "Gadsby" by Ernest Vincent Wright, all of "A Void" by Georges Perec, and almost all of "Eunoia" by Christian Bok (except for the single chapter that uses the letter "e" in it)

1 PAPER • 1 BENCHMARK

NText

NText is an eight million words dataset extracted and preprocessed from nuclear research papers and thesis.

1 PAPER • NO BENCHMARKS YET

Performance Improving Code Edits (PIE) (Performance Improving Code Edits)

PIE stands for Performance Improving Code Edits. PIE contains trajectories of programs, where a programmer begins with an initial, slower version and iteratively makes changes to improve the program’s performance.

1 PAPER • NO BENCHMARKS YET

PhilPapers

PhilPapers is a remarkable resource for the philosophical community. Let me break it down for you:

1 PAPER • 1 BENCHMARK

S-TEST

S-TEST is a benchmark for measuring the specificity of the language of pre-trained language models.

1 PAPER • NO BENCHMARKS YET

SART

SART is a collection of three datasets for Similarity, Analogies and Relatedness for the Tatar language. The three subsets are: * Similarity dataset - 202 pairs of words along with averaged human scores of similarity degree between the words (in 0-to-10 scale). For example, "йорт, бина, 7.69". * Relatedness dataset - 252 pairs of words along with averaged human scores of relatedness degree between the words. For example, "урам, балалар, 5.38". * Analogies dataset - set of analytical questions of the form A:B::C:D, meaning A to B as C to D, and D is to be predicted. For example, "Әнкара Төркия Париж Франция". Contains 34 categories, and in total 30 144 questions.

1 PAPER • NO BENCHMARKS YET

SLING (Sino LINGuistics)

SLING consists of 38K minimal sentence pairs in Mandarin Chinese grouped into 9 high-level linguistic phenomena. Each pair demonstrates the acceptability contrast of a specific syntactic or semantic phenomenon (e.g., The keys are lost vs. The keys is lost), and an LM should assign lower perplexity to the acceptable sentence.

1 PAPER • NO BENCHMARKS YET

SVLD (Social Vision and Language Dataset)

The social vision and language dataset is a large-scale multimodal dataset designed for research into social contextual learning.

1 PAPER • NO BENCHMARKS YET

Stack Exchange

The Stack Exchange dataset is a collection of data from various Stack Exchange sites, including Stack Overflow, Mathematics, Super User, and many others. It includes questions, answers, comments, tags, and other related data from these sites.

1 PAPER • 1 BENCHMARK

USPTO Backgrounds

The USPTO Backgrounds dataset provides valuable information related to patents and trademarks. Here are some key datasets available from the United States Patent and Trademark Office (USPTO):

1 PAPER • 1 BENCHMARK

Verified Smart Contracts

Verified Smart Contracts is a dataset of real Ethereum smart contracts, containing both Solidity and Vyper source code. It consists of every deployed Ethereum smart contract as of 1st of April 2022, whose been verified on Etherscan and has a least one transaction. A total of 186,397 unique smart contracts are provided, filtered down from 2,217,692 smart contracts. The dataset contains 53,843,305 lines of code.

1 PAPER • NO BENCHMARKS YET

VietMed

VietMed (VietMed: A Dataset and Benchmark for Automatic Speech Recognition of Vietnamese in the Medical Domain)

We introduced a Vietnamese speech recognition dataset in the medical domain comprising 16h of labeled medical speech, 1000h of unlabeled medical speech and 1200h of unlabeled general-domain speech. To our best knowledge, VietMed is by far the world’s largest public medical speech recognition dataset in 7 aspects: total duration, number of speakers, diseases, recording conditions, speaker roles, unique medical terms and accents. VietMed is also by far the largest public Vietnamese speech dataset in terms of total duration. Additionally, we are the first to present a medical ASR dataset covering all ICD-10 disease groups and all accents within a country.

1 PAPER • 2 BENCHMARKS

language-modeling-recommendation

This is the Big-Bench version of our language-based movie recommendation dataset

1 PAPER • 1 BENCHMARK

Datasets

157 dataset results for Language Modelling