This is a dataset of 3 English books which do not contain the letter "e" in them. This dataset includes all of "Gadsby" by Ernest Vincent Wright, all of "A Void" by Georges Perec, and almost all of "Eunoia" by Christian Bok (except for the single chapter that uses the letter "e" in it)
1 PAPER • 1 BENCHMARK
NText is an eight million words dataset extracted and preprocessed from nuclear research papers and thesis.
1 PAPER • NO BENCHMARKS YET
PIE stands for Performance Improving Code Edits. PIE contains trajectories of programs, where a programmer begins with an initial, slower version and iteratively makes changes to improve the program’s performance.
PhilPapers is a remarkable resource for the philosophical community. Let me break it down for you:
S-TEST is a benchmark for measuring the specificity of the language of pre-trained language models.
SART is a collection of three datasets for Similarity, Analogies and Relatedness for the Tatar language. The three subsets are: * Similarity dataset - 202 pairs of words along with averaged human scores of similarity degree between the words (in 0-to-10 scale). For example, "йорт, бина, 7.69". * Relatedness dataset - 252 pairs of words along with averaged human scores of relatedness degree between the words. For example, "урам, балалар, 5.38". * Analogies dataset - set of analytical questions of the form A:B::C:D, meaning A to B as C to D, and D is to be predicted. For example, "Әнкара Төркия Париж Франция". Contains 34 categories, and in total 30 144 questions.
SLING consists of 38K minimal sentence pairs in Mandarin Chinese grouped into 9 high-level linguistic phenomena. Each pair demonstrates the acceptability contrast of a specific syntactic or semantic phenomenon (e.g., The keys are lost vs. The keys is lost), and an LM should assign lower perplexity to the acceptable sentence.
The social vision and language dataset is a large-scale multimodal dataset designed for research into social contextual learning.
The Stack Exchange dataset is a collection of data from various Stack Exchange sites, including Stack Overflow, Mathematics, Super User, and many others. It includes questions, answers, comments, tags, and other related data from these sites.
The USPTO Backgrounds dataset provides valuable information related to patents and trademarks. Here are some key datasets available from the United States Patent and Trademark Office (USPTO):
Verified Smart Contracts is a dataset of real Ethereum smart contracts, containing both Solidity and Vyper source code. It consists of every deployed Ethereum smart contract as of 1st of April 2022, whose been verified on Etherscan and has a least one transaction. A total of 186,397 unique smart contracts are provided, filtered down from 2,217,692 smart contracts. The dataset contains 53,843,305 lines of code.
We introduced a Vietnamese speech recognition dataset in the medical domain comprising 16h of labeled medical speech, 1000h of unlabeled medical speech and 1200h of unlabeled general-domain speech. To our best knowledge, VietMed is by far the world’s largest public medical speech recognition dataset in 7 aspects: total duration, number of speakers, diseases, recording conditions, speaker roles, unique medical terms and accents. VietMed is also by far the largest public Vietnamese speech dataset in terms of total duration. Additionally, we are the first to present a medical ASR dataset covering all ICD-10 disease groups and all accents within a country.
1 PAPER • 2 BENCHMARKS
This is the Big-Bench version of our language-based movie recommendation dataset