6 dataset results for Masked Language Modeling

The PubMed dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words.

1,063 PAPERS • 24 BENCHMARKS

WikiText-2

The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.

808 PAPERS • 3 BENCHMARKS

C4 (Colossal Clean Crawled Corpus)

C4 is a colossal, cleaned version of Common Crawl's web crawl corpus. It was based on Common Crawl dataset: https://commoncrawl.org. It was used to train the T5 text-to-text Transformer models.

633 PAPERS • 1 BENCHMARK

CORD-19

CORD-19 is a free resource of tens of thousands of scholarly articles about COVID-19, SARS-CoV-2, and related coronaviruses for use by the global research community.

157 PAPERS • 2 BENCHMARKS

Yelp

The Yelp Dataset is a valuable resource for academic research, teaching, and learning. It provides a rich collection of real-world data related to businesses, reviews, and user interactions. Here are the key details about the Yelp Dataset: Reviews: A whopping 6,990,280 reviews from users. Businesses: Information on 150,346 businesses. Pictures: A collection of 200,100 pictures. Metropolitan Areas: Data from 11 metropolitan areas. Tips: Over 908,915 tips provided by 1,987,897 users. Business Attributes: Details like hours, parking availability, and ambiance for more than 1.2 million businesses. Aggregated Check-ins: Historical check-in data for each of the 131,930 businesses.

68 PAPERS • 21 BENCHMARKS

DiFair

DiFair serves as a meticulous endeavor to address the oversight in evaluating the impact of bias mitigation on useful gender knowledge while assessing gender neutrality in pretrained language models. This metric delves into not only quantifying a model's biased tendencies but also assessing the preservation of useful gender knowledge.

1 PAPER • NO BENCHMARKS YET

Datasets

6 dataset results for Masked Language Modeling