157 dataset results for Language Modelling

Tencent ML-Images is a large open-source multi-label image database, including 17,609,752 training and 88,739 validation image URLs, which are annotated with up to 11,166 categories.

5 PAPERS • NO BENCHMARKS YET

Text8

Desc: About of Text8

21 PAPERS • 1 BENCHMARK

TweetEval

TweetEval introduces an evaluation framework consisting of seven heterogeneous Twitter-specific classification tasks.

72 PAPERS • 2 BENCHMARKS

USPTO Backgrounds

The USPTO Backgrounds dataset provides valuable information related to patents and trademarks. Here are some key datasets available from the United States Patent and Trademark Office (USPTO):

1 PAPER • 1 BENCHMARK

ViText2SQL

ViText2SQL is a dataset for the Vietnamese Text-to-SQL semantic parsing task, consisting of about 10K question and SQL query pairs.

2 PAPERS • NO BENCHMARKS YET

WNLaMPro

WNLaMPro (WordNet Language Model Probing)

The WordNet Language Model Probing (WNLaMPro) dataset consists of relations between keywords and words. It contains 4 different kinds of relations: Antonym, Hypernym, Cohyponym and Corruption.

6 PAPERS • NO BENCHMARKS YET

WiC

WiC (Words in Context)

WiC is a benchmark for the evaluation of context-sensitive word embeddings. WiC is framed as a binary classification task. Each instance in WiC has a target word w, either a verb or a noun, for which two contexts are provided. Each of these contexts triggers a specific meaning of w. The task is to identify if the occurrences of w in the two contexts correspond to the same meaning or not. In fact, the dataset can also be viewed as an application of Word Sense Disambiguation in practise.

169 PAPERS • NO BENCHMARKS YET

Wiki-40B

A new multilingual language model benchmark that is composed of 40+ languages spanning several scripts and linguistic families containing round 40 billion characters and aimed to accelerate the research of multilingual modeling.

24 PAPERS • 3 BENCHMARKS

WikiText-TL-39

WikiText-TL-39 is a benchmark language modeling dataset in Filipino that has 39 million tokens in the training set.

3 PAPERS • NO BENCHMARKS YET

Winogender Schemas

Winogender Schemas is a novel, Winograd schema-style set of minimal pair sentences that differ only by pronoun gender.

6 PAPERS • NO BENCHMARKS YET

WritingPrompts

WritingPrompts is a large dataset of 300K human-written stories paired with writing prompts from an online forum.

94 PAPERS • 1 BENCHMARK

caWaC

The corpus represents the largest existing corpus of Catalan containing 687 million words, which is a significant increase given that until now the biggest corpus of Catalan, CuCWeb, counts 166 million words.

5 PAPERS • NO BENCHMARKS YET

irc-disentanglement

This is a dataset for disentangling conversations on IRC, which is the task of identifying separate conversations in a single stream of messages. It contains disentanglement information for 77,563 messages or IRC.

4 PAPERS • 3 BENCHMARKS

Datasets

157 dataset results for Language Modelling