Tencent ML-Images is a large open-source multi-label image database, including 17,609,752 training and 88,739 validation image URLs, which are annotated with up to 11,166 categories.
5 PAPERS • NO BENCHMARKS YET
Desc: About of Text8
21 PAPERS • 1 BENCHMARK
TweetEval introduces an evaluation framework consisting of seven heterogeneous Twitter-specific classification tasks.
72 PAPERS • 2 BENCHMARKS
The USPTO Backgrounds dataset provides valuable information related to patents and trademarks. Here are some key datasets available from the United States Patent and Trademark Office (USPTO):
1 PAPER • 1 BENCHMARK
ViText2SQL is a dataset for the Vietnamese Text-to-SQL semantic parsing task, consisting of about 10K question and SQL query pairs.
2 PAPERS • NO BENCHMARKS YET
The WordNet Language Model Probing (WNLaMPro) dataset consists of relations between keywords and words. It contains 4 different kinds of relations: Antonym, Hypernym, Cohyponym and Corruption.
6 PAPERS • NO BENCHMARKS YET
WiC is a benchmark for the evaluation of context-sensitive word embeddings. WiC is framed as a binary classification task. Each instance in WiC has a target word w, either a verb or a noun, for which two contexts are provided. Each of these contexts triggers a specific meaning of w. The task is to identify if the occurrences of w in the two contexts correspond to the same meaning or not. In fact, the dataset can also be viewed as an application of Word Sense Disambiguation in practise.
169 PAPERS • NO BENCHMARKS YET
A new multilingual language model benchmark that is composed of 40+ languages spanning several scripts and linguistic families containing round 40 billion characters and aimed to accelerate the research of multilingual modeling.
24 PAPERS • 3 BENCHMARKS
WikiText-TL-39 is a benchmark language modeling dataset in Filipino that has 39 million tokens in the training set.
3 PAPERS • NO BENCHMARKS YET
Winogender Schemas is a novel, Winograd schema-style set of minimal pair sentences that differ only by pronoun gender.
WritingPrompts is a large dataset of 300K human-written stories paired with writing prompts from an online forum.
94 PAPERS • 1 BENCHMARK
The corpus represents the largest existing corpus of Catalan containing 687 million words, which is a significant increase given that until now the biggest corpus of Catalan, CuCWeb, counts 166 million words.
This is a dataset for disentangling conversations on IRC, which is the task of identifying separate conversations in a single stream of messages. It contains disentanglement information for 77,563 messages or IRC.
4 PAPERS • 3 BENCHMARKS