McQueen dataset contains 15k visual conversations and over 80k queries where each one is associated with a fully-specified rewrite version. In addition, for entities appearing in the rewrite, the corresponding image box annotation is provided.
1 PAPER • NO BENCHMARKS YET
WikiNEuRal is a high-quality automatically-generated dataset for Multilingual Named Entity Recognition.
5 PAPERS • NO BENCHMARKS YET
ConceptNet is a knowledge graph that connects words and phrases of natural language with labeled edges. Its knowledge is collected from many sources that include expert-created resources, crowd-sourcing, and games with a purpose. It is designed to represent the general knowledge involved in understanding language, improving natural language applications by allowing the application to better understand the meanings behind the words people use.
800 PAPERS • NO BENCHMARKS YET
The MUSE dataset contains bilingual dictionaries for 110 pairs of languages. For each language pair, the training seed dictionaries contain approximately 5000 word pairs while the evaluation sets contain 1500 word pairs.
2 PAPERS • 2 BENCHMARKS
WinoBias contains 3,160 sentences, split equally for development and test, created by researchers familiar with the project. Sentences were created to follow two prototypical templates but annotators were encouraged to come up with scenarios where entities could be interacting in plausible ways. Templates were selected to be challenging and designed to cover cases requiring semantics and syntax separately.
108 PAPERS • NO BENCHMARKS YET
BioASQ is a question answering dataset. Instances in the BioASQ dataset are composed of a question (Q), human-annotated answers (A), and the relevant contexts (C) (also called snippets).
162 PAPERS • 2 BENCHMARKS
EVALution dataset is evenly distributed among the three classes (hypernyms, co-hyponyms and random) and involves three types of parts of speech (noun, verb, adjective). The full dataset contains a total of 4,263 distinct terms consisting of 2,380 nouns, 958 verbs and 972 adjectives.
27 PAPERS • NO BENCHMARKS YET
PanLex translates words in thousands of languages. Its database is panlingual (emphasizes coverage of every language) and lexical (focuses on words, not sentences).
34 PAPERS • NO BENCHMARKS YET
SemEval 2014 is a collection of datasets used for the Semantic Evaluation (SemEval) workshop, an annual event that focuses on the evaluation and comparison of systems that can analyze diverse semantic phenomena in text. The datasets from SemEval 2014 are used for various tasks, including but not limited to:
6 PAPERS • NO BENCHMARKS YET
There are now many computer programs for automatically determining the sense of a word in context (Word Sense Disambiguation or WSD). The purpose of SENSEVAL is to evaluate the strengths and weaknesses of such programs with respect to different words, different varieties of language, and different languages.
19 PAPERS • NO BENCHMARKS YET
The One Billion Word dataset is a dataset for language modeling. The training/held-out data was produced from the WMT 2011 News Crawl data using a combination of Bash shell and Perl scripts.
134 PAPERS • 2 BENCHMARKS
CELEX database comprises three different searchable lexical databases, Dutch, English and German. The lexical data contained in each database is divided into five categories: orthography, phonology, morphology, syntax (word class) and word frequency.
57 PAPERS • NO BENCHMARKS YET
Chinese Gigaword corpus consists of 2.2M of headline-document pairs of news stories covering over 284 months from two Chinese newspapers, namely the Xinhua News Agency of China (XIN) and the Central News Agency of Taiwan (CNA).
2 PAPERS • NO BENCHMARKS YET
The Climate Change Claims dataset for generating fact checking summaries contains claims broadly related to climate change and global warming from climatefeedback.org. It contains 1k documents from 104 different claims from 97 different domains.
The dataset contains the main components of the news articles published online by the newspaper named <a href="https://gazzettadimodena.gelocal.it/modena">Gazzetta di Modena</a>: url of the web page, title, sub-title, text, date of publication, crime category assigned to each news article by the author.
3 PAPERS • NO BENCHMARKS YET
Over a period of many years during the 1990s, a large group of psychologists all over the world collected data in the ISEAR project, directed by Klaus R. Scherer and Harald Wallbott. Student respondents, both psychologists and non-psychologists, were asked to report situations in which they had experienced all of 7 major emotions (joy, fear, anger, sadness, disgust, shame, and guilt). In each case, the questions covered the way they had appraised the situation and how they reacted. The final data set thus contained reports on seven emotions each by close to 3000 respondents in 37 countries on all 5 continents.
49 PAPERS • NO BENCHMARKS YET
The IndicNLP corpus is a large-scale, general-domain corpus containing 2.7 billion words for 10 Indian languages from two language families.
The IndoSum dataset is a benchmark dataset for Indonesian text summarization. The dataset consists of news articles and manually constructed summaries.
9 PAPERS • NO BENCHMARKS YET
A dataset of sentence pairs annotated following the formalization.
This dataset contains information about Japanese word similarity including rare words. The dataset is constructed following the Stanford Rare Word Similarity Dataset. 10 annotators annotated word pairs with 11 levels of similarity.
SART is a collection of three datasets for Similarity, Analogies and Relatedness for the Tatar language. The three subsets are: * Similarity dataset - 202 pairs of words along with averaged human scores of similarity degree between the words (in 0-to-10 scale). For example, "йорт, бина, 7.69". * Relatedness dataset - 252 pairs of words along with averaged human scores of relatedness degree between the words. For example, "урам, балалар, 5.38". * Analogies dataset - set of analytical questions of the form A:B::C:D, meaning A to B as C to D, and D is to be predicted. For example, "Әнкара Төркия Париж Франция". Contains 34 categories, and in total 30 144 questions.
A large-scale evaluation set that provides human ratings for the plausibility of 10,000 SP pairs over five SP relations, covering 2,500 most frequent verbs, nouns, and adjectives in American English.
4 PAPERS • NO BENCHMARKS YET
The first parallel corpus composed from United Nations documents published by the original data creator. The parallel corpus presented consists of manually translated UN documents from the last 25 years (1990 to 2014) for the six official UN languages, Arabic, Chinese, English, French, Russian, and Spanish.
17 PAPERS • NO BENCHMARKS YET
Urban Dict spelling variant is a variant spelling dataset for use of NLP research in the informal domain. It consists of around 25k variant spelling pairs form UrbanDictionary.
The WordNet Language Model Probing (WNLaMPro) dataset consists of relations between keywords and words. It contains 4 different kinds of relations: Antonym, Hypernym, Cohyponym and Corruption.
WiC is a benchmark for the evaluation of context-sensitive word embeddings. WiC is framed as a binary classification task. Each instance in WiC has a target word w, either a verb or a noun, for which two contexts are provided. Each of these contexts triggers a specific meaning of w. The task is to identify if the occurrences of w in the two contexts correspond to the same meaning or not. In fact, the dataset can also be viewed as an application of Word Sense Disambiguation in practise.
168 PAPERS • NO BENCHMARKS YET
WikiMatrix is a dataset of parallel sentences in the textual content of Wikipedia for all possible language pairs. The mined data consists of:
85 PAPERS • NO BENCHMARKS YET
The WikiSem500 dataset contains around 500 per-language cluster groups for English, Spanish, German, Chinese, and Japanese (a total of 13,314 test cases).
The pioNER corpus provides gold-standard and automatically generated named-entity datasets for the Armenian language. The automatically generated corpus is generated from Wikipedia. The gold-standard set is a collection of over 250 news articles from iLur.am with manual named-entity annotation. It includes sentences from political, sports, local and world news, and is comparable in size with the test sets of other languages.