🔔 Share your dataset with the ML community!

Filter by Modality

Filter by Task (clear)

Filter by Language (clear)

18 dataset results for Word Embeddings AND English

FrameNet is a linguistic knowledge graph containing information about lexical and predicate argument semantics of the English language. FrameNet contains two distinct entity classes: frames and lexical units, where a frame is a meaning and a lexical unit is a single meaning for a word.

436 PAPERS • NO BENCHMARKS YET

BookCorpus

BookCorpus is a large collection of free novel books written by unpublished authors, which contains 11,038 books (around 74M sentences and 1G words) of 16 different sub-genres (e.g., Romance, Historical, Adventure, etc.).

326 PAPERS • 1 BENCHMARK

WiC

WiC (Words in Context)

WiC is a benchmark for the evaluation of context-sensitive word embeddings. WiC is framed as a binary classification task. Each instance in WiC has a target word w, either a verb or a noun, for which two contexts are provided. Each of these contexts triggers a specific meaning of w. The task is to identify if the occurrences of w in the two contexts correspond to the same meaning or not. In fact, the dataset can also be viewed as an application of Word Sense Disambiguation in practise.

173 PAPERS • NO BENCHMARKS YET

Billion Word Benchmark

The One Billion Word dataset is a dataset for language modeling. The training/held-out data was produced from the WMT 2011 News Crawl data using a combination of Bash shell and Perl scripts.

136 PAPERS • 2 BENCHMARKS

WikiMatrix

WikiMatrix is a dataset of parallel sentences in the textual content of Wikipedia for all possible language pairs. The mined data consists of:

87 PAPERS • NO BENCHMARKS YET

WikiANN

WikiANN (PAN-X)

WikiANN, also known as PAN-X, is a multilingual named entity recognition dataset. It consists of Wikipedia articles that have been annotated with LOC (location), PER (person), and ORG (organization) tags in the IOB2 format¹². This dataset serves as a valuable resource for training and evaluating named entity recognition models across various languages.

60 PAPERS • 3 BENCHMARKS

CELEX

CELEX database comprises three different searchable lexical databases, Dutch, English and German. The lexical data contained in each database is divided into five categories: orthography, phonology, morphology, syntax (word class) and word frequency.

57 PAPERS • NO BENCHMARKS YET

ISEAR (International Survey on Emotion Antecedents and Reactions)

Over a period of many years during the 1990s, a large group of psychologists all over the world collected data in the ISEAR project, directed by Klaus R. Scherer and Harald Wallbott. Student respondents, both psychologists and non-psychologists, were asked to report situations in which they had experienced all of 7 major emotions (joy, fear, anger, sadness, disgust, shame, and guilt). In each case, the questions covered the way they had appraised the situation and how they reacted. The final data set thus contained reports on seven emotions each by close to 3000 respondents in 37 countries on all 5 continents.

49 PAPERS • NO BENCHMARKS YET

Panlex

PanLex translates words in thousands of languages. Its database is panlingual (emphasizes coverage of every language) and lexical (focuses on words, not sentences).

34 PAPERS • NO BENCHMARKS YET

EVALution

EVALution dataset is evenly distributed among the three classes (hypernyms, co-hyponyms and random) and involves three types of parts of speech (noun, verb, adjective). The full dataset contains a total of 4,263 distinct terms consisting of 2,380 nouns, 958 verbs and 972 adjectives.

27 PAPERS • NO BENCHMARKS YET

Senseval-2

There are now many computer programs for automatically determining the sense of a word in context (Word Sense Disambiguation or WSD). The purpose of SENSEVAL is to evaluate the strengths and weaknesses of such programs with respect to different words, different varieties of language, and different languages.

19 PAPERS • NO BENCHMARKS YET

United Nations Parallel Corpus

The first parallel corpus composed from United Nations documents published by the original data creator. The parallel corpus presented consists of manually translated UN documents from the last 25 years (1990 to 2014) for the six official UN languages, Arabic, Chinese, English, French, Russian, and Spanish.

17 PAPERS • NO BENCHMARKS YET

SemEval-2014 Task-10

SemEval 2014 is a collection of datasets used for the Semantic Evaluation (SemEval) workshop, an annual event that focuses on the evaluation and comparison of systems that can analyze diverse semantic phenomena in text. The datasets from SemEval 2014 are used for various tasks, including but not limited to:

6 PAPERS • NO BENCHMARKS YET

WikiNEuRal

WikiNEuRal is a high-quality automatically-generated dataset for Multilingual Named Entity Recognition.

5 PAPERS • NO BENCHMARKS YET

WikiSem500

The WikiSem500 dataset contains around 500 per-language cluster groups for English, Spanish, German, Chinese, and Japanese (a total of 13,314 test cases).

4 PAPERS • NO BENCHMARKS YET

MUSE

The MUSE dataset contains bilingual dictionaries for 110 pairs of languages. For each language pair, the training seed dictionaries contain approximately 5000 word pairs while the evaluation sets contain 1500 word pairs.

2 PAPERS • 2 BENCHMARKS

McQueen

McQueen dataset contains 15k visual conversations and over 80k queries where each one is associated with a fully-specified rewrite version. In addition, for entities appearing in the rewrite, the corresponding image box annotation is provided.

1 PAPER • NO BENCHMARKS YET

Urban Dict spelling variant

Urban Dict spelling variant is a variant spelling dataset for use of NLP research in the informal domain. It consists of around 25k variant spelling pairs form UrbanDictionary.

1 PAPER • NO BENCHMARKS YET

Datasets

18 dataset results for Word Embeddings AND English