The Universal Dependencies (UD) project seeks to develop cross-linguistically consistent treebank annotation of morphology and syntax for multiple languages. The first version of the dataset was released in 2015 and consisted of 10 treebanks over 10 languages. Version 2.7 released in 2020 consists of 183 treebanks over 104 languages. The annotation consists of UPOS (universal part-of-speech tags), XPOS (language-specific part-of-speech tags), Feats (universal morphological features), Lemmas, dependency heads and universal dependency labels.
473 PAPERS • 8 BENCHMARKS
OpenSubtitles is collection of multilingual parallel corpora. The dataset is compiled from a large database of movie and TV subtitles and includes a total of 1689 bitexts spanning 2.6 billion sentences across 60 languages.
187 PAPERS • 1 BENCHMARK
Common Voice is an audio dataset that consists of a unique MP3 and corresponding text file. There are 9,283 recorded hours in the dataset. The dataset also includes demographic metadata like age, sex, and accent. The dataset consists of 7,335 validated hours in 60 languages.
175 PAPERS • 261 BENCHMARKS
The OLID is a hierarchical dataset to identify the type and the target of offensive texts in social media. The dataset is collected on Twitter and publicly available. There are 14,100 tweets in total, in which 13,240 are in the training set, and 860 are in the test set. For each tweet, there are three levels of labels: (A) Offensive/Not-Offensive, (B) Targeted-Insult/Untargeted, (C) Individual/Group/Other. The relationship between them is hierarchical. If a tweet is offensive, it can have a target or no target. If it is offensive to a specific target, the target can be an individual, a group, or some other objects. This dataset is used in the OffensEval-2019 competition in SemEval-2019.
110 PAPERS • NO BENCHMARKS YET
The MOldavian and ROmanian Dialectal COrpus (MOROCO) is a corpus that contains 33,564 samples of text (with over 10 million tokens) collected from the news domain. The samples belong to one of the following six topics: culture, finance, politics, science, sports and tech. The data set is divided into 21,719 samples for training, 5,921 samples for validation and another 5,924 samples for testing.
20 PAPERS • NO BENCHMARKS YET
VoxLingua107 is a dataset for spoken language recognition of 6628 hours (62 hours per language on the average) and it is accompanied by an evaluation set of 1609 verified utterances.
19 PAPERS • 1 BENCHMARK
COunter NArratives through Nichesourcing (CONAN) is a dataset that consists of 4,078 pairs over the 3 languages. Additionally, 3 types of metadata are provided: expert demographics, hate speech sub-topic and counter-narrative type. The dataset is augmented through translation (from Italian/French to English) and paraphrasing, which brought the total number of pairs to 14.988.
10 PAPERS • NO BENCHMARKS YET
The Dakshina dataset is a collection of text in both Latin and native scripts for 12 South Asian languages. For each language, the dataset includes a large collection of native script Wikipedia text, a romanization lexicon which consists of words in the native script with attested romanizations, and some full sentence parallel data in both a native script of the language and the basic Latin alphabet.
8 PAPERS • NO BENCHMARKS YET
A parallel corpus of Hindi and English, and HindMonoCorp, a monolingual corpus of Hindi in their release version 0.5. Both corpora were collected from web sources and preprocessed primarily for the training of statistical machine translation systems. HindEnCorp consists of 274k parallel sentences (3.9 million Hindi and 3.8 million English tokens). HindMonoCorp amounts to 787 million tokens in 44 million sentences.
5 PAPERS • NO BENCHMARKS YET
A manually annotated dataset containing 4,779 posts from Twitter annotated as offensive and not offensive.
2 PAPERS • NO BENCHMARKS YET
WiLI-2018 is a benchmark dataset for monolingual written natural language identification. WiLI-2018 is a publicly available, free of charge dataset of short text extracts from Wikipedia. It contains 1000 paragraphs of 235 languages, totaling in 23500 paragraphs. WiLI is a classification dataset: Given an unknown paragraph written in one dominant language, it has to be decided which language it is.
2 PAPERS • NO BENCHMARKS YET
The first Portuguese dataset compiled for Native Language Identification (NLI), the task of identifying an author's first language based on their second language writing. The dataset includes 1,868 student essays written by learners of European Portuguese, native speakers of the following L1s: Chinese, English, Spanish, German, Russian, French, Japanese, Italian, Dutch, Tetum, Arabic, Polish, Korean, Romanian, and Swedish. NLI-PT includes the original student text and four different types of annotation: POS, fine-grained POS, constituency parses, and dependency parses. NLI-PT can be used not only in NLI but also in research on several topics in the field of Second Language Acquisition and educational NLP.
1 PAPER • NO BENCHMARKS YET
TuGebic is a corpus of recordings of spontaneous speech samples from Turkish-German bilinguals, and the compilation of a corpus called TuGebic. Participants in the study were adult Turkish and German bilinguals living in Germany or Turkey at the time of recording in the first half of the 1990s. The data were manually tokenised and normalised, and all proper names (names of participants and places mentioned in the conversations) were replaced with pseudonyms. Token-level automatic language identification was performed, which made it possible to establish the proportions of words from each language.
1 PAPER • NO BENCHMARKS YET
Automatic language identification is a challenging problem. Discriminating between closely related languages is especially difficult. This paper presents a machine learning approach for automatic language identification for the Nordic languages, which often suffer miscategorisation by existing state-of-the-art tools. Concretely we will focus on discrimination between six Nordic languages: Danish, Swedish, Norwegian (Nynorsk), Norwegian (Bokmål), Faroese and Icelandic.
1 PAPER • 1 BENCHMARK