This dataset contains orthographic samples of words in 19 languages (ar, br, de, en, eno, ent, eo, es, fi, fr, fro, it, ko, nl, pt, ru, sh, tr, zh). Each sample contains two text features: a Word (the textual representation of the word according to its orthography) and a Pronunciation (the highest-surface IPA pronunciation of the word as pronunced in its language).
1 PAPER • NO BENCHMARKS YET
QALD-9-Plus Dataset Description QALD-9-Plus is the dataset for Knowledge Graph Question Answering (KGQA) based on well-known QALD-9.
1 PAPER • 1 BENCHMARK
Multilingual explainable fact-checking dataset on Russia-Ukraine Conflict 2022
RUSLAN is a Russian spoken language corpus for text-to-speech task. RUSLAN contains 22,200 audio samples with text annotations – more than 31 hours of high-quality speech of one person – being one of the largest annotated Russian corpus in terms of speech duration for a single speaker.
The work provides a comprehensive overview of the corpus for the Russian language for the commonsense inference task. Namely, we construct event phrases, which cover a wide range of everyday situations with labelled intents and reactions of the event main participant and emotions of other people involved.
533 parallel examples sampled from TACRED, translated into Russian and Korean (and 3 additional examples in Russian), accompanied with tranlsation of a list of trigger words collected for the different relations.
UNER v1 adds an NER annotation layer to 18 datasets (primarily treebanks from UD) and covers 12 geneologically and ty- pologically diverse languages: Cebuano, Danish, German, English, Croatian, Portuguese, Russian, Slovak, Serbian, Swedish, Tagalog, and Chinese4. Overall, UNER v1 contains nine full datasets with training, development, and test splits over eight languages, three evaluation sets for lower-resource languages (TL and CEB), and a parallel evaluation benchmark spanning six languages.
1 PAPER • 31 BENCHMARKS
WEATHub is a dataset containing 24 languages. It contains words organized into groups of (target1, target2, attribute1, attribute2) to measure the association target1:target2 :: attribute1:attribute2. For example target1 can be insects, target2 can be flowers. And we might be trying to measure whether we find insects or flowers pleasant or unpleasant. The measurement of word associations is quantified using the WEAT metric in our paper. It is a metric that calculates an effect size (Cohen's d) and also provides a p-value (to measure statistical significance of the results). In our paper, we use word embeddings from language models to perform these tests and understand biased associations in language models across different languages.
News translation is a recurring WMT task. The test set is a collection of parallel corpora consisting of about 1500 English sentences translated into 5 languages (Czech, German, Finnish, French, Russian) and additional 1500 sentences from each of the 5 languages translated to English. The sentences are taken from newspaper articles for each language pair, except for French, where the test set was drawn from user-generated comments on the news articles (from Guardian and Le Monde). The translation was done by professional translators.
The Winograd schema challenge composes tasks with syntactic ambiguity, which can be resolved with logic and reasoning.
This dataset contains human-annotated sense identifiers for 2562 contexts of 20 words used in the RUSSE'2018 shared task on Word Sense Induction and Disambiguation for the Russian language. Assembled by Dmitry Ustalov in 2017. In particular, 80 pre-annotated contexts were used for training the human annotators, and 2562 contexts were annotated by humans such that each context was annotated by 9 different annotators. After the annotation, every context was additionally inspected (“curated”) by the organizers of the shared task.
0 PAPER • NO BENCHMARKS YET
This dataset contains the opinions of Russian native speakers about the relationship between a generic term (hypernym) and a specific instance of it (hyponym). Assembled by Dmitry Ustalov in 2017. A set of 300 most frequent nouns was extracted from the Russian National Corpus. Then each method or resource (including RuThes and RuWordNet) produced at most five hypernyms, if possible. This resulted in 10,600 unique non-empty subsumption pairs, which were annotated by seven different performers whose mother tongue is Russian and were at least 20 years old as of February 1, 2017. As a result, 4,576 out of 10,600 pairs were annotated as positive while the remaining 6,024 were annotated as negative. Interestingly, the performers were more confident in the negative answers than in the positive ones.
Russian dataset of emotional speech dialogues. This dataset was assembled from ~3.5 hours of live speech by actors who voiced pre-distributed emotions in the dialogue for ~3 minutes each. <br> Each sample of dataset contains name of part from the original dataset studio source, speech file (16000 or 44100Hz) of human voice, 1 of 7 labeled emotions and the speech-to-texted part of voice speech. <br>
Created as part of the Social Media Mining for Health Applications (#SMM4H '20) shared tasks, this dataset consists of 9515 tweets describing health issues. Each tweet is labeled for whether it contains information about an adverse side effect that occurred when taking a drug. The dataset was a joint effort with the UPenn HLP Center and the Chemoinformatics and Molecular Modeling Research Laboratory at Kazan Federal University.
This dataset contains annotations of semantic frames and intra-frame syntax for 1500 Russian sentences. Each sentence is annotated with predicate-argument structures. Syntactic information is also provided for each frame.