This dataset contains orthographic samples of words in 19 languages (ar, br, de, en, eno, ent, eo, es, fi, fr, fro, it, ko, nl, pt, ru, sh, tr, zh). Each sample contains two text features: a Word (the textual representation of the word according to its orthography) and a Pronunciation (the highest-surface IPA pronunciation of the word as pronunced in its language).
1 PAPER • NO BENCHMARKS YET
Multitask learning has led to significant advances in Natural Language Processing, including the decaNLP benchmark where question answering is used to frame 10 natural language understanding tasks in a single model. PQ-decaNLP is a crowd-sourced corpus of paraphrased questions, annotated with paraphrase phenomena. This enables analysis of how transformations such as swapping the class labels and changing the sentence modality lead to a large performance degradation.
Press Briefing Claim Dataset The dataset contains a total of 53 press briefings from a time span of over four years (2017-2021). While, on average, one press briefing per month is held, the distribution is highly skewed towards recent years.
QALD-9-Plus Dataset Description QALD-9-Plus is the dataset for Knowledge Graph Question Answering (KGQA) based on well-known QALD-9.
1 PAPER • 1 BENCHMARK
This dataset is based on the Spiking Heidelberg Digits (SHD) dataset. Sample inputs consist of two spike encoded digits sampled uniformly at random from the SHD dataset and concatenated, with the target being the sum of the digits (irrespective of language). The train and test split remain the same, with the test set consisting of 16k such samples based on the SHD test set.
The primary data of the SaGA corpus are made up of 25 dialogs of interlocutors (50), who engage in a spatial communication task combining direction-giving and sight description. Six of those dialogues with data only from the direction giver are available including audio (.wav) and video (.mp4) data. The secondary data consists of annotations (*.eaf) of gestures and speech-gesture referents, which have been completely and systematically annotated based on an annotation grid (cf. the SaGA documentation). The corpus is comprised of of 9881 isolated words and 1764 isolated gestures. The stimulus is a model of a town presented in a Virtual Reality (VR) environment. Upon finishing a "bus ride" through the VR town along five landmarks, a router explained the route as well as the wayside landmarks to an unknown and naive follower. The SaGA Corpus was curated for CLARIN as part of the Curation Project "Editing and Integration of Multimodal Resources in CLARIN-D" by the CLARIN-D Working Group 6
Digital Edition: Sturm Edition Source: Schrade, Torsten: „Startseite“, in: DER STURM. Digitale Quellenedition zur Geschichte der internationalen Avantgarde, erarbeitet und herausgegeben von Marjam Trautmann und Torsten Schrade. Mainz, Akademie der Wissenschaften und der Literatur, Version 1 vom 16. Jul. 2018.
THEOStereo is a dataset providing synthetic stereo image pairs and their corresponding scene depth and will be published along with 1. All images follow the omnidirectional camera model. In total, there are 31,250 omnidirectional images pairs. The training set contains 25,000 image pairs. For validation and testing there are 3,125 image pairs, respectively. For each pair, there is a ground truth depth map describing the pixel-wise distance of the object along the left camera's z-axis. The virtual omnidirectional cameras exhibit a FOV of 180 degrees and can be described using Kannala's camera model 2. The distortion parameters are k_1 = 1 and k_2 = k_3 = k_4 = k_5 = 0. The length of the stereo camera's baseline was 0.3 AU (approx. 15 cm, not 30 cm!). Please do not forget to cite 1 if you use the dataset in your work. Thank you.
A corpus of GDPR machine-readable transparency information powered by the Transparency Information Language and Toolkit (TILT). These statements were extracted from real-world services for academic research purposes. They contain information about the collection, processing, and use of personal data in accordance with the legal requirements of the GDPR. The corpus makes it possible to process the information for various applications, such as automated checks or analyses, and to illustrate the practical applicability.
The SWC is a corpus of aligned Spoken Wikipedia articles from the English, German, and Dutch Wikipedia. This corpus has several outstanding characteristics:
Thorsten-Voice (Thorsten-21.02-neutral) is a neutrally spoken voice dataset recorded by Thorsten Müller, audio optimized by Dominik Kreutz and licenced under CC0 to provide it for anybody without any financial or licence struggle. It is intended to be used for speech synthesis in German as a single speaker dataset. It contains about 23 hours of high quality audio
TuGebic is a corpus of recordings of spontaneous speech samples from Turkish-German bilinguals, and the compilation of a corpus called TuGebic. Participants in the study were adult Turkish and German bilinguals living in Germany or Turkey at the time of recording in the first half of the 1990s. The data were manually tokenised and normalised, and all proper names (names of participants and places mentioned in the conversations) were replaced with pseudonyms. Token-level automatic language identification was performed, which made it possible to establish the proportions of words from each language.
UNER v1 adds an NER annotation layer to 18 datasets (primarily treebanks from UD) and covers 12 geneologically and ty- pologically diverse languages: Cebuano, Danish, German, English, Croatian, Portuguese, Russian, Slovak, Serbian, Swedish, Tagalog, and Chinese4. Overall, UNER v1 contains nine full datasets with training, development, and test splits over eight languages, three evaluation sets for lower-resource languages (TL and CEB), and a parallel evaluation benchmark spanning six languages.
1 PAPER • 31 BENCHMARKS
This dataset contains vibration data recorded on a rotating drive train. This drive train consists of an electronically commutated DC motor and a shaft driven by it, which passes through a roller bearing. With the help of a 3D-printed holder, unbalances with different weights and different radii were attached to the shaft. Besides the strength of the unbalances, the rotation speed of the motor was also varied. This dataset can be used to develop and test algorithms for the automatic detection of unbalances on drive trains. Datasets for 4 differently sized unbalances and for the unbalance-free case were recorded. The vibration data was recorded at a sampling rate of 4096 values per second. Datasets for development (ID "D[0-4]") as well as for evaluation (ID "E[0-4]") are available for each unbalance strength. The rotation speed was varied between approx. 630 and 2330 RPM in the development datasets and between approx. 1060 and 1900 RPM in the evaluation datasets. For each measurement of
VoxForge is an open speech dataset that was set up to collect transcribed speech for use with Free and Open Source Speech Recognition Engines (on Linux, Windows and Mac).
WEATHub is a dataset containing 24 languages. It contains words organized into groups of (target1, target2, attribute1, attribute2) to measure the association target1:target2 :: attribute1:attribute2. For example target1 can be insects, target2 can be flowers. And we might be trying to measure whether we find insects or flowers pleasant or unpleasant. The measurement of word associations is quantified using the WEAT metric in our paper. It is a metric that calculates an effect size (Cohen's d) and also provides a p-value (to measure statistical significance of the results). In our paper, we use word embeddings from language models to perform these tests and understand biased associations in language models across different languages.
The WEB-FORUM-52 gold standard comprises (i) 13 web forums from the health domain, (ii) 15 forums obtained from a Wikipedia list of popular forums (https://en.wikipedia.org/wiki/List_of_Internet_forums), (iii) 13 forums mentioned on a list of popular German Web forums (https://www.beliebte-foren.de), (iv) nine forums obtained from WPressBlog (https://www.wpressblog.com/free-forum-posting-sites-list/) and (v) two additional forums. For most forums two web pages (from different threads) were used and stored together with gold standard annotations that have been manually created by domain experts and describe the post text, post date, post user and direct URL to the post.
The Medical Translation Task of WMT 2014 addresses the problem of domain-specific and genre-specific machine translation. The task is split into two subtasks: summary translation, focused on translation of sentences from summaries of medical articles, and query translation, focused on translation of queries entered by users into medical information search engines. Both subtasks included translation between English and Czech, German, and French, in both directions.
News translation is a recurring WMT task. The test set is a collection of parallel corpora consisting of about 1500 English sentences translated into 5 languages (Czech, German, Finnish, French, Russian) and additional 1500 sentences from each of the 5 languages translated to English. The sentences are taken from newspaper articles for each language pair, except for French, where the test set was drawn from user-generated comments on the news articles (from Guardian and Le Monde). The translation was done by professional translators.
The IT Translation Task is a shared task introduced in the First Conference on Machine Translation. Compared to WMT 2016 News, this task brought several novelties to WMT:
We present a multilingual test set for conducting speech intelligibility tests in the form of diagnostic rhyme tests. The materials currently contain audio recordings in 5 languages and further extensions are in progress. For Mandarin Chinese, we provide recordings for a consonant contrast test as well as a tonal contrast test. Further information on the audio data, test procedure and software to set up a full survey which can be deployed on crowdsourcing platforms is provided in our paper [arXiv preprint] and GitHub repository. We welcome contributions to this open-source project.
Current approaches to context-aware MT rely on a set of surface heuristics to translate pronouns, which break down when translations require real reasoning. We create a new template test set ContraCAT to assess the ability of Machine Translation to handle the specific steps necessary for successful pronoun translation.
0 PAPER • NO BENCHMARKS YET
German affixoids are a type of morpheme in between affixes and free stems.
An annotated data set consisting of user comments posted to an Austrian newspaper website (in German language).
The Parzival dataset consists of 47 pages by three writers. These pages were taken from a medieval German manuscript from the 13th century that contains the epic poem Parzival by Wolfram von Eschenbach. The image size is 2000 x 3000 pixels. 24 pages are selected as training set; 14 pages are selected as test set; 2 pages are selected as validation set.