EUR-Lex-Sum is a dataset for cross-lingual summarization. It is based on manually curated document summaries of legal acts from the European Union law platform. Documents and their respective summaries exist as crosslingual paragraph-aligned data in several of the 24 official European languages, enabling access to various cross-lingual and lower-resourced summarization setups. The dataset contains up to 1,500 document/summary pairs per language, including a subset of 375 cross-lingually aligned legal acts with texts available in all 24 languages.
5 PAPERS • NO BENCHMARKS YET
The German Lipreading dataset consists of 250,000 publicly available videos of the faces of speakers of the Hessian Parliament, which was processed for word-level lip reading using an automatic pipeline. The format is similar to that of the English language Lip Reading in the Wild (LRW) dataset, with each H264-compressed MPEG-4 video encoding one word of interest in a context of 1.16 seconds duration, which yields compatibility for studying transfer learning between both datasets. Choosing video material based on naturally spoken language in a natural environment ensures more robust results for real-world applications than artificially generated datasets with as little noise as possible. The 500 different spoken words ranging between 4-18 characters in length each have 500 instances and separate MPEG-4 audio- and text metadata-files, originating from 1018 parliamentary sessions. Additionally, the complete TextGrid files containing the segmentation information of those sessions are also
GermanQuAD is a Question Answering (QA) dataset of 13,722 extractive question/answer pairs in German.
WikiNEuRal is a high-quality automatically-generated dataset for Multilingual Named Entity Recognition.
The DISRPT 2019 workshop introduces the first iteration of a cross-formalism shared task on discourse unit segmentation. Since all major discourse parsing frameworks imply a segmentation of texts into segments, learning segmentations for and from diverse resources is a promising area for converging methods and insights. We provide training, development and test datasets from all available languages and treebanks in the RST, SDRT and PDTB formalisms, using a uniform format. Because different corpora, languages and frameworks use different guidelines for segmentation, the shared task is meant to promote design of flexible methods for dealing with various guidelines, and help to push forward the discussion of standards for discourse units. For datasets which have treebanks, we will evaluate in two different scenarios: with and without gold syntax, or otherwise using provided automatic parses for comparison.
4 PAPERS • NO BENCHMARKS YET
DiS-ReX is a multilingual dataset for distantly supervised (DS) relation extraction (RE). The dataset has over 1.5 million instances, spanning 4 languages (English, Spanish, German and French). The dataset has 36 positive relation types + 1 no relation (NA) class.
German Guideline Program in Oncology NLP Corpus (GGPONC) is a German language corpus based on clinical practice guidelines for oncology. This corpus is one of the largest ever built from German medical documents. Unlike clinical documents, clinical guidelines do not contain any patient-related information and can therefore be used without data protection restrictions.
KRAUTS (Korpus of newspapeR Articles with Underlinded Temporal expressionS) is a German temporally annotated news corpus accompanied with TimeML annotation guidelines for German. It was developed at Fondazione Bruno Kessler, Trento, Italy and at the Max Planck Institute for Informatics, Saarbrücken, Germany. Our goal is to boost temporal tagging research for German.
4 PAPERS • 1 BENCHMARK
The dataset introduces document alignments between German Wikipedia and the children's lexicon Klexikon. The source texts in Wikipedia are both written in a more complex language than Klexikon, and also significantly longer, which makes this a suitable application for both summarization and simplification. In fact, previous research has so far only focused on either of the two, but not comprehensively been studied as a joint task.
MuMiN is a misinformation graph dataset containing rich social media data (tweets, replies, users, images, articles, hashtags), spanning 21 million tweets belonging to 26 thousand Twitter threads, each of which have been semantically linked to 13 thousand fact-checked claims across dozens of topics, events and domains, in 41 different languages, spanning more than a decade.
4 PAPERS • 3 BENCHMARKS
MultiSubs is a dataset of multilingual subtitles gathered from the OPUS OpenSubtitles dataset, which in turn was sourced from opensubtitles.org. We have supplemented some text fragments (visually salient nouns in this release) within the subtitles with web images, where the word sense of the fragment has been disambiguated using a cross-lingual approach. We have introduced a fill-in-the-blank task and a lexical translation task to demonstrate the utility of the dataset. Please refer to our paper for a more detailed description of the dataset and tasks. Multisubs will benefit research on visual grounding of words especially in the context of free-form sentence.
4 PAPERS • 5 BENCHMARKS
This dataset arises from the READ project (Horizon 2020).
WikiCLIR is a large-scale (German-English) retrieval data set for Cross-Language Information Retrieval (CLIR). It contains a total of 245,294 German single-sentence queries with 3,200,393 automatically extracted relevance judgments for 1,226,741 English Wikipedia articles as documents. Queries are well-formed natural language sentences that allow large-scale training of (translation-based) ranking models.
The WikiSem500 dataset contains around 500 per-language cluster groups for English, Spanish, German, Chinese, and Japanese (a total of 13,314 test cases).
SRL is the task of extracting semantic predicate-argument structures from sentences. X-SRL is a multilingual parallel Semantic Role Labelling (SRL) corpus for English (EN), German (DE), French (FR) and Spanish (ES) that is based on English gold annotations and shares the same labelling scheme across languages.
BenchIE: a benchmark and evaluation framework for comprehensive evaluation of OIE systems for English, Chinese and German. In contrast to existing OIE benchmarks, BenchIE takes into account informational equivalence of extractions: our gold standard consists of fact synsets, clusters in which we exhaustively list all surface forms of the same fact.
3 PAPERS • 1 BENCHMARK
Targeted syntactic evaluation datasets in 5 languages: English, French, German, Russian, and Hebrew. Data are translated from the targeted syntactic evaluation data of Marvin & Linzen (2018): https://aclanthology.org/D18-1151/ . All stimuli focus on subject-verb agreement.
3 PAPERS • NO BENCHMARKS YET
The DAWT dataset consists of Densely Annotated Wikipedia Texts across multiple languages. The annotations include labeled text mentions mapping to entities (represented by their Freebase machine ids) as well as the type of the entity. The data set contains total of 13.6M articles, 5.0B tokens, 13.8M mention entity co-occurrences. DAWT contains 4.8 times more anchor text to entity links than originally present in the Wikipedia markup. Moreover, it spans several languages including English, Spanish, Italian, German, French and Arabic.
The DISRPT 2021 shared task, co-located with CODI 2021 at EMNLP, introduces the second iteration of a cross-formalism shared task on discourse unit segmentation and connective detection, as well as the first iteration of a cross-formalism discourse relation classification task.
This is a gzipped CSV file containing the 13 million Duolingo student learning traces used in experiments by Settles & Meeder (2016). For more details and replication source code, visit: https://github.com/duolingo/halflife-regression (2016-06-07)
GeoCoV19 is a large-scale Twitter dataset containing more than 524 million multilingual tweets. The dataset contains around 378K geotagged tweets and 5.4 million tweets with Place information. The annotations include toponyms from the user location field and tweet content and resolve them to geolocations such as country, state, or city level. In this case, 297 million tweets are annotated with geolocation using the user location field and 452 million tweets using tweet content.
Konzil dataset was created by specialists of the University of Greifswald. It contains manuscripts written in modern German. Train sample consists of 353 lines, validation - 29 lines and test - 87 lines.
MRS, a multilingual reply suggestion dataset with ten languages. MRS can be used to compare two families of models: 1) retrieval models that select the reply from a fixed set and 2) generation models that produce the reply from scratch. Therefore, MRS complements existing cross-lingual generalization benchmarks that focus on classification and sequence labeling tasks.
MultiSense is a dataset of 9,504 images annotated with an English verb and its translation in Spanish and German.
PatTR is a sentence-parallel corpus extracted from the MAREC patent collection. The current version contains more than 22 million German-English and 18 million French-English parallel sentences collected from all patent text sections as well as 5 million German-French sentence pairs from patent titles, abstracts and claims.
Patzig contains handwritten texts written in modern German. Train sample consists of 485 lines, validation - 38 lines and test -118 lines.
SV-Ident comprises 4,248 sentences from social science publications in English and German. The data is the official data for the Shared Task: “Survey Variable Identification in Social Science Publications” (SV-Ident) 2022. Sentences are labeled with variables that are mentioned either explicitly or implicitly.
3 PAPERS • 2 BENCHMARKS
Schiller contains handwritten texts written in modern German. Train sample consists of 244 lines, validation - 21 lines and test - 63 lines.
Schwerin contains handwritten texts written in medieval German. Train sample consists of 793 lines, validation - 68 lines and test - 196 lines.
AM2iCo is a wide-coverage and carefully designed cross-lingual and multilingual evaluation set. It aims to assess the ability of state-of-the-art representation models to reason over cross-lingual lexical-level concept alignment in context for 14 language pairs.
2 PAPERS • NO BENCHMARKS YET
DEplain-APA-sent: A German Parallel Corpus for Sentence Simplification on News Texts DEplain is a new dataset of parallel, professionally written and manually aligned simplifications in plain German “plain DE” (or in German: “Einfache Sprache”). DEplain consists of four main subcorpora: DEplain-APA-doc, DEplain-APA-sent, DEplain-web-doc, and DEplain-web-sent.
2 PAPERS • 1 BENCHMARK
DEplain-web-sent: A German Parallel Corpus for Sentence Simplification on Web Texts DEplain is a new dataset of parallel, professionally written and manually aligned simplifications in plain German “plain DE” (or in German: “Einfache Sprache”). DEplain consists of four main subcorpora: DEplain-APA-doc, DEplain-APA-sent, DEplain-web-doc, and DEplain-web-sent.
DeCOCO is a bilingual (English-German) corpus of image descriptions, where the English part is extracted from the COCO dataset, and the German part are translations by a native German speaker.
The GermEval dataset is a valuable resource for natural language processing (NLP) tasks, specifically named entity recognition (NER), conducted in the German language. Here are some key details about this dataset:
GermanDPR is a dataset for passage retrieval in German. GermanDPR comprises 8,245 question/answer pairs in the training set, 1,030 pairs in the development set, and 1,025 pairs in the test set. For each pair, there are one positive context and three hard negative contexts.
MobIE is a German-language dataset which is human-annotated with 20 coarse- and fine-grained entity types and entity linking information for geographically linkable entities. The dataset consists of 3,232 social media texts and traffic reports with 91K tokens, and contains 20.5K annotated entities, 13.1K of which are linked to a knowledge base. A subset of the dataset is human-annotated with seven mobility-related, n-ary relation types, while the remaining documents are annotated using a weakly-supervised labeling approach implemented with the Snorkel framework.
Morph Call is a suite of 46 probing tasks for four Indo-European languages that fall under different morphology: Russian, French, English, and German. The tasks are designed to explore the morphosyntactic content of multilingual transformers which is a less studied aspect at the moment.
MuCo-VQA consist of large-scale (3.7M) multilingual and code-mixed VQA datasets in multiple languages: Hindi (hi), Bengali (bn), Spanish (es), German (de), French (fr) and code-mixed language pairs: en-hi, en-bn, en-fr, en-de and en-es.
MultiSpider is a large multilingual text-to-SQL dataset which covers seven languages (English, German, French, Spanish, Japanese, Chinese, and Vietnamese).
SubEdits is a human-annnoated post-editing dataset of neural machine translation outputs, compiled from in-house NMT outputs and human post-edits of subtitles form Rakuten Viki. It is collected from English-German annotations and contains 160k triplets.
Tilde MODEL Corpus is a multilingual corpora for European languages – particularly focused on the smaller languages. The collected resources have been cleaned, aligned, and formatted into a corpora standard TMX format useable for developing new Language technology products and services.
WikiCaps is a large-scale multilingual but non-parallel data set for multimodal machine translation and retrieval. The image-caption data was extracted from Wikimedia Commons and is thus a representative of the collection of largely available non-descriptive image-caption pairs in the web. The current version of the dataset contains 3,816,940 images with 3,825,132 English captions and additional 1,000 image-caption pairs in German, French, and Russian together with their English counterparts.
X-WikiRE is a new, large-scale multilingual relation extraction dataset in which relation extraction is framed as a problem of reading comprehension to allow for generalization to unseen relations.
APE is useful to evaluate Machine Translation automatic post-editing (APE), which is the task of improving the output of a blackbox MT system by automatically fixing its mistakes. The act of post-editing text can be fully specified as a sequence of delete and insert actions in given positions.
1 PAPER • NO BENCHMARKS YET
The Archive Query Log (AQL) is a previously unused, comprehensive query log collected at the Internet Archive over the last 25 years. Its first version includes 356 million queries, 166 million search result pages, and 1.7 billion search results across 550 search providers. Although many query logs have been studied in the literature, the search providers that own them generally do not publish their logs to protect user privacy and vital business data. The AQL is the first publicly available query log that combines size, scope, and diversity, enabling research on new retrieval models and search engine analyses. Provided in a privacy-preserving manner, it promotes open research as well as more transparency and accountability in the search industry.
This paper analyses two hitherto unstudied sites sharing state-backed disinformation, Reliable Recent News (rrn.world) and WarOnFakes (waronfakes.com), which publish content in Arabic, Chinese, English, French, German, and Spanish.
Digital Edition: Essays from Hannah Arendt We have created a NER dataset from the digital edition "Sechs Essays" by Hannah Arendt. It consists of 23 documents from the period 1932-1976, which are available as TEI files online (see https://hannah-arendt-edition.net/3p.html?lang=de).
The CareerCoach 2022 gold standard is available for download in the NIF and JSON format, and draws upon documents from a corpus of over 99,000 education courses which have been retrieved from 488 different education providers.