A large-scale MultiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages -- namely, French, German, Spanish, Russian, Turkish. Together with English newspapers from the popular CNN/Daily mail dataset, the collected data form a large scale multilingual dataset which can enable new research directions for the text summarization community.
40 PAPERS • 5 BENCHMARKS
WikiCLIR is a large-scale (German-English) retrieval data set for Cross-Language Information Retrieval (CLIR). It contains a total of 245,294 German single-sentence queries with 3,200,393 automatically extracted relevance judgments for 1,226,741 English Wikipedia articles as documents. Queries are well-formed natural language sentences that allow large-scale training of (translation-based) ranking models.
4 PAPERS • NO BENCHMARKS YET
The DAWT dataset consists of Densely Annotated Wikipedia Texts across multiple languages. The annotations include labeled text mentions mapping to entities (represented by their Freebase machine ids) as well as the type of the entity. The data set contains total of 13.6M articles, 5.0B tokens, 13.8M mention entity co-occurrences. DAWT contains 4.8 times more anchor text to entity links than originally present in the Wikipedia markup. Moreover, it spans several languages including English, Spanish, Italian, German, French and Arabic.
3 PAPERS • NO BENCHMARKS YET
SV-Ident comprises 4,248 sentences from social science publications in English and German. The data is the official data for the Shared Task: “Survey Variable Identification in Social Science Publications” (SV-Ident) 2022. Sentences are labeled with variables that are mentioned either explicitly or implicitly.
3 PAPERS • 2 BENCHMARKS
This data adds textual meta-infomation data to two existing corpora for cross language information retrieval: BoostCLIR, and the Large Scale CLIR Dataset (wiki-clir).
1 PAPER • NO BENCHMARKS YET
The Medical Translation Task of WMT 2014 addresses the problem of domain-specific and genre-specific machine translation. The task is split into two subtasks: summary translation, focused on translation of sentences from summaries of medical articles, and query translation, focused on translation of queries entered by users into medical information search engines. Both subtasks included translation between English and Czech, German, and French, in both directions.