The Universal Dependencies (UD) project seeks to develop cross-linguistically consistent treebank annotation of morphology and syntax for multiple languages. The first version of the dataset was released in 2015 and consisted of 10 treebanks over 10 languages. Version 2.7 released in 2020 consists of 183 treebanks over 104 languages. The annotation consists of UPOS (universal part-of-speech tags), XPOS (language-specific part-of-speech tags), Feats (universal morphological features), Lemmas, dependency heads and universal dependency labels.
324 PAPERS • 5 BENCHMARKS
The ADE20K semantic segmentation dataset contains more than 20K scene-centric images exhaustively annotated with pixel-level objects and object parts labels. There are totally 150 semantic categories, which include stuffs like sky, road, grass, and discrete objects like person, car, bed.
245 PAPERS • 9 BENCHMARKS
The Cross-lingual Natural Language Inference (XNLI) corpus is the extension of the Multi-Genre NLI (MultiNLI) corpus to 15 languages. The dataset was created by manually translating the validation and test sets of MultiNLI into each of those 15 languages. The English training set was machine translated for all languages. The dataset is composed of 122k train, 2490 validation and 5010 test examples.
107 PAPERS • 9 BENCHMARKS
OpenSubtitles is collection of multilingual parallel corpora. The dataset is compiled from a large database of movie and TV subtitles and includes a total of 1689 bitexts spanning 2.6 billion sentences across 60 languages.
99 PAPERS • 1 BENCHMARK
The MSRA-TD500 dataset is a text detection dataset that contains 300 training images and 200 test images. Text regions are arbitrarily orientated and annotated at sentence level. Different from the other datasets, it contains both English and Chinese text.
72 PAPERS • 1 BENCHMARK
Total-Text is a text detection dataset that consists of 1,555 images with a variety of text types including horizontal, multi-oriented, and curved text instances. The training split and testing split have 1,255 images and 300 images, respectively.
60 PAPERS • 1 BENCHMARK
ASPEC, Asian Scientific Paper Excerpt Corpus, is constructed by the Japan Science and Technology Agency (JST) in collaboration with the National Institute of Information and Communications Technology (NICT). It consists of a Japanese-English paper abstract corpus of 3M parallel sentences (ASPEC-JE) and a Japanese-Chinese paper excerpt corpus of 680K parallel sentences (ASPEC-JC). This corpus is one of the achievements of the Japanese-Chinese machine translation project which was run in Japan from 2006 to 2010.
55 PAPERS • NO BENCHMARKS YET
MuST-C currently represents the largest publicly available multilingual corpus (one-to-many) for speech translation. It covers eight language directions, from English to German, Spanish, French, Italian, Dutch, Portuguese, Romanian and Russian. The corpus consists of audio, transcriptions and translations of English TED talks, and it comes with a predefined training, validation and test split.
51 PAPERS • NO BENCHMARKS YET
The shared task of CoNLL-2002 concerns language-independent named entity recognition. The types of named entities include: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups. The participants of the shared task were offered training and test data for at least two languages. Information sources other than the training data might have been used in this shared task.
43 PAPERS • 2 BENCHMARKS
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance. MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic, German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between 4 different languages on average.
43 PAPERS • 1 BENCHMARK
Common Voice is an audio dataset that consists of a unique MP3 and corresponding text file. There are 9,283 recorded hours in the dataset. The dataset also includes demographic metadata like age, sex, and accent. The dataset consists of 7,335 validated hours in 60 languages.
35 PAPERS • 57 BENCHMARKS
DuReader is a large-scale open-domain Chinese machine reading comprehension dataset. The dataset consists of 200K questions, 420K answers and 1M documents. The questions and documents are based on Baidu Search and Baidu Zhidao. The answers are manually generated. The dataset additionally provides question type annotations – each question was manually annotated as either Entity, Description or YesNo and one of Fact or Opinion.
33 PAPERS • 4 BENCHMARKS
LCSTS is a large corpus of Chinese short text summarization dataset constructed from the Chinese microblogging website Sina Weibo, which is released to the public. This corpus consists of over 2 million real Chinese short texts with short summaries given by the author of each text. The authors also manually tagged the relevance of 10,666 short summaries with their corresponding short texts 10,666 short summaries with their corresponding short texts.
30 PAPERS • NO BENCHMARKS YET
XQuAD (Cross-lingual Question Answering Dataset) is a benchmark dataset for evaluating cross-lingual question answering performance. The dataset consists of a subset of 240 paragraphs and 1190 question-answer pairs from the development set of SQuAD v1.1 (Rajpurkar et al., 2016) together with their professional translations into ten languages: Spanish, German, Greek, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, and Hindi. Consequently, the dataset is entirely parallel across 11 languages.
30 PAPERS • 1 BENCHMARK
DBP15k contains four language-specific KGs that are respectively extracted from English (En), Chinese (Zh), French (Fr) and Japanese (Ja) DBpedia, each of which contains around 65k-106k entities. Three sets of 15k alignment labels are constructed to align entities between each of the other three languages and En.
27 PAPERS • 3 BENCHMARKS
PAWS-X contains 23,659 human translated PAWS evaluation pairs and 296,406 machine translated training pairs in six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean. All translated pairs are sourced from examples in PAWS-Wiki.
26 PAPERS • 1 BENCHMARK
CMRC is a dataset is annotated by human experts with near 20,000 questions as well as a challenging set which is composed of the questions that need reasoning over multiple clues.
24 PAPERS • 1 BENCHMARK
The Weibo NER dataset is a Chinese Named Entity Recognition dataset drawn from the social media website Sina Weibo.
23 PAPERS • 1 BENCHMARK
Multilingual Document Classification Corpus (MLDoc) is a cross-lingual document classification dataset covering English, German, French, Spanish, Italian, Russian, Japanese and Chinese. It is a subset of the Reuters Corpus Volume 2 selected according to the following design choices:
22 PAPERS • 11 BENCHMARKS
The SCUT-CTW1500 dataset contains 1,500 images: 1,000 for training and 500 for testing. In particular, it provides 10,751 cropped text instance images, including 3,530 with curved text. The images are manually harvested from the Internet, image libraries such as Google Open-Image, or phone cameras. The dataset contains a lot of horizontal and multi-oriented text.
22 PAPERS • 2 BENCHMARKS
THCHS-30 is a free Chinese speech database THCHS-30 that can be used to build a full-fledged Chinese speech recognition system.
22 PAPERS • NO BENCHMARKS YET
Delta Reading Comprehension Dataset (DRCD) is an open domain traditional Chinese machine reading comprehension (MRC) dataset. This dataset aimed to be a standard Chinese machine reading comprehension dataset, which can be a source dataset in transfer learning. The dataset contains 10,014 paragraphs from 2,108 Wikipedia articles and 30,000+ questions generated by annotators.
21 PAPERS • 5 BENCHMARKS
The KVASIR Dataset was released as part of the medical multimedia challenge presented by MediaEval. It is based on images obtained from the GI tract via an endoscopy procedure. The dataset is composed of images that are annotated and verified by medical doctors, and captures 8 different classes. The classes are based on three anatomical landmarks (z-line, pylorus, cecum), three pathological findings (esophagitis, polyps, ulcerative colitis) and two other classes (dyed and lifted polyps, dyed resection margins) related to the polyp removal process. Overall, the dataset contains 8,000 endoscopic images, with 1,000 image examples per class.
21 PAPERS • 3 BENCHMARKS
The iKala dataset is a singing voice separation dataset that comprises of 252 30-second excerpts sampled from 206 iKala songs (plus 100 hidden excerpts reserved for MIREX data mining contest). The music accompaniment and the singing voice are recorded at the left and right channels respectively. Additionally, the human-labeled pitch contours and timestamped lyrics are provided.
18 PAPERS • NO BENCHMARKS YET
The BUCC mining task is a shared task on parallel sentence extraction from two monolingual corpora with a subset of them assumed to be parallel, and that has been available since 2016. For each language pair, the shared task provides a monolingual corpus for each language and a gold mapping list containing true translation pairs. These pairs are the ground truth. The task is to construct a list of translation pairs from the monolingual corpora. The constructed list is compared to the ground truth, and evaluated in terms of the F1 measure.
17 PAPERS • 4 BENCHMARKS
MIR-1K (Multimedia Information Retrieval lab, 1000 song clips) is a dataset designed for singing voice separation. It contains:
14 PAPERS • NO BENCHMARKS YET
The first parallel corpus composed from United Nations documents published by the original data creator. The parallel corpus presented consists of manually translated UN documents from the last 25 years (1990 to 2014) for the six official UN languages, Arabic, Chinese, English, French, Russian, and Spanish.
13 PAPERS • NO BENCHMARKS YET
OSCAR or Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. The dataset used for training multilingual models such as BART incorporates 138 GB of text.
12 PAPERS • NO BENCHMARKS YET
Resume contains eight fine-grained entity categories -score from 74.5% to 86.88%.
12 PAPERS • 1 BENCHMARK
AVSpeech is a large-scale audio-visual dataset comprising speech clips with no interfering background signals. The segments are of varying length, between 3 and 10 seconds long, and in each clip the only visible face in the video and audible sound in the soundtrack belong to a single speaking person. In total, the dataset contains roughly 4700 hours of video segments with approximately 150,000 distinct speakers, spanning a wide variety of people, languages and face poses.
11 PAPERS • NO BENCHMARKS YET
VATEX is multilingual, large, linguistically complex, and diverse dataset in terms of both video and natural language descriptions. It has two tasks for video-and-language research: (1) Multilingual Video Captioning, aimed at describing a video in various languages with a compact unified captioning model, and (2) Video-guided Machine Translation, to translate a source language description into the target language using the video information as additional spatiotemporal context.
11 PAPERS • 1 BENCHMARK
CASIA-HWDB is a dataset for handwritten Chinese character recognition. It contains 300 files (240 in HWDB1.1 training set and 60 in HWDB1.1 test set). Each file contains about 3000 isolated gray-scale Chinese character images written by one writer, as well as their corresponding labels.
9 PAPERS • NO BENCHMARKS YET
Features a large-scale dataset with 12,263 annotated images. Two tasks, namely text localization and end-to-end recognition, are set up. The competition took place from January 20 to May 31, 2017. 23 valid submissions were received from 19 teams.
9 PAPERS • NO BENCHMARKS YET
CN-Celeb is a large-scale speaker recognition dataset collected `in the wild'. This dataset contains more than 130,000 utterances from 1,000 Chinese celebrities, and covers 11 different genres in real world.
8 PAPERS • 1 BENCHMARK
ChID is a large-scale Chinese IDiom dataset for cloze test. ChID contains 581K passages and 729K blanks, and covers multiple domains. In ChID, the idioms in a passage were replaced with blank symbols. For each blank, a list of candidate idioms including the golden idiom are provided as choice.
8 PAPERS • 3 BENCHMARKS
CoVoST is a large-scale multilingual speech-to-text translation corpus. Its latest 2nd version covers translations from 21 languages into English and from English into 15 languages. It has total 2880 hours of speech and is diversified with 78K speakers and 66 accents.
8 PAPERS • NO BENCHMARKS YET
Contains two different types: cloze-style reading comprehension and user query reading comprehension, associated with large-scale training data as well as human-annotated validation and hidden test set.
7 PAPERS • 3 BENCHMARKS
FM-IQA is a question-answering dataset containing over 150,000 images and 310,000 freestyle Chinese question-answer pairs and their English translations.
7 PAPERS • NO BENCHMARKS YET
OCNLI stands for Original Chinese Natural Language Inference. It is corpus for Chinese Natural Language Inference, collected following closely the procedures of MNLI, but with enhanced strategies aiming for more challenging inference pairs. No human/machine translation is used in creating the dataset, and thus the Chinese texts are original and not translated.
7 PAPERS • 3 BENCHMARKS
C3 is a free-form multiple-Choice Chinese machine reading Comprehension dataset.
6 PAPERS • 3 BENCHMARKS
DialogRE is the first human-annotated dialogue-based relation extraction dataset, containing 1,788 dialogues originating from the complete transcripts of a famous American television situation comedy Friends. The are annotations for all occurrences of 36 possible relation types that exist between an argument pair in a dialogue. DialogRE is available in English and Chinese.
6 PAPERS • 1 BENCHMARK
News translation is a recurring WMT task. The test set is a collection of parallel corpora consisting of about 1500 English sentences translated into 5 languages (Chinese, Czech, Estonian, German, Finnish, Russian, Turkish) and additional 1500 sentences from each of the 7 languages translated to English. The sentences were selected from dozens of news websites and translated by professional translators.
6 PAPERS • NO BENCHMARKS YET
COCO-CN is a bilingual image description dataset enriching MS-COCO with manually written Chinese sentences and tags. The new dataset can be used for multiple tasks including image tagging, captioning and retrieval, all in a cross-lingual setting.
5 PAPERS • NO BENCHMARKS YET
WikiAnn is a dataset for cross-lingual name tagging and linking based on Wikipedia articles in 295 languages.
5 PAPERS • NO BENCHMARKS YET
Chinese Text in the Wild is a dataset of Chinese text with about 1 million Chinese characters from 3850 unique ones annotated by experts in over 30000 street view images. This is a challenging dataset with good diversity containing planar text, raised text, text under poor illumination, distant text, partially occluded text, etc.
4 PAPERS • NO BENCHMARKS YET
A human-to-human Chinese dialog dataset (about 10k dialogs, 156k utterances), which contains multiple sequential dialogs for every pair of a recommendation seeker (user) and a recommender (bot).
4 PAPERS • NO BENCHMARKS YET
KdConv is a Chinese multi-domain Knowledge-driven Conversation dataset, grounding the topics in multi-turn conversations to knowledge graphs. KdConv contains 4.5K conversations from three domains (film, music, and travel), and 86K utterances with an average turn number of 19.0. These conversations contain in-depth discussions on related topics and natural transition between multiple topics, while the corpus can also used for exploration of transfer learning and domain adaptation.
4 PAPERS • NO BENCHMARKS YET
Multilingual Knowledge Questions and Answers (MKQA) is an open-domain question answering evaluation set comprising 10k question-answer pairs aligned across 26 typologically diverse languages (260k question-answer pairs in total). The goal of this dataset is to provide a challenging benchmark for question answering quality across a wide set of languages. Answers are based on a language-independent data representation, making results comparable across languages and independent of language-specific passages. With 26 languages, this dataset supplies the widest range of languages to-date for evaluating question answering.
4 PAPERS • NO BENCHMARKS YET