8 dataset results for Chinese Reading Comprehension

CMRC (Chinese Machine Reading Comprehension)

CMRC is a dataset is annotated by human experts with near 20,000 questions as well as a challenging set which is composed of the questions that need reasoning over multiple clues.

62 PAPERS • 11 BENCHMARKS

DRCD (Delta Reading Comprehension Dataset)

Delta Reading Comprehension Dataset (DRCD) is an open domain traditional Chinese machine reading comprehension (MRC) dataset. This dataset aimed to be a standard Chinese machine reading comprehension dataset, which can be a source dataset in transfer learning. The dataset contains 10,014 paragraphs from 2,108 Wikipedia articles and 30,000+ questions generated by annotators.

52 PAPERS • 5 BENCHMARKS

CMRC 2018

CMRC 2018 (Chinese Machine Reading Comprehension 2018)

CMRC 2018 is a dataset for Chinese Machine Reading Comprehension. Specifically, it is a span-extraction reading comprehension dataset that is similar to SQuAD.

46 PAPERS • 7 BENCHMARKS

RoadTracer

RoadTracer is a dataset for extraction of road networks from aerial images. It consists of a large corpus of high-resolution satellite imagery and ground truth road network graphs covering the urban core of forty cities across six countries. For each city, the dataset covers a region of approximately 24 sq km around the city center. The satellite imagery is obtained from Google at 60 cm/pixel resolution, and the road network from OSM.

37 PAPERS • 2 BENCHMARKS

Belebele

Belebele is a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants. This dataset enables the evaluation of mono- and multi-lingual models in high-, medium-, and low-resource languages. Each question has four multiple-choice answers and is linked to a short passage from the FLORES-200 dataset. The human annotation procedure was carefully curated to create questions that discriminate between different levels of generalizable language comprehension and is reinforced by extensive quality checks. While all questions directly relate to the passage, the English dataset on its own proves difficult enough to challenge state-of-the-art language models. Being fully parallel, this dataset enables direct comparison of model performance across all languages. Belebele opens up new avenues for evaluating and analyzing the multilingual abilities of language models and NLP systems.

15 PAPERS • NO BENCHMARKS YET

CJRC

CJRC (Chinese judicial reading comprehension)

The Chinese judicial reading comprehension (CJRC) dataset contains approximately 10K documents and almost 50K questions with answers. The documents come from judgment documents and the questions are annotated by law experts.

9 PAPERS • 2 BENCHMARKS

ReCO

A human-curated ChineseReading Comprehension dataset on Opinion. The questions in ReCO are opinion based queries issued to the commercial search engine. The passages are provided by the crowdworkers who extract the support snippet from the retrieved documents.

9 PAPERS • NO BENCHMARKS YET

DMQA

DMQA (DeepMind Q&A)

The DeepMind Q&A Dataset consists of two datasets for Question Answering, CNN and DailyMail. Each dataset contains many documents (90k and 197k each), and each document companies on average 4 questions approximately. Each question is a sentence with one missing word/phrase which can be found from the accompanying document/context.

3 PAPERS • NO BENCHMARKS YET

Datasets

8 dataset results for Chinese Reading Comprehension