Large-scale Cloze Test Dataset Created by Teachers

EMNLP 2018  ·  Qizhe Xie, Guokun Lai, Zihang Dai, Eduard Hovy ·

Cloze tests are widely adopted in language exams to evaluate students' language proficiency. In this paper, we propose the first large-scale human-created cloze test dataset CLOTH, containing questions used in middle-school and high-school language exams. With missing blanks carefully created by teachers and candidate choices purposely designed to be nuanced, CLOTH requires a deeper language understanding and a wider attention span than previously automatically-generated cloze datasets. We test the performance of dedicatedly designed baseline models including a language model trained on the One Billion Word Corpus and show humans outperform them by a significant margin. We investigate the source of the performance gap, trace model deficiencies to some distinct properties of CLOTH, and identify the limited ability of comprehending the long-term context to be the key bottleneck.

PDF Abstract EMNLP 2018 PDF EMNLP 2018 Abstract

Datasets


Introduced in the Paper:

CLOTH

Used in the Paper:

BookCorpus LAMBADA CBT

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here