A Qualitative Comparison of CoQA, SQuAD 2.0 and QuAC

NAACL 2019  ·  Mark Yatskar ·

We compare three new datasets for question answering: SQuAD 2.0, QuAC, and CoQA, along several of their new features: (1) unanswerable questions, (2) multi-turn interactions, and (3) abstractive answers. We show that the datasets provide complementary coverage of the first two aspects, but weak coverage of the third. Because of the datasets' structural similarity, a single extractive model can be easily adapted to any of the datasets and we show improved baseline results on both SQuAD 2.0 and CoQA. Despite the similarity, models trained on one dataset are ineffective on another dataset, but we find moderate performance improvement through pretraining. To encourage cross-evaluation, we release code for conversion between datasets at https://github.com/my89/co-squac .

PDF Abstract NAACL 2019 PDF NAACL 2019 Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Question Answering CoQA BiDAF++ (single model) In-domain 69.4 # 3
Out-of-domain 63.8 # 4
Overall 67.8 # 7

Methods


No methods listed for this paper. Add relevant methods here