The NewsQA dataset is a crowd-sourced machine reading comprehension dataset of 120,000 question-answer pairs.
- Documents are CNN news articles.
- Questions are written by human users in natural language.
- Answers may be multiword passages of the source text.
- Questions may be unanswerable.
- NewsQA is collected using a 3-stage, siloed process.
- Questioners see only an article’s headline and highlights.
- Answerers see the question and the full article, then select an answer passage.
- Validators see the article, the question, and a set of answers that they rank.
- NewsQA is more natural and more challenging than previous datasets.