The ProPara dataset is designed to train and test comprehension of simple paragraphs describing processes (e.g., photosynthesis), designed for the task of predicting, tracking, and answering questions about how entities change during the process.
ProPara aims to promote the research in natural language understanding in the context of procedural text. This requires identifying the actions described in the paragraph and tracking state changes happening to the entities involved. The comprehension task is treated as that of predicting, tracking, and answering questions about how entities change during the procedure. The dataset contains 488 paragraphs and 3,300 sentences. Each paragraph is richly annotated with the existence and locations of all the main entities (the “participants”) at every time step (sentence) throughout the procedure (~81,000 annotations).
ProPara paragraphs are natural (authored by crowdsourcing) rather than synthetic (e.g., in bAbI). Workers were given a prompt (e.g., “What happens during photosynthesis?”) and then asked to author a series of sentences describing the sequence of events in the procedure. From these sentences, participant entities and their existence and locations were identified. The goal of the challenge is to predict the existence and location of each participant, based on sentences in the paragraph.Source: Allen Institute for AI