Stanford Schema2QA Dataset

Introduced by Xu et al. in Schema2QA: High-Quality and Low-Cost Q&A Agents for the Structured Web

Schema2QA is the first large question answering dataset over real-world Schema.org data. It covers 6 common domains: restaurants, hotels, people, movies, books, and music, based on crawled Schema.org metadata from 6 different websites (Yelp, Hyatt, LinkedIn, IMDb, Goodreads, and last.fm.). In total, there are over 2,000,000 examples for training, consisting of both augmented human paraphrase data and high-quality synthetic data generated by Genie. All questions are annotated with executable virtual assistant programming language ThingTalk.

Schema2QA includes challenging evaluation questions collected from crowd workers. Workers are prompted with only what the domain is and what properties are supported. Thus, the sentences are natural and diverse. They also contain entities unseen during training. The collected sentences are manually annotated with ThingTalk by the authors. In total there are over 5,000 examples for dev and test.

An example of an evaluation question and its ThingTalk annotation is shown below:

"What are the highest ranked burger joints in the 40 mile area around Asheville NC?"

sort(aggregateRating.ratingValue desc of @org.schema.Restaurant.Restaurant() 
  filter distance(geo, new Location("asheville nc" )) <= 40 mi && 
         servesCuisine =~ "burger")[1] ;

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


License


Modalities


Languages