FrameNet is a linguistic knowledge graph containing information about lexical and predicate argument semantics of the English language. FrameNet contains two distinct entity classes: frames and lexical units, where a frame is a meaning and a lexical unit is a single meaning for a word.
439 PAPERS • NO BENCHMARKS YET
The ATIS (Airline Travel Information Systems) is a dataset consisting of audio recordings and corresponding manual transcripts about humans asking for flight information on automated airline travel inquiry systems. The data consists of 17 unique intent categories. The original split contains 4478, 500 and 893 intent-labeled reference utterances in train, development and test set respectively.
268 PAPERS • 7 BENCHMARKS
WikiSQL consists of a corpus of 87,726 hand-annotated SQL query and natural language question pairs. These SQL queries are further split into training (61,297 examples), development (9,145 examples) and test sets (17,284 examples). It can be used for natural language inference tasks related to relational databases.
248 PAPERS • 4 BENCHMARKS
SCAN is a dataset for grounded navigation which consists of a set of simple compositional navigation commands paired with the corresponding action sequences.
140 PAPERS • NO BENCHMARKS YET
Spider dataset is used for evaluation in the paper "Structure-Grounded Pretraining for Text-to-SQL". The dataset is created based on the dev split of the Spider dataset (2020-06-07 version from https://yale-lily.github.io/spider). We manually modified the original questions to remove the explicit mention of column names while keeping the SQL queries unchanged to better evaluate the model's capability in aligning the NL utterance and the DB schema. For more details, please check our paper at https://arxiv.org/abs/2010.12773.
81 PAPERS • 2 BENCHMARKS
WikiTableQuestions is a question answering dataset over semi-structured tables. It is comprised of question-answer pairs on HTML tables, and was constructed by selecting data tables from Wikipedia that contained at least 8 rows and 5 columns. Amazon Mechanical Turk workers were then tasked with writing trivia questions about each table. WikiTableQuestions contains 22,033 questions. The questions were not designed by predefined templates but were hand crafted by users, demonstrating high linguistic variance. Compared to previous datasets on knowledge bases it covers nearly 4,000 unique column headers, containing far more relations than closed domain datasets and datasets for querying knowledge bases. Its questions cover a wide range of domains, requiring operations such as table lookup, aggregation, superlatives (argmax, argmin), arithmetic operations, joins and unions.
72 PAPERS • 2 BENCHMARKS
A large and realistic natural language question answering dataset.
65 PAPERS • 1 BENCHMARK
Occluded REID is an occluded person dataset captured by mobile cameras, consisting of 2,000 images of 200 occluded persons (see Fig. (c)). Each identity has 5 full-body person images and 5 occluded person images with different types of occlusion.
63 PAPERS • 1 BENCHMARK
ComplexWebQuestions is a dataset for answering complex questions that require reasoning over multiple web snippets. It contains a large set of complex questions in natural language, and can be used in multiple ways:
58 PAPERS • 2 BENCHMARKS
The WebQuestionsSP dataset is released as part of our ACL-2016 paper “The Value of Semantic Parse Labeling for Knowledge Base Question Answering” [Yih, Richardson, Meek, Chang & Suh, 2016], in which we evaluated the value of gathering semantic parses, vs. answers, for a set of questions that originally comes from WebQuestions [Berant et al., 2013]. The WebQuestionsSP dataset contains full semantic parses in SPARQL queries for 4,737 questions, and “partial” annotations for the remaining 1,073 questions for which a valid parse could not be formulated or where the question itself is bad or needs a descriptive answer. This release also includes an evaluation script and the output of the STAGG semantic parsing system when trained using the full semantic parses. More detail can be found in the document and labeling instructions included in this release, as well as the paper.
58 PAPERS • 5 BENCHMARKS
SParC is a large-scale dataset for complex, cross-domain, and context-dependent (multi-turn) semantic parsing and text-to-SQL task (interactive natural language interfaces for relational databases).
53 PAPERS • 2 BENCHMARKS
NomBank is an annotation project at New York University that is related to the PropBank project at the University of Colorado. The goal is to mark the sets of arguments that cooccur with nouns in the PropBank Corpus (the Wall Street Journal Corpus of the Penn Treebank), just as PropBank records such information for verbs. As a side effect of the annotation process, the authors are producing a number of other resources including various dictionaries, as well as PropBank style lexical entries called frame files. These resources help the user label the various arguments and adjuncts of the head nouns with roles (sets of argument labels for each sense of each noun). NYU and U of Colorado are making a coordinated effort to insure that, when possible, role definitions are consistent across parts of speech. For example, PropBank's frame file for the verb "decide" was used in the annotation of the noun "decision".
49 PAPERS • NO BENCHMARKS YET
A new large dataset with over 100,000 examples consisting of Java classes from online code repositories, and develop a new encoder-decoder architecture that models the interaction between the method documentation and the class environment.
41 PAPERS • 1 BENCHMARK
CoSQL is a corpus for building cross-domain, general-purpose database (DB) querying dialogue systems. It consists of 30k+ turns plus 10k+ annotated SQL queries, obtained from a Wizard-of-Oz (WOZ) collection of 3k dialogues querying 200 complex DBs spanning 138 domains. Each dialogue simulates a real-world DB query scenario with a crowd worker as a user exploring the DB and a SQL expert retrieving answers with SQL, clarifying ambiguous questions, or otherwise informing of unanswerable questions.
40 PAPERS • 1 BENCHMARK
Contains around 200K dialogs with a total of 1.6M turns. Further, unlike existing large scale QA datasets which contain simple questions that can be answered from a single tuple, the questions in the dialogs require a larger subgraph of the KG.
38 PAPERS • NO BENCHMARKS YET
The SQA dataset was created to explore the task of answering sequences of inter-related questions on HTML tables. It has 6,066 sequences with 17,553 questions in total.
34 PAPERS • 1 BENCHMARK
A new large-scale geometry problem-solving dataset - 3,002 multi-choice geometry problems - dense annotations in formal language for the diagrams and text - 27,213 annotated diagram logic forms (literals) - 6,293 annotated text logic forms (literals)
32 PAPERS • 1 BENCHMARK
Dataset is constructed from single intent dataset SNIPS.
26 PAPERS • 3 BENCHMARKS
Task Oriented Parsing v2 (TOPv2) representations for intent-slot based dialog systems.
25 PAPERS • NO BENCHMARKS YET
The AMR Bank is a set of English sentences paired with simple, readable semantic representations. Version 3.0 released in 2020 consists of 59,255 sentences.
22 PAPERS • 1 BENCHMARK
A large-scale dataset for Complex KBQA.
This dataset contains card descriptions of the card game Hearthstone and the code that implements them. These are obtained from the open-source implementation Hearthbreaker (https://github.com/danielyule/hearthbreaker).
21 PAPERS • NO BENCHMARKS YET
QuaRel is a crowdsourced dataset of 2771 multiple-choice story questions, including their logical forms.
20 PAPERS • NO BENCHMARKS YET
One of the largest commonsense knowledge bases available, describing over 2 million disambiguated concepts and activities, connected by over 18 million assertions.
Groningen Meaning Bank is a semantic resource that anyone can edit and that integrates various semantic phenomena, including predicate-argument structure, scope, tense, thematic roles, animacy, pronouns, and rhetorical relations.
18 PAPERS • NO BENCHMARKS YET
GraphQuestions is a characteristic-rich dataset designed for factoid question answering. The dataset aims to provide a systematic way of constructing QA datasets with rich and explicitly specified question characteristics. Here are some key details about GraphQuestions:
14 PAPERS • 2 BENCHMARKS
Fashion 144K is a novel heterogeneous dataset with 144,169 user posts containing diverse image, textual and meta information.
11 PAPERS • NO BENCHMARKS YET
SEDE is a dataset comprised of 12,023 complex and diverse SQL queries and their natural language titles and descriptions, written by real users of the Stack Exchange Data Explorer out of a natural interaction. These pairs contain a variety of real-world challenges which were rarely reflected so far in any other semantic parsing dataset. The goal of this dataset is to take a significant step towards evaluation of Text-to-SQL models in a real-world setting. Compared to other Text-to-SQL datasets, SEDE contains at least 10 times more SQL queries templates (queries after canonization and anonymization of values) than other datasets, and has the most diverse set of utterances and SQL queries (in terms of 3-grams) out of all single-domain datasets. SEDE introduces real-world challenges, such as under-specification, usage of parameters in queries, dates manipulation and more.
11 PAPERS • 1 BENCHMARK
ComQA is a large dataset of real user questions that exhibit different challenging aspects such as compositionality, temporal reasoning, and comparisons. ComQA questions come from the WikiAnswers community QA platform, which typically contains questions that are not satisfactorily answerable by existing search engine technology.
9 PAPERS • NO BENCHMARKS YET
A dataset of utterances, incorrect SQL interpretations and the corresponding natural language feedback.
6 PAPERS • NO BENCHMARKS YET
1000 query triples on 120 tables.
5 PAPERS • NO BENCHMARKS YET
The Probabilistic Context Free Grammar String Edit Task (PCFG SET) dataset is a dataset with sequence to sequence problems specifically designed to test different aspects of compositional generalisation. In particular, the dataset contains splits to test for systematicity, productivity, substitutivity, localism and overgeneralisation.
4 PAPERS • NO BENCHMARKS YET
The Szeged Treebank is the largest fully manually annotated treebank of the Hungarian language. It contains 82,000 sentences, 1.2 million words and 250,000 punctuation marks. Texts were selected from six different domains, ~200,000 words in size from each. The domains are the following:
A dataset of approximately 75,000 phrases and sentences, syntactically analyzed as typelogical derivations (i.e. proofs of modal intuitionistic linear logic, or programs of the corresponding λ calculus). Analyses were obtained by transforming the dependency graphs of the Lassy-Small corpus.
Question Answering (QA) is a widely-used framework for developing and evaluating an intelligent machine. In this light, QA on Electronic Health Records (EHR), namely EHR QA, can work as a crucial milestone toward developing an intelligent agent in healthcare. EHR data are typically stored in a relational database, which can also be converted to a directed acyclic graph, allowing two approaches for EHR QA: Table-based QA and Knowledge Graph-based QA.
3 PAPERS • NO BENCHMARKS YET
Multilingual TOP is a dataset for multilingual semantic parsing with human-written sentences as opposed to machine translated ones. The dataset sentences are in English, Italian and Japanese and it is based on the Facebook Task Oriented Parsing (TOP) dataset.
The first dataset contains annotated natural language queries (i.e. Mandarin) with their Cypher equivalent. It is made up of: - A Neo4j database - 10000 pairs of Text-Cypher queries
2 PAPERS • NO BENCHMARKS YET
Datasets Spades contains 93,319 questions derived from clueweb09 sentences. Specifically, the questions were created by randomly removing an entity, thus producing sentence-denotation pairs.
Schema2QA is the first large question answering dataset over real-world Schema.org data. It covers 6 common domains: restaurants, hotels, people, movies, books, and music, based on crawled Schema.org metadata from 6 different websites (Yelp, Hyatt, LinkedIn, IMDb, Goodreads, and last.fm.). In total, there are over 2,000,000 examples for training, consisting of both augmented human paraphrase data and high-quality synthetic data generated by Genie. All questions are annotated with executable virtual assistant programming language ThingTalk.
ViText2SQL is a dataset for the Vietnamese Text-to-SQL semantic parsing task, consisting of about 10K question and SQL query pairs.
Conic10K is an open-ended math problem dataset on conic sections in Chinese senior high school education. This dataset contains 10,861 carefully annotated problems, each one has a formal representation, the corresponding text spans, the answer, and natural language rationales. These questions require long reasoning steps while the topic is limited to conic sections. It could be used to evaluate models with 2 tasks: semantic parsing and mathematical question answering (mathQA).
1 PAPER • NO BENCHMARKS YET
Hinglish-TOP is a human annotated code-switched semantic parsing dataset containing 10k human annotations for Hindi-English (HINGLISH) code switched utterances, and over 170K CST5 generated code-switched utterances from the TOPv2 dataset.
Overnight is a dataset for semantic parsing in eight domains.
SimpleQuestionsWikidata maps SimpleQuestions to Wikidata.
TurkQA consists of a selection of sentences from English Wikipedia articles, with questions and answers crowdsourced from workers on Amazon Mechanical Turk.