8 dataset results for Text-To-SQL AND Texts

SParC is a large-scale dataset for complex, cross-domain, and context-dependent (multi-turn) semantic parsing and text-to-SQL task (interactive natural language interfaces for relational databases).

48 PAPERS • 2 BENCHMARKS

CoSQL (Conversational Text-to-SQL Challenge)

CoSQL is a corpus for building cross-domain, general-purpose database (DB) querying dialogue systems. It consists of 30k+ turns plus 10k+ annotated SQL queries, obtained from a Wizard-of-Oz (WOZ) collection of 3k dialogues querying 200 complex DBs spanning 138 domains. Each dialogue simulates a real-world DB query scenario with a crowd worker as a user exploring the DB and a SQL expert retrieving answers with SQL, clarifying ambiguous questions, or otherwise informing of unanswerable questions.

33 PAPERS • 1 BENCHMARK

KaggleDBQA (KaggleDBQA: Realistic Text-to-SQL dataset)

KaggleDBQA is a challenging cross-domain and complex evaluation dataset of real Web databases, with domain-specific data types, original formatting, and unrestricted questions.

16 PAPERS • 1 BENCHMARK

SEDE (Stack Exchange Data Explorer)

SEDE is a dataset comprised of 12,023 complex and diverse SQL queries and their natural language titles and descriptions, written by real users of the Stack Exchange Data Explorer out of a natural interaction. These pairs contain a variety of real-world challenges which were rarely reflected so far in any other semantic parsing dataset. The goal of this dataset is to take a significant step towards evaluation of Text-to-SQL models in a real-world setting. Compared to other Text-to-SQL datasets, SEDE contains at least 10 times more SQL queries templates (queries after canonization and anonymization of values) than other datasets, and has the most diverse set of utterances and SQL queries (in terms of 3-grams) out of all single-domain datasets. SEDE introduces real-world challenges, such as under-specification, usage of parameters in queries, dates manipulation and more.

8 PAPERS • 1 BENCHMARK

ADVETA

ADVErsarial Table perturbAtion (ADVETA) is a robustness evaluation benchmark featuring natural and realistic ATPs. It is based on three mainstream Text-to-SQL datasets, Spider, WikiSQL and WTQ.

2 PAPERS • NO BENCHMARKS YET

MultiSpider

MultiSpider is a large multilingual text-to-SQL dataset which covers seven languages (English, German, French, Spanish, Japanese, Chinese, and Vietnamese).

2 PAPERS • NO BENCHMARKS YET

ViText2SQL

ViText2SQL is a dataset for the Vietnamese Text-to-SQL semantic parsing task, consisting of about 10K question and SQL query pairs.

2 PAPERS • NO BENCHMARKS YET

NSText2SQL: An Open Source Text-to-SQL Dataset for Foundation Model Training

Numbers Station Text to SQL

0 PAPER • NO BENCHMARKS YET

Datasets

8 dataset results for Text-To-SQL AND Texts