Improving Text-to-SQL Evaluation Methodology

To be informative, an evaluation must measure how well systems generalize to realistic unseen data. We identify limitations of and propose improvements to current evaluations of text-to-SQL systems. First, we compare human-generated and automatically generated questions, characterizing properties of queries necessary for real-world applications. To facilitate evaluation on multiple datasets, we release standardized and improved versions of seven existing datasets and one new text-to-SQL dataset. Second, we show that the current division of data into training and test sets measures robustness to variations in the way questions are asked, but only partially tests how well systems generalize to new queries; therefore, we propose a complementary dataset split for evaluation of future work. Finally, we demonstrate how the common practice of anonymizing variables during evaluation removes an important challenge of the task. Our observations highlight key difficulties, and our methodology enables effective measurement of future development.

PDF Abstract ACL 2018 PDF ACL 2018 Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
SQL Parsing Academic Seq2Seq with copying Question Split 81 # 1
Query Split 74 # 1
SQL Parsing Academic Template Baseline Question Split 0 # 3
Query Split 0 # 3
SQL Parsing Advising Seq2Seq with copying Question Split 70 # 2
Query Split 0 # 2
SQL Parsing Advising Template Baseline Question Split 80 # 1
Query Split 0 # 2
SQL Parsing ATIS Seq2Seq with copying Question Split 51 # 1
Query Split 32 # 1
SQL Parsing ATIS Template Baseline Question Split 45 # 2
Query Split 0 # 3
SQL Parsing GeoQuery Template Baseline Question Split 66 # 2
Query Split 0 # 3
SQL Parsing GeoQuery Seq2Seq with copying Question Split 71 # 1
Query Split 20 # 2
SQL Parsing IMDb Seq2Seq with copying Question Split 26 # 1
Query Split 9 # 1
SQL Parsing IMDb Template Baseline Question Split 0 # 3
Query Split 0 # 3
SQL Parsing Restaurants Seq2Seq with copying Question Split 100 # 1
Query Split 4 # 2
SQL Parsing Restaurants Template Baseline Question Split 95 # 3
Query Split 0 # 3
SQL Parsing Scholar Seq2Seq with copying Question Split 59 # 1
Query Split 5 # 1
SQL Parsing Scholar Template Baseline Question Split 52 # 2
Query Split 0 # 3
SQL Parsing Yelp Seq2Seq with copying Question Split 12 # 1
Query Split 4 # 2
SQL Parsing Yelp Template Baseline Question Split 1 # 3
Query Split 0 # 3

Methods


No methods listed for this paper. Add relevant methods here