Conferences >2024 IEEE 40th International ...

Evaluating Text-to-SQL Model Failures on Real-World Data

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Text-to-SQL generation models, capable of converting natural language prompts into SQL queries, offer significant potential for streamlining data analytics tasks. Despite...Show More

Metadata

Abstract:

Text-to-SQL generation models, capable of converting natural language prompts into SQL queries, offer significant potential for streamlining data analytics tasks. Despite state-of-the-art performance on popular academic benchmarks such as Spider [1], recent large language models, such as GPT-4, exhibit a considerable performance degradation on real-world applications with longer, more convoluted schemas [2]. This disparity raises questions about what factors contribute to this drop and whether existing academic benchmarks are effective for representing real-world challenges. To determine these factors, we first examine Text-to-SQL model failures on customer logs. We find that accuracy on customer logs was on average 30% lower than accuracy on Spider. We identify three main challenges in real-world Text-to-SQL applications: long context length, unclear question formulation, and greater query complexity. With these insights, we create a new benchmark built from manually labeled customer logs and evaluate existing open source and private LLMs to demonstrate the impact of each factor on model performance. The benchmark incorporates 20 non-join queries and 30 join queries, each accompanied by three additional question phrasing variations, resulting in 200 queries total. To capture the effects of large schemas, we vary schema size from 5 to over 300 columns while retaining the minimum columns required to answer all questions. We assess the performance of prominent Text-to-SQL models, including GPT-4, GPT-3.5, BigCode's Starcoder [3], and NSQL Llama-2 [4] on both our benchmark and the Spider benchmark for comparative analysis. We use Spider execution accuracy to measure model performance. The evaluation results reveal a) A consistent decline in execution accuracy for longer schemas, dropping about 0.5 percentage points for every additional 10 columns, indicating that existing Text-to-SQL models struggle with progressively larger tables and schema lengths that are character...

Published in: 2024 IEEE 40th International Conference on Data Engineering (ICDE)

Date of Conference: 13-16 May 2024

Date Added to IEEE Xplore: 23 July 2024

ISBN Information:

ISSN Information:

DOI: 10.1109/ICDE60146.2024.00456

Conference Location: Utrecht, Netherlands

Evaluating Text-to-SQL Model Failures on Real-World Data

Abstract:

Metadata

Abstract:

ISSN Information:

IEEE Account

Purchase Details

Profile Information

Need Help?

Evaluating Text-to-SQL Model Failures on Real-World Data

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

IEEE Account

Purchase Details

Profile Information

Need Help?