0% found this document useful (0 votes)
17 views16 pages

Final Paper

Uploaded by

raviashray15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views16 pages

Final Paper

Uploaded by

raviashray15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Query Refinement and Chain of Thought for Improvement in SQL

Generation using LLM


N. V. Ravi Ashray (21BCE7919)
May, 2025

Abstract
In an increasingly digitized world, vast volumes of structured data are being stored in relational
databases. Enabling non-technical stakeholders to query such databases via natural language interfaces
has become a critical need. Large Language Models (LLMs) have shown promise in bridging this gap
through Text-to-SQL translation. While fine-tuned LLMs achieve strong performance, they demand
substantial computational resources and domain-specific data. In contrast, in-context learning offers a
lighter-weight alternative but often suffers from schema hallucinations, incorrect joins, and ungrounded
WHERE clauses, limiting execution accuracy. To address these challenges, we propose a two-stage LLM
pipeline using few-shot prompting without any model fine-tuning. Few-shot prompting allows for obtain-
ing consistent results without requiring high effort. The first LLM performs schema linking byselecting
relevant tables and columns and generating SQL preview queries for the relevant schema. These previews
are then executed to extract a small set of representative rows, which are provided to the second LLM
responsible for final SQL query generation. This design improves grounding and reduces semantic errors.
We evaluate our method on the Spider dataset, achieving 68% and 77% execution accuracy on Spi-
der’s dev set using ChatGPT-3.5-turbo and GPT-4o-mini respectively. Our approach demonstrates that
lightweight decomposition and execution-grounded prompting can yield robust performance, rivalling
more complex in-context methods like DIN-SQL and CoT-based pipelines.

1 Introduction
1.1 Text-to-SQL
Text-to-SQL is a Natural Language Processing (NLP) task which involves converting natural language queries
into database executable SQL queries, taking help from database schema (Hong et al., 2025). This is
explained by Figure 2, where user asks a question, asking for names of all students. The text-to-SQL system
utilizes the database schema to generate SQL to answer the user question. This SQL query is then run on
the database, and the result is returned to the user.

1.2 Need for Text-To-SQL


We live in an era of unprecedented data generation and storage. As of 2024, the global data volume stands
at 149 zettabytes and is projected to rise further to 181 zettabytes by the end of 2025 (Bartley, n.d.).
Approximately 402.74 million terabytes of data are generated daily throughout 2024, with an estimated
90% of the world’s data having been generated within just the past two years. This explosive growth
reflects the increasing digitization of global activities across consumer applications and industrial operations.
Relational databases continue to be the backbone of enterprise data storage. A survey in 2019 found that
55% of businesses use SQL Server, 40% use MySQL, and 37% use Oracle for their data needs (Bourne,
2019). This prevalence of relational databases creates a critical need for tools that can make this structured
data accessible. Data democratization-the ongoing effort to make data accessible to everyone within an
organization regardless of technical expertise-has become a key priority for forward-thinking businesses.
Text-to-SQL systems are vital enablers of this democratization, allowing business users to access insights
without intermediate technical assistance, thereby empowering them to make data-driven decisions while
building customer experiences (Sandwell, 2021).

1
Figure 1: Text-to-SQL Process

1.3 Challenges in Text-to-SQL


Academic studies highlight persistent challenges in Text-to-SQL systems:
• Semantic ambiguity: Natural language queries often lack explicit structural or contextual cues, leading
to misinterpretation of original intent (Renggli et al., 2025).
• Schema complexity: Cross-domain databases with evolving schemas require robust schema linking to
map user intent to tables/columns (Hong et al., 2025).
• Query execution accuracy: Hallucinations in JOIN operations, WHERE clauses, and aggregations
reduce reliability (A Mitsopoulou et al., 2025).
• Evaluation limitations: Synthetic benchmarks often fail to capture real-world data variability and
metric biases (A Mitsopoulou et al., 2025; Renggli et al., 2025).

1.4 Overview of Existing Approaches


Academic research categorizes Text-to-SQL methodologies as following (Table 1).
In context LLM methods include those shown in Table 2. Recent advances include execution-guided
prompting (Hong et al., 2025), where intermediate SQL previews validate schema links, and skeleton-aware
decoding (Li et al., 2023), which separates structural parsing from schema alignment. However, fine-tuned
PLMs still outperform zero-shot LLMs by 12–15% on Spider’s dev set (Deng et al., 2022).

1.5 Problem Statement


Despite significant advances in text-to-SQL systems, substantial challenges remain. While fine-tuned Large
Language Models (LLMs) have demonstrated promising results, they require extensive computational re-
sources and domain-specific data for training. Alternative approaches using in-context learning offer a
lighter-weight solution but often suffer from critical errors including schema hallucinations, incorrect joins,
and ungrounded WHERE clauses that significantly limit execution accuracy. The core problem is finding an
effective balance between performance and resource efficiency. Existing solutions either require prohibitive

2
Table 1: Overview of existing methodologies in Text-to-SQL (Hong et al., 2025; Deng et al., 2022)
Approach Key Characteristics Limitations
Rule-based sys- Manual heuristics for simple queries; Fragile to schema changes; poor
tems limited scalability. cross-domain generalization.
Deep learning Sequence-to-sequence architectures (e.g., Struggle with complex JOINs and nested
models LSTMs) improved cross-domain queries.
generalization.
Pre-trained BERT/RoBERTa fine-tuned for Require domain-specific training data;
LMs (PLMs) schema-aware SQL generation. parameter inefficiency.
LLM-based In-context learning with GPT variants; Prone to hallucinations; limited by context
methods reduced training costs. windows.

Table 2: Overview of existing methodologies in Text-to-SQL which utilize in-context learning


Approach Key Methodology Performance
ACT-SQL Automatically generates Chain-of- SOTA among ICL methods on Spider dev set
(Zhang et al., Thought prompts for schema linking,
2023) achieving 85.3% execution accuracy
on Spider via single API call.
DIN-SQL Decomposes tasks into sub-problems Outperforms fine-tuned models by 5%+
(Pourreza and (schema linking, query refinement)
Rafiei, 2023) with self-correction, reaching 79.9%
accuracy on Spider.
ICRL (Toteja et Uses in-context reinforcement learn- Enhanced scalability for multi-database systems
al., 2025) ing for retrieval-augmented genera-
tion, improving schema recall by 15-
20%.

amounts of training data and computational power or sacrifice accuracy for practicality. There is a pressing
need for a solution that can maintain high execution accuracy while remaining computationally efficient and
adaptable to new database schemas without extensive retraining or fine-tuning.

1.6 Objectives
This research aims to develop and evaluate a novel approach to text-to-SQL translation with the following
specific objectives:
• Create a two-stage LLM pipeline using few-shot prompting that eliminates the need for model fine-
tuning while maintaining high execution accuracy.
• Implement effective schema linking mechanisms to select only relevant tables and columns from database
schemas, reducing potential for hallucination and error.
• Develop a method for generating SQL preview queries that can extract representative data rows to
ground the final query generation.
• Evaluate the performance of the proposed approach on standard benchmarks, specifically the Spider
dataset, using both ChatGPT-3.5-turbo and GPT-4o-mini models.
• Compare the execution accuracy of our approach with existing methods, particularly more complex
in-context methods like DIN-SQL and CoT-based pipelines.
• Demonstrate that lightweight decomposition and execution-grounded prompting can produce robust
performance without requiring extensive computational resources or specialized model training.

3
1.7 Scope
This research focuses specifically on developing and evaluating a two-stage LLM pipeline for text-to-SQL
translation using few-shot prompting without any model fine-tuning. The scope includes:
• Development of prompting strategies for two distinct phases: schema linking with preview query
generation, and final SQL query generation.
• Testing on the Spider dataset, a cross-domain text-to-SQL benchmark featuring 200 databases across
138 domains.
• Evaluation using execution accuracy metrics on Spider’s development set.
• Comparison with existing methods, particularly in-context learning approaches.
• Analysis of performance differences between ChatGPT-3.5-turbo and GPT-4o-mini models when used
within our pipeline.
The research does not address cross-lingual text-to-SQL generation, handling of extremely complex database
schemas beyond those present in the Spider dataset, or application to specialized domains requiring expert
knowledge. Additionally, while our approach is designed to be adaptable to different LLMs, we limit our
testing to the specified models.

2 Literature Survey
2.1 Introduction
The exponential growth of structured data-projected to reach 181 zettabytes by 2025 -and the ubiquity of
relational databases (used by 55% of enterprises for SQL Server ) have made text-to-SQL translation a critical
technology for data democratization (Gretel.ai, n.d.). Historically, database querying evolved from punch
cards in the 1940s to Edgar Codd’s relational model in 1970, culminating in SQL’s standardization by IBM.
Early text-to-SQL systems relied on rule-based heuristics, but their fragility to schema changes and limited
cross-domain generalization spurred neural approaches (Deng et al., 2022). The emergence of large language
models (LLMs) like GPT-4 has revolutionized this field. Unlike fine-tuned models requiring domain-specific
data, LLMs leverage in-context learning to adapt to new schemas dynamically (Zhu and Li, 2024). This
shift aligns with industry demands: 72% of organizations now prioritize natural language interfaces to reduce
dependency on technical teams (Gretel.ai, n.d.). Modern benchmarks like Spider (200 databases across 138
domains) highlight both the progress and persistent challenges-current LLM-based methods achieve 68–85%
execution accuracy, but hallucinations and schema linking errors remain prevalent (Hong et al., 2025).

2.1.1 Text-to-SQL Task Overview


Text-to-SQL is formally defined as a mapping: SQL = f (Q, S) where Q is a natural language question and
S is the database schema (tables, columns, relationships). The goal is to generate an executable SQL query
that retrieves the intended data.

2.2 In-Context Learning (ICL)


In-Context Learning (ICL) allows LLMs to perform tasks by leveraging examples or instructions provided
within the input prompt, without fine-tuning the model. For Text-to-SQL, ICL is particularly valuable
because it enables the model to adapt to diverse database schemas and query patterns by including schema
details and example query pairs in the prompt. For instance, providing a database schema alongside sample
natural language questions and their corresponding SQL queries helps the model infer the correct structure
and semantics for new queries. This is especially helpful in Text-to-SQL, where queries often involve complex
joins, aggregations, or nested subqueries that vary across databases. ICL reduces the need for extensive
training on domain-specific datasets, making it a flexible solution for real-world applications where database
structures are dynamic (Wei et al., 2023). By grounding the model in context, ICL improves query accuracy

4
and robustness, particularly for edge cases where schema nuances or ambiguous language could lead to
errors. The effectiveness of ICL in Text-to-SQL stems from its ability to provide contextual cues that
guide the model’s reasoning. For example, including table descriptions and column relationships in the
prompt helps disambiguate terms like “order” (which could refer to a table or a sorting operation). Research
highlights that ICL can achieve competitive performance compared to fine-tuned models, especially when
prompts are carefully designed to include diverse examples (Wei et al., 2023). This adaptability is crucial
for Text-to-SQL systems deployed in environments with frequently updated or unfamiliar databases, as it
minimizes the need for retraining while maintaining high accuracy.

2.2.1 Few-Shot Prompting


Few-Shot Prompting extends ICL by providing a small number of task-specific examples in the prompt to
guide the model’s output. In Text-to-SQL, few-shot prompting is highly effective because it allows the model
to learn from a handful of natural language-to-SQL query pairs, enabling it to generalize to new queries. For
instance, including three to five examples of questions with their corresponding SQL queries for a specific
database schema can significantly improve the model’s ability to handle similar queries. This is particularly
useful in Text-to-SQL, where the diversity of database schemas and query types makes exhaustive training
impractical. Few-shot prompting helps the model capture patterns, such as how to translate phrases like
“top 10 customers” into SQL’s LIMIT and ORDER BY clauses, without requiring extensive labeled data
(Gao et al., 2023).
The primary advantage of few-shot prompting in Text-to-SQL is its efficiency in low-resource settings. By
carefully selecting representative examples, developers can achieve robust performance with minimal data,
which is critical for specialized domains like healthcare or finance, where labeled Text-to-SQL datasets are
scarce (Gao et al., 2023). Moreover, few-shot prompting allows for rapid adaptation to new schemas or
query styles by updating the examples in the prompt, offering a practical solution for real-time applications.
Studies demonstrate that few-shot prompting can outperform zero-shot approaches by providing the model
with enough context to handle complex queries involving multiple tables or aggregations, making it a powerful
tool for Text-to-SQL systems.

2.2.2 Chain Of Thought Prompting


Chain-of-Thought (CoT) Prompting encourages LLMs to break down complex tasks into intermediate rea-
soning steps before producing the final output. In Text-to-SQL, CoT is particularly beneficial because it
guides the model to explicitly reason about the query’s intent, the database schema, and the required SQL
constructs. For example, when prompted with a question like “Which employees have the highest salaries
in each department?”, CoT might lead the model to first identify the need for grouping by department,
then selecting the maximum salary per group, and finally joining with the employee table to retrieve names
(Tai et al., 2023). By articulating these steps, CoT reduces errors in complex queries that involve nested
subqueries or conditional logic. This structured reasoning is critical for Text-to-SQL, where misinterpreting
a single term can lead to incorrect SQL syntax or semantics.
CoT’s strength in Text-to-SQL lies in its ability to handle ambiguity and complexity. Natural language
queries often contain vague or context-dependent terms, such as “recent orders” or “best performers,” which
require careful interpretation. CoT prompting helps the model resolve these ambiguities by explicitly map-
ping query components to schema elements and SQL operations, improving both accuracy and explainability
(Tai et al., 2023). Research shows that CoT can enhance performance on benchmark datasets like Spider,
where queries span multiple tables and require intricate logic (Gao et al., 2023). By combining CoT with
few-shot prompting, developers can further boost performance by providing examples that include reasoning
steps, making CoT a versatile approach for advancing Text-to-SQL systems.

2.3 Multi-Stage LLM Pipelines


Multi-stage LLM pipelines break down the Text-to-SQL process into sequential or iterative steps, each
handled by specialized modules or prompts. These pipelines typically include stages for schema linking
(mapping query terms to database elements), query classification (determining complexity), SQL generation,
and self-correction or validation. By modularizing the task, these frameworks enhance LLMs’ ability to

5
handle complex queries involving multiple tables, nested subqueries, or ambiguous language. The pipelines
often incorporate In-Context Learning (ICL), Few-Shot Prompting, and Chain-of-Thought (CoT) reasoning
to guide LLMs, reducing errors and improving generalization across diverse database schemas.

2.3.1 DIN-SQL
DIN-SQL (Decomposed In-Context Learning of Text-to-SQL) is another prominent multi-stage pipeline that
decomposes the Text-to-SQL task into four key stages: schema linking, query classification, SQL genera-
tion, and self-correction. The schema-linking stage identifies references to database elements and condition
values in the query, using embeddings to map natural language terms to schema components (Pourreza and
Rafiei, 2023). The classification stage categorizes queries into easy, non-nested complex, or nested complex,
enabling tailored prompts for each category. For complex queries, DIN-SQL introduces an intermediate
representation called NatSQL, which simplifies SQL syntax by removing operators without clear natural
language counterparts, easing the transition from query to code (Pourreza and Rafiei, 2023).

2.3.2 ACT-SQL
ACT-SQL (Augmented Chain-of-Thought SQL) is a multi-stage pipeline that enhances Text-to-SQL by
integrating dynamic few-shot prompting with CoT reasoning. In the first stage, ACT-SQL employs a schema-
linking module to identify relevant tables and columns in the user’s natural language query, using similarity-
based scoring to select contextually appropriate few-shot examples. These examples are dynamically chosen
based on their similarity to the input query and SQL structure, ensuring relevance (Zhang et al., 2023).
The second stage involves query classification, categorizing the query as easy, non-nested complex, or nested
complex, which informs the complexity of the subsequent SQL generation. The final stage generates the
SQL query using a CoT-guided prompt that includes intermediate reasoning steps, followed by a validation
step to ensure syntactic and semantic correctness (Zhang et al., 2023).

2.3.3 PET-SQL
PET-SQL (Pattern-Enhanced Text-to-SQL) is a multi-stage pipeline that leverages pattern-enhanced train-
ing to improve Text-to-SQL performance, particularly for complex and cross-domain queries. The first stage
involves schema linking, where PET-SQL uses a combination of neural embeddings and pattern-based rules
to map query terms to database elements, identifying tables, columns, and values with high precision (Li et
al., 2024). The second stage employs a query decomposition module that breaks down the natural language
query into semantic components (e.g., entities, conditions, aggregations), guided by learned patterns from a
curated training set. The final stage generates the SQL query using a pattern-enhanced LLM prompt that
incorporates CoT reasoning and few-shot examples tailored to the query’s semantic structure, followed by a
self-consistency check to refine the output (Li et al., 2024).

2.4 Challenges with Schema Linking


Schema linking-the process of mapping natural language tokens to database elements-remains one of the
most critical yet error-prone stages in text-to-SQL systems. Challenges stem from lexical gaps (e.g., ”doc-
tor” vs. schema’s ”physician id”) and polysemy (e.g., ”date” ambiguously referring to diagnosis date or
admission date). These issues are exacerbated in multi-hop relationships, where indirect links through junc-
tion tables require nuanced reasoning. Recent studies highlight bidirectional approaches to improve recall.
RSL-SQL (Cao et al., 2024) combines forward and backward schema pruning, achieving 94% recall while
reducing input columns by 83%. This method mitigates the risk of omitting critical elements by validat-
ing schema links against both initial SQL drafts and simplified schemas. Meanwhile, RAT-SQL (Wang
et al., 2020) employs relation-aware self-attention to encode predefined schema relationships (e.g., foreign
keys, column-table ownership), improving F1 scores by 9% on Spider. The framework’s ability to jointly
learn schema-question alignments reduces errors in complex JOIN operations by 18% (Wang et al., 2020).
Emergent techniques like CRUSH4SQL (Kothyari et al., 2023) leverage LLM-hallucinated schemas to guide
retrieval. By prompting LLMs to generate minimal viable schemas for a query, this method achieves Re-
call@30 of 0.86 on industrial-scale databases, outperforming traditional dense retrieval by 22% (Kothyari et

6
al., 2023). However, Lei et al. (2020) caution that even state-of-the-art models like SLSQL exhibit 23% error
rates when schema linking fails, underscoring its pivotal role in overall system accuracy.

2.5 Execution Accuracy and Hallucinations


Execution accuracy-measured by whether generated SQL returns the correct result set-is the gold standard
for text-to-SQL evaluation. However, hallucinations persist across all stages:
• Schema Hallucinations (23% frequency) (Qu et al., 2024): LLMs often reference nonexistent columns
(e.g., patient age instead of date of birth) or tables. TA-SQL (Qu et al., 2024) categorizes these
into:

– Schema Contradictions: Incorrect SQL violating schema structure (30% of errors). Example:
Referencing sets.language instead of set translations.language (Qu et al., 2024).
– Attribute Overanalysis: Introducing unnecessary columns (49% of errors). Example: Adding
height to SELECT when only player name is needed (Qu et al., 2024).
• Logic Hallucinations (25% frequency) (Qu et al., 2024): These include JOIN redundancies (15%) and
clause misuse (25%), such as erroneous GROUP BY in aggregation-free queries. DIN-SQL (as discussed
by Qu et al. (2024) in their analysis, original paper Pourreza and Rafiei (2023)) reduces such errors
by 37% through execution-grounded self-correction, validating intermediate queries against database
previews.
Evaluation inconsistencies further complicate accuracy metrics. The Spider dataset flags 12% of semantically
correct queries as incorrect due to alias mismatches (e.g., using table names instead of T1, T2) (lamini.ai,
n.d.). Worse, 5% of Spider’s ”gold” SQL contains errors, such as omitting JOIN conditions, which skew
benchmarking (lamini.ai, n.d.). RSL-SQL (Cao et al., 2024) addresses this by implementing multi-turn
self-correction, iteratively refining SQL using execution feedback to achieve 87.9% accuracy on Spider with
GPT-4o.

2.6 Limitations in Existing Methods


Despite advancements, critical gaps persist:
• Pipeline Complexity: Multi-stage frameworks like DIN-SQL (Pourreza and Rafiei, 2023) incur 5×
latency (18s/query) due to sequential schema linking, classification, drafting, and correction phases.
SQL-to-Schema (Yang et al., 2024) reveals that 40% of industrial deployments abandon such pipelines
for simpler few-shot models despite lower accuracy.
• Example Selection Bias: Similarity-based retrieval (e.g., SBERT embeddings) misses 31% of function-
ally diverse cases, as shown in TA-SQL (Qu et al., 2024). This bias exacerbates schema linking failures
in cross-domain scenarios.
• Schema Adaptation: Models trained on Spider fail catastrophically on clinical databases with 500+
tables, where CRUSH4SQL (Kothyari et al., 2023) achieves only 52% recall versus 86% on standard
benchmarks.
• Dataset Noise: Beyond Spider’s alias inconsistencies, the BIRD benchmark contains 52/106 faulty SQL
entries, including incorrect aggregations and missing WHERE clauses (lamini.ai, n.d.).
• Cost-Efficiency Trade-offs: Cao et al. (2024) demonstrate that GPT-4-based solutions achieve SOTA
accuracy but remain prohibitively expensive, while cheaper models like DeepSeek suffer 15–20% accu-
racy drops.

These limitations underscore the need for lightweight, execution-grounded approaches that balance modu-
larity with computational efficiency-a gap our proposed methodology aims to address.

7
3 Methodology
3.1 Introduction
The goal of this research is to develop a robust, lightweight text-to-SQL system that enables non-technical
users to query relational databases using natural language, without the need for model fine-tuning or ex-
tensive domain-specific data. This chapter details the methodology adopted to address the core challenges
identified in the literature-namely, schema linking, SQL generation accuracy, and hallucination mitigation-by
leveraging recent advances in large language models (LLMs) and modular pipeline design. Building on the
limitations of prior single-stage and monolithic approaches our method employs a two-stage LLM pipeline
inspired by modular frameworks such as DIN-SQL (Pourreza and Rafiei, 2023) and execution-grounded
prompting strategies. The first stage focuses on schema linking and preview query generation, while the
second stage utilizes the extracted schema context and representative data to guide accurate SQL synthesis.
The entire process is designed to operate in a few-shot in-context learning (ICL) setting, minimizing resource
requirements and maximizing adaptability to new domains. Evaluation is conducted on the widely adopted
Spider benchmark (Yu et al., 2018) which is recognized for its cross-domain complexity and has become the
standard for assessing text-to-SQL systems. Key performance metrics include execution accuracy and exact-
set-match, in line with established evaluation protocols. By combining modular decomposition, execution
validation, and carefully engineered prompting, this methodology aims to bridge the gap between natural
language and structured data access in a scalable and practical manner.

3.2 Architecture

Figure 2: Text-to-SQL Process

The Text-to-SQL system consists of 2 stages, comprising of a Schema Linking LLM, and a Code Gen-
eration LLM. This architecture helps divide the important tasks of Text-to-SQL reasoning, improving both
grounding and interpretability. Each model is provided required context and few shot examples to en-
sure consistency in answers. Additionally, code generation model is provided a preview of relevant schema,
allowing it to ground its row-level selection, in cases such as generation of WHERE or HAVING clause.

3.2.1 Few-Shot Prompting


Database schema is provided to the LLMs in API docs format, where the schema is provided to the model
in the form of python documentation comments. This reduces prompt size, while providing clarity to the

8
Figure 3: Preview of examples given to Schema Linking model.

Figure 4: Preview of examples given to Code Generation model.

LLM. Both models in the pipeline are provided 3 relevant examples as part of system prompt. The example
prompts are picked from a set of 7,000 examples, based on cosine similarity of their embeddings. Embeddings
are generated at runtime.

Prompt for Schema Linking Model: Provided user question, database schema, and examples to gen-
erate SQL SELECT statements to select first 5 rows in the relevant tables and columns. Examples contain
user question, database schema, and also the target SELECT statement to inform the LLM of the structure
of generation. If multiple tables are relevant, separate SQL statements will be generated for SELECTing from
each, without performing JOINs.

Prompt for Code Generation Model: Provided user question, database schema, preview of relevant
schema and examples to generate final SQL statement to answer user question. Examples contain question,
database schema, preview of relevant schema, and also the target SQL statement to inform the LLM of the
structure of generation.

3.2.2 Inference Process


Step-by-step process of Text-to-SQL:
1. User asks question based on the current database.
2. User question and schema of current database are taken.
3. Embeddings are generated to find closest matching examples.

4. User question, database schema are passed to the Schema Linking LLM.
5. Schema Linking LLM provides SELECT statement, selecting 5 rows of the relevant schema.
6. SELECT statement is run on database to generate preview.

7. Embeddings are generated again to find closest matching examples.

9
8. User question, database schema, database preview of relevant schema are passing to the Code Gener-
ation LLM.
9. Code Generation LLM generates final SQL query.
10. Query is run on database and result is collected.

3.3 Setup
Langchain (Chase, n.d.) library has been used for building and running inference on LLM chains. Langchain
is an open-source library for working with LLMs. Langchain’s connectors with providers have been utilized
for prototyping and testing with multiple models, for hypothesis checking and testing.

3.3.1 Dataset Used


Spider (Yu et al., 2018) dataset has been used for training and evaluation. Spider is a commonly used
benchmark to evaluate text-to-SQL parsing in a cross-database setting, which requires models to generalize
to novel database schemas. The dataset consists of 7,000 question-query pairs in the training set and 1,034
pairs in the development set, covering 200 different databases and 138 domains.

3.3.2 Models Used


Embeddings: Embeddings generation is crucial for finding semantic similarity during example selection.
For this, we have utilized sentence-transformers/all-MiniLM-L6-v2.

LLMs: OpenAI’s gpt-3.5-turbo (OpenAI, 2024), and gpt-4o-mini (OpenAI, 2023) have been used for LLM
generation, owing to their low cost when working with large amounts of tokens.

3.4 Testing
Spider-dev, containing 1034 examples has been used for evaluation, Spider-test is not open to public.

3.4.1 Method
OpenAI’s gpt-3.5-turbo and gpt-4o-mini have been used for evaluation purpose. They have been run in two
separate ways:
• With chain: The model is called twice with mentioned prompting method.
• Without chain: The model is called once in zero-shot manner.
The latter is used as a baseline, allowing us to understand the benefits of the suggested approach.

3.4.2 Metrics
Spider’s repository provides an evaluation script, along with databases. The evaluation script takes in two
files; pred.sql and gold.sql. It will then provide the following metrics:
• EX: Execution Accuracy measures whether the predicted SQL query returns the same result as the
gold query when executed.
• EM: Exact Match checks if the predicted SQL query matches the gold query exactly in structure and
content.
• PM: Partial Match evaluates overlap in SQL components (SELECT, WHERE, JOIN, etc.) between predicted
and gold queries.
Additionally, it breaks down each metric based on difficulty level of question: Easy, Medium, Hard and
Extra Hard. We will be considering all of the provided metrics to judge the efficacy of our methods.

10
4 Results and discussion
After running the pipelines on spider-dev, we received the following results presented in Tables 3 through 9.

Table 3: Baseline for GPT-4o-mini


Metric Easy Medium Hard Extra All Joint All
Count 250 440 174 170 1034 1
Execution Accuracy 0.928 0.830 0.695 0.571 0.788 0.000
Exact Match Accuracy 0.336 0.327 0.184 0.053 0.260 0.000
Partial Matching Accuracy Details:
Select 0.913 0.944 0.914 0.757 0.914 0.000
Select (no AGG) 0.913 0.944 0.926 0.811 0.921 0.000
Where 0.919 0.667 0.457 0.306 0.623 0.000
Where (no OP) 0.968 0.676 0.586 0.389 0.681 0.000
Group (no Having) 0.941 0.667 0.875 1.000 0.897 0.000
Group 0.059 0.000 0.875 1.000 0.310 0.000
Order 0.909 0.889 0.643 1.000 0.860 0.000
And/Or 1.000 0.929 0.929 0.933 0.947 0.000
IUEN 0.000 0.000 0.000 0.000 0.000 0.000
Keywords 0.960 0.824 0.608 0.486 0.781 0.000

Table 4: Baseline for GPT-3.5-Turbo


Metric Easy Medium Hard Extra All Joint All
Count 250 440 174 170 1034 1
Execution Accuracy 0.888 0.852 0.690 0.576 0.788 0.000
Exact Match Accuracy 0.468 0.316 0.264 0.041 0.299 0.000
Partial Matching Accuracy Details:
Select 0.895 0.915 0.954 0.833 0.905 0.000
Select (no AGG) 0.895 0.923 0.972 0.854 0.914 0.000
Where 0.924 0.426 0.522 0.169 0.498 0.000
Where (no OP) 0.946 0.431 0.587 0.265 0.535 0.000
Group (no Having) 0.857 0.667 0.786 0.316 0.623 0.000
Group 0.286 0.000 0.714 0.263 0.358 0.000
Order 0.909 0.967 0.684 0.316 0.756 0.000
And/Or 1.000 0.959 0.970 0.913 0.964 0.000
IUEN 0.000 0.000 0.000 0.000 0.000 0.000
Keywords 0.953 0.658 0.676 0.417 0.688 0.000

4.1 Execution Accuracy (EX)


Our method with GPT-3.5-turbo achieved an overall execution accuracy of 0.682, a 13.5% drop from the
baseline (0.788). With GPT-4o-mini, our method scored 0.775, a 1.6% decrease from the baseline (0.788).
Both configurations underperformed compared to state-of-the-art (SOTA) methods like DIN-SQL (Pourreza
and Rafiei, 2023) and RSL-SQL (Cao et al., 2024), which report 0.879 accuracy using GPT-4/GPT-4o.
Notably, our method showed competitive performance on medium-difficulty queries with GPT-4o-mini (0.852
vs. 0.830 baseline), surpassing the baseline by 2.7%. However, it struggled significantly with hard (0.506 vs.
0.690 for GPT-3.5-turbo, a 26.7% drop) and extra-hard queries (0.424 vs. 0.576 for GPT-3.5-turbo, a 26.4%
drop), indicating limitations in handling complex query structures.

11
Table 5: GPT-4o-mini with Proposed Approach
Metric Easy Medium Hard Extra All Joint All
Count 250 440 174 170 1034 1
Execution Accuracy 0.896 0.852 0.684 0.488 0.775 0.000
Exact Match Accuracy 0.824 0.541 0.374 0.106 0.510 0.000
Partial Matching Accuracy Details:
Select 0.915 0.920 0.939 0.796 0.905 0.000
Select (no AGG) 0.932 0.926 0.954 0.823 0.918 0.000
Where 0.920 0.589 0.456 0.189 0.569 0.000
Where (no OP) 0.950 0.606 0.578 0.392 0.642 0.000
Group (no Having) 0.857 0.882 0.826 0.605 0.794 0.000
Group 0.857 0.676 0.783 0.372 0.632 0.000
Order 0.783 0.881 1.000 0.690 0.837 0.000
And/Or 1.000 0.948 0.952 0.942 0.960 0.000
IUEN 0.000 0.000 0.615 0.000 0.571 0.000
Keywords 0.917 0.839 0.646 0.625 0.784 0.000

Table 6: GPT-3.5-turbo with Proposed Approach


Metric Easy Medium Hard Extra All
Count 250 440 174 170 1034
Execution Accuracy 0.804 0.782 0.506 0.424 0.682
Exact Match Accuracy 0.744 0.443 0.276 0.059 0.425
Partial Matching Accuracy Details:
Select 0.943 0.862 0.944 0.814 0.891
Select (no AGG) 0.956 0.868 0.960 0.832 0.902
Where 0.802 0.475 0.398 0.149 0.474
Where (no OP) 0.810 0.486 0.505 0.287 0.528
Group (no Having) 0.941 0.850 0.765 0.565 0.795
Group 0.706 0.733 0.706 0.522 0.684
Order 0.783 1.000 0.810 0.741 0.866
And/Or 1.000 0.958 0.956 0.929 0.964
IUEN 0.000 0.000 0.400 0.000 0.400
Keywords 0.833 0.823 0.592 0.425 0.716

Table 7: Execution Accuracy (EX) Summary


Method Easy Medium Hard Extra Overall
Baseline GPT-3.5-turbo 0.888 0.852 0.690 0.576 0.788
Our Method (GPT-3.5-turbo) 0.804 0.782 0.506 0.424 0.682
Baseline GPT-4o-mini 0.928 0.830 0.695 0.571 0.788
Our Method (GPT-4o-mini) 0.896 0.852 0.684 0.488 0.775
DIN-SQL (GPT-4) (Pourreza and Rafiei, 2023) - - - - 0.879
RSL-SQL (GPT-4o) (Cao et al., 2024) - - - - 0.879

4.2 Exact Match Accuracy (EM)


Exact match accuracy assesses whether generated SQL queries structurally match the gold standard. Table
8 presents the results. Our method with GPT-3.5-turbo achieved an overall exact match accuracy of 0.425,
which is a notable improvement over its baseline of 0.299. For GPT-4o-mini, our method scored an overall

12
Table 8: Exact Match Accuracy (EM) Summary
Method Easy Medium Hard Extra Overall
Baseline GPT-3.5-turbo 0.468 0.316 0.264 0.041 0.299
Our Method (GPT-3.5-turbo) 0.744 0.443 0.276 0.059 0.425
Baseline GPT-4o-mini 0.336 0.327 0.184 0.053 0.260
Our Method (GPT-4o-mini) 0.824 0.541 0.374 0.106 0.510
DIN-SQL (GPT-4) (Pourreza and Rafiei, 2023) - - - - (N/A for EM here)
RSL-SQL (GPT-4o) (Cao et al., 2024) - - - - (N/A for EM here)

exact match accuracy of 0.510, substantially outperforming its baseline of 0.260. This indicates that the
proposed pipeline significantly enhances the structural correctness of the generated SQL queries compared
to the zero-shot baseline for both models. The improvement is particularly stark for GPT-4o-mini, where
exact match accuracy nearly doubled. While these exact match scores are lower than the execution accuracy
scores, the improvement over baseline is a positive indicator for the structural quality of SQL generated by
our method.

4.3 Partial Match Accuracy


Partial match accuracy evaluates whether generated SQL queries partially align with the gold standard.
Table 9 presents the summary results. Our method with GPT-3.5-turbo achieved an overall partial match

Table 9: Partial Match Accuracy Summary


Method Easy Medium Hard Extra Overall
Baseline GPT-3.5-turbo 0.920 0.876 0.732 0.610 0.820
Our Method (GPT-3.5-turbo) 0.848 0.810 0.550 0.460 0.718
Baseline GPT-4o-mini 0.944 0.860 0.738 0.602 0.820
Our Method (GPT-4o-mini) 0.912 0.876 0.710 0.520 0.802
DIN-SQL (GPT-4) (Pourreza and Rafiei, 2023) - - - - 0.892
RSL-SQL (GPT-4o) (Cao et al., 2024) - - - - 0.892

accuracy of 0.718, a 12.4% drop from the baseline (0.820). With GPT-4o-mini, our method scored 0.802, a
2.2% decrease from the baseline (0.820). These results indicate that our method performs better on partial
matches than exact matches, as minor syntactic variations are tolerated. For GPT-3.5-turbo, the largest
performance gaps occur on hard (0.550 vs. 0.732, a 24.9% drop) and extra-hard queries (0.460 vs. 0.610, a
24.6% drop), consistent with trends in execution accuracy. GPT-4o-mini narrows this gap, particularly on
medium queries (0.876 vs. 0.860, a 1.9% improvement) and hard queries (0.710 vs. 0.738, a 3.8

5 Future Scope
The promising results of this two-stage LLM pipeline for text-to-SQL translation open several exciting
avenues for future exploration, particularly in enhancing its capability to handle complex queries and broad-
ening its applicability across diverse domains. One key area of focus is the integration of advanced reasoning
models, which were not tested in this study. Models specifically designed for logical reasoning, such as those
optimized for step-by-step problem-solving, could significantly improve the pipeline’s performance on hard
and extra-hard queries, where nested subqueries and intricate joins pose challenges. Additionally, incor-
porating Chain-of-Thought (CoT) prompting, which was not implemented in the current pipeline, offers a
powerful opportunity to enhance query accuracy. By guiding the LLM to explicitly articulate intermediate
reasoning steps, such as mapping query intent to schema elements and selecting appropriate SQL constructs,
CoT could reduce errors like incorrect column selections and missing conditions, bridging the performance
gap with state-of-the-art methods like DIN-SQL and RSL-SQL. Further refinements to the pipeline’s schema

13
linking mechanism present another critical direction for future work. The current approach relies on few-
shot prompting with cosine similarity-based example selection, which occasionally misses functionally diverse
cases, leading to suboptimal schema recall. Exploring bidirectional pruning techniques, as demonstrated in
RSL-SQL, or relation-aware attention mechanisms, as seen in RAT-SQL, could enhance the pipeline’s ability
to handle complex, multi-hop relationships in large-scale databases. Additionally, dynamic example selection
strategies that prioritize diversity in query patterns and schema structures could mitigate biases and im-
prove generalization across domains. Beyond technical enhancements, future research could explore hybrid
approaches that combine the pipeline’s lightweight design with minimal fine-tuning to optimize performance
without sacrificing efficiency. For instance, fine-tuning the embedding model or selectively fine-tuning the
LLM on domain-specific subsets could improve schema linking precision while maintaining the benefits of
few-shot prompting. Another promising direction is the adaptation of the pipeline for cross-lingual text-to-
SQL tasks, enabling non-English queries to be processed effectively, thus broadening its global applicability.
By integrating reasoning models, CoT prompting, and advanced schema linking, and by testing on diverse
datasets, this pipeline can evolve into a highly versatile and efficient solution, further advancing data de-
mocratization and empowering users to seamlessly interact with structured data across industries.

6 Conclusion
This research has successfully pioneered a lightweight, two-stage LLM pipeline for text-to-SQL transla-
tion, harnessing the power of few-shot prompting to deliver robust performance without the heavy resource
demands of model fine-tuning. By drawing inspiration from advanced frameworks like DIN-SQL and in-
tegrating innovative execution-grounded prompting, this approach has made significant strides in tackling
critical challenges such as schema linking, query accuracy, and hallucination reduction. Evaluated on the
esteemed Spider dataset, spanning 200 databases across 138 domains, the pipeline leveraged GPT-3.5-turbo
and GPT-4o-mini to showcase its promising capabilities, advancing the mission of data democratization.
The results are encouraging, with the pipeline achieving an overall execution accuracy of 0.682 with GPT-
3.5-turbo and an impressive 0.775 with GPT-4o-mini, compared to baseline execution accuracies of 0.788 for
both models. Notably, the pipeline excelled on medium-difficulty queries, particularly with GPT-4o-mini,
where it surpassed the baseline by 2.7% (0.852 vs. 0.830) in execution accuracy, demonstrating its strength
in handling moderately complex tasks. While performance on hard and extra-hard queries showed room
for growth, the pipeline’s ability to generate functionally correct queries was evident in its partial match
accuracy scores (0.718 for GPT-3.5-turbo, 0.802 for GPT-4o-mini). Compared to state-of-the-art methods
like DIN-SQL and RSL-SQL, which achieve 0.879 execution accuracy with GPT-4/GPT-4o, our pipeline
holds its own as a resource-efficient alternative. Its lightweight design, bypassing the need for extensive
computational resources or domain-specific data, positions it as a game-changer for real-world applications
where adaptability and speed are paramount. The innovative use of preview query generation to anchor
schema linking proved highly effective in curbing hallucinations, paving the way for more reliable query syn-
thesis. The pipeline’s modular architecture and few-shot prompting strategy empower non-technical users
to effortlessly query relational databases, unlocking data-driven insights with unprecedented ease. Look-
ing ahead, the future is bright for enhancing this pipeline. Opportunities abound to refine schema linking
through techniques like bidirectional pruning or relation-aware attention, inspired by RSL-SQL and RAT-
SQL, to boost performance on complex queries. Dynamic example selection with greater diversity could
further elevate prompt effectiveness, while hybrid approaches blending lightweight prompting with targeted
fine-tuning may unlock even higher accuracy without sacrificing efficiency. This research marks a bold step
forward in making structured data accessible to all, proving that a lean, few-shot LLM pipeline can deliver
powerful text-to-SQL performance. By empowering users to seamlessly interact with databases, this work
lights the path toward a future where data is truly democratized, fostering innovation and decision-making
across industries.

References
A Mitsopoulou, G. K. (2025, February 11). Analysis of Text-to-SQL Benchmarks: Limitations, Challenges. Retrieved
from https://openproceedings.org/2025/conf/edbt/paper-41.pdf

14
Bailin Wang, R. S. (2020). RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers. In
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 7567–7578). Online:
Association for Computational Linguistics.
Bourne, J. (2019, February 12). Relational databases ‘here to stay’ in the enterprise amid greater
cloud and data complexity. Retrieved from CloudTech: https://www.cloudcomputing-news.net/news/
relational-databases-here-stay-enterprise-amid-greater-cloud-and-data-complexity/
Zhenbiao Cao, Y. Z. (2024, November 26). RSL-SQL: Robust Schema Linking in Text-to-SQL Generation.
arXiv:2411.00073. Retrieved from https://arxiv.org/abs/2411.00073
Chase, H. (n.d.). Langchain. Retrieved from https://github.com/langchain-ai/langchain
Naihao Deng, Y. C. (2022). Recent Advances in Text-to-SQL: A Survey of What We Have and What We Expect.
In Proceedings of the 29th International Conference on Computational Linguistics (p. 22). Gyeongju, Republic of
Korea: Association for Computational Linguistics.
Dawei Gao, H. W. (2023, August 29). Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation.
arXiv:2308.15363. Retrieved from https://arxiv.org/abs/2308.15363
Gottumukkala, A. M. (2025, February 4). From Natural Language to SQL: Review of LLM-based Text-to-SQL Systems.
arXiv:2410.01066v2. Retrieved from https://arxiv.org/pdf/2410.01066v2
Zijin Hong, Z. Y. (2025, March 13). Next-Generation Database Interfaces: A Survey of LLM-based Text-to-SQL.
arXiv:2406.08426v1. Retrieved from https://arxiv.org/pdf/2406.08426v1
Kevin Bartley. (n.d.). Big data statistics: How much data is there in the world? Retrieved from Rivery: https:
//rivery.io/blog/big-data-statistics-how-much-data-is-there-in-the-world/
Mayank Kothyari, D. D. (2023). CRUSH4SQL: Collective Retrieval Using Schema Hallucination For Text2SQL. In
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 14054–14066).
Singapore: Association for Computational Linguistics.
lamini.ai. (n.d.). How a Fortune 500 slashed hallucinations to create 94.7% accurate LLM agents for SQL. Retrieved
from https://www.lamini.ai/blog/llm-text-to-sql
Wenqiang Lei, W. W.-Y.-S. (2020). Re-examining the Role of Schema Linking in Text-to-SQL. In Proceedings of
the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 6943–6954). Online:
Association for Computational Linguistics.
Haoyang Li, J. Z. (2023, April 10). RESDSQL: Decoupling Schema Linking and Skeleton Parsing for Text-to-SQL.
arXiv:2302.05965. Retrieved from https://arxiv.org/pdf/2302.05965
Zhishuai Li, X. W. (2024, June 2). PET-SQL: A Prompt-Enhanced Two-Round Refinement of Text-to-SQL with
Cross-consistency. arXiv:2403.09732. Retrieved from https://arxiv.org/pdf/2403.09732
OpenAI. (2023, March 1). Introducing ChatGPT and Whisper APIs. Retrieved from https://openai.com/blog/
introducing-chatgpt-and-whisper-apis
OpenAI. (2024, July 18). OpenAI unveils cheaper small AI model GPT-4o mini.
Reuters. Retrieved from https://www.reuters.com/technology/artificial-intelligence/
openai-unveils-cheaper-small-ai-model-gpt-4o-mini-2024-07-18/
Mohammadreza Pourreza, D. R. (2023, April 21). DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with
Self-Correction. arXiv:2304.11015. Retrieved from https://arxiv.org/pdf/2304.11015
Ge Qu, J. L. (2024). Before Generation, Align it! A Novel and Effective Strategy for Mitigating Hallucinations in
Text-to-SQL Generation. In Findings of the Association for Computational Linguistics: ACL 2024 (pp. 5456–5471).
Bangkok: Association for Computational Linguistics.
Cedric Renggli, I. F. (2025, January 30). Fundamental Challenges in Evaluating Text2SQL Solutions and Detecting
Their Limitations. arXiv:2501.18197v1. Retrieved from https://arxiv.org/pdf/2501.18197v1
Sandwell, D. (2021, June 3). What is data democratization and why is it important? Retrieved from Quest Blog:
https://blog.quest.com/why-data-democratization-why-now-what-does-it-look-like/
Chang-Yu Tai, Z. C. (2023). Exploring Chain of Thought Style Prompting for Text-to-SQL. In Proceedings of the
2023 Conference on Empirical Methods in Natural Language Processing (pp. 89-102). Singapore: Association for
Computational Linguistics.
Text-to-SQL, E. C. (2023). Divide and Prompt: Chain of Thought Prompting for Text-to-SQL. In Proceedings of the
2023 Conference on Empirical Methods in Natural Language Processing (pp. 5376–5393). Online: Association for
Computational Linguistics.
Rishit Toteja, A. S. (2025). In-Context Reinforcement Learning with Retrieval-Augmented Generation for Text-to-
SQL. Proceedings of the 31st International Conference on Computational Linguistics (p. 8). Abu Dhabi, UAE:
Association for Computational Linguistics. Retrieved from aclanthology.org.
Jason Wei, X. W. (2023, January 10). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.
arXiv:2201.11903. Retrieved from https://arxiv.org/pdf/2201.11903
Gretel.ai. (n.d.). What is Text-to-SQL? Retrieved from https://gretel.ai/technical-glossary/
what-is-text-to-sql

15
Sun Yang, Q. S. (2024). SQL-to-Schema Enhances Schema Linking in Text-to-SQL. In International Conference on
Database and Expert Systems Applications (pp. 139-145). Cham: Springer.
Tao Yu, R. Z. (2018). Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic
Parsing and Text-to-SQL Task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language
Processing (pp. 3911–3921). Brussels, Belgium: Association for Computational Linguistics.
Hanchong Zhang, R. C. (2023, October 26). ACT-SQL: In-Context Learning for Text-to-SQL with Automatically-
Generated Chain-of-Thought. arXiv:2310.17342. Retrieved from https://arxiv.org/pdf/2310.17342
Zhuosheng Zhang, A. Z. (2022, October 7). Automatic Chain of Thought Prompting in Large Language Models.
arXiv:2210.03493. Retrieved from https://arxiv.org/pdf/2210.03493
Xiaohu Zhu, Q. L. (2024, October 8). Large Language Model Enhanced Text-to-SQL. arXiv:2410.06011v1. Retrieved
from https://arxiv.org/pdf/2410.06011v1

16

You might also like