0% found this document useful (0 votes)

17 views16 pages

Final Paper

Uploaded by

raviashray15

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views16 pages

Final Paper

Uploaded by

raviashray15

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Query Refinement and Chain of Thought for Improvement in SQL

Generation using LLM

N. V. Ravi Ashray (21BCE7919)
May, 2025

Abstract
In an increasingly digitized world, vast volumes of structured data are being stored in relational
databases. Enabling non-technical stakeholders to query such databases via natural language interfaces
has become a critical need. Large Language Models (LLMs) have shown promise in bridging this gap
through Text-to-SQL translation. While fine-tuned LLMs achieve strong performance, they demand
substantial computational resources and domain-specific data. In contrast, in-context learning offers a
lighter-weight alternative but often suffers from schema hallucinations, incorrect joins, and ungrounded
WHERE clauses, limiting execution accuracy. To address these challenges, we propose a two-stage LLM
pipeline using few-shot prompting without any model fine-tuning. Few-shot prompting allows for obtain-
ing consistent results without requiring high effort. The first LLM performs schema linking byselecting
relevant tables and columns and generating SQL preview queries for the relevant schema. These previews
are then executed to extract a small set of representative rows, which are provided to the second LLM
responsible for final SQL query generation. This design improves grounding and reduces semantic errors.
We evaluate our method on the Spider dataset, achieving 68% and 77% execution accuracy on Spi-
der’s dev set using ChatGPT-3.5-turbo and GPT-4o-mini respectively. Our approach demonstrates that
lightweight decomposition and execution-grounded prompting can yield robust performance, rivalling
more complex in-context methods like DIN-SQL and CoT-based pipelines.

1 Introduction
1.1 Text-to-SQL
Text-to-SQL is a Natural Language Processing (NLP) task which involves converting natural language queries
into database executable SQL queries, taking help from database schema (Hong et al., 2025). This is
explained by Figure 2, where user asks a question, asking for names of all students. The text-to-SQL system
utilizes the database schema to generate SQL to answer the user question. This SQL query is then run on
the database, and the result is returned to the user.

1.2 Need for Text-To-SQL

We live in an era of unprecedented data generation and storage. As of 2024, the global data volume stands
at 149 zettabytes and is projected to rise further to 181 zettabytes by the end of 2025 (Bartley, n.d.).
Approximately 402.74 million terabytes of data are generated daily throughout 2024, with an estimated
90% of the world’s data having been generated within just the past two years. This explosive growth
reflects the increasing digitization of global activities across consumer applications and industrial operations.
Relational databases continue to be the backbone of enterprise data storage. A survey in 2019 found that
55% of businesses use SQL Server, 40% use MySQL, and 37% use Oracle for their data needs (Bourne,
2019). This prevalence of relational databases creates a critical need for tools that can make this structured
data accessible. Data democratization-the ongoing effort to make data accessible to everyone within an
organization regardless of technical expertise-has become a key priority for forward-thinking businesses.
Text-to-SQL systems are vital enablers of this democratization, allowing business users to access insights
without intermediate technical assistance, thereby empowering them to make data-driven decisions while
building customer experiences (Sandwell, 2021).

1
Figure 1: Text-to-SQL Process

1.3 Challenges in Text-to-SQL

Academic studies highlight persistent challenges in Text-to-SQL systems:
• Semantic ambiguity: Natural language queries often lack explicit structural or contextual cues, leading
to misinterpretation of original intent (Renggli et al., 2025).
• Schema complexity: Cross-domain databases with evolving schemas require robust schema linking to
map user intent to tables/columns (Hong et al., 2025).
• Query execution accuracy: Hallucinations in JOIN operations, WHERE clauses, and aggregations
reduce reliability (A Mitsopoulou et al., 2025).
• Evaluation limitations: Synthetic benchmarks often fail to capture real-world data variability and
metric biases (A Mitsopoulou et al., 2025; Renggli et al., 2025).

1.4 Overview of Existing Approaches

Academic research categorizes Text-to-SQL methodologies as following (Table 1).
In context LLM methods include those shown in Table 2. Recent advances include execution-guided
prompting (Hong et al., 2025), where intermediate SQL previews validate schema links, and skeleton-aware
decoding (Li et al., 2023), which separates structural parsing from schema alignment. However, fine-tuned
PLMs still outperform zero-shot LLMs by 12–15% on Spider’s dev set (Deng et al., 2022).

1.5 Problem Statement

Despite significant advances in text-to-SQL systems, substantial challenges remain. While fine-tuned Large
Language Models (LLMs) have demonstrated promising results, they require extensive computational re-
sources and domain-specific data for training. Alternative approaches using in-context learning offer a
lighter-weight solution but often suffer from critical errors including schema hallucinations, incorrect joins,
and ungrounded WHERE clauses that significantly limit execution accuracy. The core problem is finding an
effective balance between performance and resource efficiency. Existing solutions either require prohibitive

2
Table 1: Overview of existing methodologies in Text-to-SQL (Hong et al., 2025; Deng et al., 2022)
Approach Key Characteristics Limitations
Rule-based sys- Manual heuristics for simple queries; Fragile to schema changes; poor
tems limited scalability. cross-domain generalization.
Deep learning Sequence-to-sequence architectures (e.g., Struggle with complex JOINs and nested
models LSTMs) improved cross-domain queries.
generalization.
Pre-trained BERT/RoBERTa fine-tuned for Require domain-specific training data;
LMs (PLMs) schema-aware SQL generation. parameter inefficiency.
LLM-based In-context learning with GPT variants; Prone to hallucinations; limited by context
methods reduced training costs. windows.

Table 2: Overview of existing methodologies in Text-to-SQL which utilize in-context learning

Approach Key Methodology Performance
ACT-SQL Automatically generates Chain-of- SOTA among ICL methods on Spider dev set
(Zhang et al., Thought prompts for schema linking,
2023) achieving 85.3% execution accuracy
on Spider via single API call.
DIN-SQL Decomposes tasks into sub-problems Outperforms fine-tuned models by 5%+
(Pourreza and (schema linking, query refinement)
Rafiei, 2023) with self-correction, reaching 79.9%
accuracy on Spider.
ICRL (Toteja et Uses in-context reinforcement learn- Enhanced scalability for multi-database systems
al., 2025) ing for retrieval-augmented genera-
tion, improving schema recall by 15-
20%.

amounts of training data and computational power or sacrifice accuracy for practicality. There is a pressing
need for a solution that can maintain high execution accuracy while remaining computationally efficient and
adaptable to new database schemas without extensive retraining or fine-tuning.

1.6 Objectives
This research aims to develop and evaluate a novel approach to text-to-SQL translation with the following
specific objectives:
• Create a two-stage LLM pipeline using few-shot prompting that eliminates the need for model fine-
tuning while maintaining high execution accuracy.
• Implement effective schema linking mechanisms to select only relevant tables and columns from database
schemas, reducing potential for hallucination and error.
• Develop a method for generating SQL preview queries that can extract representative data rows to
ground the final query generation.
• Evaluate the performance of the proposed approach on standard benchmarks, specifically the Spider
dataset, using both ChatGPT-3.5-turbo and GPT-4o-mini models.
• Compare the execution accuracy of our approach with existing methods, particularly more complex
in-context methods like DIN-SQL and CoT-based pipelines.
• Demonstrate that lightweight decomposition and execution-grounded prompting can produce robust
performance without requiring extensive computational resources or specialized model training.

3
1.7 Scope
This research focuses specifically on developing and evaluating a two-stage LLM pipeline for text-to-SQL
translation using few-shot prompting without any model fine-tuning. The scope includes:
• Development of prompting strategies for two distinct phases: schema linking with preview query
generation, and final SQL query generation.
• Testing on the Spider dataset, a cross-domain text-to-SQL benchmark featuring 200 databases across
138 domains.
• Evaluation using execution accuracy metrics on Spider’s development set.
• Comparison with existing methods, particularly in-context learning approaches.
• Analysis of performance differences between ChatGPT-3.5-turbo and GPT-4o-mini models when used
within our pipeline.
The research does not address cross-lingual text-to-SQL generation, handling of extremely complex database
schemas beyond those present in the Spider dataset, or application to specialized domains requiring expert
knowledge. Additionally, while our approach is designed to be adaptable to different LLMs, we limit our
testing to the specified models.

2 Literature Survey
2.1 Introduction
The exponential growth of structured data-projected to reach 181 zettabytes by 2025 -and the ubiquity of
relational databases (used by 55% of enterprises for SQL Server ) have made text-to-SQL translation a critical
technology for data democratization (Gretel.ai, n.d.). Historically, database querying evolved from punch
cards in the 1940s to Edgar Codd’s relational model in 1970, culminating in SQL’s standardization by IBM.
Early text-to-SQL systems relied on rule-based heuristics, but their fragility to schema changes and limited
cross-domain generalization spurred neural approaches (Deng et al., 2022). The emergence of large language
models (LLMs) like GPT-4 has revolutionized this field. Unlike fine-tuned models requiring domain-specific
data, LLMs leverage in-context learning to adapt to new schemas dynamically (Zhu and Li, 2024). This
shift aligns with industry demands: 72% of organizations now prioritize natural language interfaces to reduce
dependency on technical teams (Gretel.ai, n.d.). Modern benchmarks like Spider (200 databases across 138
domains) highlight both the progress and persistent challenges-current LLM-based methods achieve 68–85%
execution accuracy, but hallucinations and schema linking errors remain prevalent (Hong et al., 2025).

2.1.1 Text-to-SQL Task Overview

Text-to-SQL is formally defined as a mapping: SQL = f (Q, S) where Q is a natural language question and
S is the database schema (tables, columns, relationships). The goal is to generate an executable SQL query
that retrieves the intended data.

2.2 In-Context Learning (ICL)

In-Context Learning (ICL) allows LLMs to perform tasks by leveraging examples or instructions provided
within the input prompt, without fine-tuning the model. For Text-to-SQL, ICL is particularly valuable
because it enables the model to adapt to diverse database schemas and query patterns by including schema
details and example query pairs in the prompt. For instance, providing a database schema alongside sample
natural language questions and their corresponding SQL queries helps the model infer the correct structure
and semantics for new queries. This is especially helpful in Text-to-SQL, where queries often involve complex
joins, aggregations, or nested subqueries that vary across databases. ICL reduces the need for extensive
training on domain-specific datasets, making it a flexible solution for real-world applications where database
structures are dynamic (Wei et al., 2023). By grounding the model in context, ICL improves query accuracy

4
and robustness, particularly for edge cases where schema nuances or ambiguous language could lead to
errors. The effectiveness of ICL in Text-to-SQL stems from its ability to provide contextual cues that
guide the model’s reasoning. For example, including table descriptions and column relationships in the
prompt helps disambiguate terms like “order” (which could refer to a table or a sorting operation). Research
highlights that ICL can achieve competitive performance compared to fine-tuned models, especially when
prompts are carefully designed to include diverse examples (Wei et al., 2023). This adaptability is crucial
for Text-to-SQL systems deployed in environments with frequently updated or unfamiliar databases, as it
minimizes the need for retraining while maintaining high accuracy.

2.2.1 Few-Shot Prompting

Few-Shot Prompting extends ICL by providing a small number of task-specific examples in the prompt to
guide the model’s output. In Text-to-SQL, few-shot prompting is highly effective because it allows the model
to learn from a handful of natural language-to-SQL query pairs, enabling it to generalize to new queries. For
instance, including three to five examples of questions with their corresponding SQL queries for a specific
database schema can significantly improve the model’s ability to handle similar queries. This is particularly
useful in Text-to-SQL, where the diversity of database schemas and query types makes exhaustive training
impractical. Few-shot prompting helps the model capture patterns, such as how to translate phrases like
“top 10 customers” into SQL’s LIMIT and ORDER BY clauses, without requiring extensive labeled data
(Gao et al., 2023).
The primary advantage of few-shot prompting in Text-to-SQL is its efficiency in low-resource settings. By
carefully selecting representative examples, developers can achieve robust performance with minimal data,
which is critical for specialized domains like healthcare or finance, where labeled Text-to-SQL datasets are
scarce (Gao et al., 2023). Moreover, few-shot prompting allows for rapid adaptation to new schemas or
query styles by updating the examples in the prompt, offering a practical solution for real-time applications.
Studies demonstrate that few-shot prompting can outperform zero-shot approaches by providing the model
with enough context to handle complex queries involving multiple tables or aggregations, making it a powerful
tool for Text-to-SQL systems.

2.2.2 Chain Of Thought Prompting

Chain-of-Thought (CoT) Prompting encourages LLMs to break down complex tasks into intermediate rea-
soning steps before producing the final output. In Text-to-SQL, CoT is particularly beneficial because it
guides the model to explicitly reason about the query’s intent, the database schema, and the required SQL
constructs. For example, when prompted with a question like “Which employees have the highest salaries
in each department?”, CoT might lead the model to first identify the need for grouping by department,
then selecting the maximum salary per group, and finally joining with the employee table to retrieve names
(Tai et al., 2023). By articulating these steps, CoT reduces errors in complex queries that involve nested
subqueries or conditional logic. This structured reasoning is critical for Text-to-SQL, where misinterpreting
a single term can lead to incorrect SQL syntax or semantics.
CoT’s strength in Text-to-SQL lies in its ability to handle ambiguity and complexity. Natural language
queries often contain vague or context-dependent terms, such as “recent orders” or “best performers,” which
require careful interpretation. CoT prompting helps the model resolve these ambiguities by explicitly map-
ping query components to schema elements and SQL operations, improving both accuracy and explainability
(Tai et al., 2023). Research shows that CoT can enhance performance on benchmark datasets like Spider,
where queries span multiple tables and require intricate logic (Gao et al., 2023). By combining CoT with
few-shot prompting, developers can further boost performance by providing examples that include reasoning
steps, making CoT a versatile approach for advancing Text-to-SQL systems.

2.3 Multi-Stage LLM Pipelines

Multi-stage LLM pipelines break down the Text-to-SQL process into sequential or iterative steps, each
handled by specialized modules or prompts. These pipelines typically include stages for schema linking
(mapping query terms to database elements), query classification (determining complexity), SQL generation,
and self-correction or validation. By modularizing the task, these frameworks enhance LLMs’ ability to

5
handle complex queries involving multiple tables, nested subqueries, or ambiguous language. The pipelines
often incorporate In-Context Learning (ICL), Few-Shot Prompting, and Chain-of-Thought (CoT) reasoning
to guide LLMs, reducing errors and improving generalization across diverse database schemas.

2.3.1 DIN-SQL
DIN-SQL (Decomposed In-Context Learning of Text-to-SQL) is another prominent multi-stage pipeline that
decomposes the Text-to-SQL task into four key stages: schema linking, query classification, SQL genera-
tion, and self-correction. The schema-linking stage identifies references to database elements and condition
values in the query, using embeddings to map natural language terms to schema components (Pourreza and
Rafiei, 2023). The classification stage categorizes queries into easy, non-nested complex, or nested complex,
enabling tailored prompts for each category. For complex queries, DIN-SQL introduces an intermediate
representation called NatSQL, which simplifies SQL syntax by removing operators without clear natural
language counterparts, easing the transition from query to code (Pourreza and Rafiei, 2023).

2.3.2 ACT-SQL
ACT-SQL (Augmented Chain-of-Thought SQL) is a multi-stage pipeline that enhances Text-to-SQL by
integrating dynamic few-shot prompting with CoT reasoning. In the first stage, ACT-SQL employs a schema-
linking module to identify relevant tables and columns in the user’s natural language query, using similarity-
based scoring to select contextually appropriate few-shot examples. These examples are dynamically chosen
based on their similarity to the input query and SQL structure, ensuring relevance (Zhang et al., 2023).
The second stage involves query classification, categorizing the query as easy, non-nested complex, or nested
complex, which informs the complexity of the subsequent SQL generation. The final stage generates the
SQL query using a CoT-guided prompt that includes intermediate reasoning steps, followed by a validation
step to ensure syntactic and semantic correctness (Zhang et al., 2023).

2.3.3 PET-SQL
PET-SQL (Pattern-Enhanced Text-to-SQL) is a multi-stage pipeline that leverages pattern-enhanced train-
ing to improve Text-to-SQL performance, particularly for complex and cross-domain queries. The first stage
involves schema linking, where PET-SQL uses a combination of neural embeddings and pattern-based rules
to map query terms to database elements, identifying tables, columns, and values with high precision (Li et
al., 2024). The second stage employs a query decomposition module that breaks down the natural language
query into semantic components (e.g., entities, conditions, aggregations), guided by learned patterns from a
curated training set. The final stage generates the SQL query using a pattern-enhanced LLM prompt that
incorporates CoT reasoning and few-shot examples tailored to the query’s semantic structure, followed by a
self-consistency check to refine the output (Li et al., 2024).

2.4 Challenges with Schema Linking

Schema linking-the process of mapping natural language tokens to database elements-remains one of the
most critical yet error-prone stages in text-to-SQL systems. Challenges stem from lexical gaps (e.g., ”doc-
tor” vs. schema’s ”physician id”) and polysemy (e.g., ”date” ambiguously referring to diagnosis date or
admission date). These issues are exacerbated in multi-hop relationships, where indirect links through junc-
tion tables require nuanced reasoning. Recent studies highlight bidirectional approaches to improve recall.
RSL-SQL (Cao et al., 2024) combines forward and backward schema pruning, achieving 94% recall while
reducing input columns by 83%. This method mitigates the risk of omitting critical elements by validat-
ing schema links against both initial SQL drafts and simplified schemas. Meanwhile, RAT-SQL (Wang
et al., 2020) employs relation-aware self-attention to encode predefined schema relationships (e.g., foreign
keys, column-table ownership), improving F1 scores by 9% on Spider. The framework’s ability to jointly
learn schema-question alignments reduces errors in complex JOIN operations by 18% (Wang et al., 2020).
Emergent techniques like CRUSH4SQL (Kothyari et al., 2023) leverage LLM-hallucinated schemas to guide
retrieval. By prompting LLMs to generate minimal viable schemas for a query, this method achieves Re-
call@30 of 0.86 on industrial-scale databases, outperforming traditional dense retrieval by 22% (Kothyari et

6
al., 2023). However, Lei et al. (2020) caution that even state-of-the-art models like SLSQL exhibit 23% error
rates when schema linking fails, underscoring its pivotal role in overall system accuracy.

2.5 Execution Accuracy and Hallucinations

Execution accuracy-measured by whether generated SQL returns the correct result set-is the gold standard
for text-to-SQL evaluation. However, hallucinations persist across all stages:
• Schema Hallucinations (23% frequency) (Qu et al., 2024): LLMs often reference nonexistent columns
(e.g., patient age instead of date of birth) or tables. TA-SQL (Qu et al., 2024) categorizes these
into:

– Schema Contradictions: Incorrect SQL violating schema structure (30% of errors). Example:
Referencing sets.language instead of set translations.language (Qu et al., 2024).
– Attribute Overanalysis: Introducing unnecessary columns (49% of errors). Example: Adding
height to SELECT when only player name is needed (Qu et al., 2024).
• Logic Hallucinations (25% frequency) (Qu et al., 2024): These include JOIN redundancies (15%) and
clause misuse (25%), such as erroneous GROUP BY in aggregation-free queries. DIN-SQL (as discussed
by Qu et al. (2024) in their analysis, original paper Pourreza and Rafiei (2023)) reduces such errors
by 37% through execution-grounded self-correction, validating intermediate queries against database
previews.
Evaluation inconsistencies further complicate accuracy metrics. The Spider dataset flags 12% of semantically
correct queries as incorrect due to alias mismatches (e.g., using table names instead of T1, T2) (lamini.ai,
n.d.). Worse, 5% of Spider’s ”gold” SQL contains errors, such as omitting JOIN conditions, which skew
benchmarking (lamini.ai, n.d.). RSL-SQL (Cao et al., 2024) addresses this by implementing multi-turn
self-correction, iteratively refining SQL using execution feedback to achieve 87.9% accuracy on Spider with
GPT-4o.

2.6 Limitations in Existing Methods

Despite advancements, critical gaps persist:
• Pipeline Complexity: Multi-stage frameworks like DIN-SQL (Pourreza and Rafiei, 2023) incur 5×
latency (18s/query) due to sequential schema linking, classification, drafting, and correction phases.
SQL-to-Schema (Yang et al., 2024) reveals that 40% of industrial deployments abandon such pipelines
for simpler few-shot models despite lower accuracy.
• Example Selection Bias: Similarity-based retrieval (e.g., SBERT embeddings) misses 31% of function-
ally diverse cases, as shown in TA-SQL (Qu et al., 2024). This bias exacerbates schema linking failures
in cross-domain scenarios.
• Schema Adaptation: Models trained on Spider fail catastrophically on clinical databases with 500+
tables, where CRUSH4SQL (Kothyari et al., 2023) achieves only 52% recall versus 86% on standard
benchmarks.
• Dataset Noise: Beyond Spider’s alias inconsistencies, the BIRD benchmark contains 52/106 faulty SQL
entries, including incorrect aggregations and missing WHERE clauses (lamini.ai, n.d.).
• Cost-Efficiency Trade-offs: Cao et al. (2024) demonstrate that GPT-4-based solutions achieve SOTA
accuracy but remain prohibitively expensive, while cheaper models like DeepSeek suffer 15–20% accu-
racy drops.

These limitations underscore the need for lightweight, execution-grounded approaches that balance modu-
larity with computational efficiency-a gap our proposed methodology aims to address.

7
3 Methodology
3.1 Introduction
The goal of this research is to develop a robust, lightweight text-to-SQL system that enables non-technical
users to query relational databases using natural language, without the need for model fine-tuning or ex-
tensive domain-specific data. This chapter details the methodology adopted to address the core challenges
identified in the literature-namely, schema linking, SQL generation accuracy, and hallucination mitigation-by
leveraging recent advances in large language models (LLMs) and modular pipeline design. Building on the
limitations of prior single-stage and monolithic approaches our method employs a two-stage LLM pipeline
inspired by modular frameworks such as DIN-SQL (Pourreza and Rafiei, 2023) and execution-grounded
prompting strategies. The first stage focuses on schema linking and preview query generation, while the
second stage utilizes the extracted schema context and representative data to guide accurate SQL synthesis.
The entire process is designed to operate in a few-shot in-context learning (ICL) setting, minimizing resource
requirements and maximizing adaptability to new domains. Evaluation is conducted on the widely adopted
Spider benchmark (Yu et al., 2018) which is recognized for its cross-domain complexity and has become the
standard for assessing text-to-SQL systems. Key performance metrics include execution accuracy and exact-
set-match, in line with established evaluation protocols. By combining modular decomposition, execution
validation, and carefully engineered prompting, this methodology aims to bridge the gap between natural
language and structured data access in a scalable and practical manner.

3.2 Architecture

Figure 2: Text-to-SQL Process

The Text-to-SQL system consists of 2 stages, comprising of a Schema Linking LLM, and a Code Gen-
eration LLM. This architecture helps divide the important tasks of Text-to-SQL reasoning, improving both
grounding and interpretability. Each model is provided required context and few shot examples to en-
sure consistency in answers. Additionally, code generation model is provided a preview of relevant schema,
allowing it to ground its row-level selection, in cases such as generation of WHERE or HAVING clause.

3.2.1 Few-Shot Prompting

Database schema is provided to the LLMs in API docs format, where the schema is provided to the model
in the form of python documentation comments. This reduces prompt size, while providing clarity to the

8
Figure 3: Preview of examples given to Schema Linking model.

Figure 4: Preview of examples given to Code Generation model.

LLM. Both models in the pipeline are provided 3 relevant examples as part of system prompt. The example
prompts are picked from a set of 7,000 examples, based on cosine similarity of their embeddings. Embeddings
are generated at runtime.

Prompt for Schema Linking Model: Provided user question, database schema, and examples to gen-
erate SQL SELECT statements to select first 5 rows in the relevant tables and columns. Examples contain
user question, database schema, and also the target SELECT statement to inform the LLM of the structure
of generation. If multiple tables are relevant, separate SQL statements will be generated for SELECTing from
each, without performing JOINs.

Prompt for Code Generation Model: Provided user question, database schema, preview of relevant
schema and examples to generate final SQL statement to answer user question. Examples contain question,
database schema, preview of relevant schema, and also the target SQL statement to inform the LLM of the
structure of generation.

3.2.2 Inference Process

Step-by-step process of Text-to-SQL:
1. User asks question based on the current database.
2. User question and schema of current database are taken.
3. Embeddings are generated to find closest matching examples.

4. User question, database schema are passed to the Schema Linking LLM.
5. Schema Linking LLM provides SELECT statement, selecting 5 rows of the relevant schema.
6. SELECT statement is run on database to generate preview.

7. Embeddings are generated again to find closest matching examples.

9
8. User question, database schema, database preview of relevant schema are passing to the Code Gener-
ation LLM.
9. Code Generation LLM generates final SQL query.
10. Query is run on database and result is collected.

3.3 Setup
Langchain (Chase, n.d.) library has been used for building and running inference on LLM chains. Langchain
is an open-source library for working with LLMs. Langchain’s connectors with providers have been utilized
for prototyping and testing with multiple models, for hypothesis checking and testing.

3.3.1 Dataset Used

Spider (Yu et al., 2018) dataset has been used for training and evaluation. Spider is a commonly used
benchmark to evaluate text-to-SQL parsing in a cross-database setting, which requires models to generalize
to novel database schemas. The dataset consists of 7,000 question-query pairs in the training set and 1,034
pairs in the development set, covering 200 different databases and 138 domains.

3.3.2 Models Used

Embeddings: Embeddings generation is crucial for finding semantic similarity during example selection.
For this, we have utilized sentence-transformers/all-MiniLM-L6-v2.

LLMs: OpenAI’s gpt-3.5-turbo (OpenAI, 2024), and gpt-4o-mini (OpenAI, 2023) have been used for LLM
generation, owing to their low cost when working with large amounts of tokens.

3.4 Testing
Spider-dev, containing 1034 examples has been used for evaluation, Spider-test is not open to public.

3.4.1 Method
OpenAI’s gpt-3.5-turbo and gpt-4o-mini have been used for evaluation purpose. They have been run in two
separate ways:
• With chain: The model is called twice with mentioned prompting method.
• Without chain: The model is called once in zero-shot manner.
The latter is used as a baseline, allowing us to understand the benefits of the suggested approach.

3.4.2 Metrics
Spider’s repository provides an evaluation script, along with databases. The evaluation script takes in two
files; pred.sql and gold.sql. It will then provide the following metrics:
• EX: Execution Accuracy measures whether the predicted SQL query returns the same result as the
gold query when executed.
• EM: Exact Match checks if the predicted SQL query matches the gold query exactly in structure and
content.
• PM: Partial Match evaluates overlap in SQL components (SELECT, WHERE, JOIN, etc.) between predicted
and gold queries.
Additionally, it breaks down each metric based on difficulty level of question: Easy, Medium, Hard and
Extra Hard. We will be considering all of the provided metrics to judge the efficacy of our methods.

10
4 Results and discussion
After running the pipelines on spider-dev, we received the following results presented in Tables 3 through 9.

Table 3: Baseline for GPT-4o-mini

Metric Easy Medium Hard Extra All Joint All
Count 250 440 174 170 1034 1
Execution Accuracy 0.928 0.830 0.695 0.571 0.788 0.000
Exact Match Accuracy 0.336 0.327 0.184 0.053 0.260 0.000
Partial Matching Accuracy Details:
Select 0.913 0.944 0.914 0.757 0.914 0.000
Select (no AGG) 0.913 0.944 0.926 0.811 0.921 0.000
Where 0.919 0.667 0.457 0.306 0.623 0.000
Where (no OP) 0.968 0.676 0.586 0.389 0.681 0.000
Group (no Having) 0.941 0.667 0.875 1.000 0.897 0.000
Group 0.059 0.000 0.875 1.000 0.310 0.000
Order 0.909 0.889 0.643 1.000 0.860 0.000
And/Or 1.000 0.929 0.929 0.933 0.947 0.000
IUEN 0.000 0.000 0.000 0.000 0.000 0.000
Keywords 0.960 0.824 0.608 0.486 0.781 0.000

Table 4: Baseline for GPT-3.5-Turbo

Metric Easy Medium Hard Extra All Joint All
Count 250 440 174 170 1034 1
Execution Accuracy 0.888 0.852 0.690 0.576 0.788 0.000
Exact Match Accuracy 0.468 0.316 0.264 0.041 0.299 0.000
Partial Matching Accuracy Details:
Select 0.895 0.915 0.954 0.833 0.905 0.000
Select (no AGG) 0.895 0.923 0.972 0.854 0.914 0.000
Where 0.924 0.426 0.522 0.169 0.498 0.000
Where (no OP) 0.946 0.431 0.587 0.265 0.535 0.000
Group (no Having) 0.857 0.667 0.786 0.316 0.623 0.000
Group 0.286 0.000 0.714 0.263 0.358 0.000
Order 0.909 0.967 0.684 0.316 0.756 0.000
And/Or 1.000 0.959 0.970 0.913 0.964 0.000
IUEN 0.000 0.000 0.000 0.000 0.000 0.000
Keywords 0.953 0.658 0.676 0.417 0.688 0.000

4.1 Execution Accuracy (EX)

Our method with GPT-3.5-turbo achieved an overall execution accuracy of 0.682, a 13.5% drop from the
baseline (0.788). With GPT-4o-mini, our method scored 0.775, a 1.6% decrease from the baseline (0.788).
Both configurations underperformed compared to state-of-the-art (SOTA) methods like DIN-SQL (Pourreza
and Rafiei, 2023) and RSL-SQL (Cao et al., 2024), which report 0.879 accuracy using GPT-4/GPT-4o.
Notably, our method showed competitive performance on medium-difficulty queries with GPT-4o-mini (0.852
vs. 0.830 baseline), surpassing the baseline by 2.7%. However, it struggled significantly with hard (0.506 vs.
0.690 for GPT-3.5-turbo, a 26.7% drop) and extra-hard queries (0.424 vs. 0.576 for GPT-3.5-turbo, a 26.4%
drop), indicating limitations in handling complex query structures.

11
Table 5: GPT-4o-mini with Proposed Approach
Metric Easy Medium Hard Extra All Joint All
Count 250 440 174 170 1034 1
Execution Accuracy 0.896 0.852 0.684 0.488 0.775 0.000
Exact Match Accuracy 0.824 0.541 0.374 0.106 0.510 0.000
Partial Matching Accuracy Details:
Select 0.915 0.920 0.939 0.796 0.905 0.000
Select (no AGG) 0.932 0.926 0.954 0.823 0.918 0.000
Where 0.920 0.589 0.456 0.189 0.569 0.000
Where (no OP) 0.950 0.606 0.578 0.392 0.642 0.000
Group (no Having) 0.857 0.882 0.826 0.605 0.794 0.000
Group 0.857 0.676 0.783 0.372 0.632 0.000
Order 0.783 0.881 1.000 0.690 0.837 0.000
And/Or 1.000 0.948 0.952 0.942 0.960 0.000
IUEN 0.000 0.000 0.615 0.000 0.571 0.000
Keywords 0.917 0.839 0.646 0.625 0.784 0.000

Table 6: GPT-3.5-turbo with Proposed Approach

Metric Easy Medium Hard Extra All
Count 250 440 174 170 1034
Execution Accuracy 0.804 0.782 0.506 0.424 0.682
Exact Match Accuracy 0.744 0.443 0.276 0.059 0.425
Partial Matching Accuracy Details:
Select 0.943 0.862 0.944 0.814 0.891
Select (no AGG) 0.956 0.868 0.960 0.832 0.902
Where 0.802 0.475 0.398 0.149 0.474
Where (no OP) 0.810 0.486 0.505 0.287 0.528
Group (no Having) 0.941 0.850 0.765 0.565 0.795
Group 0.706 0.733 0.706 0.522 0.684
Order 0.783 1.000 0.810 0.741 0.866
And/Or 1.000 0.958 0.956 0.929 0.964
IUEN 0.000 0.000 0.400 0.000 0.400
Keywords 0.833 0.823 0.592 0.425 0.716

Table 7: Execution Accuracy (EX) Summary

Method Easy Medium Hard Extra Overall
Baseline GPT-3.5-turbo 0.888 0.852 0.690 0.576 0.788
Our Method (GPT-3.5-turbo) 0.804 0.782 0.506 0.424 0.682
Baseline GPT-4o-mini 0.928 0.830 0.695 0.571 0.788
Our Method (GPT-4o-mini) 0.896 0.852 0.684 0.488 0.775
DIN-SQL (GPT-4) (Pourreza and Rafiei, 2023) - - - - 0.879
RSL-SQL (GPT-4o) (Cao et al., 2024) - - - - 0.879

4.2 Exact Match Accuracy (EM)

Exact match accuracy assesses whether generated SQL queries structurally match the gold standard. Table
8 presents the results. Our method with GPT-3.5-turbo achieved an overall exact match accuracy of 0.425,
which is a notable improvement over its baseline of 0.299. For GPT-4o-mini, our method scored an overall

12
Table 8: Exact Match Accuracy (EM) Summary
Method Easy Medium Hard Extra Overall
Baseline GPT-3.5-turbo 0.468 0.316 0.264 0.041 0.299
Our Method (GPT-3.5-turbo) 0.744 0.443 0.276 0.059 0.425
Baseline GPT-4o-mini 0.336 0.327 0.184 0.053 0.260
Our Method (GPT-4o-mini) 0.824 0.541 0.374 0.106 0.510
DIN-SQL (GPT-4) (Pourreza and Rafiei, 2023) - - - - (N/A for EM here)
RSL-SQL (GPT-4o) (Cao et al., 2024) - - - - (N/A for EM here)

exact match accuracy of 0.510, substantially outperforming its baseline of 0.260. This indicates that the
proposed pipeline significantly enhances the structural correctness of the generated SQL queries compared
to the zero-shot baseline for both models. The improvement is particularly stark for GPT-4o-mini, where
exact match accuracy nearly doubled. While these exact match scores are lower than the execution accuracy
scores, the improvement over baseline is a positive indicator for the structural quality of SQL generated by
our method.

4.3 Partial Match Accuracy

Partial match accuracy evaluates whether generated SQL queries partially align with the gold standard.
Table 9 presents the summary results. Our method with GPT-3.5-turbo achieved an overall partial match

Table 9: Partial Match Accuracy Summary

Method Easy Medium Hard Extra Overall
Baseline GPT-3.5-turbo 0.920 0.876 0.732 0.610 0.820
Our Method (GPT-3.5-turbo) 0.848 0.810 0.550 0.460 0.718
Baseline GPT-4o-mini 0.944 0.860 0.738 0.602 0.820
Our Method (GPT-4o-mini) 0.912 0.876 0.710 0.520 0.802
DIN-SQL (GPT-4) (Pourreza and Rafiei, 2023) - - - - 0.892
RSL-SQL (GPT-4o) (Cao et al., 2024) - - - - 0.892

accuracy of 0.718, a 12.4% drop from the baseline (0.820). With GPT-4o-mini, our method scored 0.802, a
2.2% decrease from the baseline (0.820). These results indicate that our method performs better on partial
matches than exact matches, as minor syntactic variations are tolerated. For GPT-3.5-turbo, the largest
performance gaps occur on hard (0.550 vs. 0.732, a 24.9% drop) and extra-hard queries (0.460 vs. 0.610, a
24.6% drop), consistent with trends in execution accuracy. GPT-4o-mini narrows this gap, particularly on
medium queries (0.876 vs. 0.860, a 1.9% improvement) and hard queries (0.710 vs. 0.738, a 3.8

5 Future Scope
The promising results of this two-stage LLM pipeline for text-to-SQL translation open several exciting
avenues for future exploration, particularly in enhancing its capability to handle complex queries and broad-
ening its applicability across diverse domains. One key area of focus is the integration of advanced reasoning
models, which were not tested in this study. Models specifically designed for logical reasoning, such as those
optimized for step-by-step problem-solving, could significantly improve the pipeline’s performance on hard
and extra-hard queries, where nested subqueries and intricate joins pose challenges. Additionally, incor-
porating Chain-of-Thought (CoT) prompting, which was not implemented in the current pipeline, offers a
powerful opportunity to enhance query accuracy. By guiding the LLM to explicitly articulate intermediate
reasoning steps, such as mapping query intent to schema elements and selecting appropriate SQL constructs,
CoT could reduce errors like incorrect column selections and missing conditions, bridging the performance
gap with state-of-the-art methods like DIN-SQL and RSL-SQL. Further refinements to the pipeline’s schema

13
linking mechanism present another critical direction for future work. The current approach relies on few-
shot prompting with cosine similarity-based example selection, which occasionally misses functionally diverse
cases, leading to suboptimal schema recall. Exploring bidirectional pruning techniques, as demonstrated in
RSL-SQL, or relation-aware attention mechanisms, as seen in RAT-SQL, could enhance the pipeline’s ability
to handle complex, multi-hop relationships in large-scale databases. Additionally, dynamic example selection
strategies that prioritize diversity in query patterns and schema structures could mitigate biases and im-
prove generalization across domains. Beyond technical enhancements, future research could explore hybrid
approaches that combine the pipeline’s lightweight design with minimal fine-tuning to optimize performance
without sacrificing efficiency. For instance, fine-tuning the embedding model or selectively fine-tuning the
LLM on domain-specific subsets could improve schema linking precision while maintaining the benefits of
few-shot prompting. Another promising direction is the adaptation of the pipeline for cross-lingual text-to-
SQL tasks, enabling non-English queries to be processed effectively, thus broadening its global applicability.
By integrating reasoning models, CoT prompting, and advanced schema linking, and by testing on diverse
datasets, this pipeline can evolve into a highly versatile and efficient solution, further advancing data de-
mocratization and empowering users to seamlessly interact with structured data across industries.

6 Conclusion
This research has successfully pioneered a lightweight, two-stage LLM pipeline for text-to-SQL transla-
tion, harnessing the power of few-shot prompting to deliver robust performance without the heavy resource
demands of model fine-tuning. By drawing inspiration from advanced frameworks like DIN-SQL and in-
tegrating innovative execution-grounded prompting, this approach has made significant strides in tackling
critical challenges such as schema linking, query accuracy, and hallucination reduction. Evaluated on the
esteemed Spider dataset, spanning 200 databases across 138 domains, the pipeline leveraged GPT-3.5-turbo
and GPT-4o-mini to showcase its promising capabilities, advancing the mission of data democratization.
The results are encouraging, with the pipeline achieving an overall execution accuracy of 0.682 with GPT-
3.5-turbo and an impressive 0.775 with GPT-4o-mini, compared to baseline execution accuracies of 0.788 for
both models. Notably, the pipeline excelled on medium-difficulty queries, particularly with GPT-4o-mini,
where it surpassed the baseline by 2.7% (0.852 vs. 0.830) in execution accuracy, demonstrating its strength
in handling moderately complex tasks. While performance on hard and extra-hard queries showed room
for growth, the pipeline’s ability to generate functionally correct queries was evident in its partial match
accuracy scores (0.718 for GPT-3.5-turbo, 0.802 for GPT-4o-mini). Compared to state-of-the-art methods
like DIN-SQL and RSL-SQL, which achieve 0.879 execution accuracy with GPT-4/GPT-4o, our pipeline
holds its own as a resource-efficient alternative. Its lightweight design, bypassing the need for extensive
computational resources or domain-specific data, positions it as a game-changer for real-world applications
where adaptability and speed are paramount. The innovative use of preview query generation to anchor
schema linking proved highly effective in curbing hallucinations, paving the way for more reliable query syn-
thesis. The pipeline’s modular architecture and few-shot prompting strategy empower non-technical users
to effortlessly query relational databases, unlocking data-driven insights with unprecedented ease. Look-
ing ahead, the future is bright for enhancing this pipeline. Opportunities abound to refine schema linking
through techniques like bidirectional pruning or relation-aware attention, inspired by RSL-SQL and RAT-
SQL, to boost performance on complex queries. Dynamic example selection with greater diversity could
further elevate prompt effectiveness, while hybrid approaches blending lightweight prompting with targeted
fine-tuning may unlock even higher accuracy without sacrificing efficiency. This research marks a bold step
forward in making structured data accessible to all, proving that a lean, few-shot LLM pipeline can deliver
powerful text-to-SQL performance. By empowering users to seamlessly interact with databases, this work
lights the path toward a future where data is truly democratized, fostering innovation and decision-making
across industries.

References
A Mitsopoulou, G. K. (2025, February 11). Analysis of Text-to-SQL Benchmarks: Limitations, Challenges. Retrieved
from https://openproceedings.org/2025/conf/edbt/paper-41.pdf

14
Bailin Wang, R. S. (2020). RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers. In
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 7567–7578). Online:
Association for Computational Linguistics.
Bourne, J. (2019, February 12). Relational databases ‘here to stay’ in the enterprise amid greater
cloud and data complexity. Retrieved from CloudTech: https://www.cloudcomputing-news.net/news/
relational-databases-here-stay-enterprise-amid-greater-cloud-and-data-complexity/
Zhenbiao Cao, Y. Z. (2024, November 26). RSL-SQL: Robust Schema Linking in Text-to-SQL Generation.
arXiv:2411.00073. Retrieved from https://arxiv.org/abs/2411.00073
Chase, H. (n.d.). Langchain. Retrieved from https://github.com/langchain-ai/langchain
Naihao Deng, Y. C. (2022). Recent Advances in Text-to-SQL: A Survey of What We Have and What We Expect.
In Proceedings of the 29th International Conference on Computational Linguistics (p. 22). Gyeongju, Republic of
Korea: Association for Computational Linguistics.
Dawei Gao, H. W. (2023, August 29). Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation.
arXiv:2308.15363. Retrieved from https://arxiv.org/abs/2308.15363
Gottumukkala, A. M. (2025, February 4). From Natural Language to SQL: Review of LLM-based Text-to-SQL Systems.
arXiv:2410.01066v2. Retrieved from https://arxiv.org/pdf/2410.01066v2
Zijin Hong, Z. Y. (2025, March 13). Next-Generation Database Interfaces: A Survey of LLM-based Text-to-SQL.
arXiv:2406.08426v1. Retrieved from https://arxiv.org/pdf/2406.08426v1
Kevin Bartley. (n.d.). Big data statistics: How much data is there in the world? Retrieved from Rivery: https:
//rivery.io/blog/big-data-statistics-how-much-data-is-there-in-the-world/
Mayank Kothyari, D. D. (2023). CRUSH4SQL: Collective Retrieval Using Schema Hallucination For Text2SQL. In
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 14054–14066).
Singapore: Association for Computational Linguistics.
lamini.ai. (n.d.). How a Fortune 500 slashed hallucinations to create 94.7% accurate LLM agents for SQL. Retrieved
from https://www.lamini.ai/blog/llm-text-to-sql
Wenqiang Lei, W. W.-Y.-S. (2020). Re-examining the Role of Schema Linking in Text-to-SQL. In Proceedings of
the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 6943–6954). Online:
Association for Computational Linguistics.
Haoyang Li, J. Z. (2023, April 10). RESDSQL: Decoupling Schema Linking and Skeleton Parsing for Text-to-SQL.
arXiv:2302.05965. Retrieved from https://arxiv.org/pdf/2302.05965
Zhishuai Li, X. W. (2024, June 2). PET-SQL: A Prompt-Enhanced Two-Round Refinement of Text-to-SQL with
Cross-consistency. arXiv:2403.09732. Retrieved from https://arxiv.org/pdf/2403.09732
OpenAI. (2023, March 1). Introducing ChatGPT and Whisper APIs. Retrieved from https://openai.com/blog/
introducing-chatgpt-and-whisper-apis
OpenAI. (2024, July 18). OpenAI unveils cheaper small AI model GPT-4o mini.
Reuters. Retrieved from https://www.reuters.com/technology/artificial-intelligence/
openai-unveils-cheaper-small-ai-model-gpt-4o-mini-2024-07-18/
Mohammadreza Pourreza, D. R. (2023, April 21). DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with
Self-Correction. arXiv:2304.11015. Retrieved from https://arxiv.org/pdf/2304.11015
Ge Qu, J. L. (2024). Before Generation, Align it! A Novel and Effective Strategy for Mitigating Hallucinations in
Text-to-SQL Generation. In Findings of the Association for Computational Linguistics: ACL 2024 (pp. 5456–5471).
Bangkok: Association for Computational Linguistics.
Cedric Renggli, I. F. (2025, January 30). Fundamental Challenges in Evaluating Text2SQL Solutions and Detecting
Their Limitations. arXiv:2501.18197v1. Retrieved from https://arxiv.org/pdf/2501.18197v1
Sandwell, D. (2021, June 3). What is data democratization and why is it important? Retrieved from Quest Blog:
https://blog.quest.com/why-data-democratization-why-now-what-does-it-look-like/
Chang-Yu Tai, Z. C. (2023). Exploring Chain of Thought Style Prompting for Text-to-SQL. In Proceedings of the
2023 Conference on Empirical Methods in Natural Language Processing (pp. 89-102). Singapore: Association for
Computational Linguistics.
Text-to-SQL, E. C. (2023). Divide and Prompt: Chain of Thought Prompting for Text-to-SQL. In Proceedings of the
2023 Conference on Empirical Methods in Natural Language Processing (pp. 5376–5393). Online: Association for
Computational Linguistics.
Rishit Toteja, A. S. (2025). In-Context Reinforcement Learning with Retrieval-Augmented Generation for Text-to-
SQL. Proceedings of the 31st International Conference on Computational Linguistics (p. 8). Abu Dhabi, UAE:
Association for Computational Linguistics. Retrieved from aclanthology.org.
Jason Wei, X. W. (2023, January 10). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.
arXiv:2201.11903. Retrieved from https://arxiv.org/pdf/2201.11903
Gretel.ai. (n.d.). What is Text-to-SQL? Retrieved from https://gretel.ai/technical-glossary/
what-is-text-to-sql

15
Sun Yang, Q. S. (2024). SQL-to-Schema Enhances Schema Linking in Text-to-SQL. In International Conference on
Database and Expert Systems Applications (pp. 139-145). Cham: Springer.
Tao Yu, R. Z. (2018). Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic
Parsing and Text-to-SQL Task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language
Processing (pp. 3911–3921). Brussels, Belgium: Association for Computational Linguistics.
Hanchong Zhang, R. C. (2023, October 26). ACT-SQL: In-Context Learning for Text-to-SQL with Automatically-
Generated Chain-of-Thought. arXiv:2310.17342. Retrieved from https://arxiv.org/pdf/2310.17342
Zhuosheng Zhang, A. Z. (2022, October 7). Automatic Chain of Thought Prompting in Large Language Models.
arXiv:2210.03493. Retrieved from https://arxiv.org/pdf/2210.03493
Xiaohu Zhu, Q. L. (2024, October 8). Large Language Model Enhanced Text-to-SQL. arXiv:2410.06011v1. Retrieved
from https://arxiv.org/pdf/2410.06011v1

670e4e23bdd7d170839060aa2023 Findings-Emnlp 227
No ratings yet
670e4e23bdd7d170839060aa2023 Findings-Emnlp 227
32 pages
Large Language Model Enhanced Text-to-SQL Generation - A Survey
No ratings yet
Large Language Model Enhanced Text-to-SQL Generation - A Survey
18 pages
STaR SQL Self Taught Reasoner For Text To SQL
No ratings yet
STaR SQL Self Taught Reasoner For Text To SQL
11 pages
SQLPa LM
No ratings yet
SQLPa LM
61 pages
A Survey of Large Language Model-Based Generative AI For Text-to-SQL: Benchmarks, Applications, Use Cases, and Challenges
No ratings yet
A Survey of Large Language Model-Based Generative AI For Text-to-SQL: Benchmarks, Applications, Use Cases, and Challenges
7 pages
LLM Model Transform For Short Term Trading On Commodity
No ratings yet
LLM Model Transform For Short Term Trading On Commodity
7 pages
Solid-SQL Enhanced Schema-Linking Based In-Context Learning For
No ratings yet
Solid-SQL Enhanced Schema-Linking Based In-Context Learning For
11 pages
Structure-Guided Large Language Models For
No ratings yet
Structure-Guided Large Language Models For
24 pages
DB GPT Hub 2024
No ratings yet
DB GPT Hub 2024
17 pages
From Natural Language To SQL Review of
No ratings yet
From Natural Language To SQL Review of
15 pages
Next-Generation Database Interfaces: A Survey of LLM-based Text-to-SQL
No ratings yet
Next-Generation Database Interfaces: A Survey of LLM-based Text-to-SQL
17 pages
Research Paper
No ratings yet
Research Paper
32 pages
LLM Based Survey Text 1741015993
No ratings yet
LLM Based Survey Text 1741015993
20 pages
Enhancing Text-To-SQL Capabilities of Large Language Models
No ratings yet
Enhancing Text-To-SQL Capabilities of Large Language Models
22 pages
Pet-Sql:: A Prompt-Enhanced Two-Round Refinement of Text-to-SQL With Cross-Consistency
No ratings yet
Pet-Sql:: A Prompt-Enhanced Two-Round Refinement of Text-to-SQL With Cross-Consistency
15 pages
LLM-Based Text-to-SQL Survey
No ratings yet
LLM-Based Text-to-SQL Survey
18 pages
Few-Shot Text-to-SQL Translation Using Structure
No ratings yet
Few-Shot Text-to-SQL Translation Using Structure
28 pages
2024 Lrec-Main 539
No ratings yet
2024 Lrec-Main 539
19 pages
Din SQL Research Paper
No ratings yet
Din SQL Research Paper
34 pages
Preprints202402 0693 v1
No ratings yet
Preprints202402 0693 v1
9 pages
Spider 2 0 Can Language
No ratings yet
Spider 2 0 Can Language
45 pages
DTS SQL
No ratings yet
DTS SQL
9 pages
Semantic Parsing For Complex Data Retrieval: Targeting Query Plans vs. SQL For No-Code Access To Relational Databases
No ratings yet
Semantic Parsing For Complex Data Retrieval: Targeting Query Plans vs. SQL For No-Code Access To Relational Databases
17 pages
Project Report - 3
No ratings yet
Project Report - 3
3 pages
LLM Based TXT To SQL
No ratings yet
LLM Based TXT To SQL
18 pages
In Context Reinforcement Learning Based Retrieval Augmented Generation For Text To SQL
No ratings yet
In Context Reinforcement Learning Based Retrieval Augmented Generation For Text To SQL
8 pages
Datrics Text 2 SQL
No ratings yet
Datrics Text 2 SQL
29 pages
Lucy: Think and Reason To Solve Text-to-SQL: Nina Narodytska Shay Vargaftik
No ratings yet
Lucy: Think and Reason To Solve Text-to-SQL: Nina Narodytska Shay Vargaftik
33 pages
24 Data Centric Text To SQL Wi
No ratings yet
24 Data Centric Text To SQL Wi
6 pages
Project Final 1
No ratings yet
Project Final 1
55 pages
Spider 2.0: Real-World Text-to-SQL Evaluation
No ratings yet
Spider 2.0: Real-World Text-to-SQL Evaluation
45 pages
Chalkboard School Class Syllabus Education Presentation
No ratings yet
Chalkboard School Class Syllabus Education Presentation
8 pages
Chase SQL
No ratings yet
Chase SQL
30 pages
E-SQL: Direct Schema Linking Via Question Enrichment in Text-to-SQL
No ratings yet
E-SQL: Direct Schema Linking Via Question Enrichment in Text-to-SQL
18 pages
A Survey On Text-to-SQL Parsing: Concepts, Methods, and Future Directions
No ratings yet
A Survey On Text-to-SQL Parsing: Concepts, Methods, and Future Directions
19 pages
Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation
No ratings yet
Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation
22 pages
HCteam IT Proposal
No ratings yet
HCteam IT Proposal
15 pages
Enhancing Text-To-SQL Translation For Financial System Design
No ratings yet
Enhancing Text-To-SQL Translation For Financial System Design
11 pages
Project Report - 7 - Merged
No ratings yet
Project Report - 7 - Merged
46 pages
T2S Retrieval
No ratings yet
T2S Retrieval
16 pages
MCS-SQL: Leveraging Multiple Prompts and Multiple-Choice Selection For Text-to-SQL Generation
No ratings yet
MCS-SQL: Leveraging Multiple Prompts and Multiple-Choice Selection For Text-to-SQL Generation
17 pages
Text-to-SQL Meets The Real-World: 1 4 A 1 B 1 C 1 4 D 1 e 4 F 1 G 1 4 H 2 I 3 J 1 4 K
No ratings yet
Text-to-SQL Meets The Real-World: 1 4 A 1 B 1 C 1 4 D 1 e 4 F 1 G 1 4 H 2 I 3 J 1 4 K
12 pages
OmniSQL Synthesizing High-Quality Text-To-SQL Data at Scale
No ratings yet
OmniSQL Synthesizing High-Quality Text-To-SQL Data at Scale
15 pages
E-SQL Direct Schema Linking Via Question Enrichment in
No ratings yet
E-SQL Direct Schema Linking Via Question Enrichment in
20 pages
Project Report - 4
No ratings yet
Project Report - 4
5 pages
The Death of Schema Linking? Text-to-SQL in The Age of Well-Reasoned Language Models
No ratings yet
The Death of Schema Linking? Text-to-SQL in The Age of Well-Reasoned Language Models
11 pages
Recent Advances in Text To SQL
No ratings yet
Recent Advances in Text To SQL
22 pages
A Natural Language Interface To Relational Databases Using An Online Analytic Processing Hypercube
No ratings yet
A Natural Language Interface To Relational Databases Using An Online Analytic Processing Hypercube
18 pages
2.1 Review of Literature: "SQL Generation and PL/SQL Execution From Natural Language Processing"
No ratings yet
2.1 Review of Literature: "SQL Generation and PL/SQL Execution From Natural Language Processing"
11 pages
Dbms Lab El Report
No ratings yet
Dbms Lab El Report
20 pages
Data Science AI Query Solution
No ratings yet
Data Science AI Query Solution
6 pages
A Survey of NL2SQL With Large Language Models: Where Are We, and Where Are We Going?
No ratings yet
A Survey of NL2SQL With Large Language Models: Where Are We, and Where Are We Going?
20 pages
Text-To-SQL Spider Paper
No ratings yet
Text-To-SQL Spider Paper
11 pages
Balancing Content Size in RAG-Text2SQL System: Prakhar Gurawa
No ratings yet
Balancing Content Size in RAG-Text2SQL System: Prakhar Gurawa
26 pages
Catsql:: Towards Real World Natural Language To SQL Applications
No ratings yet
Catsql:: Towards Real World Natural Language To SQL Applications
14 pages
TEXT-to-SQL DIN-SQL
No ratings yet
TEXT-to-SQL DIN-SQL
11 pages
Ijctt V72i12p103
No ratings yet
Ijctt V72i12p103
8 pages
NLQ 262290 5914375 NLQ
No ratings yet
NLQ 262290 5914375 NLQ
8 pages
BMV082503446 1
No ratings yet
BMV082503446 1
16 pages
My Resume
No ratings yet
My Resume
1 page
DAA - Loop Time Complexity-1
No ratings yet
DAA - Loop Time Complexity-1
24 pages
L4 - Important Problem Solving
No ratings yet
L4 - Important Problem Solving
13 pages
Algorithm Design Essentials
No ratings yet
Algorithm Design Essentials
17 pages
L5 - Analysis of Algorithm Efficiency
No ratings yet
L5 - Analysis of Algorithm Efficiency
13 pages
Database Transactions Quiz
No ratings yet
Database Transactions Quiz
10 pages
07 07 2025 SR Star Test 2 MS
No ratings yet
07 07 2025 SR Star Test 2 MS
6 pages
DML Queries
No ratings yet
DML Queries
3 pages
21EC643
No ratings yet
21EC643
4 pages
Database & SDLC Fundamentals Quiz
No ratings yet
Database & SDLC Fundamentals Quiz
8 pages
QR Code Generation in Oracle EBS
No ratings yet
QR Code Generation in Oracle EBS
10 pages
CB3401-Evaluation Shecme & Answer Key
No ratings yet
CB3401-Evaluation Shecme & Answer Key
7 pages
Mongodb
No ratings yet
Mongodb
9 pages
AD DS Setup for IT Professionals
No ratings yet
AD DS Setup for IT Professionals
33 pages
Mysql Shell 8.0 PDF
No ratings yet
Mysql Shell 8.0 PDF
38 pages
Book Management System A Python Project
No ratings yet
Book Management System A Python Project
10 pages
Query 6 Company Database
No ratings yet
Query 6 Company Database
7 pages
ADB Chapter Two
No ratings yet
ADB Chapter Two
22 pages
DPAAdministrator Guide
No ratings yet
DPAAdministrator Guide
282 pages
Triggers and Active Data Bases in DBMS
No ratings yet
Triggers and Active Data Bases in DBMS
4 pages
Efficient Dispatcher Mechanismfor SIPCluster Basedon Memory Utilization
No ratings yet
Efficient Dispatcher Mechanismfor SIPCluster Basedon Memory Utilization
7 pages
Chapter 9. Database Design
100% (1)
Chapter 9. Database Design
52 pages
PostgreSQL Backups and Recoveries
No ratings yet
PostgreSQL Backups and Recoveries
5 pages
608 Advanced Intelligent Tourist Guide System
No ratings yet
608 Advanced Intelligent Tourist Guide System
16 pages
Cumulative KB (Hotfix) For (Internal Name) : WFO Package
No ratings yet
Cumulative KB (Hotfix) For (Internal Name) : WFO Package
93 pages
Class 12 Computer Science Exam
No ratings yet
Class 12 Computer Science Exam
9 pages
SQL Certificate Mock Exams
No ratings yet
SQL Certificate Mock Exams
15 pages
Database Management Systems
No ratings yet
Database Management Systems
1 page
AUTHCONTROLLER
No ratings yet
AUTHCONTROLLER
9 pages
DBMS (Eng) U1
No ratings yet
DBMS (Eng) U1
59 pages
Class 12 CS Practical Exercises 2024-2025 (Updated) - 1
No ratings yet
Class 12 CS Practical Exercises 2024-2025 (Updated) - 1
67 pages
TAFJ DBTools User Guide
No ratings yet
TAFJ DBTools User Guide
50 pages
UNIT 6 - DATABASE JPR Notes
No ratings yet
UNIT 6 - DATABASE JPR Notes
23 pages
File System vs. DBMS Disadvantages
No ratings yet
File System vs. DBMS Disadvantages
3 pages
Database & Cloud Infrastructure Expert
No ratings yet
Database & Cloud Infrastructure Expert
5 pages

Final Paper

Uploaded by

Final Paper

Uploaded by

Query Refinement and Chain of Thought for Improvement in SQL

Generation using LLM

1.2 Need for Text-To-SQL

1.3 Challenges in Text-to-SQL

1.4 Overview of Existing Approaches

1.5 Problem Statement

Table 2: Overview of existing methodologies in Text-to-SQL which utilize in-context learning

2.1.1 Text-to-SQL Task Overview

2.2 In-Context Learning (ICL)

2.2.1 Few-Shot Prompting

2.2.2 Chain Of Thought Prompting

2.3 Multi-Stage LLM Pipelines

2.4 Challenges with Schema Linking

2.5 Execution Accuracy and Hallucinations

2.6 Limitations in Existing Methods

Figure 2: Text-to-SQL Process

3.2.1 Few-Shot Prompting

Figure 4: Preview of examples given to Code Generation model.

3.2.2 Inference Process

7. Embeddings are generated again to find closest matching examples.

3.3.1 Dataset Used

3.3.2 Models Used

Table 3: Baseline for GPT-4o-mini

Table 4: Baseline for GPT-3.5-Turbo

4.1 Execution Accuracy (EX)

Table 6: GPT-3.5-turbo with Proposed Approach

Table 7: Execution Accuracy (EX) Summary

4.2 Exact Match Accuracy (EM)

4.3 Partial Match Accuracy

Table 9: Partial Match Accuracy Summary

You might also like