Final Paper
Final Paper
Abstract
In an increasingly digitized world, vast volumes of structured data are being stored in relational
databases. Enabling non-technical stakeholders to query such databases via natural language interfaces
has become a critical need. Large Language Models (LLMs) have shown promise in bridging this gap
through Text-to-SQL translation. While fine-tuned LLMs achieve strong performance, they demand
substantial computational resources and domain-specific data. In contrast, in-context learning offers a
lighter-weight alternative but often suffers from schema hallucinations, incorrect joins, and ungrounded
WHERE clauses, limiting execution accuracy. To address these challenges, we propose a two-stage LLM
pipeline using few-shot prompting without any model fine-tuning. Few-shot prompting allows for obtain-
ing consistent results without requiring high effort. The first LLM performs schema linking byselecting
relevant tables and columns and generating SQL preview queries for the relevant schema. These previews
are then executed to extract a small set of representative rows, which are provided to the second LLM
responsible for final SQL query generation. This design improves grounding and reduces semantic errors.
We evaluate our method on the Spider dataset, achieving 68% and 77% execution accuracy on Spi-
der’s dev set using ChatGPT-3.5-turbo and GPT-4o-mini respectively. Our approach demonstrates that
lightweight decomposition and execution-grounded prompting can yield robust performance, rivalling
more complex in-context methods like DIN-SQL and CoT-based pipelines.
1 Introduction
1.1 Text-to-SQL
Text-to-SQL is a Natural Language Processing (NLP) task which involves converting natural language queries
into database executable SQL queries, taking help from database schema (Hong et al., 2025). This is
explained by Figure 2, where user asks a question, asking for names of all students. The text-to-SQL system
utilizes the database schema to generate SQL to answer the user question. This SQL query is then run on
the database, and the result is returned to the user.
1
Figure 1: Text-to-SQL Process
2
Table 1: Overview of existing methodologies in Text-to-SQL (Hong et al., 2025; Deng et al., 2022)
Approach Key Characteristics Limitations
Rule-based sys- Manual heuristics for simple queries; Fragile to schema changes; poor
tems limited scalability. cross-domain generalization.
Deep learning Sequence-to-sequence architectures (e.g., Struggle with complex JOINs and nested
models LSTMs) improved cross-domain queries.
generalization.
Pre-trained BERT/RoBERTa fine-tuned for Require domain-specific training data;
LMs (PLMs) schema-aware SQL generation. parameter inefficiency.
LLM-based In-context learning with GPT variants; Prone to hallucinations; limited by context
methods reduced training costs. windows.
amounts of training data and computational power or sacrifice accuracy for practicality. There is a pressing
need for a solution that can maintain high execution accuracy while remaining computationally efficient and
adaptable to new database schemas without extensive retraining or fine-tuning.
1.6 Objectives
This research aims to develop and evaluate a novel approach to text-to-SQL translation with the following
specific objectives:
• Create a two-stage LLM pipeline using few-shot prompting that eliminates the need for model fine-
tuning while maintaining high execution accuracy.
• Implement effective schema linking mechanisms to select only relevant tables and columns from database
schemas, reducing potential for hallucination and error.
• Develop a method for generating SQL preview queries that can extract representative data rows to
ground the final query generation.
• Evaluate the performance of the proposed approach on standard benchmarks, specifically the Spider
dataset, using both ChatGPT-3.5-turbo and GPT-4o-mini models.
• Compare the execution accuracy of our approach with existing methods, particularly more complex
in-context methods like DIN-SQL and CoT-based pipelines.
• Demonstrate that lightweight decomposition and execution-grounded prompting can produce robust
performance without requiring extensive computational resources or specialized model training.
3
1.7 Scope
This research focuses specifically on developing and evaluating a two-stage LLM pipeline for text-to-SQL
translation using few-shot prompting without any model fine-tuning. The scope includes:
• Development of prompting strategies for two distinct phases: schema linking with preview query
generation, and final SQL query generation.
• Testing on the Spider dataset, a cross-domain text-to-SQL benchmark featuring 200 databases across
138 domains.
• Evaluation using execution accuracy metrics on Spider’s development set.
• Comparison with existing methods, particularly in-context learning approaches.
• Analysis of performance differences between ChatGPT-3.5-turbo and GPT-4o-mini models when used
within our pipeline.
The research does not address cross-lingual text-to-SQL generation, handling of extremely complex database
schemas beyond those present in the Spider dataset, or application to specialized domains requiring expert
knowledge. Additionally, while our approach is designed to be adaptable to different LLMs, we limit our
testing to the specified models.
2 Literature Survey
2.1 Introduction
The exponential growth of structured data-projected to reach 181 zettabytes by 2025 -and the ubiquity of
relational databases (used by 55% of enterprises for SQL Server ) have made text-to-SQL translation a critical
technology for data democratization (Gretel.ai, n.d.). Historically, database querying evolved from punch
cards in the 1940s to Edgar Codd’s relational model in 1970, culminating in SQL’s standardization by IBM.
Early text-to-SQL systems relied on rule-based heuristics, but their fragility to schema changes and limited
cross-domain generalization spurred neural approaches (Deng et al., 2022). The emergence of large language
models (LLMs) like GPT-4 has revolutionized this field. Unlike fine-tuned models requiring domain-specific
data, LLMs leverage in-context learning to adapt to new schemas dynamically (Zhu and Li, 2024). This
shift aligns with industry demands: 72% of organizations now prioritize natural language interfaces to reduce
dependency on technical teams (Gretel.ai, n.d.). Modern benchmarks like Spider (200 databases across 138
domains) highlight both the progress and persistent challenges-current LLM-based methods achieve 68–85%
execution accuracy, but hallucinations and schema linking errors remain prevalent (Hong et al., 2025).
4
and robustness, particularly for edge cases where schema nuances or ambiguous language could lead to
errors. The effectiveness of ICL in Text-to-SQL stems from its ability to provide contextual cues that
guide the model’s reasoning. For example, including table descriptions and column relationships in the
prompt helps disambiguate terms like “order” (which could refer to a table or a sorting operation). Research
highlights that ICL can achieve competitive performance compared to fine-tuned models, especially when
prompts are carefully designed to include diverse examples (Wei et al., 2023). This adaptability is crucial
for Text-to-SQL systems deployed in environments with frequently updated or unfamiliar databases, as it
minimizes the need for retraining while maintaining high accuracy.
5
handle complex queries involving multiple tables, nested subqueries, or ambiguous language. The pipelines
often incorporate In-Context Learning (ICL), Few-Shot Prompting, and Chain-of-Thought (CoT) reasoning
to guide LLMs, reducing errors and improving generalization across diverse database schemas.
2.3.1 DIN-SQL
DIN-SQL (Decomposed In-Context Learning of Text-to-SQL) is another prominent multi-stage pipeline that
decomposes the Text-to-SQL task into four key stages: schema linking, query classification, SQL genera-
tion, and self-correction. The schema-linking stage identifies references to database elements and condition
values in the query, using embeddings to map natural language terms to schema components (Pourreza and
Rafiei, 2023). The classification stage categorizes queries into easy, non-nested complex, or nested complex,
enabling tailored prompts for each category. For complex queries, DIN-SQL introduces an intermediate
representation called NatSQL, which simplifies SQL syntax by removing operators without clear natural
language counterparts, easing the transition from query to code (Pourreza and Rafiei, 2023).
2.3.2 ACT-SQL
ACT-SQL (Augmented Chain-of-Thought SQL) is a multi-stage pipeline that enhances Text-to-SQL by
integrating dynamic few-shot prompting with CoT reasoning. In the first stage, ACT-SQL employs a schema-
linking module to identify relevant tables and columns in the user’s natural language query, using similarity-
based scoring to select contextually appropriate few-shot examples. These examples are dynamically chosen
based on their similarity to the input query and SQL structure, ensuring relevance (Zhang et al., 2023).
The second stage involves query classification, categorizing the query as easy, non-nested complex, or nested
complex, which informs the complexity of the subsequent SQL generation. The final stage generates the
SQL query using a CoT-guided prompt that includes intermediate reasoning steps, followed by a validation
step to ensure syntactic and semantic correctness (Zhang et al., 2023).
2.3.3 PET-SQL
PET-SQL (Pattern-Enhanced Text-to-SQL) is a multi-stage pipeline that leverages pattern-enhanced train-
ing to improve Text-to-SQL performance, particularly for complex and cross-domain queries. The first stage
involves schema linking, where PET-SQL uses a combination of neural embeddings and pattern-based rules
to map query terms to database elements, identifying tables, columns, and values with high precision (Li et
al., 2024). The second stage employs a query decomposition module that breaks down the natural language
query into semantic components (e.g., entities, conditions, aggregations), guided by learned patterns from a
curated training set. The final stage generates the SQL query using a pattern-enhanced LLM prompt that
incorporates CoT reasoning and few-shot examples tailored to the query’s semantic structure, followed by a
self-consistency check to refine the output (Li et al., 2024).
6
al., 2023). However, Lei et al. (2020) caution that even state-of-the-art models like SLSQL exhibit 23% error
rates when schema linking fails, underscoring its pivotal role in overall system accuracy.
– Schema Contradictions: Incorrect SQL violating schema structure (30% of errors). Example:
Referencing sets.language instead of set translations.language (Qu et al., 2024).
– Attribute Overanalysis: Introducing unnecessary columns (49% of errors). Example: Adding
height to SELECT when only player name is needed (Qu et al., 2024).
• Logic Hallucinations (25% frequency) (Qu et al., 2024): These include JOIN redundancies (15%) and
clause misuse (25%), such as erroneous GROUP BY in aggregation-free queries. DIN-SQL (as discussed
by Qu et al. (2024) in their analysis, original paper Pourreza and Rafiei (2023)) reduces such errors
by 37% through execution-grounded self-correction, validating intermediate queries against database
previews.
Evaluation inconsistencies further complicate accuracy metrics. The Spider dataset flags 12% of semantically
correct queries as incorrect due to alias mismatches (e.g., using table names instead of T1, T2) (lamini.ai,
n.d.). Worse, 5% of Spider’s ”gold” SQL contains errors, such as omitting JOIN conditions, which skew
benchmarking (lamini.ai, n.d.). RSL-SQL (Cao et al., 2024) addresses this by implementing multi-turn
self-correction, iteratively refining SQL using execution feedback to achieve 87.9% accuracy on Spider with
GPT-4o.
These limitations underscore the need for lightweight, execution-grounded approaches that balance modu-
larity with computational efficiency-a gap our proposed methodology aims to address.
7
3 Methodology
3.1 Introduction
The goal of this research is to develop a robust, lightweight text-to-SQL system that enables non-technical
users to query relational databases using natural language, without the need for model fine-tuning or ex-
tensive domain-specific data. This chapter details the methodology adopted to address the core challenges
identified in the literature-namely, schema linking, SQL generation accuracy, and hallucination mitigation-by
leveraging recent advances in large language models (LLMs) and modular pipeline design. Building on the
limitations of prior single-stage and monolithic approaches our method employs a two-stage LLM pipeline
inspired by modular frameworks such as DIN-SQL (Pourreza and Rafiei, 2023) and execution-grounded
prompting strategies. The first stage focuses on schema linking and preview query generation, while the
second stage utilizes the extracted schema context and representative data to guide accurate SQL synthesis.
The entire process is designed to operate in a few-shot in-context learning (ICL) setting, minimizing resource
requirements and maximizing adaptability to new domains. Evaluation is conducted on the widely adopted
Spider benchmark (Yu et al., 2018) which is recognized for its cross-domain complexity and has become the
standard for assessing text-to-SQL systems. Key performance metrics include execution accuracy and exact-
set-match, in line with established evaluation protocols. By combining modular decomposition, execution
validation, and carefully engineered prompting, this methodology aims to bridge the gap between natural
language and structured data access in a scalable and practical manner.
3.2 Architecture
The Text-to-SQL system consists of 2 stages, comprising of a Schema Linking LLM, and a Code Gen-
eration LLM. This architecture helps divide the important tasks of Text-to-SQL reasoning, improving both
grounding and interpretability. Each model is provided required context and few shot examples to en-
sure consistency in answers. Additionally, code generation model is provided a preview of relevant schema,
allowing it to ground its row-level selection, in cases such as generation of WHERE or HAVING clause.
8
Figure 3: Preview of examples given to Schema Linking model.
LLM. Both models in the pipeline are provided 3 relevant examples as part of system prompt. The example
prompts are picked from a set of 7,000 examples, based on cosine similarity of their embeddings. Embeddings
are generated at runtime.
Prompt for Schema Linking Model: Provided user question, database schema, and examples to gen-
erate SQL SELECT statements to select first 5 rows in the relevant tables and columns. Examples contain
user question, database schema, and also the target SELECT statement to inform the LLM of the structure
of generation. If multiple tables are relevant, separate SQL statements will be generated for SELECTing from
each, without performing JOINs.
Prompt for Code Generation Model: Provided user question, database schema, preview of relevant
schema and examples to generate final SQL statement to answer user question. Examples contain question,
database schema, preview of relevant schema, and also the target SQL statement to inform the LLM of the
structure of generation.
4. User question, database schema are passed to the Schema Linking LLM.
5. Schema Linking LLM provides SELECT statement, selecting 5 rows of the relevant schema.
6. SELECT statement is run on database to generate preview.
9
8. User question, database schema, database preview of relevant schema are passing to the Code Gener-
ation LLM.
9. Code Generation LLM generates final SQL query.
10. Query is run on database and result is collected.
3.3 Setup
Langchain (Chase, n.d.) library has been used for building and running inference on LLM chains. Langchain
is an open-source library for working with LLMs. Langchain’s connectors with providers have been utilized
for prototyping and testing with multiple models, for hypothesis checking and testing.
LLMs: OpenAI’s gpt-3.5-turbo (OpenAI, 2024), and gpt-4o-mini (OpenAI, 2023) have been used for LLM
generation, owing to their low cost when working with large amounts of tokens.
3.4 Testing
Spider-dev, containing 1034 examples has been used for evaluation, Spider-test is not open to public.
3.4.1 Method
OpenAI’s gpt-3.5-turbo and gpt-4o-mini have been used for evaluation purpose. They have been run in two
separate ways:
• With chain: The model is called twice with mentioned prompting method.
• Without chain: The model is called once in zero-shot manner.
The latter is used as a baseline, allowing us to understand the benefits of the suggested approach.
3.4.2 Metrics
Spider’s repository provides an evaluation script, along with databases. The evaluation script takes in two
files; pred.sql and gold.sql. It will then provide the following metrics:
• EX: Execution Accuracy measures whether the predicted SQL query returns the same result as the
gold query when executed.
• EM: Exact Match checks if the predicted SQL query matches the gold query exactly in structure and
content.
• PM: Partial Match evaluates overlap in SQL components (SELECT, WHERE, JOIN, etc.) between predicted
and gold queries.
Additionally, it breaks down each metric based on difficulty level of question: Easy, Medium, Hard and
Extra Hard. We will be considering all of the provided metrics to judge the efficacy of our methods.
10
4 Results and discussion
After running the pipelines on spider-dev, we received the following results presented in Tables 3 through 9.
11
Table 5: GPT-4o-mini with Proposed Approach
Metric Easy Medium Hard Extra All Joint All
Count 250 440 174 170 1034 1
Execution Accuracy 0.896 0.852 0.684 0.488 0.775 0.000
Exact Match Accuracy 0.824 0.541 0.374 0.106 0.510 0.000
Partial Matching Accuracy Details:
Select 0.915 0.920 0.939 0.796 0.905 0.000
Select (no AGG) 0.932 0.926 0.954 0.823 0.918 0.000
Where 0.920 0.589 0.456 0.189 0.569 0.000
Where (no OP) 0.950 0.606 0.578 0.392 0.642 0.000
Group (no Having) 0.857 0.882 0.826 0.605 0.794 0.000
Group 0.857 0.676 0.783 0.372 0.632 0.000
Order 0.783 0.881 1.000 0.690 0.837 0.000
And/Or 1.000 0.948 0.952 0.942 0.960 0.000
IUEN 0.000 0.000 0.615 0.000 0.571 0.000
Keywords 0.917 0.839 0.646 0.625 0.784 0.000
12
Table 8: Exact Match Accuracy (EM) Summary
Method Easy Medium Hard Extra Overall
Baseline GPT-3.5-turbo 0.468 0.316 0.264 0.041 0.299
Our Method (GPT-3.5-turbo) 0.744 0.443 0.276 0.059 0.425
Baseline GPT-4o-mini 0.336 0.327 0.184 0.053 0.260
Our Method (GPT-4o-mini) 0.824 0.541 0.374 0.106 0.510
DIN-SQL (GPT-4) (Pourreza and Rafiei, 2023) - - - - (N/A for EM here)
RSL-SQL (GPT-4o) (Cao et al., 2024) - - - - (N/A for EM here)
exact match accuracy of 0.510, substantially outperforming its baseline of 0.260. This indicates that the
proposed pipeline significantly enhances the structural correctness of the generated SQL queries compared
to the zero-shot baseline for both models. The improvement is particularly stark for GPT-4o-mini, where
exact match accuracy nearly doubled. While these exact match scores are lower than the execution accuracy
scores, the improvement over baseline is a positive indicator for the structural quality of SQL generated by
our method.
accuracy of 0.718, a 12.4% drop from the baseline (0.820). With GPT-4o-mini, our method scored 0.802, a
2.2% decrease from the baseline (0.820). These results indicate that our method performs better on partial
matches than exact matches, as minor syntactic variations are tolerated. For GPT-3.5-turbo, the largest
performance gaps occur on hard (0.550 vs. 0.732, a 24.9% drop) and extra-hard queries (0.460 vs. 0.610, a
24.6% drop), consistent with trends in execution accuracy. GPT-4o-mini narrows this gap, particularly on
medium queries (0.876 vs. 0.860, a 1.9% improvement) and hard queries (0.710 vs. 0.738, a 3.8
5 Future Scope
The promising results of this two-stage LLM pipeline for text-to-SQL translation open several exciting
avenues for future exploration, particularly in enhancing its capability to handle complex queries and broad-
ening its applicability across diverse domains. One key area of focus is the integration of advanced reasoning
models, which were not tested in this study. Models specifically designed for logical reasoning, such as those
optimized for step-by-step problem-solving, could significantly improve the pipeline’s performance on hard
and extra-hard queries, where nested subqueries and intricate joins pose challenges. Additionally, incor-
porating Chain-of-Thought (CoT) prompting, which was not implemented in the current pipeline, offers a
powerful opportunity to enhance query accuracy. By guiding the LLM to explicitly articulate intermediate
reasoning steps, such as mapping query intent to schema elements and selecting appropriate SQL constructs,
CoT could reduce errors like incorrect column selections and missing conditions, bridging the performance
gap with state-of-the-art methods like DIN-SQL and RSL-SQL. Further refinements to the pipeline’s schema
13
linking mechanism present another critical direction for future work. The current approach relies on few-
shot prompting with cosine similarity-based example selection, which occasionally misses functionally diverse
cases, leading to suboptimal schema recall. Exploring bidirectional pruning techniques, as demonstrated in
RSL-SQL, or relation-aware attention mechanisms, as seen in RAT-SQL, could enhance the pipeline’s ability
to handle complex, multi-hop relationships in large-scale databases. Additionally, dynamic example selection
strategies that prioritize diversity in query patterns and schema structures could mitigate biases and im-
prove generalization across domains. Beyond technical enhancements, future research could explore hybrid
approaches that combine the pipeline’s lightweight design with minimal fine-tuning to optimize performance
without sacrificing efficiency. For instance, fine-tuning the embedding model or selectively fine-tuning the
LLM on domain-specific subsets could improve schema linking precision while maintaining the benefits of
few-shot prompting. Another promising direction is the adaptation of the pipeline for cross-lingual text-to-
SQL tasks, enabling non-English queries to be processed effectively, thus broadening its global applicability.
By integrating reasoning models, CoT prompting, and advanced schema linking, and by testing on diverse
datasets, this pipeline can evolve into a highly versatile and efficient solution, further advancing data de-
mocratization and empowering users to seamlessly interact with structured data across industries.
6 Conclusion
This research has successfully pioneered a lightweight, two-stage LLM pipeline for text-to-SQL transla-
tion, harnessing the power of few-shot prompting to deliver robust performance without the heavy resource
demands of model fine-tuning. By drawing inspiration from advanced frameworks like DIN-SQL and in-
tegrating innovative execution-grounded prompting, this approach has made significant strides in tackling
critical challenges such as schema linking, query accuracy, and hallucination reduction. Evaluated on the
esteemed Spider dataset, spanning 200 databases across 138 domains, the pipeline leveraged GPT-3.5-turbo
and GPT-4o-mini to showcase its promising capabilities, advancing the mission of data democratization.
The results are encouraging, with the pipeline achieving an overall execution accuracy of 0.682 with GPT-
3.5-turbo and an impressive 0.775 with GPT-4o-mini, compared to baseline execution accuracies of 0.788 for
both models. Notably, the pipeline excelled on medium-difficulty queries, particularly with GPT-4o-mini,
where it surpassed the baseline by 2.7% (0.852 vs. 0.830) in execution accuracy, demonstrating its strength
in handling moderately complex tasks. While performance on hard and extra-hard queries showed room
for growth, the pipeline’s ability to generate functionally correct queries was evident in its partial match
accuracy scores (0.718 for GPT-3.5-turbo, 0.802 for GPT-4o-mini). Compared to state-of-the-art methods
like DIN-SQL and RSL-SQL, which achieve 0.879 execution accuracy with GPT-4/GPT-4o, our pipeline
holds its own as a resource-efficient alternative. Its lightweight design, bypassing the need for extensive
computational resources or domain-specific data, positions it as a game-changer for real-world applications
where adaptability and speed are paramount. The innovative use of preview query generation to anchor
schema linking proved highly effective in curbing hallucinations, paving the way for more reliable query syn-
thesis. The pipeline’s modular architecture and few-shot prompting strategy empower non-technical users
to effortlessly query relational databases, unlocking data-driven insights with unprecedented ease. Look-
ing ahead, the future is bright for enhancing this pipeline. Opportunities abound to refine schema linking
through techniques like bidirectional pruning or relation-aware attention, inspired by RSL-SQL and RAT-
SQL, to boost performance on complex queries. Dynamic example selection with greater diversity could
further elevate prompt effectiveness, while hybrid approaches blending lightweight prompting with targeted
fine-tuning may unlock even higher accuracy without sacrificing efficiency. This research marks a bold step
forward in making structured data accessible to all, proving that a lean, few-shot LLM pipeline can deliver
powerful text-to-SQL performance. By empowering users to seamlessly interact with databases, this work
lights the path toward a future where data is truly democratized, fostering innovation and decision-making
across industries.
References
A Mitsopoulou, G. K. (2025, February 11). Analysis of Text-to-SQL Benchmarks: Limitations, Challenges. Retrieved
from https://openproceedings.org/2025/conf/edbt/paper-41.pdf
14
Bailin Wang, R. S. (2020). RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers. In
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 7567–7578). Online:
Association for Computational Linguistics.
Bourne, J. (2019, February 12). Relational databases ‘here to stay’ in the enterprise amid greater
cloud and data complexity. Retrieved from CloudTech: https://www.cloudcomputing-news.net/news/
relational-databases-here-stay-enterprise-amid-greater-cloud-and-data-complexity/
Zhenbiao Cao, Y. Z. (2024, November 26). RSL-SQL: Robust Schema Linking in Text-to-SQL Generation.
arXiv:2411.00073. Retrieved from https://arxiv.org/abs/2411.00073
Chase, H. (n.d.). Langchain. Retrieved from https://github.com/langchain-ai/langchain
Naihao Deng, Y. C. (2022). Recent Advances in Text-to-SQL: A Survey of What We Have and What We Expect.
In Proceedings of the 29th International Conference on Computational Linguistics (p. 22). Gyeongju, Republic of
Korea: Association for Computational Linguistics.
Dawei Gao, H. W. (2023, August 29). Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation.
arXiv:2308.15363. Retrieved from https://arxiv.org/abs/2308.15363
Gottumukkala, A. M. (2025, February 4). From Natural Language to SQL: Review of LLM-based Text-to-SQL Systems.
arXiv:2410.01066v2. Retrieved from https://arxiv.org/pdf/2410.01066v2
Zijin Hong, Z. Y. (2025, March 13). Next-Generation Database Interfaces: A Survey of LLM-based Text-to-SQL.
arXiv:2406.08426v1. Retrieved from https://arxiv.org/pdf/2406.08426v1
Kevin Bartley. (n.d.). Big data statistics: How much data is there in the world? Retrieved from Rivery: https:
//rivery.io/blog/big-data-statistics-how-much-data-is-there-in-the-world/
Mayank Kothyari, D. D. (2023). CRUSH4SQL: Collective Retrieval Using Schema Hallucination For Text2SQL. In
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 14054–14066).
Singapore: Association for Computational Linguistics.
lamini.ai. (n.d.). How a Fortune 500 slashed hallucinations to create 94.7% accurate LLM agents for SQL. Retrieved
from https://www.lamini.ai/blog/llm-text-to-sql
Wenqiang Lei, W. W.-Y.-S. (2020). Re-examining the Role of Schema Linking in Text-to-SQL. In Proceedings of
the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 6943–6954). Online:
Association for Computational Linguistics.
Haoyang Li, J. Z. (2023, April 10). RESDSQL: Decoupling Schema Linking and Skeleton Parsing for Text-to-SQL.
arXiv:2302.05965. Retrieved from https://arxiv.org/pdf/2302.05965
Zhishuai Li, X. W. (2024, June 2). PET-SQL: A Prompt-Enhanced Two-Round Refinement of Text-to-SQL with
Cross-consistency. arXiv:2403.09732. Retrieved from https://arxiv.org/pdf/2403.09732
OpenAI. (2023, March 1). Introducing ChatGPT and Whisper APIs. Retrieved from https://openai.com/blog/
introducing-chatgpt-and-whisper-apis
OpenAI. (2024, July 18). OpenAI unveils cheaper small AI model GPT-4o mini.
Reuters. Retrieved from https://www.reuters.com/technology/artificial-intelligence/
openai-unveils-cheaper-small-ai-model-gpt-4o-mini-2024-07-18/
Mohammadreza Pourreza, D. R. (2023, April 21). DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with
Self-Correction. arXiv:2304.11015. Retrieved from https://arxiv.org/pdf/2304.11015
Ge Qu, J. L. (2024). Before Generation, Align it! A Novel and Effective Strategy for Mitigating Hallucinations in
Text-to-SQL Generation. In Findings of the Association for Computational Linguistics: ACL 2024 (pp. 5456–5471).
Bangkok: Association for Computational Linguistics.
Cedric Renggli, I. F. (2025, January 30). Fundamental Challenges in Evaluating Text2SQL Solutions and Detecting
Their Limitations. arXiv:2501.18197v1. Retrieved from https://arxiv.org/pdf/2501.18197v1
Sandwell, D. (2021, June 3). What is data democratization and why is it important? Retrieved from Quest Blog:
https://blog.quest.com/why-data-democratization-why-now-what-does-it-look-like/
Chang-Yu Tai, Z. C. (2023). Exploring Chain of Thought Style Prompting for Text-to-SQL. In Proceedings of the
2023 Conference on Empirical Methods in Natural Language Processing (pp. 89-102). Singapore: Association for
Computational Linguistics.
Text-to-SQL, E. C. (2023). Divide and Prompt: Chain of Thought Prompting for Text-to-SQL. In Proceedings of the
2023 Conference on Empirical Methods in Natural Language Processing (pp. 5376–5393). Online: Association for
Computational Linguistics.
Rishit Toteja, A. S. (2025). In-Context Reinforcement Learning with Retrieval-Augmented Generation for Text-to-
SQL. Proceedings of the 31st International Conference on Computational Linguistics (p. 8). Abu Dhabi, UAE:
Association for Computational Linguistics. Retrieved from aclanthology.org.
Jason Wei, X. W. (2023, January 10). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.
arXiv:2201.11903. Retrieved from https://arxiv.org/pdf/2201.11903
Gretel.ai. (n.d.). What is Text-to-SQL? Retrieved from https://gretel.ai/technical-glossary/
what-is-text-to-sql
15
Sun Yang, Q. S. (2024). SQL-to-Schema Enhances Schema Linking in Text-to-SQL. In International Conference on
Database and Expert Systems Applications (pp. 139-145). Cham: Springer.
Tao Yu, R. Z. (2018). Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic
Parsing and Text-to-SQL Task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language
Processing (pp. 3911–3921). Brussels, Belgium: Association for Computational Linguistics.
Hanchong Zhang, R. C. (2023, October 26). ACT-SQL: In-Context Learning for Text-to-SQL with Automatically-
Generated Chain-of-Thought. arXiv:2310.17342. Retrieved from https://arxiv.org/pdf/2310.17342
Zhuosheng Zhang, A. Z. (2022, October 7). Automatic Chain of Thought Prompting in Large Language Models.
arXiv:2210.03493. Retrieved from https://arxiv.org/pdf/2210.03493
Xiaohu Zhu, Q. L. (2024, October 8). Large Language Model Enhanced Text-to-SQL. arXiv:2410.06011v1. Retrieved
from https://arxiv.org/pdf/2410.06011v1
16