SEEK-SQL: Self-Optimizing Agent for
Knowledge-Enhanced Text-to-SQL in Real-World
Scenario
Anonymous Submission
No Institute Given
Abstract. Deploying modern Text-to-SQL methods in real-world envi-
ronments faces two challenges: knowledge-gap errors caused by a database
environment lacking critical knowledge and the requirement for rapid
adaptability to unfamiliar environments. To meet these challenges, we
propose a novel knowledge-enhanced self-optimizing multi-agent frame-
work SEEK-SQL composed of four agents: one main Manager for plan-
ning and decomposition and three auxiliary agents: Selector for retrieval-
relevant database schema, Generator for SQL sentence generation and
Sniffer which firstly introduces knowledge retrieval module into the Text-
to-SQL framework. Also, we propose a new contrastive self-refine strat-
egy that not only fixes bugs but also repairs system defects through
batch-wise contrastive self-reflection, enhancing our framework’s gener-
alization. Overall, SEEK-SQL achieves a new state-of-art performance
among all in-context-learning methods on two datasets, Spider and BIRD,
with much less average token consumption compared with other multi-
agent frameworks. Such results validate our framework’s efficacy in solv-
ing real-world challenges and preparing for the next step of working in
real environments.
Keywords: Text-to-SQL · Multi-agent Framework.
1 Introduction
The objective of the Text-to-SQL task is to facilitate the automatic translation
of users’ natural language inquiries into SQL queries with database schema [31].
This technology liberates users from the necessity of SQL expertise, facilitating
their interaction with intricate database systems and thereby enabling them to
uncover significant insights, conduct efficient data analysis, reach well-founded
conclusions, produce reports grounded in data, and extract superior features for
machine learning purposes [19, 28]. Moreover, Text-to-SQL systems are instru-
mental in automating sophisticated data analytics and driving conversational
agents, thereby extending their utility beyond the confines of conventional data
retrieval [17]. With the relentless expansion of data, the capacity to efficiently
query databases without profound SQL knowledge is becoming ever more crucial
for a diverse array of applications.
2 Anonymous Submission
Text-to-SQL SQL with Knowledge Gap Error
Which countries' channels are playing SELECT T2.Country FROM TV_channel AS T1 JOIN Cartoon AS T2
some animation by todd casey? SQL ON T1.id = T2.Channel WHERE T2.Directed_by = 'todd caset'
User Missing Knowledge:
todd casey is a famous cartoon director
TV_Channel: id, series_name, Country, Language, ...
TV_series: id, Episode, Air_Data, Rating, Channel, ... Gold Sql
Database Catoon: id, Deirected_by, Written_by, Channel, ... SELECT T1.Country FROM TV_channel AS T1 JOIN Cartoon AS T2
ON T1.id = T2.Channel WHERE T2.Written_by = 'Todd Casey"
Fig. 1: A real-world example of the Text-to-SQL task and knowledge gap error
SQL
Unlike previous works using single-step or multi-step strategies with a single
LLM agent [5, 15], recent works are exploring using the multi-agent framework to
generate SQL sentences and refine detected errors [20, 2]. Due to the challenges
of generating complete correct answers in a single turn, usually, two straightfor-
ward approaches are taken into account: using an auxiliary agent with specially
finetuned LLM to fix SQL errors [2, 22, 20] and generating top-k candidates then
choosing the right one [22, 14]. However, when trying these approaches in a real-
world environment, we find two challenges that prevent the utilization of these
method. First, most SQL refinement methods only look for errors and try them
out based on the SQL answer and data table conditions, assuming that the
database environment is correct and complete. However, in real environments,
even in some datasets [8], the dataset environment usually does not provide
complete necessary information. The lack of necessary information may lead
to incorrect SQL answers, which we call knowledge-gap errors. Such errors
may compose up to 40% of total errors, as analyzed in Section 3.1, so how to
find and analyze external knowledge dynamically has become a key challenge
for real-world applications of Text-to-SQL. Second, using fine-tuned models to
correct errors or generate multiple candidates at a time exposed new drawbacks
in real environments. Fine-tuned models do not perform well in unfamiliar envi-
ronments, and frequently fine-tuning models in complex and changeable formal
environments will lead to a huge waste of computing power and very low effi-
ciency. At the same time, the existing algorithm only corrects SQL errors and
does not learn from experience, resulting in the same error occurring multi-
ple times. Generating multiple candidates at a time consumes more tokens in
the reasoning phase, which greatly increases the operating cost and reduces the
robustness in high-concurrency scenarios. Therefore, how to build an efficient,
agile, and highly versatile Text-to-SQL system has become another challenge in
the real environment.
Building on the real-world challenges mentioned above, we propose SEEK-
SQL, a knowledge-enhanced self-optimizing multi-agent framework for Text-to-
SQL tasks. To address knowledge-gap errors, we first introduce a knowledge
retrieval module into the Text-to-SQL framework. We proposed a new multi-
agent framework including Agent Manager, which is responsible for problem
decomposition and task allocation, and three other auxiliary agents: Selector,
SEEK-SQL 3
which is responsible for form extraction; Generator, which is responsible for
code generation; and Sniffer, a knowledge-enhanced agent with the ability to
retrieve and integrate external, local, and internal knowledge. In our framework,
Manager decomposes complex queries into a string of sub-queries, and uses in-
ternal reasoning to dynamically call other auxiliary agents to retrieve knowledge,
extract knowledge bases, and generate SQL when solving sub-queries. Moreover,
in order to expand the self-correction objects from SQL errors to the defects of
the system itself, we proposed a new self-correction algorithm. Different from
the previous correction method, we let the Generator and Manager self-correct
the errors of the current answer from back to front in the opposite order of gen-
eration. After the repair is completed, we adopted a batch-wise self-reflection
mechanism, allowing Manager to reflect on the generated guidelines in a batch
processing manner based on the historical data of the repair so as to achieve
learning of the error history and rapid adaptation to unfamiliar environments.
We present comprehensive evaluations on the efficacy of SEEK-SQL on two
datasets [29, 8] and two backbone LLM [9, 13]. Experiments show that our method,
compared with other ICL-based single-agent and multi-agent frameworks, achieves
a new SOTA on both datasets with much fewer tokens consumed than multi-
agent frameworks, showing our method’s efficacy on real-world challenges. Our
contribution can be summarized as follows:
– We define and highlight the key challenge in real-world Text-to-SQL scenar-
ios. To address the problem of knowledge-gap errors, we first introduce
the knowledge retrieval task into the Text-to-SQL framework and propose a
knowledge-enhanced framework SEEK-SQL.
– We propose a new self-refinement strategy that expands from SQL errors to
system defects. This strategy enables our framework’s self-optimization and
robustness in unfamiliar environments.
– We measure our framework on two datasets and two backbone LLM. Re-
sults prove our method’s efficacy with better overall performance and token
efficiency. All code, data, and prompts are available in anonymous Github
https://anonymous.4open.science/r/BC3F.
2 Related Works
Large language models (LLM) have driven significant progress in the Text-to-
SQL task, evolving from prompt engineering to multi-stage frameworks and,
more recently, multi-agent collaboration. Early work focused on leveraging high-
quality prompts to harness LLMs’ potential, as demonstrated by ACT-SQL
[30] and QDecomp [18], enhancing reasoning by incorporating chain-of-thought.
DAIL-SQL [5] further refined prompt design with systematic engineering. As
research progressed, multi-stage frameworks such as DIN-SQL [15] and DEA-
SQL [27] introduced task decomposition,applying tailored prompts to subtasks
while integrating error correction. This trend continued with sophisticated frame-
works like C3-SQL [4] and StructGPT [6], which further structured the process
4 Anonymous Submission
by incorporating database simplification,SQL generation, and verification into a
zero-shot learning pipeline.
More recently, works such as MAC-SQL [20] and SQLFixAgent [2] have
emerged, integrating multi-stage generation and self-correction modules to im-
prove efficiency and performance. Following the approach of previous work, our
work extends the multi-agent framework by incorporating unstructured knowl-
edge retrieval for the first time,enabling the combined retrieval of structured and
unstructured knowledge bases. Additionally, we have refined the self-correction
mechanism to align SQL fixing with the system’s self-evolution, thereby improv-
ing generalization performance.
3 Preliminaries
3.1 Definition of Knowledge-gap Errors
Fig. 2: Error number comparison before/after given external knowledge
Previous studies on SQL-fixing tasks have focused on defining SQL errors
based on their superficial manifestations rather than the deep-seated causes that
lead to the errors, such as mismatch errors [22] or semantic errors [2]. Our ex-
perience in practice has revealed that a significant portion of such errors stem
from a loss of essential knowledge. Fig 1 shows one example from [22]:The sys-
tem first gives the wrong answer as it doesn’t know the identity of Todd Casey.
When provided with external knowledge ’Todd Casey is a famous director’, the
system can revise the SQL to the correct answer. So we assume the existence
of a "knowledge-gap" error, defined as wrong SQL sentences because of a loss of
certain essential knowledge.
To confirm the existence of such error, we sampled 100 error cases generated
by GPT4o [13] from 2 widely used benchmarks, Spider [29] and BIRD [8], respec-
tively. After adding additional useful information to the query, we ask GPT4o
to regenerate the SQL again. Figure 2 shows the result, where we find nearly
40% errors are refined, proving the hypothesis of knowledge-gap error. Under
this premise, we focus on the research problem of "how to retrieve and introduce
external knowledge into real-world Text-to-SQL generation".
SEEK-SQL 5
3.2 Problem Definition
Given a triple X = (Q, S, D), where Q, S and D are natural language ques-
tion, database schema and external knowledge corpus, the database schema S
is defined as {T , C}, where T represents multiple tables {T1 , T2 , . . . , T|T | }and C
represents columns {C1 , C2 , . . . , C|C| }. The purpose of Text-to-SQL task is to gen-
erate the correct SQL Y corresponding to the question Q based on the database
schema S and external knowledge K retrieved from knowledge corpus D.
4 Methodology
4.1 Framework of SEEK-SQL
User
Query Anwser
Manager
Database
Sniffer
Query: Which countries' channels are playing some animation by todd casey?
SubQ1: Filter cartoons by todd casey
SubSchema1
Selector Knowledge: todd casey is a famous cartoon director
SQL1: SELECT * FROM Cartoon WHERE Dierected_by = 'Todd Casey' Generator
SubQ2: Filter cartoons whose channels are playing given cartoons
SubSchema2
Contrastive SQL2: SELECT T1.Country FROM TV_Channel AS T1 JOIN SQL1 AS T2
Self-refinement ON T1.id = T2.Channel
(Algorithm2)
Ans: SELECT T1.Country FROM TV_Channel AS T1 JOIN Cartoon AS T2 ON
T1.id = T2.Channel WHERE T2.Written_by = 'Todd Casey'
Fig. 3: The overall structure of SEEK-SQL
To address the problem of knowledge-gap errors, we propose SEEK-SQL, a
novel multi-agent collaborative framework that innovatively integrates external
knowledge sources into the text2SQL process. SEEK-SQL comprises four agents
for SQL generation: a Selector for database schema extraction, a Sniffer for
external knowledge retrieval, a Generator for SQL sentence generation, and a
core Manager for task decomposition and planning. Unlike previous Text-to-
SQL works [1, 20, 26], we use a self-correction and self-evolution strategy instead
of a separate SQL-refining agent for error refinement, as described in Section
4.2. Algorithm 1 shows the generation process in SEEK-SQL, and the detailed
introduction of agents is presented below.
6 Anonymous Submission
Algorithm 1: SQL Generation Progress of SEEK-SQL
Input: Query q; Database S; Knowledge Corpus D
Output: SQL answer SQL
1 subQs = Manager.decompose(q, s);
2 SQLs = [];
3 for subQ in subQs do
4 k = Sniffer.search(subQ, S, D);
5 subS = Selector.select(subQ, S, K);
6 subSQL = Generator.generate(subQ, subS, k);
7 subSQL = Generator.selfCheck(subSQL);
8 SQLs.append(subSQL)
9 end
10 sql = SQLs.gather();
11 ok, err = Excute(sql, S);
12 if ok then
13 return sql
14 else
15 sql = SelfRefine(err, sql); // self-refinement in Algorithm 2
16 return sql
17 end
Selector is responsible for extracting the minimum sub-schema needed to solve
the problem from the overall database schema. Given the input triple X =
(Q, S, D) from Manager, and the external knowledge K retrieved from D. The
function of Selector can be described as follows:
S ′ = fSelector {Q, S, K} (1)
where S ′ is one sub-schema extracted from database schema S.
We utilize Agent Selector with similar motivations like [20]: Firstly, intro-
ducing excess irrelevant schema items in prompts boosts LLM’s likelihood of
generating extraneous SQL elements, leading to potential errors. Secondly, Us-
ing the entire database schema can cause excessively long context, raising API
costs and possibly exceeding LLM’s context length limit. As shown in Algorithm
1 Selector dynamically extracts different sub-schema according to different sub-
query decomposed by Manager, which further minimizes the scale of sub-schema.
Manager is the core agent responsible for planning and decomposing the gen-
eral process into a series of intermediate steps, including sub-queries and SQL
sentences. This process can be described as:
L
Y
P (Y | Q, S, D) = P Y j | Y <j ; Qj , S j , K| (2)
j=1
where Y is the final SQL answer, Q, S, D is the original complex question, dataset
schema, external knowledge corpus, Y <j is SQL sentences generated in previous
SEEK-SQL 7
steps, Y j , Qj , S j , K| is the SQL sentence, sub-query, sub-schema and external
knowledge generated by Generator, Manager, Selector and Sniffer respectively.
Like previous work [1, 26], we adopted chain-of-thought (CoT) [23] prompting
method and few-shot learning as Manager’s working pattern. To be specific, it
dynamically assesses the complexity of user queries, if they can be answered with
a simple SQL query, the SQL is generated directly. For more intricate questions,
it starts by generating SQL for the simplest sub-queries and then progressively
breaks them down into smaller sub-queries until it arrives at the final SQL that
corresponds to the original question. Moreover, in order to improve the general-
ization ability in held-out environments, we also proposed a new decomposition
example construction method Forward-Backward Decomposition Generation for
SEEK-SQL.
Forward-Backward Decomposition Generation For a given tuple (Q, S, D), which
represents the original complex question, dataset schema, external knowledge
corpus and the final SQL answer Y, the goal of decomposition is to generate a
sequence of (< Y 1 , Q1 , S 1 , K1 >, . . . , < Y t , Qt , S t , Kt >), where Y j , Qj , S j , K|
is the SQL sentence, sub-query, sub-schema and external knowledge used in the
j step. The commonly used method is to use LLM to generate examples for
each step from front to back so as to ensure the consistency of the final answer
[20]. However, this method may conclude redundant information in the final
example, which may cause potential semantic errors in the final SQL sentence
[2] and reduce external knowledge retrieval accuracy [25].
To avoid such shortcomings, we propose a new example construction method,
Forward-Backward Decomposition Generation. Specifically, first, we use LLM
(GPT-4o [13] in our experiment) to generate examples for each step from front
to back like common practice. Then, we use LLM to check each step from back
to the front, removing redundant information already included in former steps
from generated content, like repeatedly generated SQL sentences, useless dataset
parts, parts of the sub-query repeated in previous sub-queries, and so on. Finally,
we check the executability of the left sequence from front to back. The final
output of our method is a sequence in which each step only contains the minimum
necessary information and operation for the current step. Results in Ablation
Study 5.4 have validated the effectiveness of our method.
Sniffer is responsible for using various tools to retrieve and extract relevant
information from external knowledge sources to assist Manager in planning and
SQL generation. For a external knowledge corpus D, the function can be de-
scribed as:
K = fSnif f er {Q′ , S ′ , D, N } (3)
where Q′ , S ′ is the (sub-)query and (sub-)schema given by Manager, and N is
the optional additional information given by Manager through CoT progress.
There are multiple modes for external information retrieval. We choose and
utilize three modes in our system: In Local mode, Sniffer utilizes a light RAG
system with LlamaIndex [10] to search relevant information in a locally built
8 Anonymous Submission
knowledge base, which is mostly chosen in our real-world environment; in Open
World mode, Sniffer uses a search engine function to search relevant informa-
tion online and returns a summarized result, which is suitable for environments
lacking database information; in Close World mode, which is only suitable for
the highly sensitive environment, Sniffer only reflects and gives the information
based on its knowledge, which largely relies on its base model. Users can also
choose Hybrid model to make Sniffer comprehensively compare information from
all origins.
Generator is responsible for generating SQL sentences for the current sub-
query based on the previous sub-query and the previous SQL sentences as in
Equation 2. To avoid accumulating mistakes in each step, we adopted a light
self-refining strategy in generating progress, which asks Generator itself to check
the answer for syntax and table-mismatch errors after each step. Also, as Gen-
erator focuses on SQL generation, users can easily finetune SQL-focused code
generation models as its base model in real-world environments, thus further
reducing the difficulty of adapting to held-out environments.
4.2 Contrastive Self-refinement Strategy
Common approaches to fixing SQL bugs utilize single or multiple agents to
analyze SQL bugs and generate correct ones [2, 22]. However, such approaches
only focus on fixing SQL bugs in a single case without considering the underlying
systemic defects. This leads to frequent occurrence of the same type of errors,
leading to a large amount of wasted computing resources and time consumption.
To address this challenge, we propose a new self-refinement strategy targeting
not only SQL sentence fixing but also system-level evolution, which involves two
parts: Backward SQL Fixing and Batch-wise Contrastive Reflection, as shown
in Algorithm 2.
Backward SQL Fixing LLM-based Text-to-SQL system commonly faces two
sorts of errors: syntax errors and semantic errors. Syntax errors are execution
failures caused by incorrect syntax or spelling in the generated SQL, which are
mostly caused by the wrong behavior of Generator. Semantic errors, including
knowledge-gap errors, are often caused by redundant confusing sentences or lack
of understanding of the context, which can often execute smoothly but with
wrong and confusing outputs. Contrary to the generation order, we utilize a
backward two-step method to solve these errors:
Step 1: Syntax Error Checking First, we utilize Generator to review the wrong
SQL based on the error message. Generator will try to identify and repair syntax
errors and re-execute the answer. If the answer is still not correct or Generator
cannot identify any syntax error, this issue will be handed over to Manager and
activated in Step 2.
SEEK-SQL 9
Algorithm 2: Contrastive Self-refinement Strategy
Input: Query q; Database S; Knowledge Corpus D; Wrong SQL SQLerror
Output: SQL answer SQL
/* SQL fixing */
1 for count in [0, MaxTryTimes] do
2 err = SyntaxCheck(Generator, SQLerror );
3 if err! = ∅ then
4 sql = SyntaxRefine(Generator, SQLerror , err);
5 ok, err = Excute(sql, S);
6 if ok then break;
7 end
8 Solution = SemanticCheck(Manager, err, SQLerror );
9 sql = SemanticRefine(Solution, Manager, err, SQLerror );
10 ok, err = Excute(sql, S);
11 if ok then break;
12 end
/* Batch-wise Contrastive Reflection */
13 ReflectionBatch.append([CorrectTraj, WrongTraj]);
14 if Ref lectionBatch.size == SettledSize then
15 Guideline = ContrastiveReflection(Manager, ReflectionBatch);
16 GuidelineUpdate(Agents, Guideline);
17 end
18 MemoryBankUpdate(CorrectTraj);
19 return sql
Step 2: Semantic Error Repair When the error message is sent to Manager,
following [7], we utilize the rubber duck debugging method, where A programmer
goes through his code step-by-step to an unresponsive item (such as a rubber
duck) to aid in pinpointing and comprehending mistakes. Specifically, we ask
Manager to go through each line of SQL sentences to check its alignment with the
sub-query’s intention. When Manager detects mistakes, it calls different agents
to repair the mistakes. To be specific, the Manager calls Sniffer to find relevant
information when additional external information is needed to correct the SQL
answer; it calls Selector to re-extract the sub-schema when the existing sub-
schema does not match the query; and it calls itself to re-decompose the query
when certain sub-queries are not sufficiently detailed.
Batch-wise Contrastive Reflection To address the challenge of making the
Text-to-SQL system learn from bug-fixing progress and fix its system flaws, we
propose a new multi-agent self-reflection strategy through contrastive reasoning.
This strategy involves two steps:
Step 1: Positive and Negative Trajectory Batch Construction After Manager fixes
one error case, it will collect both the correct and wrong answer’s action trajec-
tories of this case including sub-queries and agent-calling history as positive and
10 Anonymous Submission
negative samples. Following [24], we adopt a batched training strategy, where
we sample n trajectories as a batch, with an equal split of positive and negative
trajectories (b/2 each) as two minibatches for contrastive reasoning. Experience
from [24] shows that such batched strategy can help generate more general and
comprehensive guidelines on the complex trajectories, including problem decom-
position and solutions to sub-queries for it targets not only individual samples
but a diverse set.
Step 2: Guideline Generation and Memory Bank Construction After construct-
ing the sample batch, Manager will compare the two minibatches based on their
key characteristics, attributing the performance gap to particular action in the
intricate trajectory, and then generating general instruction guidelines for each
agent to boost overall task performance. The instruction guideline will be con-
catenated after each agent prompts them to improve their performance in future
tasks.
Moreover, we manage a memory bank including most recent five correct
samples for agents, inspired by the human decision-making progresses where
recent past experiences are always referred [16]. Memory bank, including the
tuples of action sequences, instructions from Manager and performance of these
actions in recent cases, are given with task instruction and updated after each
case.
5 Experiments
5.1 Experiment Setup
Datasets We evaluate our framework on two popular benchmarks: Spider [29]
and BIRD [8].
Spider, a comprehensive dataset spanning 200 databases within 138 do-
mains, is extensively utilized to gauge the adaptability and generalization of
Text-to-SQL parsers against unfamiliar database schemas. It provides a sub-
stantial training set of 8,659 samples, a development set of 1,034 samples, and a
test set of 2,147 samples designed to challenge models in navigating diverse and
novel database structures.
BIRD, introduced by Alibaba DAMO Academy, stands as a novel bench-
mark for large-scale, real-world database grounded Text-to-SQL evaluation, en-
compassing 95 large-scale databases and high-quality Text-SQL pairs. With a
substantial data storage volume of 33.4GB across 37 professional domains, BIRD
differentiates itself from Spider by emphasizing the integration of external knowl-
edge reasoning to connect natural language queries with database content and
by addressing the novel challenges associated with SQL efficiency handling ex-
tensive databases.
Evaluation Metrics Adopting the evaluation frameworks from BIRD [8] and
Test-suite [31], we assess the performance of our text-to-SQL model in real-world
SEEK-SQL 11
scenarios with large databases using three key metrics: Exact Match Accuracy
(EM), Execution Accuracy (EX), and Valid Efficiency Score (VES). EM, as intro-
duced by Test-Suites [31], evaluates each SQL clause as a set, requiring a perfect
match between the predicted and reference query clauses without considering
values. EX, the proportion of queries where both predicted and ground-truth
execution results are identical, gauges the correctness of the query outcomes.
VES, introduced by BIRD [8], measures the efficiency of valid SQL queries, de-
fined as those whose result sets align with the ground-truth, thus considering
both the accuracy and the efficiency of the model-generated SQL.
Implementation Details In evaluation, we separately measure our framework
on two backbone LLMs: Deepseek-V3-671B [9] and GPT4o [13]. We also use
ChatGPT-3.5-turbo [13] in the token efficiency experiment. For Agent Sniffer,
we utilize E5-v2-base [21] and Flan-T5-base [3] as the embedding and generation
model of the light RAG system for Local mode; we also choose Web-search-pro
[11] with Google engine as search engine for Open World mode. The number
of few-shot examples, the size of the memory bank, and the batch size for con-
trastive reflection are all set to 5 to match other baselines. All experiments are
conducted on a server with one NVIDIA A100 40G GPU.
Baselines We choose two groups of previous works as baselines in our experi-
ment:
LLM-based methods use a single LLM with in-context-learning methods
to generate SQL sentences from given queries with one-stage or multi-stage rea-
soning. These methods include:DIN-SQL [15], DAIL-SQL [5], C3-SQL [4] and
ACT-SQL [30].
Multi-agent-based methods use various agents responsible for different
tasks to generate and refine SQL sentences. These methods include:MAC-SQL
[20], TOOL-SQL [22], SQLFixAgent [2], CHASE-SQL [14].
5.2 Overall Performance
Results on Spider Table 1 shows SEEK-SQL results with two different back-
bone LLMs and two different Sniffer modes on Spider benchmark. Results show
that SEEK-SQL outperforms all other baselines and achieves a new state-of-art,
confirming the effectiveness of our method. Moreover, our method has a far bet-
ter EX score than the performance of SQLFixAgent [2], while the EM score is
similar. This may be attributed to our efficient self-refinement strategy, which
allows some answers that were originally generated with errors to be correctly
executed and get correct results after being repaired, although these answers are
not completely the same as the golden answers. Also, compared with the perfor-
mance of Close World mode, SEEK-SQL performs better in Open World mode,
which shows that although the Spider dataset is relatively simple and equipped
with enough knowledge, it can still help relieve the problem of knowledge gap
errors with actively searching external knowledge online.
12 Anonymous Submission
Table 1: Evaluation of SEEK-SQL on Spider’s dev/test sets.
Dev Test
Methods
EM% EX% EM% EX%
DAIL-SQL[5] + GPT-4 + SC 68.7 83.6 66.0 86.6
C3[4] + ChatGPT 71.4 81.8 - 82.3
LLM-based
ACT-SQL[30] + GPT4 61.7 82.9 - -
DIN-SQL[15] + GPT4 60.0 85.3 - -
MAC-SQL[20] + GPT-4 23.5 86.8 19.3 82.8
SQLFixAgent[2] + ChatGPT 77.9 84.8 71.2 82.9
Multi-Agent-based
Tool-SQL[22] + GPT-4 - 86.9 - 85.6
CHASE-SQL[14] + Gemini-1.5-Pro - - - 87.6
SEEK-SQL(Close World) + Deepseek-V3 77.2 87.4 72.4 87.4
Ours SEEK-SQL(Open World) + Deepseek-V3 78.3 88.2 74.3 87.9
SEEK-SQL(Close World) + GPT4o 77.9 87.1 71.6 86.5
Table 2: Evaluation of SEEK-SQL on BIRD’s dev sets.
Dev
Methods
EX% VES%
DAIL-SQL[5] + GPT-4 54.76 56.08
LLM-based DIN-SQL[15] + GPT4 50.72 59.79
GPT4[13] 46.35 49.77
MAC-SQL[20] + GPT-4 59.39 66.39
Multi-Agent-based SQLFixAgent[2] + ChatGPT 58.67 62.19
CHASE-SQL[14] + Gemini-1.5-Pro 74.46 -
SEEK-SQL(Close World) + Deepseek-V3 72.15 68.85
Ours SEEK-SQL(Open World) + Deepseek-V3 74.53 69.72
SEEK-SQL(Close World) + GPT4o 73.72 68.81
Results on BIRD We further explore the performance of SEEK-SQL on
a larger and more complex benchmark, BIRD. Results in Table 2 show that
SEEK-SQL outperforms all other baselines and achieves a new state-of-art in
ICL methods. Compared with results in Spider, the performance gap between
SEEK-SQL in Close World mode and Open World mode is larger, indicating
the essence of external knowledge acquirement under complex realistic environ-
ments. Moreover, compared with other multi-agent self-refinement frameworks
like SQLFixAgent [2], even SEEK-SQL in Close World mode has a huge im-
provement (14.33%) on BIRD, achieves similar performance like CHASE-SQL
[14], which utilizes more complex agents in their framework. Such improvements
show the effectiveness of our contrastive self-reflection strategy, which continu-
ously learns from its failures and makes progress.
5.3 Efficiency Analysis
SEEK-SQL 13
(a) Dynamics curve of Execution Accuracy (b) Dynamics curve of average self-
(EX) refinement iterations needed for each query
Fig. 4: Optimization dynamics of SEEK-SQL agents on Spider dev set
Iteration Efficiency Figure 4 shows the optimization curve of EX and av-
erage self-refinement iterations needed for each query of SEEK-SQL on Spider
dev set. Impressively, SEEK-SQL agents show significant performance improve-
ments, e.g., from 53% to 74% on EX value, and average self-refinement iteration
drop from 3.7 to 1.8. This evidence strongly supports the effectiveness of our
contrastive self-reflection strategy as generation accuracy and self-refinement ef-
ficiency are rising. Additionally, our memory bank, which stores recent successful
trajectories, encourages agents to converge by the end of the optimization pro-
cess gradually.
(a) Average token consumption (b) Average EX value
Fig. 5: Token consumption and execution accuracy of Text-to-SQL methods on
Spider’s dev set sample.
Token Efficiency Figure 5 shows the average token consumption and execution
accuracy on a Spider’s dev set sample, including 100 queries. We compare our
method to three advanced methods, STRIKE [12], MAC-SQL, SQLFixAgent,
and DAIL-SQL, all powered by GPT-3.5-turbo for a fair comparison. Results
show that SEEK-SQL achieves the best performance with fewer tokens con-
sumed. Considering that the LLM APIs charge based on token number and the
14 Anonymous Submission
inference time is proportional to token lengths, this result shows that our method
is more efficient and cheaper. Moreover, even though during the first 20 iterations
our method has achieved similar performance and token efficiency with SQLFix-
Agent, SEEK-SQL consumed 45% fewer tokens and achieved better performance
in its last 20 iterations, benefiting from its fewer refinement iterations shown in
Fig 4b. SEEK-SQL’s final token efficiency is comparable to ICL-based method
DAIL-SQL, showing the system’s continuous evolution through contrastive re-
flection.
5.4 Ablation Study
Table 3: Ablation study of SEEK-SQL on Spider and BIRD dev set
Spider BIRD
Methods
EM% EX% EX% VES%
SEEK-SQL 78.3 88.2 74.53 69.72
w/o Sniffer 76.9 86.7 72.07 67.25
w/o FBD 77.8 87.2 73.83 68.27
w/o CRF 74.4 85.1 71.25 65.07
Table 3 shows the result of our ablation study of three main parts of our
framework. Starting from the complete framework, we separately remove Sniffer,
replace the forward-backward decomposition sample(FBD) with common GPT-
generated samples and disable the contrastive reflection self-refinement(CRF)
strategy. Results show that removing these parts will decrease performance in all
scenarios we tested, confirming the effectiveness and necessity of our framework.
6 Conclusion
In this paper, we highlight the key challenges of Text-to-SQL systems in real-
world scenarios. To address the challenges, we propose a new knowledge-enhanced
self-optimizing multi-agent framework SEEK-SQL, first introducing the knowl-
edge retrieval tasks into the Text-to-SQL framework. Results on Spider and
BIRD show that SEEK-SQL achieves a new SOTA performance with smaller
token consumption, benefiting from our system flaw repair mechanism brought
by contrastive self-refinement strategy, confirming its efficacy in addressing real-
world challenges.
References
1. Askari, A., Poelitz, C., Tang, X.: Magic: Generating self-correction guideline for
in-context text-to-sql. arXiv preprint arXiv:2406.12692 (2024)
SEEK-SQL 15
2. Cen, J., Liu, J., Li, Z., Wang, J.: Sqlfixagent: Towards semantic-accurate text-to-sql
parsing via consistency-enhanced multi-agent collaboration. In: AAAI (2025)
3. Chung, H.W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X.,
Dehghani, M., Brahma, S., et al.: Scaling instruction-finetuned language models.
Journal of Machine Learning Research 25(70), 1–53 (2024)
4. Dong, X., Zhang, C., Ge, Y., Mao, Y., Gao, Y., Lin, J., Lou, D., et al.: C3: Zero-shot
text-to-sql with chatgpt. arXiv preprint arXiv:2307.07306 (2023)
5. Gao, D., Wang, H., Li, Y., Sun, X., Qian, Y.: Text-to-sql empowered by large
language models: A benchmark evaluation. CoRR abs/2308.15363 (2023)
6. Jiang, J., Zhou, K., Dong, Z., Ye, K., Zhao, X., Wen, J.R.: StructGPT: A general
framework for large language model to reason over structured data. In: Proceedings
of the 2023 Conference on Empirical Methods in Natural Language Processing.
pp. 9237–9251. Association for Computational Linguistics, Singapore (Dec 2023).
https://doi.org/10.18653/v1/2023.emnlp-main.574
7. Lee, C., Xia, C.S., Yang, L., Huang, J.t., Zhu, Z., Zhang, L., Lyu, M.R.: A
unified debugging approach via llm-based multi-agent synergy. arXiv preprint
arXiv:2404.17153 (2024)
8. Li, J., Hui, B., Qu, G., Yang, J., Li, B., Li, B., Wang, B., Qin, B., Geng, R.,
Huo, N., Zhou, X., Chenhao, M., Li, G., Chang, K., Huang, F., Cheng, R., Li, Y.:
Can llm already serve as a database interface? a big bench for large-scale database
grounded text-to-sqls. In: Advances in Neural Information Processing Systems.
vol. 36, pp. 42330–42357. Curran Associates, Inc. (2023)
9. Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C.,
Ruan, C., et al.: Deepseek-v3 technical report. arXiv preprint:2412.19437 (2024)
10. Liu, J.: Llamaindex (2022). https://doi.org/10.5281/zenodo.1234
11. Liu, X., Qin, B., Liang, D., Dong, G., Lai, H., Zhang, H., Zhao, H., Iong, I.L.,
Sun, J., Wang, J., et al.: Autoglm: Autonomous foundation agents for guis. arXiv
preprint arXiv:2411.00820 (2024)
12. Nan, L., Zhao, Y., Zou, W., Ri, N., Tae, J., Zhang, E., Cohan, A., Radev, D.:
Enhancing text-to-SQL capabilities of large language models: A study on prompt
design strategies. In: Findings of the Association for Computational Linguistics:
EMNLP 2023. pp. 14935–14956. Association for Computational Linguistics, Sin-
gapore (Dec 2023). https://doi.org/10.18653/v1/2023.findings-emnlp.996
13. OpenAI, :, Hurst, A., Lerer, A., Goucher, A.P., et al, A.P.: Gpt-4o system card
(2024), https://arxiv.org/abs/2410.21276
14. Pourreza, M., Li, H., Sun, R., Chung, Y., Talaei, S., Kakkar, G.T., Gan, Y., Saberi,
A., Ozcan, F., Arik, S.O.: CHASE-SQL: Multi-path reasoning and preference op-
timized candidate selection in text-to-SQL. In: The Thirteenth International Con-
ference on Learning Representations (2025)
15. Pourreza, M., Rafiei, D.: Din-sql: Decomposed in-context learning of text-to-sql
with self-correction. arXiv preprint arXiv:2304.11015 (2023)
16. Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., Yao, S.: Reflexion: lan-
guage agents with verbal reinforcement learning. In: Proceedings of the 37th Inter-
national Conference on Neural Information Processing Systems. NIPS ’23, Curran
Associates Inc., Red Hook, NY, USA (2023)
17. Sun, R., Arik, S.Ö., Muzio, A., Miculicich, L., Gundabathula, S., Yin, P., Dai, H.,
Nakhost, H., Sinha, R., Wang, Z., et al.: Sql-palm: Improved large language model
adaptation for text-to-sql (extended). arXiv preprint arXiv:2306.00739 (2023)
18. Tai, C.Y., Chen, Z., Zhang, T., Deng, X., Sun, H.: Exploring chain of thought
style prompting for text-to-SQL. In: Proceedings of the 2023 Conference on Em-
16 Anonymous Submission
pirical Methods in Natural Language Processing. pp. 5376–5393. Association for
Computational Linguistics, Singapore (Dec 2023)
19. Wang, B., Shin, R., Liu, X., Polozov, O., Richardson, M.: Rat-sql: Relation-
aware schema encoding and linking for text-to-sql parsers. arXiv preprint
arXiv:1911.04942 (2019)
20. Wang, B., Ren, C., Yang, J., Liang, X., Bai, J., Chai, L., Yan, Z., Zhang, Q.W., Yin,
D., Sun, X., Li, Z.: Mac-sql: A multi-agent collaborative framework for text-to-sql.
arXiv preprint (2024)
21. Wang, L., Yang, N., Huang, X., Jiao, B., Yang, L., Jiang, D., Majumder, R., Wei,
F.: Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint
arXiv:2212.03533 (2022)
22. Wang, Z., Zhang, R., Nie, Z., Kim, J.: Tool-assisted agent on sql inspection and
refinement in real-world scenarios. arXiv preprint (2024)
23. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E.H., Le,
Q.V., Zhou, D.: Chain-of-thought prompting elicits reasoning in large language
models. In: Proceedings of the 36th International Conference on Neural Information
Processing Systems. NIPS ’22, Curran Associates Inc., Red Hook, NY, USA (2022)
24. Wu, S., Zhao, S., Huang, Q., Huang, K., Yasunaga, M., Cao, K., Ioannidis, V.,
Subbian, K., Leskovec, J., Zou, J.Y.: Avatar: Optimizing llm agents for tool usage
via contrastive reasoning. In: Advances in Neural Information Processing Systems.
vol. 37, pp. 25981–26010. Curran Associates, Inc. (2024)
25. Xia, Z., Wu, Y., Xia, Y., Nguyen, C.T.: Momentum posterior regularization for
multi-hop dense retrieval. In: Proceedings of the 31st International Conference on
Computational Linguistics. pp. 8255–8271. Association for Computational Linguis-
tics, Abu Dhabi, UAE (Jan 2025), https://aclanthology.org/2025.coling-main.550/
26. Xie, W., Wu, G., Zhou, B.: Mag-sql: Multi-agent generative approach with soft
schema linking and iterative sub-sql refinement for text-to-sql. arXiv preprint
arXiv:2408.07930 (2024)
27. Xie, Y., Jin, X., Xie, T., Matrixmxlin, M., Chen, L., Yu, C., Lei, C., Zhuo, C., Hu,
B., Li, Z.: Decomposition for enhancing attention: Improving LLM-based text-
to-SQL through workflow paradigm. In: Findings of the Association for Compu-
tational Linguistics: ACL 2024. pp. 10796–10816. Association for Computational
Linguistics, Bangkok, Thailand (Aug 2024)
28. Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K.R., Cao, Y.: React:
Synergizing reasoning and acting in language models. In: The Eleventh Interna-
tional Conference on Learning Representations (2023)
29. Yu, T., Zhang, R., Yang, K., Yasunaga, M., Wang, D., Li, Z., Ma, J., Li, I., Yao, Q.,
Roman, S., Zhang, Z., Radev, D.: Spider: A large-scale human-labeled dataset for
complex and cross-domain semantic parsing and text-to-SQL task. In: Proceedings
of the 2018 Conference on Empirical Methods in Natural Language Processing. pp.
3911–3921. Association for Computational Linguistics, Brussels, Belgium (Oct-Nov
2018). https://doi.org/10.18653/v1/D18-1425
30. Zhang, H., Cao, R., Chen, L., Xu, H., Yu, K.: ACT-SQL: In-context learning for
text-to-SQL with automatically-generated chain-of-thought. In: Findings of the As-
sociation for Computational Linguistics: EMNLP 2023. pp. 3501–3532. Association
for Computational Linguistics, Singapore (Dec 2023)
31. Zhong, R., Yu, T., Klein, D.: Semantic evaluation for text-to-SQL with distilled
test suites. In: Proceedings of the 2020 Conference on Empirical Methods in Nat-
ural Language Processing (EMNLP). pp. 396–411. Association for Computational
Linguistics, Online (Nov 2020). https://doi.org/10.18653/v1/2020.emnlp-main.29