Llmquoter: Enhancing Rag Capabilities Through Efficient Quote Extraction From Large Contexts
Llmquoter: Enhancing Rag Capabilities Through Efficient Quote Extraction From Large Contexts
a b
Yuri Façanha Bezerra and Li Weigang
TransLab, Department of Computer Science, University of Brasilia, Brasilia, Federal District, Brazil
Keywords: Knowledge Distillation, Large Language Models, LLM Reasoning, Low-Rank Adaptation,
Retrieval-Augmented Generation.
1335
Bezerra, Y. F. and Weigang, L.
LLMQuoter: Enhancing RAG Capabilities Through Efficient Quote Extraction from Large Contexts.
DOI: 10.5220/0013358700003890
In Proceedings of the 17th International Conference on Agents and Artificial Intelligence (ICAART 2025) - Volume 3, pages 1335-1342
ISBN: 978-989-758-737-5; ISSN: 2184-433X
Copyright © 2025 by Paper published under CC license (CC BY-NC-ND 4.0)
ICAART 2025 - 17th International Conference on Agents and Artificial Intelligence
1336
LLMQuoter: Enhancing RAG Capabilities Through Efficient Quote Extraction from Large Contexts
where Q represents the question, C is the textual con- 2.5 Proving the Benefit of Using Quotes
text, and R is the set of relevant quotes generated
by the fine-tuned model. The training process is de- Let fbase represent base models without any fine-
scribed in the following steps: tuning to establish a baseline for comparison. Two ex-
1. Input: Data from the Dgold dataset in the form perimental setups are defined to demonstrate the ad-
of tuples (Q,C), where Q is the question, C is the vantage of using relevant quotes R instead of the full
textual context. context C:
2. Output: The fine-tuned model fsmall is optimized 1. Providing only the gold quotes R from Dgold to
to predict R , replicating the behavior of the larger the base models fbase to answer the questions:
model fhigh , but without knowing the answer. fbase : (Q, Rgold ) → Abase
2.4 Evaluation Framework and Metrics 2. Providing the full context C instead of the quotes
R to the same base models fbase to answer the
The model’s performance is evaluated using the DSpy questions:
framework, which computes task-specific metrics tai-
fbase : (Q,C) → Abase
lored to LLM outputs. Precision and recall are re-
defined for the quote extraction task using an LLM For both setups, Q represents the question, Rgold is
Judge to assess semantic relevance between model the set of gold quotes extracted from the Dgold dataset,
predictions and ground truth. C is the entire context, and Abase is the base models
Precision measures the proportion of predicted answers.
quotes (Rmodel ) that align semantically with the The accuracy of the answers produced by fbase
golden answers (Rgold ), defined as: is measured using Semantic Accuracy (Sacc ), which
∑r∈Rmodel Judge(r, Rgold ) evaluates the alignment between the model-generated
P= answers Abase and the expected answers Agold . Seman-
|Rmodel | tic Accuracy is defined as:
where Rmodel is the set of quotes predicted by the
∑a∈Abase Judge(a, Agold )
model, Rgold is the set of golden answers, and Sacc =
Judge(r, Rgold ) is a scoring function returning values |Agold |
from 0 (no match) to 1 (perfect match). where Judge(a, Agold ) is a semantic similarity func-
Recall quantifies the proportion of golden answers tion scoring the alignment between a model-
(Rgold ) captured by the model’s predictions (Rmodel ), generated answer a and the ground truth Agold , with
defined as: scores ranging from 0 (no match) to 1 (perfect match).
∑r∈Rgold Judge(r, Rmodel )
R=
|Rgold |
3 EXPERIMENTS
F1-score balances precision and recall and is de-
fined as:
P·R This section describes the experimental setup used
F1 = 2 · to analyze the performance of the proposed method-
P+R
ology. It begins with details of the datasets used
for training and evaluation, followed by an ex-
DSpy-Assisted Validation with LLM Judge: The
planation of the training configurations, including
DSpy framework incorporates large language mod-
hyper-parameters and computational resources. An
els (LLMs) as automated evaluators, enabling robust
overview of the entire process, from data distillation
and interpretable metric calculations. This flexibil-
to evaluation, is illustrated in Figure 2. Finally, the
ity allows DSpy to integrate a wide range of LLMs,
experiments designed to validate the effectiveness of
referred to here as the LLM Judge. This variation
using relevant quotes instead of full context are pre-
of precision and recall, tailored for LLM-generated
sented (Figure 3 illustrates the process). The code uti-
outputs and supported by the LLM Judge’s semantic
lized in this work is available on GitHub1 . Concrete
judgment, ensures a nuanced evaluation of the quote
examples of the experimental results can be found in
extraction model. The integration of DSpy and the
the appendix for further clarification.
Judge provides a systematic, interpretable, and ro-
bust framework for assessing and iteratively improv-
ing model performance. 1 https://github.com/yurifacanha/LLMQuoter
1337
ICAART 2025 - 17th International Conference on Agents and Artificial Intelligence
1338
LLMQuoter: Enhancing RAG Capabilities Through Efficient Quote Extraction from Large Contexts
3.4 Evaluation and Proving the Benefits Table 2: Performance of the Quoter Model Before and After
Fine-Tuning.
The evaluation of the extracted quotes was performed Metric Before After
using the DSpy framework in conjunction with Ope- Recall 48.3% 68.0%(+19.7%)
nAI GPT-4.0. GPT-4.0 was selected as it operates Precision 43.6% 71.0%(+27.4%)
outside the scope of the training data and methods, F1-Score 41.3% 69.1%(+27.8%)
is recognized as one of the top reasoning models,
and remains unbiased regarding the problem context. The results show significant improvements in all
By leveraging these tools, the metrics defined in the three metrics after fine-tuning the quoter model. The
methodology section were concretely implemented F1-score increased from 41.3% to 69.1%, demon-
and materialized for evaluating the system’s perfor- strating the quoter’s ability to accurately identify rel-
mance in a structured and measurable way. evant quotes with low computational resources and a
To validate the benefit of using quotes instead of compact model.
the full context, comparisons were performed across
several base models ( fbase ), including LLAMA 4.2 Benefits of Using Quotes over Full
3.2:1B, LLAMA 3.2:3B, GPT-3.5 Turbo. These
models were evaluated in two configurations: using
Context
extracted quotes R and using the full context C. The
To validate the benefit of using quotes instead of full
accuracy of the answers produced by these models
context, a comparison was performed using original
was assessed to determine the effectiveness of the
models without any training. Both the gold quotes
quote extraction approach. GPT-4.0 was chosen as
and the full context were provided as inputs to dif-
the external LLM Judge again to compute Semantic
ferent models: LLAMA 1B, LLAMA 3B, and GPT-
Accuracy (Sacc ).
3.5 Turbo. The accuracy of the answers generated by
each model in these two configurations is summarized
in Table 3.
1339
ICAART 2025 - 17th International Conference on Agents and Artificial Intelligence
1340
LLMQuoter: Enhancing RAG Capabilities Through Efficient Quote Extraction from Large Contexts
REFERENCES APPENDIX
An, S., Ma, Z., Lin, Z., Zheng, N., and Lou, J.-G. (2024). This section presents examples of inferences drawn
Make your llm fully utilize the context. arXiv preprint from the experiments.
arXiv:2404.16811.
Chen, X., Wang, L., Wu, W., Tang, Q., and Liu, Y. (2024).
Honest ai: Fine-tuning” small” language models to A.1 Distillation
say” i don’t know”, and reducing hallucination in rag.
arXiv preprint arXiv:2410.09699. The input(Q,C,A):
Di Oliveira, V., Bezerra, Y. F., Weigang, L., Brom, P. C., and """
Celestino, V. R. (2024). Slim-raft: A novel fine-tuning Instruction: Given the question,
approach to improve cross-linguistic performance for the context
mercosur common nomenclature. In WEBIST. and the expected answer bellow,
Fu, J., Ng, S.-K., Jiang, Z., and Liu, P. (2024). Generative provide relevant quotes from the
context distillation. arXiv preprint arXiv:2411.15927. context that support the answer.
Gogate, N. et al. (2024). Reducing llm hallucination using your answer must be just the
knowledge distillation: A case study with mistral large quotes, not the entire context.
and mmlu benchmark. TechRxiv. format:
##begin_quote##quote##end_quote##
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang,
for each quote.
S., Wang, L., and Chen, W. (2021). Lora: Low-rank
do not add anything else other
adaptation of large language models. arXiv preprint
than the quotes.
arXiv:2106.09685.
Your turn:
Hu, S., Tu, Y., Han, X., et al. (2024). Minicpm: Unveiling Question: Unlike Xuzhou, where is Rugao
the potential of small language models with scalable under the adminstration of?
training strategies. arXiv preprint arXiv:2404.06395. Context: Rugao () is a county-level city
Jin, H., Han, X., Yang, J., Jiang, Z., Liu, Z., Chang, C.- under the administration of Nantong,
Y., Chen, H., and Hu, X. (2024). Llm maybe longlm: Jiangsu province, China, located
Self-extend llm context window without tuning. arXiv in [about 200 words...] Shanghai.
preprint arXiv:2401.01325. Answer: Nantong
Khattab, O., Singhvi, A., Maheshwari, P., Zhang, Z., San- Quotes:
thanam, K., Vardhamanan, S., Haq, S., Sharma, A., """)
Joshi, T. T., Moazam, H., Miller, H., Zaharia, M., and And the generated quotes(R):
Potts, C. (2023). Dspy: Compiling declarative lan-
guage model calls into self-improving pipelines. arXiv """
preprint arXiv:2310.03714. ##begin_quote## Rugao () is a county-level
city under the administration of
Lin, B., Zhang, C., Peng, T., Zhao, H., Xiao, W., Sun,
Nantong ##end_quote##
M., Liu, A., Zhang, Z., Li, L., Qiu, X., et al. (2024).
"""
Infinite-llm: Efficient llm service for long context with
distattention and distributed kvcache. arXiv preprint
arXiv:2401.02669. A.2 Quote Train Sample
Mirzadeh, I., Alizadeh, K., Shahrokhi, H., Tuzel, O., Ben-
gio, S., and Farajtabar, M. (2024). Gsm-symbolic: The input(Q,C):
Understanding the limitations of mathematical rea- """
soning in large language models. arXiv preprint Instruction: Given the question
arXiv:2410.05229. and the context
Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, provide relevant quotes from the
W. W., Salakhutdinov, R., and Manning, C. D. context that support the answer.
(2018). Hotpotqa: A dataset for diverse, explain- your answer must be just the
able multi-hop question answering. arXiv preprint quotes, not the entire context.
arXiv:1809.09600. format:
Yao, B., Zhang, Y., Li, Q., and Qin, J. (2024). Is sarcasm de- ##begin_quote##quote##end_quote##
tection a step-by-step reasoning process in large lan- for each quote.
guage models? arXiv preprint arXiv:2407.12725. do not add anything else other than
Zhang, T., Patil, S. G., Jain, N., Shen, S., Zaharia, M., Sto- the quotes.
ica, I., and Gonzalez, J. E. (2024a). Raft: Adapting Question: What authority manages
language model to domain specific rag. arXiv preprint the regional
arXiv:2403.10131. passenger train service that
Zhang, Y., Khalifa, M., Logeswaran, L., et al. (2024b). runs through
Small language models need strong verifiers to self- the same junction as
correct reasoning. arXiv preprint arXiv:2404.17140. West Amesbury Branch Railroad?
1341
ICAART 2025 - 17th International Conference on Agents and Artificial Intelligence
Context: Rapido was the brand name Table 5: Golden quote , before and after train quotes infer-
for the Canadian National Railway’s ences from a test sample.
(CN) express passenger train service ##begin quote## Emil and the Detectives
in the Quebec City{Windsor Corridor. is a 1964 film directed by Peter Tewksbury
"Rapido" service was introduced on based on the novel by German author Erich
[..about 100 lines hide..] Kästner. ##end quote##
The rail service may be replaced if the Gold
##begin quote## Toy Story 2 is a 1999
line is closed due to rail maintenance, American computer-animated comedy film
a breakdown of a train, a rail accident, produced by Pixar Animation Studios for
strike action, or if the rail service Walt Disney Pictures. ##end quote##
is not economically viable. ##begin quote## ”A Bug’s Life”,
Quotes: ”Monsters, Inc.”, ”Finding Nemo”,
""" ”Cars”, ”The Incredibles”, ”Ratatouille”
The desired output(R): ##end quote##
Before
""" ##begin quote## ”Toy Story 3” (2010) is
##begin_quote## The Downeaster is a 145 mi the third installment in Pixar’s ”Toy Story”
regional passenger train service, managed series, and the sequel to 1999’s ”Toy Story
by the Northern New England Passenger Rail 2”. ##end quote##
Authority (NNEPRA, created by the State of ##begin quote## Emil and the Detectives
Maine), and operated by Amtrak. ##end_quote## is a 1964 film directed by Peter Tewksbury
##begin_quote## The West Amesbury Branch based on the novel by German author Erich
Railroad was a railroad that once led form Kästner. ##end quote##
After
Newton Junction, New Hampshire to Merrimac, ##begin quote## Toy Story 2 is a 1999
Massachusetts. Although the railroad does American computer-animated comedy film
not exist, the Amtrak "Downeaster" line produced by Pixar Animation Studios for
now passes through the Newton Junction Walt Disney Pictures. ##end quote##
station, which is now a pizza restaurant,
and most of the railroad, is now a gravel A detailed comparison of the model’s perfor-
walking trail. ##end_quote## mance, based on answers generated using either the
""" provided context or the golden quotes, is presented in
Table 5 demonstrates an example of model per- Table 6.
formance before and after training in a test sample for Table 6: Comparison of Q/A results: context vs. quotes.
the question: ”Which film was produced first, Emil
and the Detectives or Toy Story 2?” Model Context Answer Quotes Answer
gpt3.5-turbo Finding Nemo was The Wild Country
created first.
A.3 Comparison: Quote x Context llama3.2:1b Finding Dory is The Wild Country
created first.
An example illustrating the performance comparison llama3.2:3b Finding Dory is The Wild Country
between using full context and extracted quotes. created first.
Question:
"""
Which Walt Disney Pictures
film was created first,
Finding Dory or The Wild Country?
"""
Context: A ‘5086‘ characters context about Dis-
ney and Pixar films.
Quotes:
"""
##begin_quote## The Wild Country is a 1970
American adventure film produced by Walt
Disney Pictures and directed by Robert Totten.
##end_quote##
##begin_quote## Finding Nemo is a 2003
American computer-animated
family film produced
by Pixar Animation Studios and released
by Walt Disney Pictures. ##end_quote##
"""
1342