0% found this document useful (0 votes)
19 views8 pages

Llmquoter: Enhancing Rag Capabilities Through Efficient Quote Extraction From Large Contexts

LLMQuoter is a lightweight model designed to enhance Retrieval-Augmented Generation (RAG) by efficiently extracting relevant quotes from large contexts for reasoning tasks. It utilizes a 'quote-first-then-answer' strategy, achieving significant accuracy improvements over traditional full-context methods while being resource-efficient through knowledge distillation. The model demonstrates a scalable solution for researchers and practitioners, enabling advanced RAG capabilities without extensive retraining.

Uploaded by

dtai130203
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views8 pages

Llmquoter: Enhancing Rag Capabilities Through Efficient Quote Extraction From Large Contexts

LLMQuoter is a lightweight model designed to enhance Retrieval-Augmented Generation (RAG) by efficiently extracting relevant quotes from large contexts for reasoning tasks. It utilizes a 'quote-first-then-answer' strategy, achieving significant accuracy improvements over traditional full-context methods while being resource-efficient through knowledge distillation. The model demonstrates a scalable solution for researchers and practitioners, enabling advanced RAG capabilities without extensive retraining.

Uploaded by

dtai130203
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

LLMQuoter: Enhancing RAG Capabilities Through Efficient Quote

Extraction from Large Contexts

a b
Yuri Façanha Bezerra and Li Weigang
TransLab, Department of Computer Science, University of Brasilia, Brasilia, Federal District, Brazil

Keywords: Knowledge Distillation, Large Language Models, LLM Reasoning, Low-Rank Adaptation,
Retrieval-Augmented Generation.

Abstract: We introduce LLMQuoter, a lightweight, distillation-based model designed to enhance Retrieval-Augmented


Generation (RAG) by extracting the most relevant textual evidence for downstream reasoning tasks. Built on
the LLaMA-3B architecture and fine-tuned with Low-Rank Adaptation (LoRA) on a 15,000-sample subset of
HotpotQA, LLMQuoter adopts a “quote-first-then-answer” strategy, efficiently identifying key quotes before
passing curated snippets to reasoning models. This workflow reduces cognitive overhead and outperforms full-
context approaches like Retrieval-Augmented Fine-Tuning (RAFT), achieving over 20-point accuracy gains
across both small and large language models. By leveraging knowledge distillation from a high-performing
teacher model, LLMQuoter achieves competitive results in a resource-efficient fine-tuning setup. It democra-
tizes advanced RAG capabilities, delivering significant performance improvements without requiring extensive
model retraining. Our results highlight the potential of distilled quote-based reasoning to streamline complex
workflows, offering a scalable and practical solution for researchers and practitioners alike.

1 INTRODUCTION consistency while reducing computational overhead


(Fu et al., 2024; Gogate et al., 2024). Distilled student
Large Language Models (LLMs) have revolutionized models can leverage split-step reasoning, domain-
natural language processing, exhibiting robust perfor- specific fine-tuning, and self-correction mechanisms
mance across a wide range of tasks such as open- to tackle intricate tasks, improving both inference ef-
domain question answering, summarization, and con- ficiency and overall performance (Yao et al., 2024;
versational AI (Lin et al., 2024; Jin et al., 2024; Zhang et al., 2024b).
An et al., 2024). Yet, as model sizes grow, so Within the realm of retrieval-augmented ap-
do their computational demands, creating inefficien- proaches, RAFT (Retrieval-Augmented Fine-Tuning)
cies—particularly in tasks requiring complex rea- exemplifies how “quote while thinking” strategies
soning or retrieval from large contexts. Retrieval- can bridge the gap between retrieval and generation
Augmented Generation (RAG) has emerged as a pop- (Zhang et al., 2024a; Di Oliveira et al., 2024). By
ular solution, integrating external knowledge sources training the model to reason, quote relevant passages,
so models can dynamically access relevant informa- and answer in one sequence, RAFT demonstrates
tion without extensive retraining (Mirzadeh et al., that targeted fine-tuning can enhance context-aware
2024; Hu et al., 2024). However, smaller models still responses (see Figure 1). Nevertheless, even well-
struggle to maintain coherent reasoning over exten- crafted frameworks like RAFT encounter difficulties
sive or noisy contexts, highlighting a persistent gap in when smaller LLMs face large documents or complex
efficiency and accuracy. multi-step reasoning (Zhang et al., 2024b; Chen et al.,
Knowledge distillation addresses these chal- 2024).
lenges by transferring capabilities from high-capacity To address these limitations, we propose
teacher models to smaller students, preserving ad- LLMQuoter, a lightweight model that adopts a
vanced features like multi-step reasoning and factual “quote-first-then-answer” strategy. Rather than rea-
soning over an entire context, LLMQuoter identifies
a and retrieves the most pertinent excerpts, which are
https://orcid.org/0009-0001-8294-7163
b subsequently handed off to downstream models. This
https://orcid.org/0000-0003-1826-1850

1335
Bezerra, Y. F. and Weigang, L.
LLMQuoter: Enhancing RAG Capabilities Through Efficient Quote Extraction from Large Contexts.
DOI: 10.5220/0013358700003890
In Proceedings of the 17th International Conference on Agents and Artificial Intelligence (ICAART 2025) - Volume 3, pages 1335-1342
ISBN: 978-989-758-737-5; ISSN: 2184-433X
Copyright © 2025 by Paper published under CC license (CC BY-NC-ND 4.0)
ICAART 2025 - 17th International Conference on Agents and Artificial Intelligence

tuning a smaller LLM, and validating the approach


with task-specific metrics.
We begin with a formalization of the distillation
problem in Section 2.1, followed by an overview of
the fine-tuning process in Section 2.2. Finally, the
evaluation framework and metrics used to validate the
model’s performance are described, along with a sim-
ple approach to demonstrate the benefits of extracting
relevant quotes instead of using the large content it-
self.

Figure 1: RAFT inference example(Zhang et al., 2024a). 2.1 Problem Formalization


decouples retrieval from reasoning, reducing the cog- Let us consider a dataset of text samples, denoted by
nitive load and enabling both large and small models D = {(C, Q, A)}, where:
to achieve higher accuracy with less computational • C: a large text context.
cost. By building on knowledge distillation and
leveraging Low-Rank Adaptation (LoRA) (Hu et al., • Q: a specific question.
2021) to fine-tune a LLaMA-3B model, LLMQuoter • A: the expected answer.
streamlines RAG pipelines, surpassing full-context The task is to train a model capable of extracting rel-
approaches like RAFT in efficiency and scalability. evant quotes from C that support A in response to Q.
We evaluate LLMQuoter within the DSPy frame- To achieve this, we employ a distillation process
work (Khattab et al., 2023), using a 15,000-sample in which a large LLM generates high-quality train-
subset of the HotpotQA dataset (Yang et al., 2018), ing data, and a smaller LLM is fine-tuned on this
a benchmark commonly employed for retrieval- dataset to efficiently replicate the behavior of the
augmented generation (RAG). Empirical results re- larger model.
veal that LLMQuoter excels in accuracy, all while
remaining computationally lightweight. Through 2.2 LLM Distillation
a two-phase workflow—quote retrieval followed by
reasoning—LLMQuoter democratizes access to ad- The dataset creation process can be formalized as
vanced RAG solutions, offering a scalable alternative follows: Given a high-performance language model
for researchers and practitioners constrained by com- fhigh , such as ChatGPT or Gemini, the task is to ex-
putational resources. tract quotes R from a context C that directly support
This workflow achieves over 20-point accuracy an answer A in response to a question Q. Formally,
gains compared to full-context approaches like RAFT, this process can be represented as:
demonstrating significant improvements across both
small and large language models. Leveraging knowl- fhigh : (Q, A,C) → R
edge distillation from high-performing teacher mod- For each data point (Q, A,C), the high-performance
els, LLMQuoter delivers competitive results with model fhigh generates the set of quotes R , which serve
resource-efficient fine-tuning, eliminating the need as the ground truth:
for extensive retraining. The approach highlights the Dgold = {(Q, A,C, R ) | R = fhigh (Q, A,C)}
scalability of distilled quote-based reasoning, provid-
The result is a high-quality dataset Dgold , consisting
ing a practical and efficient solution for RAG work-
of tuples (Q, A,C, R ), where R represents the rele-
flows.
vant quotes extracted by fhigh . This dataset is then
used to train and evaluate the smaller distilled model
fsmall .
2 METHODOLOGY
2.3 Fine-Tuning LLM with LoRA
With the goal of developing an efficient language
model for extracting relevant quotes from contexts to The smaller model fsmall is fine-tuned on the Dgold
properly answer questions about it, this section de- dataset using Low-Rank Adaptation (LoRA) for task-
tails the methodology employed in training and evalu- specific learning in the extraction of relevant quotes.
ating the distilled LLM. The process involves leverag- The fine-tuning process is defined as:
ing a high-performing LLM for dataset creation, fine-
fsmall : (Q,C) → R

1336
LLMQuoter: Enhancing RAG Capabilities Through Efficient Quote Extraction from Large Contexts

where Q represents the question, C is the textual con- 2.5 Proving the Benefit of Using Quotes
text, and R is the set of relevant quotes generated
by the fine-tuned model. The training process is de- Let fbase represent base models without any fine-
scribed in the following steps: tuning to establish a baseline for comparison. Two ex-
1. Input: Data from the Dgold dataset in the form perimental setups are defined to demonstrate the ad-
of tuples (Q,C), where Q is the question, C is the vantage of using relevant quotes R instead of the full
textual context. context C:
2. Output: The fine-tuned model fsmall is optimized 1. Providing only the gold quotes R from Dgold to
to predict R , replicating the behavior of the larger the base models fbase to answer the questions:
model fhigh , but without knowing the answer. fbase : (Q, Rgold ) → Abase

2.4 Evaluation Framework and Metrics 2. Providing the full context C instead of the quotes
R to the same base models fbase to answer the
The model’s performance is evaluated using the DSpy questions:
framework, which computes task-specific metrics tai-
fbase : (Q,C) → Abase
lored to LLM outputs. Precision and recall are re-
defined for the quote extraction task using an LLM For both setups, Q represents the question, Rgold is
Judge to assess semantic relevance between model the set of gold quotes extracted from the Dgold dataset,
predictions and ground truth. C is the entire context, and Abase is the base models
Precision measures the proportion of predicted answers.
quotes (Rmodel ) that align semantically with the The accuracy of the answers produced by fbase
golden answers (Rgold ), defined as: is measured using Semantic Accuracy (Sacc ), which
∑r∈Rmodel Judge(r, Rgold ) evaluates the alignment between the model-generated
P= answers Abase and the expected answers Agold . Seman-
|Rmodel | tic Accuracy is defined as:
where Rmodel is the set of quotes predicted by the
∑a∈Abase Judge(a, Agold )
model, Rgold is the set of golden answers, and Sacc =
Judge(r, Rgold ) is a scoring function returning values |Agold |
from 0 (no match) to 1 (perfect match). where Judge(a, Agold ) is a semantic similarity func-
Recall quantifies the proportion of golden answers tion scoring the alignment between a model-
(Rgold ) captured by the model’s predictions (Rmodel ), generated answer a and the ground truth Agold , with
defined as: scores ranging from 0 (no match) to 1 (perfect match).
∑r∈Rgold Judge(r, Rmodel )
R=
|Rgold |
3 EXPERIMENTS
F1-score balances precision and recall and is de-
fined as:
P·R This section describes the experimental setup used
F1 = 2 · to analyze the performance of the proposed method-
P+R
ology. It begins with details of the datasets used
for training and evaluation, followed by an ex-
DSpy-Assisted Validation with LLM Judge: The
planation of the training configurations, including
DSpy framework incorporates large language mod-
hyper-parameters and computational resources. An
els (LLMs) as automated evaluators, enabling robust
overview of the entire process, from data distillation
and interpretable metric calculations. This flexibil-
to evaluation, is illustrated in Figure 2. Finally, the
ity allows DSpy to integrate a wide range of LLMs,
experiments designed to validate the effectiveness of
referred to here as the LLM Judge. This variation
using relevant quotes instead of full context are pre-
of precision and recall, tailored for LLM-generated
sented (Figure 3 illustrates the process). The code uti-
outputs and supported by the LLM Judge’s semantic
lized in this work is available on GitHub1 . Concrete
judgment, ensures a nuanced evaluation of the quote
examples of the experimental results can be found in
extraction model. The integration of DSpy and the
the appendix for further clarification.
Judge provides a systematic, interpretable, and ro-
bust framework for assessing and iteratively improv-
ing model performance. 1 https://github.com/yurifacanha/LLMQuoter

1337
ICAART 2025 - 17th International Conference on Agents and Artificial Intelligence

Figure 2: The LLMQuoter diagram.

3.1 Datasets quotes (R ). This enriched dataset, now comprising


question (Q), context (C), and quotes (R ), served as
Our method was evaluated on the HotpotQA dataset the foundation for training and evaluating the smaller
(Yang et al., 2018), an open-domain question- fsmall model.
answering benchmark derived from Wikipedia, with a
focus on common knowledge topics such as movies, 3.3 Fine-Tuning Process
sports, and general trivia. The dataset consists of three
columns: question, context, and answer, where each The fine-tuning process was applied to the smaller
sample pairs a question with a large textual context LLM, LLAMA 3.2 3B, using the Low-Rank Adap-
and its corresponding answer. tation (LoRA) technique to optimize the model for
Due to resource constraints, a random subset of the quote extraction task. LLAMA 3.2 3B was cho-
15,000 samples was selected from the original dataset sen as the base model due to its balance between
to serve as the basis for applying the distillation pro- computational efficiency and task-specific adaptabil-
cess. From this subset, 600 samples were set aside for ity. The fine-tuning process was completed over a sin-
evaluation purposes, forming the test set. This test set gle epoch, ensuring efficient adaptation without over-
was used to measure the model’s performance during fitting.
the evaluation phase and to validate the benefit of us- The fine-tuning process was conducted on a
ing extracted quotes as opposed to the entire context NVIDIA A100-SXM4-40GB GPU, with a maximum
for answering questions. The remaining 14,400 sam- memory capacity of 39.564 GB. The specific resource
ples were utilized for training and validation during utilization and training parameters are summarized
the distillation and fine-tuning steps. below:

3.2 Data Distillation Table 1: Summary of Fine-Tuning Configuration and Re-


source Usage.
The distillation process was performed using Gem- Configuration/Metric Value
ini Pro 1.5 as the high-performance model ( fhigh ) Memory Usage 3.56GB(peak)
and LangChain as the framework for managing the Training Memory 1.06GB(peak)
pipeline. The process involved generating relevant Batch Configuration Batch size: 2
quotes for each sample in both the training and test Gradient accumulation steps 4
datasets by leveraging the capabilities of Gemini Pro Total effective batch size 8
1.5. Training Steps 60
Gemini Pro 1.5, as one of the most powerful Trainable Parameters 24M aprox
models available today, was tasked with extracting Training Time 5 minutes
quotes directly supporting the answer to each ques-
tion. Given the model’s advanced performance and This setup highlights the efficiency of the LoRA
ability to generate high-quality answers, it is reason- approach in adapting a compact model like LLAMA
able to assume that the resulting dataset represents an 3.2 3B for specific tasks with minimal resource usage
excellent ”gold” standard for the task of quote extrac- and rapid training over just one epoch (see Table 1).
tion.
After this step, the dataset was finalized, aug-
mented with a new column containing the extracted

1338
LLMQuoter: Enhancing RAG Capabilities Through Efficient Quote Extraction from Large Contexts

3.4 Evaluation and Proving the Benefits Table 2: Performance of the Quoter Model Before and After
Fine-Tuning.
The evaluation of the extracted quotes was performed Metric Before After
using the DSpy framework in conjunction with Ope- Recall 48.3% 68.0%(+19.7%)
nAI GPT-4.0. GPT-4.0 was selected as it operates Precision 43.6% 71.0%(+27.4%)
outside the scope of the training data and methods, F1-Score 41.3% 69.1%(+27.8%)
is recognized as one of the top reasoning models,
and remains unbiased regarding the problem context. The results show significant improvements in all
By leveraging these tools, the metrics defined in the three metrics after fine-tuning the quoter model. The
methodology section were concretely implemented F1-score increased from 41.3% to 69.1%, demon-
and materialized for evaluating the system’s perfor- strating the quoter’s ability to accurately identify rel-
mance in a structured and measurable way. evant quotes with low computational resources and a
To validate the benefit of using quotes instead of compact model.
the full context, comparisons were performed across
several base models ( fbase ), including LLAMA 4.2 Benefits of Using Quotes over Full
3.2:1B, LLAMA 3.2:3B, GPT-3.5 Turbo. These
models were evaluated in two configurations: using
Context
extracted quotes R and using the full context C. The
To validate the benefit of using quotes instead of full
accuracy of the answers produced by these models
context, a comparison was performed using original
was assessed to determine the effectiveness of the
models without any training. Both the gold quotes
quote extraction approach. GPT-4.0 was chosen as
and the full context were provided as inputs to dif-
the external LLM Judge again to compute Semantic
ferent models: LLAMA 1B, LLAMA 3B, and GPT-
Accuracy (Sacc ).
3.5 Turbo. The accuracy of the answers generated by
each model in these two configurations is summarized
in Table 3.

Table 3: Comparison of Accuracy Between Using Full Con-


text and Quotes.
Model Context Quotes
LLAMA 1B 24.4% 62.2% (+37.8%)
LLAMA 3B 57.7% 83.0% (+25.3%)
GPT-3.5 Turbo 75.8% 88.5% (+12.7%)
Figure 3: Context X context process.
The results highlight a clear improvement in ac-
curacy when using gold quotes compared to full con-
4 RESULTS AND DISCUSSION text. For instance, LLAMA 1B achieved an accuracy
of 62.2% with quotes versus 24.4% with full context,
This section presents the experimental results ob- and GPT-3.5 Turbo achieved 88.5% with quotes ver-
tained by evaluating the quote extraction model sus 75.8% with full context. These findings indicate
(quoter) and validating the benefit of using quotes that providing a good quoter model can significantly
over full context in open-domain question-answering enhance the performance of both small and large lan-
tasks. The results demonstrate the effectiveness of guage models in RAG scenarios.
the proposed method in improving the performance
of both small and large language models in RAG 4.3 Discussion
(retrieval-augmented generation) scenarios.
The results validate the hypothesis that using ex-
4.1 Evaluation of the Quoter Model tracted quotes instead of full context significantly im-
proves model performance in open-domain question-
The performance of the quoter model was evaluated answering tasks. This finding aligns with the orig-
using the metrics described in Section 3.3. The re- inal RAFT approach, which involves reasoning and
call, precision, and F1-score were measured both be- answering directly over the full context. How-
fore and after fine-tuning the smaller LLM using the ever, our experiments demonstrate that separating the
LoRA approach. The results are summarized in Table tasks—first extracting quotes with a simple quoter
2. and then reasoning over the concise data—can lead

1339
ICAART 2025 - 17th International Conference on Agents and Artificial Intelligence

to comparable or better outcomes with lower compu- 5 CONCLUSIONS AND FUTURE


tational overhead. WORK
Table 4: Comparison of RAFT and Full Context Results on
LLaMA2-7B over HotPotQA dataset. This study demonstrates the effectiveness of data
distillation and lightweight training for enhancing
Method Accuracy Retrieval-Augmented Generation (RAG) systems. By
LLaMA2-7B + Full Context 26.43% leveraging a high-performing teacher model to distill
RAFT (LLaMA2-7B) 35.28% relevant quotes and fine-tuning a compact model, we
achieved significant improvements in model perfor-
To provide context, RAFT was tested with
mance. The fine-tuning process required minimal re-
LLaMA2-7B over the full dataset, achieving an ac-
sources, with just 5 minutes of training on an NVIDIA
curacy of 35.28% when reasoning over both context
A100 GPU, yet delivered robust results.
and question simultaneously. Using the same model
The experiments validate that an efficient quoter
(LLaMA2-7B) with only the full context reduced per-
model can substantially enhance RAG performance
formance to 26.43% (see Table 4). While our exper-
by reducing the cognitive load on the reasoning pro-
iments used a random sample of 15,000 rows from
cess. By focusing the model’s efforts on the an-
the HotpotQA dataset due to resource constraints,
swer rather than processing and reasoning over large
the results are promising. For instance, even with a
contexts, we eliminate the need for extensive train-
lightweight 3B quoter model fine-tuned with minimal
ing while improving accuracy. This approach aligns
resources on Colab, the quote-based approach signifi-
with the principle of “divide and conquer,” where the
cantly boosted accuracy for various downstream mod-
reasoning task is simplified and made more manage-
els.
able for even small models. Ultimately, our results
The comparison highlights that the quoter tech-
demonstrate that high-quality quote extraction can de-
nique is a promising alternative. By offloading the
mocratize access to high-performing RAG capabili-
task of quote extraction to a small and efficient model,
ties across a range of computational constraints.
we can streamline the reasoning process for larger
While this work has established a strong founda-
models, avoiding the pitfalls of over-reasoning. The
tion for quote-based RAG, several avenues for future
”divide and conquer” strategy allows each model to
research remain open:
focus on its strength: smaller models specialize in
targeted preprocessing, while larger models excel in • Expanded Datasets: Test the approach on di-
reasoning over concise, relevant data. verse datasets across various domains and com-
While our study only utilized a subset of the Hot- plexities to ensure broader applicability and ro-
potQA dataset, the results suggest that the quoter bustness.
technique offers a scalable and efficient solution • Reinforcement Learning: Utilize techniques
for enhancing retrieval-augmented generation (RAG) like Proximal Policy Optimization (PPO) or Di-
pipelines. Notably, the models used with the extracted rect Preference Optimization (DPO) to enhance
quotes were not fine-tuned to reason better, yet still quote extraction and reasoning.
achieved significant improvements in accuracy. This
• Larger Models: Explore scalability by training
highlights the power of the quoter approach in sim-
larger models, such as an 8B parameter LLAMA,
plifying the reasoning task by reducing the cognitive
to assess the impact of size on performance.
load on base models, allowing even non-optimized
models to perform effectively. • Prompt Engineering: Develop advanced
This approach could serve as a viable alternative prompts to optimize extraction and reasoning,
to RAFT in scenarios with limited resources, demon- improving system accuracy and efficiency.
strating that a well-trained quoter can democratize ac- • Extended Applications: Adapt the methodology
cess to high-performing NLP solutions. By offloading for memory-augmented systems to efficiently re-
the preprocessing task of identifying relevant infor- trieve and manage information from extensive ex-
mation, the quoter enables base models to focus their ternal knowledge bases.
reasoning capabilities on concise, relevant data rather
than processing large and noisy contexts. By exploring these directions, we aim to further
refine the quote-based RAG pipeline and expand its
applicability to broader NLP tasks, offering scalable
and resource-efficient solutions for both research and
real-world scenarios.

1340
LLMQuoter: Enhancing RAG Capabilities Through Efficient Quote Extraction from Large Contexts

REFERENCES APPENDIX
An, S., Ma, Z., Lin, Z., Zheng, N., and Lou, J.-G. (2024). This section presents examples of inferences drawn
Make your llm fully utilize the context. arXiv preprint from the experiments.
arXiv:2404.16811.
Chen, X., Wang, L., Wu, W., Tang, Q., and Liu, Y. (2024).
Honest ai: Fine-tuning” small” language models to A.1 Distillation
say” i don’t know”, and reducing hallucination in rag.
arXiv preprint arXiv:2410.09699. The input(Q,C,A):
Di Oliveira, V., Bezerra, Y. F., Weigang, L., Brom, P. C., and """
Celestino, V. R. (2024). Slim-raft: A novel fine-tuning Instruction: Given the question,
approach to improve cross-linguistic performance for the context
mercosur common nomenclature. In WEBIST. and the expected answer bellow,
Fu, J., Ng, S.-K., Jiang, Z., and Liu, P. (2024). Generative provide relevant quotes from the
context distillation. arXiv preprint arXiv:2411.15927. context that support the answer.
Gogate, N. et al. (2024). Reducing llm hallucination using your answer must be just the
knowledge distillation: A case study with mistral large quotes, not the entire context.
and mmlu benchmark. TechRxiv. format:
##begin_quote##quote##end_quote##
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang,
for each quote.
S., Wang, L., and Chen, W. (2021). Lora: Low-rank
do not add anything else other
adaptation of large language models. arXiv preprint
than the quotes.
arXiv:2106.09685.
Your turn:
Hu, S., Tu, Y., Han, X., et al. (2024). Minicpm: Unveiling Question: Unlike Xuzhou, where is Rugao
the potential of small language models with scalable under the adminstration of?
training strategies. arXiv preprint arXiv:2404.06395. Context: Rugao () is a county-level city
Jin, H., Han, X., Yang, J., Jiang, Z., Liu, Z., Chang, C.- under the administration of Nantong,
Y., Chen, H., and Hu, X. (2024). Llm maybe longlm: Jiangsu province, China, located
Self-extend llm context window without tuning. arXiv in [about 200 words...] Shanghai.
preprint arXiv:2401.01325. Answer: Nantong
Khattab, O., Singhvi, A., Maheshwari, P., Zhang, Z., San- Quotes:
thanam, K., Vardhamanan, S., Haq, S., Sharma, A., """)
Joshi, T. T., Moazam, H., Miller, H., Zaharia, M., and And the generated quotes(R):
Potts, C. (2023). Dspy: Compiling declarative lan-
guage model calls into self-improving pipelines. arXiv """
preprint arXiv:2310.03714. ##begin_quote## Rugao () is a county-level
city under the administration of
Lin, B., Zhang, C., Peng, T., Zhao, H., Xiao, W., Sun,
Nantong ##end_quote##
M., Liu, A., Zhang, Z., Li, L., Qiu, X., et al. (2024).
"""
Infinite-llm: Efficient llm service for long context with
distattention and distributed kvcache. arXiv preprint
arXiv:2401.02669. A.2 Quote Train Sample
Mirzadeh, I., Alizadeh, K., Shahrokhi, H., Tuzel, O., Ben-
gio, S., and Farajtabar, M. (2024). Gsm-symbolic: The input(Q,C):
Understanding the limitations of mathematical rea- """
soning in large language models. arXiv preprint Instruction: Given the question
arXiv:2410.05229. and the context
Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, provide relevant quotes from the
W. W., Salakhutdinov, R., and Manning, C. D. context that support the answer.
(2018). Hotpotqa: A dataset for diverse, explain- your answer must be just the
able multi-hop question answering. arXiv preprint quotes, not the entire context.
arXiv:1809.09600. format:
Yao, B., Zhang, Y., Li, Q., and Qin, J. (2024). Is sarcasm de- ##begin_quote##quote##end_quote##
tection a step-by-step reasoning process in large lan- for each quote.
guage models? arXiv preprint arXiv:2407.12725. do not add anything else other than
Zhang, T., Patil, S. G., Jain, N., Shen, S., Zaharia, M., Sto- the quotes.
ica, I., and Gonzalez, J. E. (2024a). Raft: Adapting Question: What authority manages
language model to domain specific rag. arXiv preprint the regional
arXiv:2403.10131. passenger train service that
Zhang, Y., Khalifa, M., Logeswaran, L., et al. (2024b). runs through
Small language models need strong verifiers to self- the same junction as
correct reasoning. arXiv preprint arXiv:2404.17140. West Amesbury Branch Railroad?

1341
ICAART 2025 - 17th International Conference on Agents and Artificial Intelligence

Context: Rapido was the brand name Table 5: Golden quote , before and after train quotes infer-
for the Canadian National Railway’s ences from a test sample.
(CN) express passenger train service ##begin quote## Emil and the Detectives
in the Quebec City{Windsor Corridor. is a 1964 film directed by Peter Tewksbury
"Rapido" service was introduced on based on the novel by German author Erich
[..about 100 lines hide..] Kästner. ##end quote##
The rail service may be replaced if the Gold
##begin quote## Toy Story 2 is a 1999
line is closed due to rail maintenance, American computer-animated comedy film
a breakdown of a train, a rail accident, produced by Pixar Animation Studios for
strike action, or if the rail service Walt Disney Pictures. ##end quote##
is not economically viable. ##begin quote## ”A Bug’s Life”,
Quotes: ”Monsters, Inc.”, ”Finding Nemo”,
""" ”Cars”, ”The Incredibles”, ”Ratatouille”
The desired output(R): ##end quote##
Before
""" ##begin quote## ”Toy Story 3” (2010) is
##begin_quote## The Downeaster is a 145 mi the third installment in Pixar’s ”Toy Story”
regional passenger train service, managed series, and the sequel to 1999’s ”Toy Story
by the Northern New England Passenger Rail 2”. ##end quote##
Authority (NNEPRA, created by the State of ##begin quote## Emil and the Detectives
Maine), and operated by Amtrak. ##end_quote## is a 1964 film directed by Peter Tewksbury
##begin_quote## The West Amesbury Branch based on the novel by German author Erich
Railroad was a railroad that once led form Kästner. ##end quote##
After
Newton Junction, New Hampshire to Merrimac, ##begin quote## Toy Story 2 is a 1999
Massachusetts. Although the railroad does American computer-animated comedy film
not exist, the Amtrak "Downeaster" line produced by Pixar Animation Studios for
now passes through the Newton Junction Walt Disney Pictures. ##end quote##
station, which is now a pizza restaurant,
and most of the railroad, is now a gravel A detailed comparison of the model’s perfor-
walking trail. ##end_quote## mance, based on answers generated using either the
""" provided context or the golden quotes, is presented in
Table 5 demonstrates an example of model per- Table 6.
formance before and after training in a test sample for Table 6: Comparison of Q/A results: context vs. quotes.
the question: ”Which film was produced first, Emil
and the Detectives or Toy Story 2?” Model Context Answer Quotes Answer
gpt3.5-turbo Finding Nemo was The Wild Country
created first.
A.3 Comparison: Quote x Context llama3.2:1b Finding Dory is The Wild Country
created first.
An example illustrating the performance comparison llama3.2:3b Finding Dory is The Wild Country
between using full context and extracted quotes. created first.
Question:
"""
Which Walt Disney Pictures
film was created first,
Finding Dory or The Wild Country?
"""
Context: A ‘5086‘ characters context about Dis-
ney and Pixar films.
Quotes:
"""
##begin_quote## The Wild Country is a 1970
American adventure film produced by Walt
Disney Pictures and directed by Robert Totten.
##end_quote##
##begin_quote## Finding Nemo is a 2003
American computer-animated
family film produced
by Pixar Animation Studios and released
by Walt Disney Pictures. ##end_quote##
"""

1342

You might also like