0% found this document useful (0 votes)
26 views8 pages

Rise of LLM 05

The document outlines a validation framework for Large Language Models (LLMs), emphasizing the importance of proactive risk assessment to mitigate issues such as misinformation, bias, and privacy concerns. It highlights the necessity for thorough validation before deploying LLMs in production, in compliance with regulatory requirements like the AI Act in Europe and the NIST AI Risk Management Framework in the U.S. The framework encompasses various dimensions including model risk, data management, and ethical considerations, and suggests a combination of quantitative and qualitative evaluation techniques for effective validation.

Uploaded by

rk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views8 pages

Rise of LLM 05

The document outlines a validation framework for Large Language Models (LLMs), emphasizing the importance of proactive risk assessment to mitigate issues such as misinformation, bias, and privacy concerns. It highlights the necessity for thorough validation before deploying LLMs in production, in compliance with regulatory requirements like the AI Act in Europe and the NIST AI Risk Management Framework in the U.S. The framework encompasses various dimensions including model risk, data management, and ethical considerations, and suggests a combination of quantitative and qualitative evaluation techniques for effective validation.

Uploaded by

rk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Auge LLM-Eng- Vdef_Maquetación 1 30/05/2024 23:48 Página 36

LLM: validation framework

“The consequences of AI going wrong are serious,


so we need to be proactive rather than reactive“.
Elon Musk94
MANAGEMENT SOLUTIONS

36
The rise of Large Language Models : from fundamentals to application
Auge LLM-Eng- Vdef_Maquetación 1 30/05/2024 23:48 Página 37

Framework others. By systematically addressing all of these issues,


organizations can proactively identify and mitigate the risks
Large Language Models (LLMs) have great potential to associated with LLMs and lay the foundation for unlocking their
transform various industries and applications, but they also potential in a safe and responsible manner.
pose significant risks that must be addressed. These risks
include the generation of misinformation or hallucinations, In LLMs, this risk assessment can be anchored in the following
perpetuation of biases, difficulty in forgetting learned dimensions used in the model risk discipline, adapting the tests
information, ethical and fairness concerns, privacy issues due to according to the nature and use of the LLM:
misuse, difficulty in interpreting results, and the potential
creation of malicious content, among others. 4 Input data: text comprehension98, data quality99.

Given the potential impact of these risks, LLMs must be 4 Conceptual soundness and model design: selection of the
thoroughly validated before deployment in production model and its components (e.g., fine-tuning methodologies,
environments. Validation of LLMs is not only a best practice, but database connections, RAG100), and comparison with other
also a regulatory requirement in many jurisdictions. In Europe, models101.
the proposed AI Act requires risk assessment and mitigation of
AI systems95. At the same time, in the United States, the NIST AI
Risk Management Framework96 and the AI Bill of Rights
highlight the importance of understanding and addressing the 37

risks inherent in these systems.

Validation of LLMs can be based on the principles established in


the discipline of model risk, which focuses97 on assessing and
mitigating the risks arising from errors, poor implementation or 94
Elon Musk (n. 1971), CEO of X, SpaceX, Tesla. South African-American
misuse of models. However, in the case of AI, and particularly entrepreneur, known for founding or co-founding companies such as Tesla,
SpaceX and PayPal, owner of X (formerly Twitter), a social network that has its
LLMs, a broader perspective needs to be taken that own LLM, called Grok.
encompasses the other risks involved. A comprehensive 95
European Parliament (2024) AI Act Art. 9: ”A risk management system shall be
approach to validation is essential to ensure the safe and established, implemented, documented and maintained in relation to high-risk
AI systems. The risk management system [...] shall [...] comprise [...] the
responsible use of LLMs. estimation and evaluation of risks that may arise when the high-risk AI system is
used in accordance with its intended purpose, and under reasonably
foreseeable conditions of misuse“.
This holistic approach is embodied in a multidimensional 96
NIST (2023): ”The decision to commission or deploy an AI system should be
validation framework for LLMs that covers key aspects (Figure 9) based on a contextual assessment of reliability characteristics and relative risks,
impacts, costs, and benefits, and should be informed by a broad set of
such as model risk, data and privacy management, stakeholders“.
97
cybersecurity, legal and compliance risks, operational and Management Solutions (2014). Model Risk Management: Quantitative and
Qualitative Aspects.
technology risks, ethics and reputation, and vendor risk, among 98
Imperial et al. (2023).
99
Wettig et al (2024).
100
RAG (Retrieval-Augmented Generation) is an advanced technique in which a
language model searches for relevant information from an external source
before generating text. This enriches answers with accurate and current
knowledge by intelligently combining information search and text generation.
By integrating data from external sources, RAG models, such as the RAG-Token
and RAG-Sequence models proposed by Lewis et al. (2020), provide more
informed and consistent responses, minimizing the risk of generating
inaccurate content or 'hallucinations'. This advance represents a significant step
towards more reliable and evidence-based artificial intelligence models.
101
Khang (2024).
Auge LLM-Eng- Vdef_Maquetación 1 30/05/2024 23:48 Página 38

Figure 9. AI Risks and Regulatory References in the AI Act.


Compliance & Legal Risk
AI Act Art. 8, 9
Compliance with AI Act, GDPR, ethical AI frameworks,
Compliance & intellectual property
Model Risk Legal Risk OpRisk, IT Risk & Cybersecurity
AI Act Art. 8, 9, 10, 14, 15, 29 AI Act Art. 8, 15
MRM policy, inventory, validation guidelines, OpRisk, AI vulnerabilities, adversarial AI, incident
risk classification, XAI and bias detection Model Risk IT Risk & response, overreliance on AI, AI
Cybersecurity implementation, record keeping

AI Risk
Vendor Risk ESG & Reputational Risk
AI Act Art. 8, 9, 12 AI Act Art. 8, 29a
Third party screening, AI ethics of vendor, AI ESG & Ethics, fairness, environmental impact, social
integration, copyright issues Vendor Risk Reputational impact, reputation
Risk

Data
Management &
Data Privacy Data Management & Data Privacy
AI Act Art. 8, 10
Transparency, consent for AI usage, anonymization, record keeping,
bias in data, data poisoning

4 Model evaluation and analysis of results: privacy and In each of these dimensions, two sets of complementary
security of the results102, model accuracy103, consistency104, techniques allow for a more complete validation (Figure 10):
robustness105, adaptability106, interpretability (XAI)107, ethics,
bias and fairness108, toxicity109, comparison against 4 Quantitative evaluation metrics (tests): These standardized
challenger models. quantitative tests measure the model's performance on
specific tasks. They are predefined benchmarks and metrics
4 Implementation and use: human review in use (including for evaluating various LLM performance aspects after pre-
monitoring for misuse), error resolution, scalability and training or during the fine-tuning or instruction tuning (i.e.,
MANAGEMENT SOLUTIONS

efficiency, user acceptance. reinforcement learning techniques), optimization, prompt


engineering, or information retrieval and generation
4 Governance110 and ethics111: governance framework for phases. Examples include summarization accuracy,
generative AI, including LLMs. robustness to adversarial attacks, or consistency of
responses to similar prompts.
4 Documentation112: completeness of the model
documentation. 4 Human evaluation: involves qualitative judgment by
38 experts and end users, such as a human review of a specific
4 Regulatory compliance113: assessment of regulatory sample of LLM prompts and responses to identify errors.
The rise of Large Language Models : from fundamentals to application

requirements (e.g., AI Act).


The validation of a specific use of an LLM is therefore carried
To ensure the effective and safe use of language models, it is out by a combination of quantitative (tests) and qualitative
essential to perform a risk assessment that considers both the (human evaluation) techniques. For each specific use case, it is
model itself and its specific use. This will ensure that the model, necessary to design a tailor-made validation approach
regardless of its origin (in-house or from a vendor) or consisting of a selection of some of these techniques.
customization (fine-tuning), will function properly in its context
of use and meet the necessary security, ethical, and regulatory
standards.

Validation techniques
When an organization is considering implementing an LLM for a
specific use case, it may be beneficial to take a holistic approach 102
Nasr (2023).
103
that encompasses the key dimensions of the model's lifecycle: Liang (2023).
104
Elazar (2021).
data, design, assessment, implementation and use. It is also 105
Liu (2023).
necessary to assess compliance with applicable regulations, 106
Dun (2024).
107
such as the AI Act in the European Union, in a cross-cutting Singh (2024).d
108
NIST (2023), Oneto (2020), Zhou (2021).
manner. 109
Shaikh (2023).
110
Management Solutions (2014). Model Risk Management.
111
Oneto (2020).
112
NIST (2023).
113
European Parliament (2024). AI Act.
Auge LLM-Eng- Vdef_Maquetación 1 30/05/2024 23:48 Página 39

Figure 10. LLM evaluation tests.

Human evaluation
Dimensions Validated aspects Description Validation metrics (examples) (examples)

Degree of quality of modeling or


1. Input data 1.1 Data quality • Flesch-Kinkaid Grade • Case-by-case review
application data.

• Review of LLM elements: RAG, input or output


filters, prompts definition, finetuning,
Choice of appropriate models optimization...
2. Model design 2.1 Model design • A/B Testing
and methodology
• Comparison with other LLMs

Respect confidentiality and do • Registrations


3.1 Privacy and • Data leakage
not regurgitate personal • Ethical hacking
security • PII tests, K-anonymity
information.

• Q&A: SummaQA, Word error rate


• Information retrieval: SSA, nDCG
• Summary: ROUGE • Backtesting of overrides
Correctness and relevance of
3.2 Accuracy • Translation: BLEU, Ruby, ROUGE-L • Case-by-case review
model responses
• Others: QA systems, level of overrides, level of
hallucinations...
• Benchmarks: XSUM, LogiQA, WikiData...

• Cosine similarity • Case-by-case review


Correctness and relevance of
3.3 Consistency • Jaccard similarity index • A/B Testing
model responses

• Adversarial text generation (TextFooler), Regex


patterns • Ethical hacking
3. Model Resilience to adverse or
3.4 Robustness • Benchmarks of adversarial attacks (PromptBench), • Incident drills
evaluation misleading informationa
number of refusals

Ability to learn or adapt to new • LLM performance on new data by Zero/One/Few- • A/B Testing
3.5.Adaptability
contexts shot learning • Case-by-case review

Understanding the decision • SHAP • UX tracking


3.6 Explainability
making process • Explainability scores • Focus groups 39

• AI Fairness 360 toolkit


• Ethical hacking
3.7 Biases and Responses without demographic • WEAT score, demographic parity, word
• Focus groups
fairness bias associations...
• Benchmarks of biases (BBQ...)

• Perspective API, Hatebase API


Propensity to generate harmful • Ethical hacking
3.8 Toxicity • Toxicity benchmarks (RealToxicityPrompts, BOLD,
content. • Focus groups
etc.)

Avoid harmful or illegal • Risk protocols, safety assessments


4.1 Human review • Ethical hacking
suggestions and include a • Human control
and safety of use • Focus groups
'human-in-the-loop' review.

4.2 Recovery and Ability to recover from errors • System recovery tests
• Incident drills
error handling and handle unexpected inputs • Error processing metrics

4.Implementation Maintain performance with • Stress testing of the system, Apache Jmeter... • Incident drills
4.3 Scalability
and use more data or users • Scalability benchmarks • A/B Testing

Resource utilization and speed • Time-to-first-byte (TTFB), GPU/CPU utilization,


4.4 Efficiency • Incident drills
of response broadcast inference, memory, latency

• User requirements checklist, user opt-out • UX tracking


4.5 User acceptance User acceptance testing. • User Satisfaction (Net Promoter Score, CSAT) • A/B Testing
Auge LLM-Eng- Vdef_Maquetación 1 30/05/2024 23:48 Página 40

The exact selection of techniques will depend on the particular 4 LLM Comparator114: a tool developed by Google
characteristics of the use case; and, in particular, several researchers for automatically evaluating and comparing
important factors to consider when deciding on the most LLMs, which checks the quality of LLM answers.
appropriate techniques are:
4 HELM115: Holistic Evaluation of Language Models, which
4 The level of risk and criticality of the tasks to be entrusted to compiles evaluation metrics along seven dimensions
the LLM. (accuracy, calibration, robustness, fairness, bias, toxicity,
and efficiency) for a set of predefined scenarios.
4 Whether the LLM is open to the public (n which case ethical
hacking becomes particularly relevant) or its use is limited 4 ReLM116: LLM validation and query system using language
to the internal scope of the organization. usage, including evaluation of linguistic models,
memorization, bias, toxicity and language comprehension.
4 Whether the LLM processes personal data.
At present, certain validation techniques, such as SHAP-based
4 The line of business or service the LLM will be used for. explainability methods (XAI), some metrics such as ROUGE117 or
fairness analyses using demographic parity, do not yet have
Careful analysis of these factors will allow the construction of a widely accepted predefined thresholds. In these cases, it is the
robust validation framework tailored to the needs of each LLM task of the scientific community and the industry to continue
application. research to establish clear criteria for robust and standardized
validation.
Quantitative evaluation metrics
Although this is an emerging field of study, there is a wide
range of quantitative metrics that can be used to evaluate LLM
performance. Some of these metrics are adaptations of those
used in traditional machine learning models, such as accuracy,
recall, F1 score, or area under the ROC curve (AUC-ROC). Other
metrics are specifically designed to evaluate unique aspects of
MANAGEMENT SOLUTIONS

LLMs, such as the coherence of the generated text, factual


fidelity, or language diversity.

In this context, holistic quantitative LLM testing frameworks


already exist in Python programming environments, which
114
facilitate the implementation of many of the quantitative Kahng (2024).
115
Liang (2023).
validation metrics, such as: 116
Kuchnik (2023).
40 117
Duan (2023).
The rise of Large Language Models : from fundamentals to application

Figure 11. Some LLM human evaluation techniques.

A Overrides backtest
Count and measure the significance of human
E Focus groups
Collect insights on LLM outputs from diverse
users (for ethics, cultural appropriateness,
modifications to LLM outputs. discrimination, etc.).

B
Case-by-case check
Compare a representative sample (e.g., minimum
of 200 through Z-test1) of LLM responses with
F User experience (UX) tracking
Observe and assess user interactions with the
human outputs (‘ground truth’), incl. double-blind. LLM over time / in real time.

C G
Ethical hacking (aka Red Team) Incident drills
Manipulate prompts to force the LLM to produce Simulate adverse scenarios to test LLM response
undesired outputs (incl. PII regurgitation, and recovery (stress test, check backup, measure
compliance, prompt engineering, penetration tests, recovery time, etc.).
AI vulnerabilities, etc.).

D A/B testing
Conduct parallel trials to evaluate different
versions (A and B) or compare with human
H Record-keeping
Review the LLM system’s logs and records,
performance. ensuring compliance with regulation.
Auge LLM-Eng- Vdef_Maquetación 1 30/05/2024 23:48 Página 41

Human evaluation techniques Benchmarks for LLM Evaluation


While quantitative assessment metrics are more directly
Most generative artificial intelligence models, including LLMs,
implementable due to the multitude of online resources and
are tested against public benchmarks to evaluate their
publications in recent years, human assessment techniques118 performance on a variety of tasks related to natural language
are varied and must be constructed based on the specific task119 understanding and usage. These tests are used to measure how
being performed by the LLM, and include (Figure 11): well the LLM handles specific tasks and mirrors human
understanding. Some of these benchmarks include:

4 User override backtesting: counting and measuring the 4 GLUE/SuperGLUE: assesses language comprehension
importance of human modifications to LLM results (e.g., through tasks that measure a model's ability to understand
text.
how many times a sales manager must manually modify
customer call summaries generated by an LLM). 4 Eleuther AI Language Model Evaluation Harness: performs
"few-shot" model evaluation, that is, evaluates model
accuracy with very few training examples.
4 Case-by-case review: comparing a representative sample
of LLM responses to user expectations ("ground truth"). 4 ARC (AI2 Reasoning Challenge): tests the model's ability to
answer scientific questions that require reasoning.

4 Ethical hacking (Red Team): manipulating prompts to 4 HellaSwag: evaluates the model's common sense through
force the LLM to produce undesired results (e.g., tasks that require predicting a coherent story ending.
regurgitation of personal information, illegal content, 4 MMLU (Massive Multitask Language Understanding): tests
penetration testing, vulnerability exploitation). the model's accuracy on a variety of tasks to assess its
understanding of multitasking.

4 A/B testing: comparison to evaluate two versions of the 4 TruthfulQA: challenges the model to distinguish between
LLM (A and B), or an LLM against a human being. true and false information, assessing its ability to handle
truthful data.

4 Focus groups: gathering opinions from various users on 4 Winogrande: another tool to assess common sense, similar
LLM behavior, e.g., ethics, cultural appropriateness, to HEllaSwag, but with different methods and emphasis.
discrimination, etc. 4 GSM8K: uses mathematical problems designed for students
to assess the model's logical-mathematical capability.
4 User experience (UX tracking): observing and evaluating
user interactions with the LLM over time or in real time.

4 Incident drills: simulating adverse scenarios to test LLM


response (e.g., stress test, backup check, recovery time
measurement, etc.).

4 Record keeping: reviewing LLM system logs and records to 41

ensure compliance with regulations and the audit trail.

118
Datta, Dickerson (2023).
119
Guzmán (2015).
Auge LLM-Eng- Vdef_Maquetación 1 30/05/2024 23:48 Página 42

New trends 4 Post-hoc interpretability techniques: These techniques


are based on the interpretability of the results at the post-
The field of LLM validation is constantly evolving, driven by training or fine-tuning stage, and allow to identify which
rapid advances developing these models and a growing parts of the input have most influenced the model response
awareness of the importance of ensuring their reliability, (feature importance), to find similar examples in the
fairness and alignment with ethics and regulation. training data set (similarity based on embeddings) or to
design specific prompts that guide the model towards
Below are some of the key emerging trends in this area: more informative explanations (prompting strategies).

4 Explainability of LLMs: As LLMs become more complex 4 Attribution scores: As part of post-hoc interpretability122,
and opaque, there is a growing need for mechanisms to techniques are being developed to identify which parts of
understand and explain their inner workings. XAI the input text have the greatest influence on the response
(eXplainable AI) techniques such as SHAP, LIME, or assigning generated by an LLM. They help to understand which words
importance to input tokens are gaining importance in LLM or phrases are most important for the model. There are
validation. Although a variety of post-hoc techniques for different methods for calculating these scores:
understanding the operation of models at the local and
global level are available for traditional models120 (e.g., - Gradient-based methods: Analyze how the gradients (a
Anchors, PDP, ICE), and the definition and implementation measure of sensitivity) change for each word as it
of inherently interpretable models by construction has moves back through the neural network.
proliferated, the implementation of these principles for
LLMs is still unresolved. - Perturbation-based methods: Slightly modify the input
text and observe how the model response changes.
4 Using LLMs to explain LLMs: An emerging trend is to use
one LLM to generate explanations for the behavior or - Interpretation of internal metrics: Use metrics calculated
responses of another LLM. In other words, one language by the model itself, such as attention weights in
model is used to interpret and communicate the underlying transformers, to determine the importance of each
reasoning of another model in a more understandable way. word.
To enrich these explanations, tools are being developed121
MANAGEMENT SOLUTIONS

that also incorporate post-hoc analysis techniques.

120
Management Solutions (2023). Explainable Artificial Intelligence.
121
Wang (2024).
122
42 Sarti (2023).
The rise of Large Language Models : from fundamentals to application

Figure 12¡. Implementation of SHAP values for text summarization.

Output summary: “The full cost of damage in Newton Stewart, one of the areas worst affected, is still being assessed . First Minister Nicola Sturgeon
visited the area to inspect the damage. Labour Party 's deputy Scottish leader Alex Row ley was in Haw ick on Monday to see the situation first hand.
He said it was important to get the flood protection plan right”

Of + damage + in + Newton + Stewart + . +2

+1,81

The +1,81
Clustering cutoff = 0,5

full +1,81

cost +1,81

One + 11 other features +0,31

remain + 24 other features +0,24

to + 79 other features +0,46

habe + 95 other features +0,48

+ 292 other features +1,6

0,00 0,25 0,50 0,75 1,00 1,25 1,50 1,75 2,00


SHAP value
Auge LLM-Eng- Vdef_Maquetación 1 30/05/2024 23:48 Página 43

An example of attribution scoring is the use of the SHAP SHAP (SHapley Additive
technique to provide a quantitative measure of the importance exPlanations) applied to an LLM
of each word to the LLM output, which facilitates its
interpretation and understanding (Figure 12).
SHAP is a post-hoc explainability method based on cooperative
game theory. It assigns each feature (token) an importance
4 Continuous validation and monitoring in production: In value (Shapley value) that represents its contribution to the
addition to pre-deployment evaluation, the practice of model prediction.
continuously monitoring the behavior of LLMs in Formally, let x = (x1,…,xn) be a sequence of input tokens. The
production, as is done with traditional models, is growing. prediction of the model is denoted by f(x). The Shapley value φ
This makes it possible to detect possible deviations or value for the token xi is defined as:
degradations in their performance over time, and identify
biases or risks that were not initially anticipated.

4 Collaborative and participatory validation: Greater


involvement of different stakeholders in the validation where N is the set of all tokens, S is a subset of tokens, and f(S)
process is encouraged, including not only technical experts is the model prediction for subset S.
but also end users, regulators, external auditors and Intuitively, the Shapley value φi captures the average impact of
representatives of civil society. This plural participation token xi on the model prediction, considering all possible
allows for the inclusion of different perspectives and subsets of tokens.
promotes transparency and accountability. Example: Consider an LLM trained to classify corporate emails
as "important" or "unimportant". Given a vector of input tokens:
4 Ethical and regulatory-aligned validation: In addition to
x = [The, Q2, financial, report, shows, significant, increase, in,
performance metrics, it is becoming increasingly important revenue, and, profitability].
to assess whether LLM behavior is ethical and in line with
The model classifies the mail as "important" with = 0.85.
human values and regulations. This involves analyzing
issues such as fairness, privacy, security, transparency, or the Using SHAP, the following Shapley values are obtained:
social impact of these systems. φ1 = 0.01 (The)

φ2 = 0.2 (report)
4 Machine unlearning: This is an emerging technique123 that
allows unlearning "known information from a LLM without φ3 = 0.15 (financial)
retraining it from scratch. This is achieved, for example, by
φ4 = 0.02 (from)
adapting the hyperparameters of the model to the data to
be unlearned. The same principle can be used to remove φ5 = 0.1 (Q2)
identified biases. The result is a model that retains its φ6 = 0.05 (show)
general knowledge but has problematic biases removed,
φ7 = 0.01 (a) 43
improving its fairness and ethical orientation in an efficient
and selective way. Several machine unlearning methods are φ8 = 0.15 (increase)
currently being explored, such as gradient ascent124, the use φ9 = 0.1 (significant)
of fine-tuning125 or selective modification of certain weights,
φ10 = 0.01 (in)
layers or neurons of the model126.
φ11 = 0.02 (th)

φ12 = 0.12 (income)

φ13 = 0.01 (and)

φ14 = 0.02 (the)

φ15 = 0.08 (profitability)

Interpretation: The tokens "report" (0.2), "financial" (0.15),


"increase" (0.15) and "revenue" (0.12) have the highest
contribution to the classification of the mail as "important". This
suggests that the LLM has learned to associate these terms with
the importance of the message in a business context.

123
Liu (2024).
124
Jang (2022).
125
Yu (2023).
126
Wu (2023)

You might also like