Rise of LLM 05
Rise of LLM 05
36
The rise of Large Language Models : from fundamentals to application
Auge LLM-Eng- Vdef_Maquetación 1 30/05/2024 23:48 Página 37
Given the potential impact of these risks, LLMs must be 4 Conceptual soundness and model design: selection of the
thoroughly validated before deployment in production model and its components (e.g., fine-tuning methodologies,
environments. Validation of LLMs is not only a best practice, but database connections, RAG100), and comparison with other
also a regulatory requirement in many jurisdictions. In Europe, models101.
the proposed AI Act requires risk assessment and mitigation of
AI systems95. At the same time, in the United States, the NIST AI
Risk Management Framework96 and the AI Bill of Rights
highlight the importance of understanding and addressing the 37
AI Risk
Vendor Risk ESG & Reputational Risk
AI Act Art. 8, 9, 12 AI Act Art. 8, 29a
Third party screening, AI ethics of vendor, AI ESG & Ethics, fairness, environmental impact, social
integration, copyright issues Vendor Risk Reputational impact, reputation
Risk
Data
Management &
Data Privacy Data Management & Data Privacy
AI Act Art. 8, 10
Transparency, consent for AI usage, anonymization, record keeping,
bias in data, data poisoning
4 Model evaluation and analysis of results: privacy and In each of these dimensions, two sets of complementary
security of the results102, model accuracy103, consistency104, techniques allow for a more complete validation (Figure 10):
robustness105, adaptability106, interpretability (XAI)107, ethics,
bias and fairness108, toxicity109, comparison against 4 Quantitative evaluation metrics (tests): These standardized
challenger models. quantitative tests measure the model's performance on
specific tasks. They are predefined benchmarks and metrics
4 Implementation and use: human review in use (including for evaluating various LLM performance aspects after pre-
monitoring for misuse), error resolution, scalability and training or during the fine-tuning or instruction tuning (i.e.,
MANAGEMENT SOLUTIONS
Validation techniques
When an organization is considering implementing an LLM for a
specific use case, it may be beneficial to take a holistic approach 102
Nasr (2023).
103
that encompasses the key dimensions of the model's lifecycle: Liang (2023).
104
Elazar (2021).
data, design, assessment, implementation and use. It is also 105
Liu (2023).
necessary to assess compliance with applicable regulations, 106
Dun (2024).
107
such as the AI Act in the European Union, in a cross-cutting Singh (2024).d
108
NIST (2023), Oneto (2020), Zhou (2021).
manner. 109
Shaikh (2023).
110
Management Solutions (2014). Model Risk Management.
111
Oneto (2020).
112
NIST (2023).
113
European Parliament (2024). AI Act.
Auge LLM-Eng- Vdef_Maquetación 1 30/05/2024 23:48 Página 39
Human evaluation
Dimensions Validated aspects Description Validation metrics (examples) (examples)
Ability to learn or adapt to new • LLM performance on new data by Zero/One/Few- • A/B Testing
3.5.Adaptability
contexts shot learning • Case-by-case review
4.2 Recovery and Ability to recover from errors • System recovery tests
• Incident drills
error handling and handle unexpected inputs • Error processing metrics
4.Implementation Maintain performance with • Stress testing of the system, Apache Jmeter... • Incident drills
4.3 Scalability
and use more data or users • Scalability benchmarks • A/B Testing
The exact selection of techniques will depend on the particular 4 LLM Comparator114: a tool developed by Google
characteristics of the use case; and, in particular, several researchers for automatically evaluating and comparing
important factors to consider when deciding on the most LLMs, which checks the quality of LLM answers.
appropriate techniques are:
4 HELM115: Holistic Evaluation of Language Models, which
4 The level of risk and criticality of the tasks to be entrusted to compiles evaluation metrics along seven dimensions
the LLM. (accuracy, calibration, robustness, fairness, bias, toxicity,
and efficiency) for a set of predefined scenarios.
4 Whether the LLM is open to the public (n which case ethical
hacking becomes particularly relevant) or its use is limited 4 ReLM116: LLM validation and query system using language
to the internal scope of the organization. usage, including evaluation of linguistic models,
memorization, bias, toxicity and language comprehension.
4 Whether the LLM processes personal data.
At present, certain validation techniques, such as SHAP-based
4 The line of business or service the LLM will be used for. explainability methods (XAI), some metrics such as ROUGE117 or
fairness analyses using demographic parity, do not yet have
Careful analysis of these factors will allow the construction of a widely accepted predefined thresholds. In these cases, it is the
robust validation framework tailored to the needs of each LLM task of the scientific community and the industry to continue
application. research to establish clear criteria for robust and standardized
validation.
Quantitative evaluation metrics
Although this is an emerging field of study, there is a wide
range of quantitative metrics that can be used to evaluate LLM
performance. Some of these metrics are adaptations of those
used in traditional machine learning models, such as accuracy,
recall, F1 score, or area under the ROC curve (AUC-ROC). Other
metrics are specifically designed to evaluate unique aspects of
MANAGEMENT SOLUTIONS
A Overrides backtest
Count and measure the significance of human
E Focus groups
Collect insights on LLM outputs from diverse
users (for ethics, cultural appropriateness,
modifications to LLM outputs. discrimination, etc.).
B
Case-by-case check
Compare a representative sample (e.g., minimum
of 200 through Z-test1) of LLM responses with
F User experience (UX) tracking
Observe and assess user interactions with the
human outputs (‘ground truth’), incl. double-blind. LLM over time / in real time.
C G
Ethical hacking (aka Red Team) Incident drills
Manipulate prompts to force the LLM to produce Simulate adverse scenarios to test LLM response
undesired outputs (incl. PII regurgitation, and recovery (stress test, check backup, measure
compliance, prompt engineering, penetration tests, recovery time, etc.).
AI vulnerabilities, etc.).
D A/B testing
Conduct parallel trials to evaluate different
versions (A and B) or compare with human
H Record-keeping
Review the LLM system’s logs and records,
performance. ensuring compliance with regulation.
Auge LLM-Eng- Vdef_Maquetación 1 30/05/2024 23:48 Página 41
4 User override backtesting: counting and measuring the 4 GLUE/SuperGLUE: assesses language comprehension
importance of human modifications to LLM results (e.g., through tasks that measure a model's ability to understand
text.
how many times a sales manager must manually modify
customer call summaries generated by an LLM). 4 Eleuther AI Language Model Evaluation Harness: performs
"few-shot" model evaluation, that is, evaluates model
accuracy with very few training examples.
4 Case-by-case review: comparing a representative sample
of LLM responses to user expectations ("ground truth"). 4 ARC (AI2 Reasoning Challenge): tests the model's ability to
answer scientific questions that require reasoning.
4 Ethical hacking (Red Team): manipulating prompts to 4 HellaSwag: evaluates the model's common sense through
force the LLM to produce undesired results (e.g., tasks that require predicting a coherent story ending.
regurgitation of personal information, illegal content, 4 MMLU (Massive Multitask Language Understanding): tests
penetration testing, vulnerability exploitation). the model's accuracy on a variety of tasks to assess its
understanding of multitasking.
4 A/B testing: comparison to evaluate two versions of the 4 TruthfulQA: challenges the model to distinguish between
LLM (A and B), or an LLM against a human being. true and false information, assessing its ability to handle
truthful data.
4 Focus groups: gathering opinions from various users on 4 Winogrande: another tool to assess common sense, similar
LLM behavior, e.g., ethics, cultural appropriateness, to HEllaSwag, but with different methods and emphasis.
discrimination, etc. 4 GSM8K: uses mathematical problems designed for students
to assess the model's logical-mathematical capability.
4 User experience (UX tracking): observing and evaluating
user interactions with the LLM over time or in real time.
118
Datta, Dickerson (2023).
119
Guzmán (2015).
Auge LLM-Eng- Vdef_Maquetación 1 30/05/2024 23:48 Página 42
4 Explainability of LLMs: As LLMs become more complex 4 Attribution scores: As part of post-hoc interpretability122,
and opaque, there is a growing need for mechanisms to techniques are being developed to identify which parts of
understand and explain their inner workings. XAI the input text have the greatest influence on the response
(eXplainable AI) techniques such as SHAP, LIME, or assigning generated by an LLM. They help to understand which words
importance to input tokens are gaining importance in LLM or phrases are most important for the model. There are
validation. Although a variety of post-hoc techniques for different methods for calculating these scores:
understanding the operation of models at the local and
global level are available for traditional models120 (e.g., - Gradient-based methods: Analyze how the gradients (a
Anchors, PDP, ICE), and the definition and implementation measure of sensitivity) change for each word as it
of inherently interpretable models by construction has moves back through the neural network.
proliferated, the implementation of these principles for
LLMs is still unresolved. - Perturbation-based methods: Slightly modify the input
text and observe how the model response changes.
4 Using LLMs to explain LLMs: An emerging trend is to use
one LLM to generate explanations for the behavior or - Interpretation of internal metrics: Use metrics calculated
responses of another LLM. In other words, one language by the model itself, such as attention weights in
model is used to interpret and communicate the underlying transformers, to determine the importance of each
reasoning of another model in a more understandable way. word.
To enrich these explanations, tools are being developed121
MANAGEMENT SOLUTIONS
120
Management Solutions (2023). Explainable Artificial Intelligence.
121
Wang (2024).
122
42 Sarti (2023).
The rise of Large Language Models : from fundamentals to application
Output summary: “The full cost of damage in Newton Stewart, one of the areas worst affected, is still being assessed . First Minister Nicola Sturgeon
visited the area to inspect the damage. Labour Party 's deputy Scottish leader Alex Row ley was in Haw ick on Monday to see the situation first hand.
He said it was important to get the flood protection plan right”
+1,81
The +1,81
Clustering cutoff = 0,5
full +1,81
cost +1,81
An example of attribution scoring is the use of the SHAP SHAP (SHapley Additive
technique to provide a quantitative measure of the importance exPlanations) applied to an LLM
of each word to the LLM output, which facilitates its
interpretation and understanding (Figure 12).
SHAP is a post-hoc explainability method based on cooperative
game theory. It assigns each feature (token) an importance
4 Continuous validation and monitoring in production: In value (Shapley value) that represents its contribution to the
addition to pre-deployment evaluation, the practice of model prediction.
continuously monitoring the behavior of LLMs in Formally, let x = (x1,…,xn) be a sequence of input tokens. The
production, as is done with traditional models, is growing. prediction of the model is denoted by f(x). The Shapley value φ
This makes it possible to detect possible deviations or value for the token xi is defined as:
degradations in their performance over time, and identify
biases or risks that were not initially anticipated.
φ2 = 0.2 (report)
4 Machine unlearning: This is an emerging technique123 that
allows unlearning "known information from a LLM without φ3 = 0.15 (financial)
retraining it from scratch. This is achieved, for example, by
φ4 = 0.02 (from)
adapting the hyperparameters of the model to the data to
be unlearned. The same principle can be used to remove φ5 = 0.1 (Q2)
identified biases. The result is a model that retains its φ6 = 0.05 (show)
general knowledge but has problematic biases removed,
φ7 = 0.01 (a) 43
improving its fairness and ethical orientation in an efficient
and selective way. Several machine unlearning methods are φ8 = 0.15 (increase)
currently being explored, such as gradient ascent124, the use φ9 = 0.1 (significant)
of fine-tuning125 or selective modification of certain weights,
φ10 = 0.01 (in)
layers or neurons of the model126.
φ11 = 0.02 (th)
123
Liu (2024).
124
Jang (2022).
125
Yu (2023).
126
Wu (2023)