LLM Evaluation

The document discusses task-specific fine-tuning, multi-task fine-tuning, and evaluating language models. Task-specific fine-tuning involves training a pre-trained model on a single task using examples for that task, while multi-task fine-tuning trains on multiple tasks concurrently. Evaluation of language models is challenging as there are no perfect metrics and models can provide valid but differently worded answers.

Uploaded by

Vishnuvardhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

638 views1 page

LLM Evaluation

Uploaded by

Vishnuvardhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1

TASKSPECIFIC FINETUNING MULTI-TASK FINE-TUNING MODEL EVALUATION

LLM Instruction Task-specific fine-tuning involves training a pre-trained

model on a particular task or domain using a dataset
Multi-task fine-tuning diversifies training with examples
for multiple tasks, guiding the model to perform
Fine-Tuning & Evaluation tailored for that purpose. various tasks. Evaluating LLMs Is Challenging

(e.g., various tasks, non-deterministic outputs, equally

valid answers with different wordings).
INSTRUCTION FINETUNING Task-specific dataset
e.g., translation
Multi-task training dataset
Need for automated and organized performance
Analyze the sentiment
Translate the text: Pre-trained
Identify entities Instruct LLM assessments
In-Context Learning Limitations: Pre-trained Source text (English) Instruct LLM LLM
LLM Source completion (French) Summarize the text
Various approaches exist, but there are a few examples:
• May be insufficient for very specific tasks. Translate the text:
Source text (English)
• Examples take up space in the context window. Often, good results can be achieved with just a Source completion
few hundred or thousand examples. (French) ROUGE & BLEU SCORE
Instruction Fine-Tuning • Purpose: To evaluate LLMs on narrow tasks
Many examples of each task needed for training
(summarization, translation) when a reference
The LLM is trained to estimate the next token probability is available
Fine-tuning can significantly increase the performance
on a cautiously curated dataset of high-quality examples • Based on n-grams and rely on precision and
of a model on a specific task, but can reduce the Drawback: It requires a lot of data
for specific tasks. recall scores (multiple variants)
performance on other tasks (“catastrophic forgetting”). (around 50K to 100K examples).
Task-specific examples
Model variants differ based on the datasets and tasks BERT SCORE
Pre-trained
Prompt, completion
Fine-tuned
Solutions:
Prompt, completion used during fine-tuning. • Purpose: To evaluate LLMs in a task-agnostic
LLM Prompt, completion LLM
manner when a reference is available.
• It might not be an issue if only a single task matters.
Prompt-completion pairs Adjusted LLM weights • Based on token-wise comparison, a similarity score
• Fine-tune for multiple tasks concurrently is computed between candidate and reference
(~50K to 100K examples needed). sentences.
• The LLM generates better completions for a specific task Example of the FLAN family of models
• Has potentially high computing requirements • Opt for Parameter Efficient Fine-Tuning (PEFT) instead
of full fine-tuning, which involves training only a small FLAN, or Fine-tuned LAnguage Net, provides
number of task-specific adapter layers and parameters. tailored instructions for refining various LLM-as-a-Judge
Steps:
models, akin to dessert after pre-training.
• Purpose: To evaluate LLMs in a task-agnostic
1. Prepare the training data. manner when a reference is available.
2. Pass examples of training data to the LLM FLAN-T5 is an instruct fine-tuned version of the • Based on prompting an LLM to assess the equivalence
(prompt and ground-truth answer). T5 foundation model, serving as a versatile model of a generated answer with a ground-truth answer.
for various tasks.
Prompt LLM completion

Label this review: Label this review:

FLAN-T5 has been fine-tuned on a total of 473 To measure and compare LLMs more holistically, use
Pre-trained
Amazing product!
LLM
Amazing product! datasets across 146 task categories. For instance, evaluation benchmark datasets specific to model skills.
Sentiment: Sentiment: Neutral
the SAMSum dataset was used for summarization.
Ground truth Loss E.g., GLUE, SuperGLUE, MMLU, Big Bench, Helm
Label this review: A specialized variant of this model for chat
Amazing product!
Sentiment: Positive summarization or for custom company usage
Training data could be developed through additional fine-tuning
on specialized datasets (e.g., DialogSum or custom
internal data).
3. Compute the cross-entropy loss for each completion
token and backpropagate.

Best Practices For Fine-Tuning and Prompt Engineering LLMs - Weights & Biases LLM Whitepaper
50% (2)
Best Practices For Fine-Tuning and Prompt Engineering LLMs - Weights & Biases LLM Whitepaper
21 pages
Introduction To LLMS: Transformers Types of Llms Configuration Settings
100% (2)
Introduction To LLMS: Transformers Types of Llms Configuration Settings
7 pages
AI Model Optimization Guide
100% (1)
AI Model Optimization Guide
1 page
26 RAG Concepts in Alphabetical Order
100% (1)
26 RAG Concepts in Alphabetical Order
15 pages
7 Agentic RAG System Architectures To Build AI Agents
100% (2)
7 Agentic RAG System Architectures To Build AI Agents
12 pages
A Taxonomy of Retrieval Augmented Generation
100% (5)
A Taxonomy of Retrieval Augmented Generation
56 pages
Introduction To Generative AI LLM
100% (1)
Introduction To Generative AI LLM
9 pages
Hands-On Guide To Agentic Corrective RAG-1
100% (1)
Hands-On Guide To Agentic Corrective RAG-1
5 pages
Evolving LLOMPS For RAG
No ratings yet
Evolving LLOMPS For RAG
6 pages
Types of RAG: @bhavishya Pandit
100% (1)
Types of RAG: @bhavishya Pandit
15 pages
LangGraph: Multi-Agent Systems
No ratings yet
LangGraph: Multi-Agent Systems
9 pages
RAG Technics
100% (1)
RAG Technics
8 pages
Vector Databases
100% (1)
Vector Databases
35 pages
LLMs and Retrieval-Augmented Generation (RAG)
No ratings yet
LLMs and Retrieval-Augmented Generation (RAG)
120 pages
RAG Understanding PDF
No ratings yet
RAG Understanding PDF
12 pages
Data Science & Generative AI Technologies
100% (1)
Data Science & Generative AI Technologies
97 pages
Multi-Document Agentic RAG Using Llama-Index and Mistral - by Plaban Nayak - The AI Forum - May, 2024 - Medium
100% (2)
Multi-Document Agentic RAG Using Llama-Index and Mistral - by Plaban Nayak - The AI Forum - May, 2024 - Medium
24 pages
LLM Questions
100% (2)
LLM Questions
51 pages
Build Scalable RAG-Based LLM Apps
100% (2)
Build Scalable RAG-Based LLM Apps
39 pages
Fine-Tuning AI Models for Developers
100% (2)
Fine-Tuning AI Models for Developers
19 pages
LLM Mesh: A Practical Guide To Using Generative AI in The Enterprise
100% (3)
LLM Mesh: A Practical Guide To Using Generative AI in The Enterprise
27 pages
Mastering AI Agents
100% (11)
Mastering AI Agents
93 pages
Langchain Retrieval Augmented Generation White Paper
100% (1)
Langchain Retrieval Augmented Generation White Paper
23 pages
Ai Agents Workflow
100% (6)
Ai Agents Workflow
67 pages
Agents Companion v2
100% (3)
Agents Companion v2
76 pages
LangChainJS Guide for Developers
No ratings yet
LangChainJS Guide for Developers
168 pages
Vector Database Essentials
No ratings yet
Vector Database Essentials
26 pages
Fine Tuning Techniques For Large Language Models LLMs
100% (4)
Fine Tuning Techniques For Large Language Models LLMs
15 pages
Building A Dynamic Multi-Agent Workflow - Harnessing AI Collaboration With LangChain & LangGraph - by Rohit Kumar - Oct, 2024 - Medium
No ratings yet
Building A Dynamic Multi-Agent Workflow - Harnessing AI Collaboration With LangChain & LangGraph - by Rohit Kumar - Oct, 2024 - Medium
13 pages
Vector Databases - A Technical Primer
100% (1)
Vector Databases - A Technical Primer
68 pages
01 GenAI ExploreWorkshopGuide
No ratings yet
01 GenAI ExploreWorkshopGuide
55 pages
Building LLM Powered Applications With Langchain
100% (1)
Building LLM Powered Applications With Langchain
11 pages
RAG and AI Agents Simplified
0% (1)
RAG and AI Agents Simplified
14 pages
Create LLM Application Using Langchain With Ease
100% (5)
Create LLM Application Using Langchain With Ease
12 pages
Advanced Retrieval-Augmented Generation (RAG) With LangChain, LangGraph, and AI Agents - by Manoj Mukherjee - Oct, 2024 - Medium
No ratings yet
Advanced Retrieval-Augmented Generation (RAG) With LangChain, LangGraph, and AI Agents - by Manoj Mukherjee - Oct, 2024 - Medium
15 pages
AI & NLP Mastery Course
83% (6)
AI & NLP Mastery Course
34 pages
Building LLM Applications For Production
80% (5)
Building LLM Applications For Production
28 pages
The New Stack and Ops For AI - LLMOps
No ratings yet
The New Stack and Ops For AI - LLMOps
12 pages
Generative Ai Terminology
67% (3)
Generative Ai Terminology
26 pages
Top Agentic AI Architecture Design Patterns
100% (6)
Top Agentic AI Architecture Design Patterns
8 pages
Agentic AI - Comprehensive Guide
100% (1)
Agentic AI - Comprehensive Guide
20 pages
How To Build AI Agent Cheat Sheet by Dr. Maryam Miradi
100% (2)
How To Build AI Agent Cheat Sheet by Dr. Maryam Miradi
2 pages
Generative AI Usecases - A Comprehensive Guide - Dummies
100% (1)
Generative AI Usecases - A Comprehensive Guide - Dummies
19 pages
Agentic AI - Architecture Implementation Summary V2
50% (2)
Agentic AI - Architecture Implementation Summary V2
5 pages
Local LLM Inference and Fine-Tuning
100% (3)
Local LLM Inference and Fine-Tuning
26 pages
How To Become An Agentic AI Expert in 2025
67% (3)
How To Become An Agentic AI Expert in 2025
19 pages
BCG AI Agent Report 1745757269
100% (5)
BCG AI Agent Report 1745757269
37 pages
Agentic RAGs 1740054167
No ratings yet
Agentic RAGs 1740054167
10 pages
Architecting Agentic Apps
100% (1)
Architecting Agentic Apps
10 pages
Aryan A. What Is LLMOps. Large Language Models in Production 2024
100% (1)
Aryan A. What Is LLMOps. Large Language Models in Production 2024
67 pages
An AI Engineer's Guide To Machine Learning and Generative AI - by Ai Geek (Wishesh) - Medium
100% (1)
An AI Engineer's Guide To Machine Learning and Generative AI - by Ai Geek (Wishesh) - Medium
67 pages
AI Agents Level 1 - Client Presentation
No ratings yet
AI Agents Level 1 - Client Presentation
19 pages
Generative AI & LLMs Course Overview
No ratings yet
Generative AI & LLMs Course Overview
6 pages
A Developer's Guide To Building AI Applications: Second Edition
100% (6)
A Developer's Guide To Building AI Applications: Second Edition
46 pages
Devs: Build PDF Bots with Open-Source LLMs
No ratings yet
Devs: Build PDF Bots with Open-Source LLMs
13 pages
Generative AI Agents Explained
100% (4)
Generative AI Agents Explained
42 pages
Scalable Deployment of AI Agents in The Enterprise 1736999533
100% (1)
Scalable Deployment of AI Agents in The Enterprise 1736999533
48 pages
Genai Llms w2
No ratings yet
Genai Llms w2
114 pages
Instruction Fine-Tuning
No ratings yet
Instruction Fine-Tuning
6 pages
Fine-Tuning Large Language Models 2
No ratings yet
Fine-Tuning Large Language Models 2
9 pages
Underground Cables#
No ratings yet
Underground Cables#
6 pages
First Grade Sight Words
No ratings yet
First Grade Sight Words
24 pages
Eguide Mold Flow Analysis
No ratings yet
Eguide Mold Flow Analysis
5 pages
Diagrama de Conexion PDI + MEDIDOR +ACP
No ratings yet
Diagrama de Conexion PDI + MEDIDOR +ACP
52 pages
Indian Overseas Bank-Aadhaar Operator Attendance Report For The Month of July'2024
No ratings yet
Indian Overseas Bank-Aadhaar Operator Attendance Report For The Month of July'2024
1 page
Capital Allowances
No ratings yet
Capital Allowances
2 pages
KARAN GUPTA Fiver GIG
No ratings yet
KARAN GUPTA Fiver GIG
2 pages
Install Cisco Packet Tracer on Linux
No ratings yet
Install Cisco Packet Tracer on Linux
11 pages
Compiler Design for CS Students
No ratings yet
Compiler Design for CS Students
34 pages
2 - Factors Changing IBE, Economic Growth Impact
No ratings yet
2 - Factors Changing IBE, Economic Growth Impact
11 pages
Sensory Analysis of Cosmetic Formulation
No ratings yet
Sensory Analysis of Cosmetic Formulation
5 pages
Steel Plates & Sheets Guide
No ratings yet
Steel Plates & Sheets Guide
18 pages
Investment Account New
No ratings yet
Investment Account New
19 pages
Name: Ferdy Rasyid Syachranie Matric No: 208654: 1. What Is The Function of The Screen?
No ratings yet
Name: Ferdy Rasyid Syachranie Matric No: 208654: 1. What Is The Function of The Screen?
3 pages
Nandakumar & Anr V State of Kerala
No ratings yet
Nandakumar & Anr V State of Kerala
6 pages
en-NTF2004-Coverage Analysis
No ratings yet
en-NTF2004-Coverage Analysis
25 pages
Silent Hunter 5 Instruction Manual (English)
56% (9)
Silent Hunter 5 Instruction Manual (English)
35 pages
Developer Lawsuit Against Keep Deltona Wild
No ratings yet
Developer Lawsuit Against Keep Deltona Wild
15 pages
DBExtract Manual
100% (1)
DBExtract Manual
38 pages
2015.04.15 Pathology Power Calcs
No ratings yet
2015.04.15 Pathology Power Calcs
45 pages
Flower Carnival A Business Plan On Flowe
No ratings yet
Flower Carnival A Business Plan On Flowe
26 pages
Lte-Rf-Planning-Optimization Drive Testing Et Al
No ratings yet
Lte-Rf-Planning-Optimization Drive Testing Et Al
60 pages
An Empirical Study On Organized Retail Outlet and Consumer Perception Towards Retail Stores in Tiruchirappalli City
No ratings yet
An Empirical Study On Organized Retail Outlet and Consumer Perception Towards Retail Stores in Tiruchirappalli City
11 pages
RBI Naya RPR
No ratings yet
RBI Naya RPR
18 pages
Python Setup for OOP Beginners
No ratings yet
Python Setup for OOP Beginners
11 pages
Runner 125 VX 4T (UK)
0% (1)
Runner 125 VX 4T (UK)
72 pages
Ias 12
No ratings yet
Ias 12
33 pages
Chery International After-Sales Guide
No ratings yet
Chery International After-Sales Guide
66 pages
AUTOSAR CP SWS GPTDriver
No ratings yet
AUTOSAR CP SWS GPTDriver
63 pages
10 1108 - XJM 09 2023 0192
100% (1)
10 1108 - XJM 09 2023 0192
27 pages

LLM Evaluation

Uploaded by

LLM Evaluation

Uploaded by

TASKSPECIFIC FINETUNING MULTI-TASK FINE-TUNING MODEL EVALUATION

LLM Instruction Task-specific fine-tuning involves training a pre-trained

(e.g., various tasks, non-deterministic outputs, equally

Label this review: Label this review:

You might also like

TASKSPECIFIC FINETUNING MULTI-TASK FINE-TUNING MODEL EVALUATION