LLM Evaluation

The document discusses task-specific fine-tuning, multi-task fine-tuning, and evaluating language models. Task-specific fine-tuning involves training a pre-trained model on a single task using examples for that task, while multi-task fine-tuning trains on multiple tasks concurrently. Evaluation of language models is challenging as there are no perfect metrics and models can provide valid but differently worded answers.

Uploaded by

Vishnuvardhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

638 views1 page

LLM Evaluation

Uploaded by

Vishnuvardhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1

TASKSPECIFIC FINETUNING MULTI-TASK FINE-TUNING MODEL EVALUATION

LLM Instruction Task-specific fine-tuning involves training a pre-trained

model on a particular task or domain using a dataset
Multi-task fine-tuning diversifies training with examples
for multiple tasks, guiding the model to perform
Fine-Tuning & Evaluation tailored for that purpose. various tasks. Evaluating LLMs Is Challenging

(e.g., various tasks, non-deterministic outputs, equally

valid answers with different wordings).
INSTRUCTION FINETUNING Task-specific dataset
e.g., translation
Multi-task training dataset
Need for automated and organized performance
Analyze the sentiment
Translate the text: Pre-trained
Identify entities Instruct LLM assessments
In-Context Learning Limitations: Pre-trained Source text (English) Instruct LLM LLM
LLM Source completion (French) Summarize the text
Various approaches exist, but there are a few examples:
• May be insufficient for very specific tasks. Translate the text:
Source text (English)
• Examples take up space in the context window. Often, good results can be achieved with just a Source completion
few hundred or thousand examples. (French) ROUGE & BLEU SCORE
Instruction Fine-Tuning • Purpose: To evaluate LLMs on narrow tasks
Many examples of each task needed for training
(summarization, translation) when a reference
The LLM is trained to estimate the next token probability is available
Fine-tuning can significantly increase the performance
on a cautiously curated dataset of high-quality examples • Based on n-grams and rely on precision and
of a model on a specific task, but can reduce the Drawback: It requires a lot of data
for specific tasks. recall scores (multiple variants)
performance on other tasks (“catastrophic forgetting”). (around 50K to 100K examples).
Task-specific examples
Model variants differ based on the datasets and tasks BERT SCORE
Pre-trained
Prompt, completion
Fine-tuned
Solutions:
Prompt, completion used during fine-tuning. • Purpose: To evaluate LLMs in a task-agnostic
LLM Prompt, completion LLM
manner when a reference is available.
• It might not be an issue if only a single task matters.
Prompt-completion pairs Adjusted LLM weights • Based on token-wise comparison, a similarity score
• Fine-tune for multiple tasks concurrently is computed between candidate and reference
(~50K to 100K examples needed). sentences.
• The LLM generates better completions for a specific task Example of the FLAN family of models
• Has potentially high computing requirements • Opt for Parameter Efficient Fine-Tuning (PEFT) instead
of full fine-tuning, which involves training only a small FLAN, or Fine-tuned LAnguage Net, provides
number of task-specific adapter layers and parameters. tailored instructions for refining various LLM-as-a-Judge
Steps:
models, akin to dessert after pre-training.
• Purpose: To evaluate LLMs in a task-agnostic
1. Prepare the training data. manner when a reference is available.
2. Pass examples of training data to the LLM FLAN-T5 is an instruct fine-tuned version of the • Based on prompting an LLM to assess the equivalence
(prompt and ground-truth answer). T5 foundation model, serving as a versatile model of a generated answer with a ground-truth answer.
for various tasks.
Prompt LLM completion

Label this review: Label this review:

FLAN-T5 has been fine-tuned on a total of 473 To measure and compare LLMs more holistically, use
Pre-trained
Amazing product!
LLM
Amazing product! datasets across 146 task categories. For instance, evaluation benchmark datasets specific to model skills.
Sentiment: Sentiment: Neutral
the SAMSum dataset was used for summarization.
Ground truth Loss E.g., GLUE, SuperGLUE, MMLU, Big Bench, Helm
Label this review: A specialized variant of this model for chat
Amazing product!
Sentiment: Positive summarization or for custom company usage
Training data could be developed through additional fine-tuning
on specialized datasets (e.g., DialogSum or custom
internal data).
3. Compute the cross-entropy loss for each completion
token and backpropagate.

Best Practices For Fine-Tuning and Prompt Engineering LLMs - Weights & Biases LLM Whitepaper
50% (2)
Best Practices For Fine-Tuning and Prompt Engineering LLMs - Weights & Biases LLM Whitepaper
21 pages
Introduction To LLMS: Transformers Types of Llms Configuration Settings
100% (2)
Introduction To LLMS: Transformers Types of Llms Configuration Settings
7 pages
AI Model Optimization Guide
100% (1)
AI Model Optimization Guide
1 page
26 RAG Concepts in Alphabetical Order
100% (1)
26 RAG Concepts in Alphabetical Order
15 pages
7 Agentic RAG System Architectures To Build AI Agents
100% (2)
7 Agentic RAG System Architectures To Build AI Agents
12 pages
A Taxonomy of Retrieval Augmented Generation
100% (5)
A Taxonomy of Retrieval Augmented Generation
56 pages
Introduction To Generative AI LLM
100% (1)
Introduction To Generative AI LLM
9 pages
Hands-On Guide To Agentic Corrective RAG-1
100% (1)
Hands-On Guide To Agentic Corrective RAG-1
5 pages
Evolving LLOMPS For RAG
No ratings yet
Evolving LLOMPS For RAG
6 pages
Types of RAG: @bhavishya Pandit
100% (1)
Types of RAG: @bhavishya Pandit
15 pages
LangGraph: Multi-Agent Systems
No ratings yet
LangGraph: Multi-Agent Systems
9 pages
RAG Technics
100% (1)
RAG Technics
8 pages
Vector Databases
100% (1)
Vector Databases
35 pages
LLMs and Retrieval-Augmented Generation (RAG)
No ratings yet
LLMs and Retrieval-Augmented Generation (RAG)
120 pages
RAG Understanding PDF
No ratings yet
RAG Understanding PDF
12 pages
Data Science & Generative AI Technologies
100% (1)
Data Science & Generative AI Technologies
97 pages
Multi-Document Agentic RAG Using Llama-Index and Mistral - by Plaban Nayak - The AI Forum - May, 2024 - Medium
100% (2)
Multi-Document Agentic RAG Using Llama-Index and Mistral - by Plaban Nayak - The AI Forum - May, 2024 - Medium
24 pages
LLM Questions
100% (2)
LLM Questions
51 pages
Build Scalable RAG-Based LLM Apps
100% (2)
Build Scalable RAG-Based LLM Apps
39 pages
Fine-Tuning AI Models for Developers
100% (2)
Fine-Tuning AI Models for Developers
19 pages
LLM Mesh: A Practical Guide To Using Generative AI in The Enterprise
100% (3)
LLM Mesh: A Practical Guide To Using Generative AI in The Enterprise
27 pages
Mastering AI Agents
100% (11)
Mastering AI Agents
93 pages
Langchain Retrieval Augmented Generation White Paper
100% (1)
Langchain Retrieval Augmented Generation White Paper
23 pages
Ai Agents Workflow
100% (6)
Ai Agents Workflow
67 pages
Agents Companion v2
100% (3)
Agents Companion v2
76 pages
LangChainJS Guide for Developers
No ratings yet
LangChainJS Guide for Developers
168 pages
Vector Database Essentials
No ratings yet
Vector Database Essentials
26 pages
Fine Tuning Techniques For Large Language Models LLMs
100% (4)
Fine Tuning Techniques For Large Language Models LLMs
15 pages
Building A Dynamic Multi-Agent Workflow - Harnessing AI Collaboration With LangChain & LangGraph - by Rohit Kumar - Oct, 2024 - Medium
No ratings yet
Building A Dynamic Multi-Agent Workflow - Harnessing AI Collaboration With LangChain & LangGraph - by Rohit Kumar - Oct, 2024 - Medium
13 pages
Vector Databases - A Technical Primer
100% (1)
Vector Databases - A Technical Primer
68 pages
01 GenAI ExploreWorkshopGuide
No ratings yet
01 GenAI ExploreWorkshopGuide
55 pages
Building LLM Powered Applications With Langchain
100% (1)
Building LLM Powered Applications With Langchain
11 pages
RAG and AI Agents Simplified
0% (1)
RAG and AI Agents Simplified
14 pages
Create LLM Application Using Langchain With Ease
100% (5)
Create LLM Application Using Langchain With Ease
12 pages
Advanced Retrieval-Augmented Generation (RAG) With LangChain, LangGraph, and AI Agents - by Manoj Mukherjee - Oct, 2024 - Medium
No ratings yet
Advanced Retrieval-Augmented Generation (RAG) With LangChain, LangGraph, and AI Agents - by Manoj Mukherjee - Oct, 2024 - Medium
15 pages
AI & NLP Mastery Course
83% (6)
AI & NLP Mastery Course
34 pages
Building LLM Applications For Production
80% (5)
Building LLM Applications For Production
28 pages
The New Stack and Ops For AI - LLMOps
No ratings yet
The New Stack and Ops For AI - LLMOps
12 pages
Generative Ai Terminology
67% (3)
Generative Ai Terminology
26 pages
Top Agentic AI Architecture Design Patterns
100% (6)
Top Agentic AI Architecture Design Patterns
8 pages
Agentic AI - Comprehensive Guide
100% (1)
Agentic AI - Comprehensive Guide
20 pages
How To Build AI Agent Cheat Sheet by Dr. Maryam Miradi
100% (2)
How To Build AI Agent Cheat Sheet by Dr. Maryam Miradi
2 pages
Generative AI Usecases - A Comprehensive Guide - Dummies
100% (1)
Generative AI Usecases - A Comprehensive Guide - Dummies
19 pages
Agentic AI - Architecture Implementation Summary V2
50% (2)
Agentic AI - Architecture Implementation Summary V2
5 pages
Local LLM Inference and Fine-Tuning
100% (3)
Local LLM Inference and Fine-Tuning
26 pages
How To Become An Agentic AI Expert in 2025
67% (3)
How To Become An Agentic AI Expert in 2025
19 pages
BCG AI Agent Report 1745757269
100% (5)
BCG AI Agent Report 1745757269
37 pages
Agentic RAGs 1740054167
No ratings yet
Agentic RAGs 1740054167
10 pages
Architecting Agentic Apps
100% (1)
Architecting Agentic Apps
10 pages
Aryan A. What Is LLMOps. Large Language Models in Production 2024
100% (1)
Aryan A. What Is LLMOps. Large Language Models in Production 2024
67 pages
An AI Engineer's Guide To Machine Learning and Generative AI - by Ai Geek (Wishesh) - Medium
100% (1)
An AI Engineer's Guide To Machine Learning and Generative AI - by Ai Geek (Wishesh) - Medium
67 pages
AI Agents Level 1 - Client Presentation
No ratings yet
AI Agents Level 1 - Client Presentation
19 pages
Generative AI & LLMs Course Overview
No ratings yet
Generative AI & LLMs Course Overview
6 pages
A Developer's Guide To Building AI Applications: Second Edition
100% (6)
A Developer's Guide To Building AI Applications: Second Edition
46 pages
Devs: Build PDF Bots with Open-Source LLMs
No ratings yet
Devs: Build PDF Bots with Open-Source LLMs
13 pages
Generative AI Agents Explained
100% (4)
Generative AI Agents Explained
42 pages
Scalable Deployment of AI Agents in The Enterprise 1736999533
100% (1)
Scalable Deployment of AI Agents in The Enterprise 1736999533
48 pages
Genai Llms w2
No ratings yet
Genai Llms w2
114 pages
Instruction Fine-Tuning
No ratings yet
Instruction Fine-Tuning
6 pages
Fine-Tuning Large Language Models 2
No ratings yet
Fine-Tuning Large Language Models 2
9 pages
Hindalco - Supply Chain Framework - MBA Operations Summer Training Project Report PDF
100% (1)
Hindalco - Supply Chain Framework - MBA Operations Summer Training Project Report PDF
120 pages
Speed of Sound Bubbly Liquids
No ratings yet
Speed of Sound Bubbly Liquids
8 pages
10 Maths CBSE Exam Papers 2017 Delhi Set 3
No ratings yet
10 Maths CBSE Exam Papers 2017 Delhi Set 3
12 pages
Environment and Sustainable Development Notes
No ratings yet
Environment and Sustainable Development Notes
5 pages
Egyptair Training Center
100% (1)
Egyptair Training Center
30 pages
SKY TEC STRTER Manual
No ratings yet
SKY TEC STRTER Manual
34 pages
Basic Soil-Plant-Water Relationships
No ratings yet
Basic Soil-Plant-Water Relationships
64 pages
WEEK 9.3lesson Plan Word Problems Linear Equations
No ratings yet
WEEK 9.3lesson Plan Word Problems Linear Equations
7 pages
Model of Motors Cars
100% (2)
Model of Motors Cars
51 pages
DM70MANUAL
No ratings yet
DM70MANUAL
5 pages
Learning Mastercam X7 Lathe Step by Step James Valentino Joseph Goldenberg Instant Download
100% (1)
Learning Mastercam X7 Lathe Step by Step James Valentino Joseph Goldenberg Instant Download
32 pages
Class Xii Term 1 Mock Test 2021
No ratings yet
Class Xii Term 1 Mock Test 2021
11 pages
May Jun 2023
No ratings yet
May Jun 2023
2 pages
076 Worship 1980 Eucharist
No ratings yet
076 Worship 1980 Eucharist
50 pages
Samsung SDI Prismatic EV Battery Cells
No ratings yet
Samsung SDI Prismatic EV Battery Cells
6 pages
The Elements of Philosophy A C - William Wallace
100% (7)
The Elements of Philosophy A C - William Wallace
362 pages
Electrical Circuits
No ratings yet
Electrical Circuits
30 pages
Understanding Discourse Analysis 1st Edition Vine 2024 Scribd Download
100% (6)
Understanding Discourse Analysis 1st Edition Vine 2024 Scribd Download
71 pages
X15 (S) T User Manual (Quick Start in English German Spanish Italian French)
No ratings yet
X15 (S) T User Manual (Quick Start in English German Spanish Italian French)
71 pages
Field Study 1 (Episode 3)
100% (1)
Field Study 1 (Episode 3)
3 pages
A Wadood CV
No ratings yet
A Wadood CV
3 pages
New York State Testing Program Grade 8 Mathematics Test: Released Questions
100% (1)
New York State Testing Program Grade 8 Mathematics Test: Released Questions
43 pages
Karmas Decoding
100% (2)
Karmas Decoding
33 pages
Deep Vein Thrombosis Guide
No ratings yet
Deep Vein Thrombosis Guide
9 pages
Dimensions, Sizes and Specification of JIS B2220 Standard Steel Flanges PDF
No ratings yet
Dimensions, Sizes and Specification of JIS B2220 Standard Steel Flanges PDF
8 pages
Gri 2 General Disclosures 2021
No ratings yet
Gri 2 General Disclosures 2021
59 pages
Sample Project - Report - Guidelines
No ratings yet
Sample Project - Report - Guidelines
5 pages
Technical Information No. 1: Grey Lamellar Graphite Cast Iron
No ratings yet
Technical Information No. 1: Grey Lamellar Graphite Cast Iron
2 pages
Financial Analysis for Accountancy Students
No ratings yet
Financial Analysis for Accountancy Students
18 pages
AC3193 Exam Paper - May 2023
No ratings yet
AC3193 Exam Paper - May 2023
7 pages

LLM Evaluation

Uploaded by

LLM Evaluation

Uploaded by

TASKSPECIFIC FINETUNING MULTI-TASK FINE-TUNING MODEL EVALUATION

LLM Instruction Task-specific fine-tuning involves training a pre-trained

(e.g., various tasks, non-deterministic outputs, equally

Label this review: Label this review:

You might also like

TASKSPECIFIC FINETUNING MULTI-TASK FINE-TUNING MODEL EVALUATION