0% found this document useful (0 votes)
638 views1 page

LLM Evaluation

The document discusses task-specific fine-tuning, multi-task fine-tuning, and evaluating language models. Task-specific fine-tuning involves training a pre-trained model on a single task using examples for that task, while multi-task fine-tuning trains on multiple tasks concurrently. Evaluation of language models is challenging as there are no perfect metrics and models can provide valid but differently worded answers.

Uploaded by

Vishnuvardhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
638 views1 page

LLM Evaluation

The document discusses task-specific fine-tuning, multi-task fine-tuning, and evaluating language models. Task-specific fine-tuning involves training a pre-trained model on a single task using examples for that task, while multi-task fine-tuning trains on multiple tasks concurrently. Evaluation of language models is challenging as there are no perfect metrics and models can provide valid but differently worded answers.

Uploaded by

Vishnuvardhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

TASKSPECIFIC FINETUNING MULTI-TASK FINE-TUNING MODEL EVALUATION

LLM Instruction Task-specific fine-tuning involves training a pre-trained


model on a particular task or domain using a dataset
Multi-task fine-tuning diversifies training with examples
for multiple tasks, guiding the model to perform
Fine-Tuning & Evaluation tailored for that purpose. various tasks. Evaluating LLMs Is Challenging

(e.g., various tasks, non-deterministic outputs, equally


valid answers with different wordings).
INSTRUCTION FINETUNING Task-specific dataset
e.g., translation
Multi-task training dataset
Need for automated and organized performance
Analyze the sentiment
Translate the text: Pre-trained
Identify entities Instruct LLM assessments
In-Context Learning Limitations: Pre-trained Source text (English) Instruct LLM LLM
LLM Source completion (French) Summarize the text
Various approaches exist, but there are a few examples:
• May be insufficient for very specific tasks. Translate the text:
Source text (English)
• Examples take up space in the context window. Often, good results can be achieved with just a Source completion
few hundred or thousand examples. (French) ROUGE & BLEU SCORE
Instruction Fine-Tuning • Purpose: To evaluate LLMs on narrow tasks
Many examples of each task needed for training
(summarization, translation) when a reference
The LLM is trained to estimate the next token probability is available
Fine-tuning can significantly increase the performance
on a cautiously curated dataset of high-quality examples • Based on n-grams and rely on precision and
of a model on a specific task, but can reduce the Drawback: It requires a lot of data
for specific tasks. recall scores (multiple variants)
performance on other tasks (“catastrophic forgetting”). (around 50K to 100K examples).
Task-specific examples
Model variants differ based on the datasets and tasks BERT SCORE
Pre-trained
Prompt, completion
Fine-tuned
Solutions:
Prompt, completion used during fine-tuning. • Purpose: To evaluate LLMs in a task-agnostic
LLM Prompt, completion LLM
manner when a reference is available.
• It might not be an issue if only a single task matters.
Prompt-completion pairs Adjusted LLM weights • Based on token-wise comparison, a similarity score
• Fine-tune for multiple tasks concurrently is computed between candidate and reference
(~50K to 100K examples needed). sentences.
• The LLM generates better completions for a specific task Example of the FLAN family of models
• Has potentially high computing requirements • Opt for Parameter Efficient Fine-Tuning (PEFT) instead
of full fine-tuning, which involves training only a small FLAN, or Fine-tuned LAnguage Net, provides
number of task-specific adapter layers and parameters. tailored instructions for refining various LLM-as-a-Judge
Steps:
models, akin to dessert after pre-training.
• Purpose: To evaluate LLMs in a task-agnostic
1. Prepare the training data. manner when a reference is available.
2. Pass examples of training data to the LLM FLAN-T5 is an instruct fine-tuned version of the • Based on prompting an LLM to assess the equivalence
(prompt and ground-truth answer). T5 foundation model, serving as a versatile model of a generated answer with a ground-truth answer.
for various tasks.
Prompt LLM completion

Label this review: Label this review:


FLAN-T5 has been fine-tuned on a total of 473 To measure and compare LLMs more holistically, use
Pre-trained
Amazing product!
LLM
Amazing product! datasets across 146 task categories. For instance, evaluation benchmark datasets specific to model skills.
Sentiment: Sentiment: Neutral
the SAMSum dataset was used for summarization.
Ground truth Loss E.g., GLUE, SuperGLUE, MMLU, Big Bench, Helm
Label this review: A specialized variant of this model for chat
Amazing product!
Sentiment: Positive summarization or for custom company usage
Training data could be developed through additional fine-tuning
on specialized datasets (e.g., DialogSum or custom
internal data).
3. Compute the cross-entropy loss for each completion
token and backpropagate.

You might also like