UNIVERSIDADE*FEDERAL
DE*MINAS*GERAIS
Advanced Seminars on Large Language Models
Introduction to LLMs
Rodrygo L. T. Santos
rodrygo@dcc.ufmg.br
Silhouette of a human female on the left
and a humanoid AI on the right; a white
wire connects their brains through their
mouths symbolizing communication.
By DALL·E 3
Language
A natural ability for humans
◦ Effortless use for communication
◦ Expressive of thoughts, emotions, instructions
A challenge for machines
◦ Ambiguity, context-dependency, nuanced semantics
A milestone towards AGI?
Credit: Shutterstock
Credit: Forbes
Credit: The Verge
By Codex
Credit: The Register
By DALL·E 3
Photorealistic closeup video of two
pirate ships battling each other as
they sail inside a cup of coffee.
By SORA
Language model
A probability distribution over word sequences
◦ 𝑃(“Today is Wednesday”) » 0.001
◦ 𝑃(“Today Wednesday is”) » 0.0000000000001
◦ 𝑃(“The eigenvalue is positive”) » 0.00001
Also a mechanism for “generating” text
◦ 𝑃(“Wednesday”|“Today is”) > 𝑃(“blah”|“Today is”)
Language model
Ideal (aka full dependence) model
◦ 𝑃 𝑤! … 𝑤" = 𝑃 𝑤! 𝑃 𝑤# 𝑤! … 𝑃 𝑤" 𝑤! … 𝑤"$!
Infeasible in practice
◦ Expensive computation
◦ Poor estimation (data sparsity)
Evolution of language models
Statistical LMs
(1950s-1990s)
Tunable dependence via n-grams
3-gram (”trigram”)
◦ 𝑃 𝑤! … 𝑤" = 𝑃 𝑤! 𝑃 𝑤# 𝑤! … 𝑃 𝑤" 𝑤"$# , 𝑤"$!
2-gram (”bigram”)
◦ 𝑃 𝑤! … 𝑤" = 𝑃 𝑤! 𝑃 𝑤# 𝑤! … 𝑃 𝑤" 𝑤"$!
1-gram (”unigram”)
◦ 𝑃 𝑤! … 𝑤" = 𝑃 𝑤! 𝑃(𝑤# ) … 𝑃(𝑤" )
Improved estimation via smoothing
𝑃(𝑤)
maximum
likelihood
estimation
smoothed
estimation
word 𝑤
Evolution of language models
Statistical LMs Neural LMs
(1950s-1990s) (2013)
Neurons
output Improved word-level representation
𝑤
%# ◦ From sparse to distributional semantics
◦ Better generalization to unseen data
dense Context still lacking
network ◦ Fixed-length input and output
◦ Non-sequential representation
𝑤!:#$!
input
Neurons… with recurrence!
output
𝑤
%# 𝑤
%! 𝑤
%& 𝑤
%'
recurrent dense ℎ% dense ℎ! dense ℎ&
network network network network
ℎ#$!
𝑤#$! 𝑤% 𝑤! 𝑤&
input
Neurons… with recurrence!
output Sequential bless
𝑤
%# ◦ Dynamic state maintains linguistic context
◦ Enables handling variable-length sequences
recurrent Sequential curse
network ◦ Single state as information bottleneck
ℎ#$! ◦ Inherently non-parallelizable
𝑤#$!
input
Evolution of language models
Statistical LMs Neural LMs Pretrained LMs
(1950s-1990s) (2013) (2018)
Vaswani et al. (NIPS 2017)
Neurons… with attention!
output
𝑤
%#
attention The animal didn’t cross the street because it was too ______
𝑤!:#$!
input
Neurons… with attention!
output
𝑤
%#
attention The animal didn’t cross the street because it was too ______
𝑤!:#$!
input
Neurons… with attention!
output
𝑤
%#
attention The animal didn’t cross the street because it was too ______
𝑤!:#$!
input
Neurons… with attention!
output
𝑤
%#
attention The animal didn’t cross the street because it was too ______
𝑤!:#$!
input
Neurons… with attention!
output
𝑤
%#
attention
attn The animal didn’t cross the street because it was too scared
𝑤!:#$!
input
Attention is (not) all you need
𝑤
)#
Prediction: select best output; decode
dense
Enrichment: attend to multiple contexts; add nonlinearities
attn ℎ
Preparation: tokenize; mark position; encode
𝑤!:#$!
Transformer
𝑤
)# Effective representation
◦ Can attend to entire context – no bottleneck
◦ Attention heads as representation subspaces
dense ◦ Order retained via positional encoding
Decoder
attn ℎ Efficient processing
𝑛
◦ Parallelization across tokens and heads
◦ Much faster training and inference
𝑤!:#$!
◦ Scalability to massive training datasets
Transformer architectures
𝑤
)!:% 𝑤
)# 𝑤
)#
Encoder Encoder Decoder Decoder
𝑛 𝑛 𝑛 𝑛
𝑤!:% 𝑤!:#$! 𝑤!:#$!
encoder-only encoder-decoder decoder-only
(e.g. BERT (2018)) (e.g. T5 (2019)) (e.g. GPT (2018))
The power of transfer learning
Self-supervised pretraining (expensive)
◦ Standard language modeling objective
◦ Train on massive textual corpora
Supervised fine-tuning (cheap)
◦ Multiple task-specific objectives
◦ Improved performance downstream
Evolution of language models
Statistical LMs Neural LMs Pretrained LMs Large LMs
(1950s-1990s) (2013) (2018) (2020)
Model size vs. time
GPT-1 GPT-2
117M 1.5B
2018 2019 2020 2021 2022 2023 2024
Model size vs. time
GPT-1 GPT-2 GPT-3
117M 1.5B 175B
2018 2019 2020 2021 2022 2023 2024
Model size vs. time
Are you guys
still there?
GPT-1 GPT-2 GPT-3 GPT-4
117M 1.5B 175B 1.76T*
2018 2019 2020 2021 2022 2023 2024
Model size vs. time
Advent of the Transformer
Availability of massive datasets
Access to powerful computing
GPT-4
1.76T*
2018 2019 2020 2021 2022 2023 2024
Credit: Google
Credit: Google
Credit: Mistral AI
Credit: Mistral AI
Credit: Anthropic
Credit: Anthropic
Credit: Reuters
The power of scaling
LLMs show improved performance with scale
◦ Increased model size (in trillions of parameters)
◦ Increased training size (in trillions of tokens)
Improvements in next token prediction
◦ But also in unforeseen capabilities!
Instruction following
PROMPT COMPLETION
Classify this review: Positive
I loved this film!
Sentiment:
LLM
Instruction following
PROMPT COMPLETION
Classify this review: received a very nice
I loved this film! book review
Sentiment:
LLM
In-context learning
PROMPT COMPLETION
Classify this review: Positive
I don’t like this chair!
Sentiment: Negative
Classify this review: LLM
I loved this film!
Sentiment:
Basic, emerging, augmented capabilities!
The challenges of scaling
System challenges
◦ Substantial compute and energy consumption
◦ Continual learning and adaptation
Data challenges
◦ Data quality and representativeness
◦ Low-resource domains and languages
The challenges of scaling
Human challenges
◦ Responsible alignment
◦ Interpretability and explainability
◦ Privacy and security
Course goals
Understand the fundamentals of LLMs
Explore the capabilities and limitations of LLMs
Keep up with the current state of the field
Have a grasp of where the field is headed
Course scope
LLM architectures – Transformers and beyond
LLM lifecycle
◦ Pretraining: data preparation, objectives
◦ Adaptation: instruction, alignment, PEFT/MEFT
◦ Utilization: prompting, in-context, augmentation
◦ Evaluation: language, downstream
Course structure (tentative)
Intro lectures by instructor Week Mon Wed
Paper seminars by students 18/03 G1 G2
◦ 1 group per class 25/03 G3 G4
(rotate every 2 weeks) 01/04 G1 G2
◦ 2 papers per group 08/04 G3 G4
(30min + 20min discussion) 15/04 G1 G2
◦ 2 students per paper 22/04 G3 G4
Course structure (tentative)
Week Mon Wed
Final paper list and 18/03 G1 G2
25/03 G3 G4
seminar schedule will
01/04 G1 G2
be available later 08/04 G3 G4
today for enrollment 15/04 G1 G2
22/04 G3 G4
Course grading
Seminar presentations
◦ 3x 20% = 60%
Seminar feedback
◦ 21x 1% = 21%
Class participation
◦ 21x 1% = 21%
Course attendance
❝
Os créditos relativos a cada disciplina só
serão conferidos ao aluno que obtiver, no
mínimo, o conceito D e que comprovar efetiva
frequência a, no mínimo, 75% (setenta e cinco
por cento) das atividades em que estiver
matriculado, vedado o abono de faltas.
NGPG, art. 65
Course materials: books & surveys
Build a Large Language Model (from Scratch)
by Raschka (2024)
Large Language Models: A Survey
by Minaee et al. (2024)
A Comprehensive Overview of Large Language Models
by Naveed et al. (2024)
Course materials: books & surveys
Efficient Large Language Models: A Survey
by Wan et al. (2024)
A Survey of Large Language Models
by Zhao et al. (2023)
Course materials: courses and tutorials
Generative AI with Large Language Models
by DeepLearning.AI / AWS
Large Language Models
by Databricks
Neural Networks: Zero to Hero
by Karpathy
Pre-course survey
Fill in a short survey describing your past experience
and expectations related to the course
◦ https://forms.gle/7mcatGc5LtAFM2ta7
UNIVERSIDADE*FEDERAL
DE*MINAS*GERAIS
Coming next…
Architecture of LLMs
Rodrygo L. T. Santos
rodrygo@dcc.ufmg.br