0% found this document useful (0 votes)
45 views40 pages

Intro to Large Language Models

The document discusses introductory topics in natural language processing including recurrent neural networks and their applications to tasks like language modeling, machine translation, and sequence labeling. It also covers modern transformer models and contextual word embeddings from pretrained language models like ELMo.

Uploaded by

cjchien
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views40 pages

Intro to Large Language Models

The document discusses introductory topics in natural language processing including recurrent neural networks and their applications to tasks like language modeling, machine translation, and sequence labeling. It also covers modern transformer models and contextual word embeddings from pretrained language models like ELMo.

Uploaded by

cjchien
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

CS447: Natural Language Processing

http://courses.grainger.illinois.edu/cs447

Lecture 27:
Intro to Large Language
Models
Julia Hockenmaier
juliahmr@illinois.edu
Today’s class
Recap: Using RNNs for various NLP tasks

From static to contextual embeddings: ELMO

Recap: Transformers

Subword tokenizations

Early Large Language Models (GPT, BERT)

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 2


a p: r
R e c fo
N s s ks
R N ta
i ng L P
U s t N
r e n
iff e
d
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 3
RNNs for language generation
C HAPTER 9 • S EQUENCE P ROCESSING WITH R ECURRENT N ETWORKS
AKA “autoregressive generation”

Sampled Word In a hole ?

Softmax

RNN

Embedding

Input Word <s> In a hole

Figure 9.7 Autoregressive generation with an RNN-based neural language model.


CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 4
task is part-of-speech tagging, discussed in detail in Chapter 8. In an RNN approach
source. Then we begin autoregressive generation, asking for a word in the context
of the hidden layer from the end of the source input as well as the end-of-sentence
marker. Subsequent words are conditioned on the previous hidden state and the
An RNN for Machine Translation
embedding for the last word generated.

lived a hobbit </s> vivait un hobbit </s>

there lived a hobbit </s> vivait un hobbit

Source Target

Figure 10.2 Training setup for a neural language model approach to machine translation. Source-target bi-
texts are concatenated and used to train a language model.

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 5


Early efforts using this clever approach demonstrated surprisingly good results
RNNs for sequence classi cation
If we just want to assign one label to the entire
sequence, we don’t need to produce output at each
time step, so we can use a simpler architecture.

We can use the hidden state of the last word


in the sequence as input to a feedforward net:

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 6

fi
Basic RNNs for sequence labeling
Figure 9.7 Autoregressive generation with an RNN-based neural language model.

task is part-of-speech tagging, discussed in detail in Chapter 8. In an RNN approach


to POS tagging, inputs are word embeddings and the outputs are tag probabilities
Sequence labeling (e.g. POS tagging):
generated by a softmax layer over the tagset, as illustrated in Fig. 9.8.
In this figure, the inputs at each time step are pre-trained word embeddings cor-
Assign one label to each element in the sequence.
responding to the input tokens. The RNN block is an abstraction that represents
an unrolled simple recurrent network consisting of an input layer, hidden layer, and
output layer at each time step, as well as the shared U, V and W weight matrices that
comprise the network. The outputs of the network at each time step represent the
distribution over the POS tagset generated by a softmax layer.
RNN Architecture:To generate a tag sequence for a given input, we can run forward inference over
the input sequence and select the most likely tag from the softmax at each step. Since
we’re using a softmax layer to generate the probability distribution over the output
Each time step has a distribution over output classes

RNN

Janet will back the bill

Figure 9.8 Part-of-speech tagging as sequence labeling with a simple RNN. Pre-trained
word embeddings serve as inputs and a softmax layer provides a probability distribution over
Extension: add a CRF tags
the part-of-speech layer to atcapture
as output dependencies among labels
each time step. of adjacent tokens.
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 7
M o
E L
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 8
Embeddings from Language Models
Replace static embeddings (lexicon lookup)
with context-dependent embeddings
(produced by a neural language model)

=> Each token’s representation is a function of


the entire input sentence, computed by a deep
(multi-layer) bidirectional language model
=> Return for each token a (task-dependent) linear
combination of its representation across layers.
=> Different layers capture different information

Peters et al., NAACL 2018

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 9


ELMo
Pre-training:
— Train a multi-layer bidirectional language model
with character convolutions on raw text
— Each layer of this language model network
computes a vector representation for each token.
— Freeze the language model parameters.

Fine-tuning (for each task)


Train task-dependent softmax weights to combine
the layer-wise representations into a single vector
for each token jointly with a task-speci c model
that uses those vectors
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 10

fi
ELMo’s input token representations
The input token representations are purely character-
based: a character CNN, followed by linear projection
to reduce dimensionality

“2048 character n-gram convolutional lters


with two highway layers, followed by a linear
projection to 512 dimensions”

Advantage over using xed embeddings:


no UNK tokens, any word can be represented

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 11


fi
fi
ELMo’s bidirectional language models
Forward LM: a deep LSTM that goes over the sequence from
start to end to predict token tk based on the pre x t1…tk-1:
p(tk | t1, …, tk−1; Θx, Θ LSTM, Θs)
Parameters: token embeddings Θx LSTM Θ LSTM , softmax Θs

Backward LM: a deep LSTM that goes over the sequence


from end to start to predict token tk based on the suf x tk+1…tN:
p(tk | tk+1, …, tN; Θx, Θ LSTM, Θs)

Train these LMs jointly, with the same parameters for the token
representations and the softmax layer (but not for the LSTMs)

∑( )
N
log p(tk | t1, …, tk−1; Θx, Θ LSTM, Θs) + log p(tk | tk+1, …, tN; Θx, Θ LSTM, Θs)
k=1

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 12

fi
fi
each token tk , a L-layer biLM computes a set of
2L + 1 representations
ELMo’s output token representations
1 ). Rk = {xLM LM LM
k , h k,j , h k,j | j = 1, . . . , L}
Given an inputLM token representation xk,
= {h
each layer the| LSTM
j ofk,j j = 0,language
. . . , L}, models computes
dels a vector representation
each token
where hLM
t , a L-layerhbiLMk,j for every
kis the token layer and hLM
tokenak.set of
computes
Mer- k,0 k,j =
t to-
2L
WithLM
representations
+L1layers,
LM ELMo represents each token as L vectors hLM
[ h k,j ; h k,j ], for each biLSTM layer. k,l
).s or For inclusion LM in a downstream
LM LM model, ELMo
Rk = {xk , h k,j , h k,j | j = 1, . . . , L}
lay- collapses all layers in R into a single vector,
each {h LM
e ).0, In
. . .the simplest case,
ELMo= k = k,j k|; j =
E(R , L},
pre-
els ELMo just selects the top layer, E(R ) = h LM ,
where hLM = [ h LM
; h LM
] h LM
= x
k k,L
ayer and kh(Mc-
er- where hk,jLM
as in TagLMk,0 is k,j
(Peters
the et k,j 2017)
al.,
token layer k,0 and
and CoVe LM =
k,j
next
to- Cann et al.,LM2017). More generally, task
we compute a task
LM
[ELMo
h k,j learns ], for each
; h k,j softmax biLSTM
weights s layer.
and a task-speci c scalar γ
task specific weighting of all biLM j layers:
or to For inclusion
collapse these Linvectors
a downstream model,
into a single ELMo c token vector:
task-speci
ex-
ay- collapses all layers in R into a single L vector,
dict-
ch ELMo task
= E(R ; task
) = task
s task LM
hk,j .
t:
ELMokk = E(Rk ; e ). In the simplest
k j case,
re- ELMo just selects the top layer, E(R j=0
k ) = h LM ,
(1)k,L
yer as CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/
In in
(1),TagLM
stask are(Peters et al., 2017) and
softmax-normalized CoVe
weights and(Mc- 13

fi
fi
Results
ELMo gave improvements on a variety of tasks:
— question answering (SQuAD)
— entailment/natural language inference (SNLI)
— semantic role labeling (SRL)
— coreference resolution (Coref)
— named entity recognition (NER)
— sentiment analysis (SST-5) I NCREASE
O UR ELM O +
TASK P REVIOUS SOTA ( ABSOLUTE /
BASELINE BASELINE
RELATIVE )
SQuAD Liu et al. (2017) 84.4 81.1 85.8 4.7 / 24.9%
SNLI Chen et al. (2017) 88.6 88.0 88.7 ± 0.17 0.7 / 5.8%
SRL He et al. (2017) 81.7 81.4 84.6 3.2 / 17.2%
Coref Lee et al. (2017) 67.2 67.2 70.4 3.2 / 9.8%
NER Peters et al. (2017) 91.93 ± 0.19 90.15 92.22 ± 0.10 2.06 / 21%
SST-5 McCann et al. (2017) 53.7 51.4 54.7 ± 0.5 3.3 / 6.8%
Table 1: Test set comparison of ELMo enhanced neural models with state-of-the-art single model baselines across
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/
six benchmark NLP tasks. The performance metric varies across tasks – accuracy for SNLI and SST-5; F1 for
14
SQuAD, SRL and NER; average F1 for Coref. Due to the small test sizes for NER and SST-5, we report the mean
ELMo:
ELMo showed that contextual embeddings are very
useful: it outperformed other models on many tasks
ELMo embeddings could also be concatenated with other
token-speci c features, depending on the task

ELMo requires training a task-speci c softmax and


scalar to predict how best to combine each layer
Not all layers were equally useful for each task

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 15


fi
fi
eq ,
q 2 s
: S e rs
a p m e
Re c fo r
an s
Tr
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 16
Encoder-Decoder (seq2seq) model
The decoder is a language model that generates an
output sequence conditioned on the input sequence.
— Vanilla RNN: condition on the last hidden state
— Attention: condition on all hidden states

Encoder Decoder

output

hidden

input

17

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 17


Transformers use Self-Attention
Attention so far (in seq2seq architectures):
In the decoder (which has access to the complete input
sequence), compute attention weights over encoder positions
that depend on each decoder position

Self-attention:
If the encoder has access to the complete input sequence,
we can also compute attention weights over encoder positions
that depend on each encoder position
self-attention:
encoder
For each decoder position t…,
…Compute an attention weight for each encoder position s
…Renormalize these weights (that depend on t) w/ softmax
to get a new weighted avg. of the input sequence vectors
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 18
Transformer Architecture
Non-Recurrent Encoder-Decoder
architecture

— No hidden states
— Context information
captured via attention
and positional encodings
— Consists of stacks of layers
with various sublayers

Vaswani et al, NIPS 2017

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 19


Encoder Vaswani et al, NIPS 2017

A stack of N=6 identical layers


All layers and sublayers are 512-dimensional

Each layer consists of two sublayers


— one multi-head self attention layer
— one position-wise feed forward layer

Each sublayer is followed by an “Add & Norm” layer:


… a residual connection x + Sublayer(x)
(the input x is added to the output of the sublayer)
… followed by a normalization step
(using the mean and standard deviation of its activations)
LayerNorm(x + Sublayer(x))
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 20
Decoder Vaswani et al, NIPS 2017

A stack of N=6 identical layers


All layers and sublayers are 512-dimensional

Each layer consists of three sublayers


— one masked multi-head self attention layer
over decoder output
(masked, i.e. ignoring future tokens)
— one multi-headed attention layer
over encoder output
— one position-wise feed forward layer

Each sublayer has a residual connection


and is normalized: LayerNorm(x + Sublayer(x))
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 21
o r d
b w i o n
Su izat
ke n
To
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 22
BPE Tokenization (Sennrich et al, ACL 2016)

BytePair Encoding (Gage 1994): a compression


algorithm that iteratively replaces the most common
pair of adjacent bytes with a single, unused byte

BPE tokenization: introduce new tokens by merging


the most common adjacent pairs of tokens
Start with all characters, plus a special end-of-word character
Introduce new token by merging the most common pair of
adjacent tokens.
(Assumption: each individual token will still occur in a different
context, so we will also keep both tokens in the vocabulary)
Machine translation: train one tokenizer across both
languages (better generalization for related languages)
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 23
Wordpiece tokenization (Wu et al, 2016)

Part of Google’s LSTM-based Neural Machine


Translation system (https://arxiv.org/pdf/1609.08144.pdf)

Segment words into subtokens (with special word


boundary symbols to recover original tokenization)
Input: Jet makers feud over seat width with big orders at stake
Output: _J et _makers _fe ud _over _seat _width _with _big _orders _at _stake

Training of Wordpiece:
Specify desired number of tokens, D
Add word boundary token (at beginning of words)
Optimization task: greedily merge adjacent characters to
improve log-likelihood of data until the vocabulary has size D.

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 24


Subword Regularization (Kudo, ACL 2018)

Observation: Subword tokenization can be ambiguous


Can this be harnessed?
Approach: Train a (translation) model with (multiple)
subword segmentations that are sampled from a
character-based unigram language model

Training the unigram model:


Start with an overly large seed vocabulary V (all possible single-
character tokens and many multi-character tokens)
Randomly sample a segmentation from the unigram model
Decide which multi-character words to remove from V based on
how the likelihood decreases by removing them
Stop when the vocabulary is small enough.

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 25


P T
G
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 26
Generative Pre-Training (Radford et al, 2018)

Auto-regressive 12-layer
Figure 1: (left) Transformer transformer
architecture and training decoder
objectives used in this work. (right) Input
transformations for fine-tuning on different tasks. We convert all structured inputs into token
sequences to be processed by our pre-trained model, followed by a linear+softmax layer.
Each token only conditioned on preceding context
BPE tokenization (|V|
3.3 Task-specific = transformations
input 40K), 768 hidden size, 12 attention heads
Pre-trained
Certain other on raw text as a language model
For some tasks, like text classification, we can directly fine-tune our model as described above.
tasks, like question answering or textual entailment, have structured inputs such as
ordered sentence pairs, or triplets of document, question, and answers. Since our pre-trained model
(Maximize the probability of predicting the next word)
was trained on contiguous sequences of text, we require some modifications to apply it to these tasks.
Previous work proposed learning task specific architectures on top of transferred representations [44].
Such an approach re-introduces a significant amount of task-specific customization and does not
Fine-tuned on labeled data (and language modeling)
use transfer learning for these additional architectural components. Instead, we use a traversal-style
approach [52], where we convert structured inputs into an ordered sequence that our pre-trained
Include new start, delimiter and end tokens,
model can process. These input transformations allow us to avoid making extensive changes to the
architecture across tasks. We provide a brief description of these input transformations below and
plus linear layer added to last layer of end token output.
Figure 1 provides a visual illustration. All transformations include adding randomly initialized start
and end tokens (hsi, hei).
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 27
Textual entailment For entailment tasks, we concatenate the premise p and hypothesis h token
sequences, with a delimiter token ($) in between.
E RT
B
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 28
BERT (Devlin et al, NAACL 2019)

Fully bidirectional transformer encoder


BERTbase: 12 layers, hidden size=768, 12 att’n heads (110M parameters)
BERTlarge: 24 layers, hidden size=1024, 16 attention heads (340M parameters)

Input: sum of token, positional, segment embeddings


Segment embeddings (A and B): is this token part of
sentence A (before SEP) or sentence B (after SEP)?

[CLS] and [SEP] tokens: added during pre-training

Pre-training tasks:
– Masked language modeling
– Next sentence prediction
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 29
BERT Input

[CLS] Sentence A [SEP] Sentence B [SEP]

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 30


Pre-training tasks
BERT is jointly pre-trained on two tasks:

Next-sentence prediction: [based on CLS token]


Does sentence B follow sentence A in a real document?

Masked language modeling:


15% of tokens are randomly chosen as masking tokens
10% of the time, a masking token remains unchanged
10% of the time, a masking token is replaced by a random token
80% of the time, a masking token is replaced by [MASK],
and the output layer has to predict the original token

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 31


Using BERT for classi cation
Class Class
Label Label

C T1 ... TN T[SEP] T1’ ... TM’ C T1 T2 ... TN

BERT BERT

E[CLS] E1 ... EN E[SEP] E1’ ... EM’ E[CLS] E1 E2 ... EN

[CLS]
Tok
... Tok
[SEP]
Tok
... Tok [CLS] Tok 1 Tok 2 ... Tok N
1 N 1 M

Sentence 1 Sentence 2 Single Sentence

Sentence Pair Single Sentence


Classi cation Classi cation

Add a softmax classi er on nal layer of [CLS] token

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 32


fi
fi
fi
fi
fi
Using BERT for Question-Answering
Start/End Span

C T1 ... TN T[SEP] T1’ ... TM’

BERT

E[CLS] E1 ... EN E[SEP] E1’ ... EM’

[CLS]
Tok
1
... Tok
N
[SEP]
Tok
1
... Tok
M

Question Paragraph

Input: [CLS] question [SEP] answer passage [SEP]


Learn to predict a START and an END token on answer tokens
Represent START and END as H-dimensional Figure 4: Illustrations
vectors S, of E Fine-tuning BERT on Different Tasks.
Find the most likely start and end tokens in the answer by computing a softmax over the dot
product of all tokenSST-2
embeddings Ti and S
The Stanford (or E ) Treebank is a
Sentiment for whether the sentences in th
binary single-sentence exp(T
classification taski ⋅ consist-
S) cally equivalent (Dolan and Bro
P(Ti is start) =
ing of sentences extracted from ∑jmovie
exp(Tjreviews
⋅ S)
with human annotations of their sentiment (Socher RTE Recognizing Textual E
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/
nary entailment task33 similar t
et al., 2013).
much less training data (Bentiv
Using BERT for Sequence Labeling
O B-PER ... O

C T1 T2 ... TN

BERT

E[CLS] E1 E2 ... EN

[CLS] Tok 1 Tok 2 ... Tok N

Single Sentence

Add a softmax classi er to the tokens in the sequence


ure 4: Illustrations of Fine-tuning BERT on Different Tasks.

ntiment Treebank is Language


CS447 Natural a forProcessing
whether(J.the sentences
Hockenmaier) in the pair are semanti-
https://courses.grainger.illinois.edu/cs447/ 34
ssification task consist- cally equivalent (Dolan and Brockett, 2005).
fi
Fine-tuning BERT
To use BERT on any task, it needs to be ne-tuned:

— Add any new parts to the model


(e.g. classi er layers)
This will add new parameters (initialized randomly)

— Retrain the entire model (update all parameters)

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 35


fi
fi
More compact BERT models (Turc et al., 2019)

Pre-training and ne-tuning works well on much smaller


BERT variants
https://arxiv.org/abs/1908.08962

Additional improvements through knowledge


distillation:
– Pre-train a compact model (‘student’) in the standard way
– Train/Fine-tune a large model (‘teacher’) on the target task
– Knowledge distillation step:
Train the student on noisy task predictions made by teacher
– Fine-tune student on actual task data
Students can have more layers (but smaller
embeddings) than models trained in the standard way
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 36
fi
E R T
B nt s
r i a
Va
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 37
RoBERTA (Liu et al. 2019)

Investigates better pre-training for BERT


Found that BERT was undertrained.
Optimizes hyperparameter choice.
Evaluates next-sentence prediction task
RoBERTA outperforms BERT on several tasks.

Pre-training improvements:
Dynamic masking: randomly change which tokens in a
sentence get masked (BERT: same tokens in each epoch)
Much larger batch sizes (2K sentences instead of 256)
Use byte-level BPE, not character level BPE

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 38


BART (Lewis et al., ACL 2020)

Combines bidirectional encoder (like BERT) with


auto-regressive (unidirectional) decoder (like GPT)
Used for classi cation, generation, translation
Uses nal token of decoder sequence for classi cation tasks.

Pre-training: corrupts (encoder) input with masking, deletion,


rotation, permutation, in lling.
Decoder needs to recover original input

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 39


fi
fi
fi
fi
SentenceBERT (Reimers & Gurevych, EMNLP 2019)
For tasks that require scoring of sentence pairs
(e.g. semantic textual similarity, or entailment recognition)
Motivation: BERT treats sequence pairs as one (long) sequence,
but cross-attention across O(2n) words is very slow.

SentenceBERT Solution: Siamese network


Run BERT over each sentence independently
Compute one vector (u and v)
for each sentence by (mean or max)
pooling over word embeddings or by using CLS token
Classi cation tasks:
concatenate u, v, and u–v,
use as input to softmax
Similarity tasks:
use the cosine similarity
of u and v as similarity score
Training: start with BERT, ne-tune Siamese model on task-speci c data
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 40
fi
fi
fi

You might also like