Intro to Large Language Models
Intro to Large Language Models
http://courses.grainger.illinois.edu/cs447
Lecture 27:
Intro to Large Language
Models
Julia Hockenmaier
juliahmr@illinois.edu
Today’s class
Recap: Using RNNs for various NLP tasks
Recap: Transformers
Subword tokenizations
Softmax
RNN
Embedding
Source Target
Figure 10.2 Training setup for a neural language model approach to machine translation. Source-target bi-
texts are concatenated and used to train a language model.
fi
Basic RNNs for sequence labeling
Figure 9.7 Autoregressive generation with an RNN-based neural language model.
RNN
Figure 9.8 Part-of-speech tagging as sequence labeling with a simple RNN. Pre-trained
word embeddings serve as inputs and a softmax layer provides a probability distribution over
Extension: add a CRF tags
the part-of-speech layer to atcapture
as output dependencies among labels
each time step. of adjacent tokens.
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 7
M o
E L
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 8
Embeddings from Language Models
Replace static embeddings (lexicon lookup)
with context-dependent embeddings
(produced by a neural language model)
fi
ELMo’s input token representations
The input token representations are purely character-
based: a character CNN, followed by linear projection
to reduce dimensionality
Train these LMs jointly, with the same parameters for the token
representations and the softmax layer (but not for the LSTMs)
∑( )
N
log p(tk | t1, …, tk−1; Θx, Θ LSTM, Θs) + log p(tk | tk+1, …, tN; Θx, Θ LSTM, Θs)
k=1
fi
fi
each token tk , a L-layer biLM computes a set of
2L + 1 representations
ELMo’s output token representations
1 ). Rk = {xLM LM LM
k , h k,j , h k,j | j = 1, . . . , L}
Given an inputLM token representation xk,
= {h
each layer the| LSTM
j ofk,j j = 0,language
. . . , L}, models computes
dels a vector representation
each token
where hLM
t , a L-layerhbiLMk,j for every
kis the token layer and hLM
tokenak.set of
computes
Mer- k,0 k,j =
t to-
2L
WithLM
representations
+L1layers,
LM ELMo represents each token as L vectors hLM
[ h k,j ; h k,j ], for each biLSTM layer. k,l
).s or For inclusion LM in a downstream
LM LM model, ELMo
Rk = {xk , h k,j , h k,j | j = 1, . . . , L}
lay- collapses all layers in R into a single vector,
each {h LM
e ).0, In
. . .the simplest case,
ELMo= k = k,j k|; j =
E(R , L},
pre-
els ELMo just selects the top layer, E(R ) = h LM ,
where hLM = [ h LM
; h LM
] h LM
= x
k k,L
ayer and kh(Mc-
er- where hk,jLM
as in TagLMk,0 is k,j
(Peters
the et k,j 2017)
al.,
token layer k,0 and
and CoVe LM =
k,j
next
to- Cann et al.,LM2017). More generally, task
we compute a task
LM
[ELMo
h k,j learns ], for each
; h k,j softmax biLSTM
weights s layer.
and a task-speci c scalar γ
task specific weighting of all biLM j layers:
or to For inclusion
collapse these Linvectors
a downstream model,
into a single ELMo c token vector:
task-speci
ex-
ay- collapses all layers in R into a single L vector,
dict-
ch ELMo task
= E(R ; task
) = task
s task LM
hk,j .
t:
ELMokk = E(Rk ; e ). In the simplest
k j case,
re- ELMo just selects the top layer, E(R j=0
k ) = h LM ,
(1)k,L
yer as CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/
In in
(1),TagLM
stask are(Peters et al., 2017) and
softmax-normalized CoVe
weights and(Mc- 13
fi
fi
Results
ELMo gave improvements on a variety of tasks:
— question answering (SQuAD)
— entailment/natural language inference (SNLI)
— semantic role labeling (SRL)
— coreference resolution (Coref)
— named entity recognition (NER)
— sentiment analysis (SST-5) I NCREASE
O UR ELM O +
TASK P REVIOUS SOTA ( ABSOLUTE /
BASELINE BASELINE
RELATIVE )
SQuAD Liu et al. (2017) 84.4 81.1 85.8 4.7 / 24.9%
SNLI Chen et al. (2017) 88.6 88.0 88.7 ± 0.17 0.7 / 5.8%
SRL He et al. (2017) 81.7 81.4 84.6 3.2 / 17.2%
Coref Lee et al. (2017) 67.2 67.2 70.4 3.2 / 9.8%
NER Peters et al. (2017) 91.93 ± 0.19 90.15 92.22 ± 0.10 2.06 / 21%
SST-5 McCann et al. (2017) 53.7 51.4 54.7 ± 0.5 3.3 / 6.8%
Table 1: Test set comparison of ELMo enhanced neural models with state-of-the-art single model baselines across
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/
six benchmark NLP tasks. The performance metric varies across tasks – accuracy for SNLI and SST-5; F1 for
14
SQuAD, SRL and NER; average F1 for Coref. Due to the small test sizes for NER and SST-5, we report the mean
ELMo:
ELMo showed that contextual embeddings are very
useful: it outperformed other models on many tasks
ELMo embeddings could also be concatenated with other
token-speci c features, depending on the task
Encoder Decoder
output
hidden
input
17
Self-attention:
If the encoder has access to the complete input sequence,
we can also compute attention weights over encoder positions
that depend on each encoder position
self-attention:
encoder
For each decoder position t…,
…Compute an attention weight for each encoder position s
…Renormalize these weights (that depend on t) w/ softmax
to get a new weighted avg. of the input sequence vectors
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 18
Transformer Architecture
Non-Recurrent Encoder-Decoder
architecture
— No hidden states
— Context information
captured via attention
and positional encodings
— Consists of stacks of layers
with various sublayers
Training of Wordpiece:
Specify desired number of tokens, D
Add word boundary token (at beginning of words)
Optimization task: greedily merge adjacent characters to
improve log-likelihood of data until the vocabulary has size D.
Auto-regressive 12-layer
Figure 1: (left) Transformer transformer
architecture and training decoder
objectives used in this work. (right) Input
transformations for fine-tuning on different tasks. We convert all structured inputs into token
sequences to be processed by our pre-trained model, followed by a linear+softmax layer.
Each token only conditioned on preceding context
BPE tokenization (|V|
3.3 Task-specific = transformations
input 40K), 768 hidden size, 12 attention heads
Pre-trained
Certain other on raw text as a language model
For some tasks, like text classification, we can directly fine-tune our model as described above.
tasks, like question answering or textual entailment, have structured inputs such as
ordered sentence pairs, or triplets of document, question, and answers. Since our pre-trained model
(Maximize the probability of predicting the next word)
was trained on contiguous sequences of text, we require some modifications to apply it to these tasks.
Previous work proposed learning task specific architectures on top of transferred representations [44].
Such an approach re-introduces a significant amount of task-specific customization and does not
Fine-tuned on labeled data (and language modeling)
use transfer learning for these additional architectural components. Instead, we use a traversal-style
approach [52], where we convert structured inputs into an ordered sequence that our pre-trained
Include new start, delimiter and end tokens,
model can process. These input transformations allow us to avoid making extensive changes to the
architecture across tasks. We provide a brief description of these input transformations below and
plus linear layer added to last layer of end token output.
Figure 1 provides a visual illustration. All transformations include adding randomly initialized start
and end tokens (hsi, hei).
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 27
Textual entailment For entailment tasks, we concatenate the premise p and hypothesis h token
sequences, with a delimiter token ($) in between.
E RT
B
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 28
BERT (Devlin et al, NAACL 2019)
Pre-training tasks:
– Masked language modeling
– Next sentence prediction
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 29
BERT Input
BERT BERT
[CLS]
Tok
... Tok
[SEP]
Tok
... Tok [CLS] Tok 1 Tok 2 ... Tok N
1 N 1 M
BERT
[CLS]
Tok
1
... Tok
N
[SEP]
Tok
1
... Tok
M
Question Paragraph
C T1 T2 ... TN
BERT
E[CLS] E1 E2 ... EN
Single Sentence
Pre-training improvements:
Dynamic masking: randomly change which tokens in a
sentence get masked (BERT: same tokens in each epoch)
Much larger batch sizes (2K sentences instead of 256)
Use byte-level BPE, not character level BPE