0% found this document useful (0 votes)

49 views99 pages

Deeplearning - Ai Deeplearning - Ai

Uploaded by

Jian Quan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views99 pages

Deeplearning - Ai Deeplearning - Ai

Uploaded by

Jian Quan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 99

Copyright Notice

These slides are distributed under the Creative Commons License.

DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.

For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode

Week 3
Overview
deeplearning.ai
Week 3

Question BERT
Answering

Transfer T5
learning
Question Answering
Context-based Closed book

Model Model
Not just the model

Data Data

Training Training Transfer Learning!

Model Model
Classical training
Training Inference

Course Review

Course Review Model

Model
Transfer learning
Inference
Pre-training
Movie Reviews Model Course Review

Model
Training
on “Downstream” Task
Course Review Model
Transfer Learning: Different Tasks
Inference
Pre-
Training Watching the Model When’s my
movie is like ... birthday?
Sentiment
Classification

Model
Training When is
Downstream task: Pi Day? Model
Question Answering March
14! Umm…
BERT: Bi-directional Context
Uni-directional
Learning from deeplearning.ai is like watching the sunset with my best friend!

context
Bi-directional

Learning from deeplearning.ai is like watching the sunset with my best friend!

context context
T5: Single task vs. Multi task
Studying with Studying with
deeplearning.ai deeplearning.ai
was ... was ...

Model 1
Model

Model 2
T5: more data, better performance

C4
Colossal Clean Crawled
English wikipedia
Corpus
~13 GB ~800 GB
Transfer
Learning
in NLP
deeplearning.ai
Desirable Goals

● Reduce training time

Transfer Learning!
● Improve predictions

● Small datasets
Transfer learning options
1 2
Pre-train
Transfer Model
data
Train
Model Labeled
data Feature- Unlabeled
based
3
prediction Fine- Pre-training task
tuning
Language modeling
Masked words
Next sentence
Transfer 1
General purpose learning
I am because I am learning CBOW “Happy”

Word Embeddings

input
Translation

“Features”
Transfer 1
Feature-based vs. Fine-Tuning
Pre-Train Pre-Train

Model prediction Model prediction

features
Fine-tune same model
on Downstream task

Model prediction
Model prediction
Train a new model
Fine-tune: adding a layer Transfer
1

Pre-Training
..
Movies .

Course
reviews
2
Pre-train
Data and performance data

Data Model

Data Model
2
Pre-train
Labeled vs Unlabeled Data data

Labeled text data Unlabeled text data

2
Pre-train
Transfer learning with unlabeled data data
Pre-Training

Model

Which tasks work with

No labels !
unlabeled data?
Downstream task

What day is Pi
Model March 14
day?

Labeled data
3
Self-supervised task Pre-training task

Unlabeled
data

Create
Inputs targets
(features) (Labels)
3
Self-supervised tasks Pre-training task
Unlabeled Data
Learning from deeplearning.ai
is like watching the sunset Target
with my best friend.
Input friend

Learning from deeplearning.ai

is like watching the sunset Model prediction Loss
with my best _______

Update

Language modeling
Fine-tune a model for each downstream task
Pre Training
Model

Training on
Downstream task
Model Model Model

Translation Summarization Q&A

Summary
1 2
Pre-train
Model
Transfer data
Train
Model Labeled
data Feature- Unlabeled
based
3
prediction Fine- Pre-training task
tuning
Language modeling
Masked words
Next sentence
ELMo, GPT,
BERT, T5
deeplearning.ai
Outline

CBOW ELMo GPT BERT T5

Context
… right ...

… they were on the right ...

… they were on the right side of the street

Continuous Bag of Words

… they were on the right side of the street

Fixed window Fixed window

“on”

“the” “right”

“side”

“of” Fully-connected (Feed Forward) neural

network
Need more context?

… they were on the right side of the street.

Fixed window Fixed window

… they were on the right side of history.

Use all context words
The legislators believed that they were on the right side of history, so they changed the law.
ELMo: Full context using RNN
The legislators believed that they were on the _____ side of history so they changed the law.

Bi-directional LSTM

“right”
LSTM LSTM

Word embedding for “right”

Open AI GPT
ELMo Transformer GPT

Decoder
RNN
Decoder
Encoder

The legislators believed that they were on the _____

Uni-directional
Why not bi-directional?
Transformer
Attention

… on the right side...

Each word can peek at
itself!
GPT: Uni-directional
Transformer Transformer

Attention
Attention

… on the right side... … on the right

Each word can peek at No peeking!
itself!
BERT
Transformer GPT BERT

Decoder
Decoder Encoder
Encoder

The legislators believed that they were on the _____ side of history, so they changed the
law.

Bi-directional
Transformer + Bi-directional Context

… on the _ side _ history ... Model “right”

“of”

Multi-Mask Language Modeling

BERT: Words to Sentences
So they changed the law.
The legislators believed that they were
on the right side of history. ?
Then the bunny ate the carrot.

Sentence “A”
? Sentence “B”

Next Sentence Prediction

BERT Pre-training Tasks
Multi-Mask Language Modeling

… on the _ side _ history ... Model “right”

“of”

Next Sentence Prediction

Sentence “A”
? Sentence “B”
T5: Encoder vs. Encoder-Decoder

Transformer GPT BERT T5

Decoder Decoder
Decoder Encoder
Encoder Encoder
T5: Multi-task
Studying with
deeplearning.ai
was ...

How?
Model
T5: Text-to-Text
“5 stars”
“Classify: Learning from deeplearning.ai is like...”

Classify

“It was alright”

“Summarize: It was the best of times…” Summarize
Task type
Question
“Question: “When is Pi day?” “March 14”
Summary
More details next!

CBOW ELMo GPT BERT T5

Context Full sentence Transformer: Transformer: Transformer:

window Decoder Encoder Encoder -
Bi-directional Decoder
FFNN Context Uni-directional Bi-directional
Context Context Bi-directional
RNN Context
Multi-Mask
Multi-Task
Next Sentence
Prediction
Bidirectional Encoder
Representations from
Transformers (BERT)
deeplearning.ai
Outline

● Learn about the BERT architecture

● Understand how BERT pre-training works

BERT
● Makes use of transfer learning/pre-training:

...
...

...
...
BERT

● A multi layer bidirectional transformer

● Positional embeddings

● BERT_base:
12 layers (12 transformer blocks)
12 attentions heads
110 million parameters
BERT pre-training

After school Lukasz does his in the library.

● Masked language modeling (MLM)

BERT pre-training

After school Lukasz does his homework in the library.

After school his homework in the .

Summary

● Choose 15% of the tokens at random: mask them 80% of the time,
replace them with a random token 10% of the time, or keep as is 10%
of the time.

● There could be multiple masked spans in a sentence

● Next sentence prediction is also used when pre-training.

BERT
Objective
deeplearning.ai
Outline

● Understand how BERT inputs are fed into the model

● Visualize the output

● Learn about the BERT objective

Formalizing the input

Input [CLS] my dog is cute [SEP] he likes play ##ing [SEP]

E E E E E E E E E E E
Token [CLS] my dog is cute [SEP] he likes play ##ing [SEP]

Embeddings
E E E E E E E E E E E
Segment A A A
A
A A B B B B B

Embeddings
Position E E E E E E E E E E E
0 1 2 4 5 6 7 8 9
3 10
Embeddings
Visualizing the output
NSP Mask ML Mask ML

C T1 TN T [SEP] T1 ’ TM ’
• [CLS]: a special
... ...
classification symbol
added in front of
BERT
every input
E [CLS] E1 ... EN E E1 ’ ... EM’
[SEP]

[CLS]
Tok
... Tok N [SEP] Tok 1 ... Tok M • [SEP]: a special
1
separator token
Masked sentence A Masked sentence
B
Unlabeled Sentence A and B Pair
BERT Objective
Objective 1: Objective 2:
Multi-Mask LM Next Sentence Prediction

Loss: Cross Entropy Loss Loss: Binary Loss

V 2
Summary

● BERT objective

● Model inputs/outputs
Fine-tuning
BERT
deeplearning.ai
Fine-tuning BERT: Outline
MNLI
Pre-train BERT
BERT
Hypothesis Premise
Sentence A Sentence B

NER
SQuAD BERT

BERT
Sentence A Tags

Question Answer
Inputs
Summary
Sentence A Sentence B Sentence Entities

Text Ø Sentence Paraphrase

Question Passage Article Summary

Hypothesis Premise
⋮
Transformer
T5
deeplearning.ai
Outline

● Understand how T5 works

● Recognize the different types of attention used

● Overview of model architecture

Transformer - T5 Model
Text to Text Machine Translation

Classification
Summarization

Question
Answering (Q&A)
Sentiment
©️Exploring the Limits of Transfer learning with a unified text to Text Transformer. Raffel et. al. 2020
Transformer - T5 Model
Original text

Thank you for inviting me to your party last week.

Inputs

Thank you <X> me to your party <Y>

week.
Targets

<X> for inviting <Y> last

<Z>

©️Exploring the Limits of Transfer learning with a unified text to Text Transformer. Raffel et. al. 2020
Model Architecture
y1 y2 .
Language model Prefix LM
Decoder

X2 X3 y1 y2 . X2 X3 y1 y2 .
Encoder

X1 X2 X3 X4 X1 X2 X3 y1 y2 X1 X X 3
y1 y2
2

©️Exploring the Limits of Transfer learning with a unified text to Text Transformer. Raffel et. al. 2020
Model Architecture
● Encoder/decoder Decoder

Encoder

● 12 transformer blocks each

● 220 million parameters

©️Exploring the Limits of Transfer learning with a unified text to Text Transformer. Raffel et. al. 2020
Summary

● Prefix LM attention

● Model architecture

● Pre-training T5 (MLM)
Multi-task
Training
Strategy
deeplearning.ai
Multi-task training strategy
“Translate English to German: That is
good.” “Das ist gut”

“cola sentence: The course is jumping

T5
well.” “not acceptable”

“stsb sentence1: The rhino grazed on “3.8”

the grass. Sentence2: A rhino is
grazing in a field.”
“six people
“Summarize: state authorities hospitalized after a
dispatched emergency crews tuesday storm in attala
to survey the damage after an county”
onslaught of severe weather in
mississippi…”

©️Exploring the Limits of Transfer learning with a unified text to Text Transformer. Raffel et. al. 2020
Input and Output Format
Machine translation:
• translate English to German: That is good.
● Predict entailment, contradiction , or neutral
• mnli premise: I hate pigeons hypothesis: My feelings
towards pigeons are filled with animosity. target:
entailment
● Winograd schema
• The city councilmen refused the demonstrators a permit
because *they* feared violence

©️Exploring the Limits of Transfer learning with a unified text to Text Transformer. Raffel et. al. 2020
Multi-task Training Strategy
Fine-tuning method GLUE CNNDM SQuAD SGLUE EnDe EnFr EnRo

* All parameters 83.28 19.24 80.88 71.36 26.98 39.82 27.65

Adapter layers, 80.52 15.08 79.32 60.40 13.84 17.88 15.54

Adapter layers, 81.51 16.62 79.47 63.03 19.83 27.50 22.63

Adapter layers, 81.54 17.78 79.18 64.30 23.45 33.98 25.81

Adapter layers, 81.51 16.62 79.47 63.03 19.83 27.50 22.63

Gradual unfreezing 82.50 18.95 79.17 70.79 26.71 39.02 26.93

How much data from each task to train on?

©️Exploring the Limits of Transfer learning with a unified text to Text Transformer. Raffel et. al. 2020
Data Training Strategies
Examples-proportional mixing Equal mixing

Data 1 Data 1

Sample 1 Sample 1

Data 2
Data 2 Sample 2
Sample 2

Temperature-scaled mixing
Gradual unfreezing vs. Adapter layers

Gradual unfreezing Adapter layers

©️Exploring the Limits of Transfer learning with a unified text to Text Transformer. Raffel et. al. 2020
Fine-tuning
Pre Training
Model Model Model

Translation Summarization MLM

Fine Tune on Specific Task

Model 218 steps

Q&A
GLUE
Benchmark
deeplearning.ai
General Language Understanding Evaluation
● A collection used to train, evaluate, analyze natural language
understanding systems
● Datasets with different genres, and of different sizes and
difficulties
● Leaderboard
Tasks Evaluated on
● Sentence grammatical or not?
● Sentiment
● Paraphrase
● Similarity
● Questions duplicates
● Answerable
● Contradiction
● Entailment
● Winograd (co-ref)
General Language Understanding Evaluation

● Drive research

● Model agnostic

● Makes use of transfer learning

Question
Answering
deeplearning.ai
Transformer encoder Feedforward:

Add & [
Norm
Feed LayerNorm,
Forward
dense,
Add &
Norm activation,
Multi-Head
dropout_middle,
Attention
dense,
Positional dropout_final
Encoding ]
Input
Embedding
Inputs
Transformer encoder Encoder block:

Add & [
Norm
Feed Residual(
Forward
LayerNorm,
Add &
Norm attention,
Multi-Head
dropout_,
Attention
),
Positional Residual(
Encoding feed_forward,
Input
Embedding ),
Inputs ]
Transformer encoder Feedforward: Encoder block:

Add & [ [
Norm
Feed LayerNorm, Residual(
Forward
dense, LayerNorm,
Add &
activation, attention,
Norm
Multi-Head dropout_middle, dropout_,
Attention
dense, ),

Positional dropout_final Residual(

Encoding ] feed_forward,
Input
Embedding )

Inputs ]
Data examples
Question: What percentage of the French population today is non - European ?

Context: Since the end of the Second World War , France has become an ethnically diverse country . Today ,
approximately five percent of the French population is non - European and non - white . This does not
approach the number of non - white citizens in the United States ( roughly 28 – 37 % , depending on how Latinos are
classified ; see Demographics of the United States ) . Nevertheless , it amounts to at least three million people , and has
forced the issues of ethnic diversity onto the French policy agenda . France has developed an approach to dealing with
ethnic problems that stands in contrast to that of many advanced , industrialized countries . Unlike the United States ,
Britain , or even the Netherlands , France maintains a " color - blind " model of public policy . This means that it targets
virtually no policies directly at racial or ethnic groups . Instead , it uses geographic or class criteria to address issues of social
inequalities . It has , however , developed an extensive anti - racist policy repertoire since the early 1970s . Until recently ,
French policies focused primarily on issues of hate speech — going much further than their American counterparts — and
relatively less on issues of discrimination in jobs , housing , and in provision of goods and services .

Target: Approximately five percent

Implementing Q&A with T5
“Translate English to
● Load a pre-trained model German: That is good.” “Das ist gut”

● Process data to get the required inputs “cola sentence: The course “not

T5
is jumping well.” acceptable”
and outputs: "question: Q context: C"
as input and "A" as target “stsb sentence1: The rhino
grazed on the grass. “3.8”
● Fine tune your model on the new task Sentence2: A rhino is
grazing in a field.”
and input “six people
“Summarize: state hospitalized
authorities dispatched after a storm
● Predict using your own model emergency crews tuesday in attala
to survey the damage after county”
an onslaught of severe
weather in mississippi…”
Hugging
Face:
Introduction
deeplearning.ai
Outline
● What is Hugging Face?

● How you can use the Hugging Face ecosystem

Hugging Face Use it with

Transformers library

Use it for

Applying state of the art Fine-tuning pretrained

transformer models transformer models
Hugging Face: Using Transformers
Pipelines 1. Pre-processing your inputs

2. Running the model

3. Post-processing the outputs

Context
Q/ Answers
A
Questions
Hugging Face: Fine-Tuning Transformers
Datasets:
Tokenizer
One Thousand

Model Checkpoints:
Trainer Evaluation metrics
More than 14 thousand

Tokenizer
Checkpoint: Set of learned
parameters for a model using a
training procedure for some task
Human readable output
Hugging
Face: Using
Transformers
deeplearning.ai
Using Transformers
Pipelines 1. Pre-processing your inputs

2. Running the model

3. Post-processing the outputs

Context
Q/ Answers
A
Questions
Tasks Task

Initialization
Pipelines Model
Checkpoint

Inputs for
Use
the task

Sentiment Analysis Question Answering Fill-Mask

Context and Sentence and

Sequence
questions position
Checkpoints

Huge number of model checkpoints that you can

use in your pipelines.

But beware, not every checkpoint would be

suitable for your task.
Model Hub

Hub containing models that you can use in your

pipelines according to the task you need:
https://huggingface.co/models

Model Card shows a description of your selected

model and useful information such as code
snippet examples.
Hugging
Face: Fine-
Tuning
Transformers
deeplearning.ai
Fine-Tuning Tools
Datasets:
Tokenizer
One Thousand

Model Checkpoints:
Trainer Evaluation metrics
More than 14 thousand

Tokenizer

Human readable output

Model Checkpoints
Model Checkpoints:
Model Dataset Name in
More than 15 thousand
(and increasing) Stanford Question
distilbert-base-cased-
DistilBERT Answering Dataset
distilled-squad
(SQuAD)
Upload the architecture
and weights with 1 line Wikipedia and
BERT bert-base-cased
of code! Book Corpus

... ... ...

Datasets

Datasets:
One Thousand Load them using just one function

Optimized to work with massive amounts of data!

Tokenizers

[ 101, 1327, 1218, 118,

"What well-known
1227, 18365, 1279, 1127,
superheroes were introduced
Tokenizer 2234, 1206, 3061, 1105,
between 1939 and 1941 by
3018, 1118, 9187, 7452,
Detective Comics?"
136, 102]

Depending on the use case, you

might need to run additional steps.
Trainer and Evaluation Metrics
Trainer object let’s you define the training procedure

Number of epochs
Warm-up steps
Weight decay
...

Train using one line of code!

Pre-defined evaluation metrics, like BLEU and ROUGE

Week 3: Deeplearning - Ai
No ratings yet
Week 3: Deeplearning - Ai
98 pages
11 Bert
No ratings yet
11 Bert
66 pages
BERT Finetuning Theory
No ratings yet
BERT Finetuning Theory
14 pages
Pretraining Part1 16 Mar 23 PDF
No ratings yet
Pretraining Part1 16 Mar 23 PDF
32 pages
Lec14 Pretraining
No ratings yet
Lec14 Pretraining
42 pages
NLP Pretrained Language Models BERT and Its Variants Model Analysis ML Pretraining Finetuning
No ratings yet
NLP Pretrained Language Models BERT and Its Variants Model Analysis ML Pretraining Finetuning
71 pages
HKBK College of Engineering Department of Computer Science and Engineering
No ratings yet
HKBK College of Engineering Department of Computer Science and Engineering
24 pages
Transformer Part3 16 Mar 23 PDF
No ratings yet
Transformer Part3 16 Mar 23 PDF
59 pages
Visualizing BERT & NLP Advances
No ratings yet
Visualizing BERT & NLP Advances
19 pages
BERT and Transformer
No ratings yet
BERT and Transformer
48 pages
Transformers MUIA
No ratings yet
Transformers MUIA
34 pages
BERT Architecture
No ratings yet
BERT Architecture
23 pages
495 Lecture 11 BERT
No ratings yet
495 Lecture 11 BERT
31 pages
Bert Model - NLP
No ratings yet
Bert Model - NLP
10 pages
Bert Explained
No ratings yet
Bert Explained
8 pages
Jacob Devlin BERT
No ratings yet
Jacob Devlin BERT
43 pages
NLP DL Lecture4
No ratings yet
NLP DL Lecture4
78 pages
Language Models for NLP Experts
No ratings yet
Language Models for NLP Experts
31 pages
Pretrained Transformers Insights
No ratings yet
Pretrained Transformers Insights
42 pages
Understanding BERT
No ratings yet
Understanding BERT
4 pages
BERT
No ratings yet
BERT
4 pages
ACL Exp7
No ratings yet
ACL Exp7
7 pages
BERT: Key Insights for NLP Students
No ratings yet
BERT: Key Insights for NLP Students
33 pages
NLP LLM
No ratings yet
NLP LLM
47 pages
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
No ratings yet
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
20 pages
BERT
No ratings yet
BERT
98 pages
The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning) - Jay Alammar - Visualizing Machine Learning One Concept at A Time
No ratings yet
The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning) - Jay Alammar - Visualizing Machine Learning One Concept at A Time
20 pages
Bert
No ratings yet
Bert
20 pages
Data Mining Report
No ratings yet
Data Mining Report
17 pages
BERT GPT CoT
No ratings yet
BERT GPT CoT
83 pages
Bert
No ratings yet
Bert
36 pages
Bert
No ratings yet
Bert
60 pages
All About Encoder-Decoder Models
No ratings yet
All About Encoder-Decoder Models
50 pages
Lecture 12 Pretraining
No ratings yet
Lecture 12 Pretraining
46 pages
BERT vs GPT: Key Differences
No ratings yet
BERT vs GPT: Key Differences
41 pages
How To Fine-Tune BERT For Text Classification?: Corresponding Author The Source Codes Are Available at
No ratings yet
How To Fine-Tune BERT For Text Classification?: Corresponding Author The Source Codes Are Available at
10 pages
Evolution of NLP Models: LSTM to BERT
No ratings yet
Evolution of NLP Models: LSTM to BERT
30 pages
Bert Model
No ratings yet
Bert Model
18 pages
BERT (Bidirectional Encoder Representations From Transformers)
No ratings yet
BERT (Bidirectional Encoder Representations From Transformers)
4 pages
BERT Slides
No ratings yet
BERT Slides
62 pages
Punctuation Restoration Using BERTs Variants
No ratings yet
Punctuation Restoration Using BERTs Variants
11 pages
Class Notes
No ratings yet
Class Notes
43 pages
GenAI Workflow Automation NPTEL Zoom Course
No ratings yet
GenAI Workflow Automation NPTEL Zoom Course
88 pages
Bert
No ratings yet
Bert
10 pages
LLM Learning
No ratings yet
LLM Learning
56 pages
Pre-Training & LLM 2
No ratings yet
Pre-Training & LLM 2
46 pages
NLP Lecture 01-16-Plm-tl
No ratings yet
NLP Lecture 01-16-Plm-tl
11 pages
Lecture 13 - Transformer Encoder Decoderv2
No ratings yet
Lecture 13 - Transformer Encoder Decoderv2
65 pages
NLP Week9 Fine Tuning - and - IR
No ratings yet
NLP Week9 Fine Tuning - and - IR
64 pages
BERT Interview Questions and Cross Questions-1
No ratings yet
BERT Interview Questions and Cross Questions-1
9 pages
7 Transformers
No ratings yet
7 Transformers
20 pages
BERT for NLP Experts
No ratings yet
BERT for NLP Experts
17 pages
Introduction To LLMS: Transformers Types of Llms Configuration Settings
100% (2)
Introduction To LLMS: Transformers Types of Llms Configuration Settings
7 pages
Improving BERT-Based Text Classification With Auxiliary Sentence and Domain Knowledge
No ratings yet
Improving BERT-Based Text Classification With Auxiliary Sentence and Domain Knowledge
16 pages
Bert Ayman
No ratings yet
Bert Ayman
5 pages
Transformer Models: Reasoning Limits
No ratings yet
Transformer Models: Reasoning Limits
28 pages
Pretrained Sentence Embedding and Semantic Sentence Similarity Language Model Fo
No ratings yet
Pretrained Sentence Embedding and Semantic Sentence Similarity Language Model Fo
5 pages
Lecture 15 - Foundation Models - CLIP and GPT
No ratings yet
Lecture 15 - Foundation Models - CLIP and GPT
45 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
91 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
91 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
79 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
32 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
41 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
42 pages
Deeplearning Ai
No ratings yet
Deeplearning Ai
123 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
144 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
40 pages
C2 - W3 Mlopssasaddsad
No ratings yet
C2 - W3 Mlopssasaddsad
65 pages
Deeplearning Ai
No ratings yet
Deeplearning Ai
64 pages

Deeplearning - Ai Deeplearning - Ai

Uploaded by

Deeplearning - Ai Deeplearning - Ai

Uploaded by

Copyright Notice

These slides are distributed under the Creative Commons License.

For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode

Training Training Transfer Learning!

Course Review Model

● Reduce training time

Model prediction Model prediction

Labeled text data Unlabeled text data

Which tasks work with

Learning from deeplearning.ai

Translation Summarization Q&A

CBOW ELMo GPT BERT T5

… they were on the right ...

… they were on the right side of the street

… they were on the right side of the street

“of” Fully-connected (Feed Forward) neural

… they were on the right side of the street.

… they were on the right side of history.

Word embedding for “right”

The legislators believed that they were on the _____

… on the right side...

… on the right side... … on the right

… on the ___ side ___ history ... Model “right”

Multi-Mask Language Modeling

Next Sentence Prediction

… on the ___ side ___ history ... Model “right”

Next Sentence Prediction

Transformer GPT BERT T5

“It was alright”

CBOW ELMo GPT BERT T5

Context Full sentence Transformer: Transformer: Transformer:

● Learn about the BERT architecture

● Understand how BERT pre-training works

● A multi layer bidirectional transformer

After school Lukasz does his in the library.

● Masked language modeling (MLM)

After school Lukasz does his homework in the library.

After school his homework in the .

● There could be multiple masked spans in a sentence

● Next sentence prediction is also used when pre-training.

● Understand how BERT inputs are fed into the model

● Visualize the output

● Learn about the BERT objective

Input [CLS] my dog is cute [SEP] he likes play ##ing [SEP]

Loss: Cross Entropy Loss Loss: Binary Loss

Text Ø Sentence Paraphrase

Question Passage Article Summary

● Understand how T5 works

● Recognize the different types of attention used

● Overview of model architecture

Thank you for inviting me to your party last week.

Thank you <X> me to your party <Y>

<X> for inviting <Y> last

● 12 transformer blocks each

● 220 million parameters

“cola sentence: The course is jumping

“stsb sentence1: The rhino grazed on “3.8”

* All parameters 83.28 19.24 80.88 71.36 26.98 39.82 27.65

Adapter layers, 80.52 15.08 79.32 60.40 13.84 17.88 15.54

Adapter layers, 81.51 16.62 79.47 63.03 19.83 27.50 22.63

Adapter layers, 81.54 17.78 79.18 64.30 23.45 33.98 25.81

Adapter layers, 81.51 16.62 79.47 63.03 19.83 27.50 22.63

Gradual unfreezing 82.50 18.95 79.17 70.79 26.71 39.02 26.93

How much data from each task to train on?

Gradual unfreezing Adapter layers

Translation Summarization MLM

Fine Tune on Specific Task

● Makes use of transfer learning

Positional dropout_final Residual(

Target: Approximately five percent

● How you can use the Hugging Face ecosystem

Applying state of the art Fine-tuning pretrained

2. Running the model

3. Post-processing the outputs

2. Running the model

3. Post-processing the outputs

… on the _ side _ history ... Model “right”

… on the _ side _ history ... Model “right”