Deeplearning - Ai Deeplearning - Ai
Deeplearning - Ai Deeplearning - Ai
DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.
Question    BERT
Answering
 Transfer   T5
 learning
Question Answering
    Context-based    Closed book
       Model             Model
Not just the model
Data Data
     Model           Model
Classical training
                     Training           Inference
Course Review
                                                           Model
Training
on “Downstream” Task
                       Course Review   Model
Transfer Learning: Different Tasks
                                                                      Inference
 Pre-
 Training             Watching the                Model       When’s my
                      movie is like ...                       birthday?
 Sentiment
 Classification
                                                                          Model
Training             When is
Downstream task:     Pi Day?              Model
Question Answering                                    March
                                                      14!                         Umm…
BERT: Bi-directional Context
Uni-directional
             Learning from deeplearning.ai is like watching the sunset with my best friend!
                 context
Bi-directional
Learning from deeplearning.ai is like watching the sunset with my best friend!
                 context                                      context
T5: Single task vs. Multi task
Studying with                Studying with
deeplearning.ai              deeplearning.ai
was ...                      was ...
                  Model 1
                                               Model
              Model 2
T5: more data, better performance
                                    C4
                          Colossal Clean Crawled
    English wikipedia
                                  Corpus
         ~13 GB                  ~800 GB
                  Transfer
                  Learning
                   in NLP
deeplearning.ai
Desirable Goals
                                 Transfer Learning!
        ● Improve predictions
        ● Small datasets
Transfer learning options
                                 1                  2
                                        Pre-train
                     Transfer                                Model
                                        data
Train
          Model                       Labeled
data                  Feature-        Unlabeled
                      based
                                                         3
        prediction    Fine-          Pre-training task
                      tuning
                                     Language modeling
                                     Masked words
                                      Next sentence
                                                     Transfer   1
General purpose learning
   I am   because I am learning       CBOW        “Happy”
Word Embeddings
             input
                                    Translation
                       “Features”
                                                                       Transfer       1
Feature-based vs. Fine-Tuning
 Pre-Train                                        Pre-Train
                                 features
                                                                      Fine-tune same model
                                                                      on Downstream task
                   Model             prediction
                                                              Model          prediction
             Train a new model
Fine-tune: adding a layer        Transfer
                                            1
 Pre-Training
                            ..
                Movies      .
                Course
                reviews
                                      2
                          Pre-train
Data and performance      data
Data Model
     Data         Model
                                                     2
                                         Pre-train
Labeled vs Unlabeled Data                data
Model
What day is Pi
                    Model     March 14
day?
    Labeled data
                                                                    3
Self-supervised task                            Pre-training task
                         Unlabeled
                           data
                                      Create
              Inputs                 targets
            (features)               (Labels)
                                                                                          3
Self-supervised tasks                                                Pre-training task
                                        Unlabeled Data
                                     Learning from deeplearning.ai
                                     is like watching the sunset               Target
                                     with my best friend.
              Input                                                              friend
Update
                                 Language modeling
 Fine-tune a model for each downstream task
Pre Training
                                 Model
Training on
Downstream task
               Model             Model          Model
“on”
“the” “right”
“side”
Bi-directional LSTM
                                                                   “right”
                            LSTM            LSTM
                                   Decoder
          RNN
                                                        Decoder
                                   Encoder
                               Uni-directional
Why not bi-directional?
        Transformer
            Attention
            Attention
                                    Attention
         Decoder
                                            Decoder                          Encoder
         Encoder
 The legislators believed that they were on the _____ side of history, so they changed the
 law.
                                          Bi-directional
Transformer + Bi-directional Context
                Sentence “A”
                                          ?       Sentence “B”
         Sentence “A”
                                         ?    Sentence “B”
T5: Encoder vs. Encoder-Decoder
        Decoder                          Decoder
                     Decoder   Encoder
        Encoder                          Encoder
T5: Multi-task
        Studying with
        deeplearning.ai
        was ...
                                  How?
                          Model
 T5: Text-to-Text
                                                                           “5 stars”
        “Classify: Learning from deeplearning.ai is like...”
Classify
                                                    ...
                                                    ...
                                                    ...
                                                    ...
BERT
● Positional embeddings
● BERT_base:
          12 layers (12 transformer blocks)
          12 attentions heads
          110 million parameters
BERT pre-training
● Choose 15% of the tokens at random: mask them 80% of the time,
   replace them with a random token 10% of the time, or keep as is 10%
   of the time.
             E       E        E         E        E          E           E         E           E          E       E
Token        [CLS]       my       dog       is       cute       [SEP]        he       likes       play   ##ing       [SEP]
Embeddings
             E       E        E         E        E          E           E         E           E          E       E
Segment          A       A         A
                                        A
                                                      A         A           B          B           B         B        B
Embeddings
Position     E       E        E         E        E          E           E         E           E          E       E
                 0       1         2                  4             5       6            7         8         9
                                        3                                                                        10
Embeddings
Visualizing the output
 NSP                Mask ML                         Mask ML
     C         T1             TN       T [SEP]   T1 ’           TM ’
                                                                       •   [CLS]:     a   special
                       ...                               ...
                                                                           classification symbol
                                                                           added in front of
                                     BERT
                                                                           every input
 E   [CLS]    E1       ...     EN      E         E1 ’    ...   EM’
                                        [SEP]
 [CLS]
              Tok
                       ...   Tok N     [SEP]     Tok 1   ...   Tok M   •   [SEP]:   a    special
               1
                                                                           separator token
             Masked sentence A              Masked sentence
                                            B
                        Unlabeled Sentence A and B Pair
BERT Objective
 Objective 1:               Objective 2:
 Multi-Mask LM              Next Sentence Prediction
                 V                    2
Summary
● BERT objective
 ● Model inputs/outputs
                  Fine-tuning
                     BERT
deeplearning.ai
Fine-tuning BERT: Outline
                                 MNLI
Pre-train                                      BERT
            BERT
                                  Hypothesis          Premise
  Sentence A       Sentence B
                                 NER
  SQuAD                                        BERT
                 BERT
                                  Sentence A           Tags
      Question          Answer
Inputs
Summary
Sentence A   Sentence B   Sentence       Entities
Hypothesis   Premise
                                     ⋮
                  Transformer
                      T5
deeplearning.ai
Outline
                                  Classification
                                                                                                 Summarization
                                 Question
                                 Answering (Q&A)
                                                                                                 Sentiment
©️Exploring the Limits of Transfer learning with a unified text to Text Transformer. Raffel et. al. 2020
          Transformer - T5 Model
               Original text
Inputs
©️Exploring the Limits of Transfer learning with a unified text to Text Transformer. Raffel et. al. 2020
           Model Architecture
                        y1        y2           .
                                                                        Language model                                   Prefix LM
             Decoder
                                                                X2       X3        y1       y2             .   X2   X3     y1        y2    .
         Encoder
                   X1        X2        X3          X4           X1       X2        X3       y1        y2       X1   X      X    3
                                                                                                                                     y1   y2
                                                                                                                     2
©️Exploring the Limits of Transfer learning with a unified text to Text Transformer. Raffel et. al. 2020
          Model Architecture
           ● Encoder/decoder                                                                               Decoder
Encoder
©️Exploring the Limits of Transfer learning with a unified text to Text Transformer. Raffel et. al. 2020
Summary
● Prefix LM attention
● Model architecture
● Pre-training T5 (MLM)
                  Multi-task
                  Training
                  Strategy
deeplearning.ai
      Multi-task training strategy
           “Translate English to German: That is
                           good.”                                                                             “Das ist gut”
                                                                                     T5
                          well.”                                                                            “not acceptable”
©️Exploring the Limits of Transfer learning with a unified text to Text Transformer. Raffel et. al. 2020
      Input and Output Format
           Machine translation:
                 •       translate English to German: That is good.
            ● Predict entailment, contradiction , or neutral
             • mnli premise: I hate pigeons hypothesis: My feelings
                towards pigeons are filled with animosity. target:
                entailment
            ● Winograd schema
                 •       The city councilmen refused the demonstrators a permit
                         because *they* feared violence
©️Exploring the Limits of Transfer learning with a unified text to Text Transformer. Raffel et. al. 2020
      Multi-task Training Strategy
              Fine-tuning method                          GLUE         CNNDM            SQuAD          SGLUE       EnDe    EnFr    EnRo
©️Exploring the Limits of Transfer learning with a unified text to Text Transformer. Raffel et. al. 2020
Data Training Strategies
Examples-proportional mixing                            Equal mixing
Data 1 Data 1
Sample 1 Sample 1
                                               Data 2
  Data 2         Sample 2
                                                                 Sample 2
                       Temperature-scaled mixing
      Gradual unfreezing vs. Adapter layers
                             Q&A
                    GLUE
                  Benchmark
deeplearning.ai
General Language Understanding Evaluation
● A collection used to train, evaluate, analyze natural language
  understanding systems
● Datasets with different genres, and of different sizes and
  difficulties
● Leaderboard
Tasks Evaluated on
● Sentence grammatical or not?
● Sentiment
● Paraphrase
● Similarity
● Questions duplicates
● Answerable
● Contradiction
● Entailment
● Winograd (co-ref)
General Language Understanding Evaluation
● Drive research
● Model agnostic
               Add &      [
               Norm
               Feed           LayerNorm,
              Forward
                              dense,
               Add &
               Norm           activation,
             Multi-Head
                              dropout_middle,
              Attention
                              dense,
Positional                    dropout_final
Encoding                  ]
               Input
             Embedding
              Inputs
 Transformer encoder                 Encoder block:
               Add &      [
               Norm
               Feed           Residual(
              Forward
                                   LayerNorm,
               Add &
               Norm                attention,
             Multi-Head
                                   dropout_,
              Attention
                              ),
Positional                    Residual(
Encoding                           feed_forward,
               Input
             Embedding        ),
              Inputs      ]
 Transformer encoder           Feedforward:         Encoder block:
               Add &      [                     [
               Norm
               Feed           LayerNorm,            Residual(
              Forward
                              dense,                     LayerNorm,
               Add &
                              activation,                attention,
               Norm
             Multi-Head       dropout_middle,            dropout_,
              Attention
                              dense,                ),
              Inputs                            ]
Data examples
Question: What percentage of the French population today is non - European ?
Context: Since the end of the Second World War , France has become an ethnically diverse country . Today ,
approximately five percent of the French population is non - European and non - white . This does not
approach the number of non - white citizens in the United States ( roughly 28 – 37 % , depending on how Latinos are
classified ; see Demographics of the United States ) . Nevertheless , it amounts to at least three million people , and has
forced the issues of ethnic diversity onto the French policy agenda . France has developed an approach to dealing with
ethnic problems that stands in contrast to that of many advanced , industrialized countries . Unlike the United States ,
Britain , or even the Netherlands , France maintains a " color - blind " model of public policy . This means that it targets
virtually no policies directly at racial or ethnic groups . Instead , it uses geographic or class criteria to address issues of social
inequalities . It has , however , developed an extensive anti - racist policy repertoire since the early 1970s . Until recently ,
French policies focused primarily on issues of hate speech — going much further than their American counterparts — and
relatively less on issues of discrimination in jobs , housing , and in provision of goods and services .
● Process data to get the required inputs “cola sentence: The course “not
                                                                           T5
                                                    is jumping well.”           acceptable”
    and outputs: "question: Q context: C"
    as input and "A" as target                “stsb sentence1: The rhino
                                                  grazed on the grass.              “3.8”
●   Fine tune your model on the new task         Sentence2: A rhino is
                                                   grazing in a field.”
    and input                                                                    “six people
                                                   “Summarize: state            hospitalized
                                                 authorities dispatched         after a storm
●   Predict using your own model               emergency crews tuesday             in attala
                                              to survey the damage after           county”
                                                an onslaught of severe
                                                weather in mississippi…”
                    Hugging
                      Face:
                  Introduction
deeplearning.ai
Outline
● What is Hugging Face?
Transformers library
Use it for
           Context
                           Q/            Answers
                           A
          Questions
Hugging Face: Fine-Tuning Transformers
       Datasets:
                                Tokenizer
     One Thousand
   Model Checkpoints:
                                 Trainer          Evaluation metrics
  More than 14 thousand
                                Tokenizer
                                                     Checkpoint: Set of learned
                                                   parameters for a model using a
                                                  training procedure for some task
                          Human readable output
                    Hugging
                   Face: Using
                  Transformers
deeplearning.ai
Using Transformers
           Pipelines   1. Pre-processing your inputs
           Context
                           Q/            Answers
                           A
          Questions
Tasks                                              Task
                              Initialization
               Pipelines                          Model
                                                Checkpoint
                                                Inputs for
                                  Use
                                                 the task
   Model Checkpoints:
                                 Trainer          Evaluation metrics
  More than 14 thousand
Tokenizer
    Datasets:
  One Thousand   Load them using just one function
            Number of epochs
            Warm-up steps
            Weight decay
            ...