Natural Language Processing
Contextualized Embeddings and Large
         Language Models
           Felipe Bravo-Marquez
             June 20, 2023
                 Representations for a word
  • So far, we’ve basically had one representation of words, the word embeddings
    we’ve already learned: Word2vec, GloVe, fastText.1 .
  • These embeddings have a useful semi-supervised quality, as they can be
    learned from unlabeled corpora and used in our downstream task-oriented
    architectures (LSTM, CNN, Transformer).
  • However, they exhibit two problems.
  • Problem 1: They always produce the same representation for a word type
    regardless of the context in which a word token occurs
  • We might want very fine-grained word sense disambiguation
  • Problem 2: We just have one representation for a word, but words have different
    aspects, including semantics, syntactic behavior, and register/connotations
  1
   These slides are partially based on the Stanford CS224N: Natural
Language Processing with Deep Learning course:
http://web.stanford.edu/class/cs224n/
Neural Language Models can produce Contextualized
                  Embeddings
    • In a Neural Language Model (NLM), we immediately stuck word vectors
      (perhaps only trained on the corpus) through LSTM layers
    • Those LSTM layers are trained to predict the next word.
    • But these language models produce context-specific word representations in the
      hidden states of each position.
ELMo: Embeddings from Language Models
• Idea: train a large language model (LM) with a recurrent neural network and use
  its hidden states as “contextualized word embeddings” [Peters et al., 2018].
• ELMO is bidirectional LM with 2 biLSTM layers and around 100 million
  parameters.
• Uses character CNN to build initial word representation (only)
• 2048 char n-gram filters and 2 highway layers, 512 dim projection
• User 4096 dim hidden/cell LSTM states with 512 dim projections to next input
• Uses a residual connection
• Parameters of token input and output (softmax) are tied.
ELMo: Embeddings from Language Models
                   ELMo: Use with a task
• First run biLM to get representations for each word.
• Then let (whatever) end-task model use them.
• Freeze weights of ELMo for purposes of supervised model.
• Concatenate ELMo weights into task-specific model.
ELMo: Results
                                   ULMfit
• Howard and Ruder (2018) Universal Language Model Fine-tuning for Text
  Classification [Howard and Ruder, 2018].
• Same general idea of transferring NLM knowledge
• Here applied to text classification
                                  ULMfit
• Train LM on big general domain corpus (use biLM)
• Tune LM on target task data
• Fine-tune as classifier on target task
                       ULMfit emphases
• Use reasonable-size “1 GPU” language model not really huge one
• A lot of care in LM fine-tuning
• Different per-layer learning rates
• Slanted triangular learning rate (STLR) schedule
• Gradual layer unfreezing and STLR when learning classifier
• Classify using concatenation [hT ,maxpool(h),meanpool(h)]
                              Text classifier error rates
ULMfit transfer learning
Let’s scale it up!
Transformer models
BERT (Bidirectional Encoder Representations from
                  Transformers)
   • Idea: combine ideas from ELMO, ULMFit and the Transformer
     [Kenton and Toutanova, 2019].
   • How: Train a large model (335 million parameters) from a large unlabeled corpus
     using a Transformer encoder and then fine-tune it for other downstream tasks.
   • The parallelizable properties of the Transformer (unlike RNNs, which must be
     processed sequentially) allow the model to scale to more parameters.
   • This model is related but a little bit different from a standard Language Model.
BERT (Bidirectional Encoder Representations from
                  Transformers)
   • BERT doesn’t predict the next word in a sentence like a traditional language
     model, but rather learns utilizes a “masked language modeling” (MLM)
     objective during pre-training.
   • In MLM, random words in a sentence are masked and the model is trained to
     predict those masked words based on the surrounding context.
   • BERT also incorporates a “next sentence prediction” task, where pairs of
     sentences are fed to the model, and it learns to predict whether the second
     sentence follows the first in the original text.
   • Fine-tuning BERT involves adding a task-specific layer on top of the pre-trained
     model and training it on a labeled dataset for the target task.
   • BERT achieved state-of-the-art results at the time of its release on NLP tasks,
     including sentence classification, named entity recognition, question answering,
     and more.
Masked Language Modeling and Next Sentence
                Prediction
 • MLM: Mask out k% of the input words, and then predict the masked words
 • They always use k = 15%.
 • Too little masking: Too expensive to train
 • Too much masking: Not enough context
Masked Language Modeling and Next Sentence
                Prediction
 • Next sentence prediction: To learn relationships between sentences, predict
   whether Sentence B is actual sentence that proceeds Sentence A, or a random
   sentence
            BERT sentence pair encoding
• Token embeddings: Words are divided into smaller units called word pieces, and
  each word piece is assigned a token embedding.
• BERT learns a segmented embedding [SEP] to differentiate between the two
  sentences in a pair.
• BERT utilizes positional embeddings to capture the position of each word within
  the sentence.
     BERT Model Architecture and Training
• BERT is based on the Transformer encoder.
• The multi-headed self-attention block of the Transformer allows BERT to
  consider long-distance context effectively.
• The use of self-attention also enables efficient computations on GPU/TPU, with
  only a single multiplication per layer.
• BERT was trained on a large amount of unlabeled text data from Wikipedia and
  BookCorpus.
• Two different model sizes were trained:
     1. BERT-Base: 12 layers, 768 hidden units, and 12 attention heads.
     2. BERT-Large: 24 layers, 1024 hidden units, and 16 attention heads.
• The training process involved utilizing 4x4 or 8x8 TPU (Tensor Processing Unit)
  configurations for faster computation.
• Training BERT models took approximately 4 days to complete.
                  BERT model fine tuning
• Fine-tuning involves customizing the pre-trained BERT model for specific tasks.
• To fine-tune BERT, we add a task-specific layer on top of the pre-trained BERT
  model.
• The task-specific layer can vary depending on the task at hand, such as
  sequence labeling or sentence classification.
• We train the entire model, including the pre-trained BERT and the added
  task-specific layer, for the specific task.
                 BERT results on GLUE tasks
   • BERT was massively popular and hugely versatile; finetuning BERT led to new
     state-of- the-art results on a broad range of tasks.
   • BERT’s performance was assessed using the GLUE benchmark, a collection of
     diverse NLP tasks.
   • The GLUE benchmark primarily consists of natural language inference tasks, but
     also includes sentence similarity and sentiment analysis tasks.
Example Task: MultiNLI (Natural Language Inference)
   • Premise: ”Hills and mountains are especially sanctified in Jainism.”
   • Hypothesis: ”Jainism hates nature.”
   • Label: Contradiction
Example Task: CoLa
   • Sentence: ”The wagon rumbled down the road.”
   • Label: Acceptable
   • Sentence: ”The car honked down the road.”
   • Label: Unacceptable
             BERT results on GLUE tasks
• QQP: Quora Question Pairs (detect paraphrase questions)
• QNLI: natural language inference over question answering data
• SST-2: sentiment analysis
• CoLA: corpus of linguistic acceptability (detect whether sentences are
  grammatical.)
• STS-B: semantic textual similarity
• MRPC: microsoft paraphrase corpus
• RTE: a small natural language inference corpus
BERT Effect of pre-training task
     Pre-training decoders GPT and GPT-2
• Contemporary to BERT, OpenAI introduced an alternative approach called
  Generative Pretrained Transformer (GPT) [Radford et al., ].
• The idea behind GPT is to train a large standard language model using the
  generative part of the Transformer, specifically the decoder.
• GPT is a Transformer decoder with 12 layers and 117 million parameters.
• It has 768-dimensional hidden states and 3072-dimensional feed-forward hidden
  layers.
• GPT utilizes byte-pair encoding with 40,000 merges to handle subword units.
• GPT was trained on BooksCorpus, which consists of over 7,000 unique books.
• OpenAI later introduced GPT-2, a larger version with 1.5 billion parameters,
  trained on even more data.
• GPT-2 has been shown to generate relatively convincing samples of natural
  language.
  GPT-2 language model (cherry-picked) output
Human provided prompt:
In a shocking finding, scientist discovered a herd of unicorns living in a remote,
previously unexplored valley, in the Andes Mountains. Even more surprising to the
researchers was the fact that the unicorns spoke perfect English.
Model Completition:
The scientist named the population, after their distinctive horn, Ovid’s Unicorn. These
four-horned, silver-white unicorns were previously unknown to science.
Now, after almost two centuries, the mystery of what sparked this odd phenomenon is
finally solved.
Dr. Jorge Pérez, an evolutionary biologist from the University of La Paz, and several
companions, were exploring the Andes Mountains when they found a small valley, with
no other animals or humans. Pérez noticed that the valley had what appeared to be a
natural fountain, surrounded by two peaks of rock and silver snow.
What kinds of things does pretraining learn?
• Stanford University is located in     , California. [Trivia]
• I put      fork down on the table. [syntax]
• The woman walked across the street, checking for traffic over           shoulder.
  [coreference]
• I went to the ocean to see the fish, turtles, seals, and     . [lexical
  semantics/topic]
• Overall, the value I got from the two hours watching it was the sum total of the
  popcorn and the drink. The movie was           . [sentiment]
• Iroh went into the kitchen to make some tea. Standing next to Iroh, Zuko
  pondered his destiny. Zuko left the       . [some reasoning – this is harder]
• I was thinking about the sequence that goes 1, 1, 2, 3, 5, 8, 13, 21,        [some
  basic arithmetic; they don’t learn the Fibonnaci sequence]
             Phase Change: GPT-3 (2020)
• GPT-3 is another Transformer-based Language Model (LM) that pushed the
  boundaries with nearly 200 billion parameters, making it the largest model at the
  time [Brown et al., 2020].
• It was trained on a massive corpus consisting of nearly 500 billion words.
• In-context learning: GPT-3 demonstrated the ability to solve various natural
  language processing (NLP) tasks using zero-shot, one-shot and few-shot
  learning.
• The key to this capability lies in the prompt or context provided to GPT-3.
• GPT-3 demonstrated the ability to solve various tasks without performing
  gradient updates to the base model.
Zero-shot, One-shot, and Few-shot Learning with
                     GPT-3
  • Zero-shot learning: With zero-shot learning, GPT-3 can tackle tasks without any
    specific training. It achieves this by providing a prompt or instruction to guide its
    generation process. For example, by providing GPT-3 with a prompt like,
    “Translate this English sentence to French,” it can generate the translated
    sentence without any explicit training for translation tasks.
  • One-shot learning: In one-shot learning, GPT-3 can perform a task by adding a
    single input-output pair to the instruction.
  • Few-shot learning: similar idea but providing a limited number input-output
    pairs after the instruction in the prompt.
Zero-shot, One-shot, and Few-shot Learning with
                     GPT-3
GPT-3 Few-shot Learning Results
              Chain-of-thought Prompting
• Chain-of-thought prompting is a simple mechanism for eliciting multi-step rea-
  soning behavior in large language models.
• Idea: augment each exemplar in few-shot prompting with a chain of thought for
  an associated answer [Wei et al., 2022]
Language Models as User Assitants (or Chatbots)
  • Autoregressive Large Language Models are not aligned with user intent
    [Ouyang et al., 2022]
  • Solution: align the language model with user intent via fine-tuning.
LaMDA: Language Models for Dialog Applications
  • LaMDA is a language model developed by Google based on Transformer
    optimized for open domain dialog [Thoppilan et al., 2022].
  • It has 137 billion parameters and is trained on 1.56 billion words.
  • It is initially pre-trained in the same way as traditional language models
    (predicting words) with language models (predicting words) with a strong focus
    on dialog data.
  • It is then fine-tuned to generate responses with respect to several other criteria.
  • In order to fit LaMBDA to all these criteria they worked with a large number of
    crowd-workers.
  • These are people who manually labeled conversations from the pre-trained
    model.
                 LaMDA Optimization Criteria
Quality
  • Sensibleness: give meaningful answers.
  • Specificity: avoid vague answers.
   • Interestingness: give insightful, unexpected or witty answers.
Safety
   • Avoid violent language.
   • Avoid hate speech.
   • Avoid stereotyped speech.
Groundedness and Informativity
   • Avoid giving answers not validated by external sources.
   • Optimize the fraction of responses that can be validated in authoritative sources
     using search engines.
                      LaMDA Evaluation
• The system is compared with the original pre-trained PT model and human
  judgments.
• The evaluation is done by another group of people through questionnaires.
                       ChatGPT and RLHF
• Model similar to LaMDA launched by
  OpenAI at the end of 2022.
• It also uses Crowdsourcing to
  improve its responses, but its
  fine-tuning process uses
  Reinforcement Learning (RL), a
  different learning paradigm from
  supervised learning.
• In particular, it uses Reinforcement
  Learning from Human Feedback
  (RLHF) [Ouyang et al., 2022].
• It builds a preference model that
  assigns a score to a generated
  sentence and adjusts the language
  model accordingly.
                   ChatGPT and RLHF
Source: https://huggingface.co/blog/rlhf
                              GPT-4 (2023)
• Last LM of OpenAI [OpenAI, 2023],
  this time able to include images in
  the prompt.
• Still a Transformer LM.
• Able to pass exams in several
  disciplines being able to process the
  images of the questions.
• From ChatGPT onwards, companies
  have stopped making public all the
  details of the construction of their
  models.
                   Instruction Fine-tuning
• A more efficient way to fine-tune Large Language Models is Instruction
  Fine-Tuning [Chung et al., 2022].
• Idea: collect examples of (instruction, output) pairs across many tasks and
  finetune an LM.
• Evaluate on unseen tasks.
          Dangers of Large Language Models
The research community has raised concerns about several dangers associated with
Large Language Models [Bender et al., 2021].
   • Hallucination: Probabilistic language models can generate fabricated
      information lacking factual basis.
   • Fairness: These models can perpetuate biases present in the training data,
      including toxic language, racism, and gender discrimination.
   • Copyright infringement: Large language models may violate copyright laws by
      reproducing content without proper authorization.
   • Lack of transparency: The complex nature of these models makes it difficult to
      interpret their predictions and understand the reasoning behind specific
      responses.
   • Monopolization: The high costs of training these models create barriers for
      non-big-tech companies to compete.
   • High carbon footprint: The energy-intensive training process of these models
      contributes to a significant carbon footprint.
         Large Language Models Time-line
• As of today (2023), the development of new Large Language Models continues
  uninterrupted.
• A timeline of existing large language models (having a size larger than 10B) in
  recent years [Zhao et al., 2023].
                     Prompt Engineering
• Prompt engineering is a new discipline for developing and optimizing prompts to
  efficiently use language models (LMs).
                             Conclusions
• The growth in the size and power of language models has accelerated
  dramatically.
• It is very difficult to predict what they will do in the future.
• What can we predict with confidence?
• There will be an overload of generative models for multiple formats (text, code,
  image, video, virtual realities).
• There will be a plethora of agents/programs that act and make decisions by
  interacting with these models (medical appointments, investments, travel).
        Questions?
Thanks for your Attention!
                           References I
Bender, E. M., Gebru, T., McMillan-Major, A., and Shmitchell, S. (2021).
On the dangers of stochastic parrots: Can language models be too big?
In Proceedings of the 2021 ACM conference on fairness, accountability, and
transparency, pages 610–623.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P.,
Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020).
Language models are few-shot learners.
Advances in neural information processing systems, 33:1877–1901.
Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X.,
Dehghani, M., Brahma, S., Webson, A., Gu, S. S., Dai, Z., Suzgun, M., Chen, X.,
Chowdhery, A., Castro-Ros, A., Pellat, M., Robinson, K., Valter, D., Narang, S.,
Mishra, G., Yu, A., Zhao, V., Huang, Y., Dai, A., Yu, H., Petrov, S., Chi, E. H.,
Dean, J., Devlin, J., Roberts, A., Zhou, D., Le, Q. V., and Wei, J. (2022).
Scaling instruction-finetuned language models.
Howard, J. and Ruder, S. (2018).
Universal language model fine-tuning for text classification.
In Proceedings of the 56th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pages 328–339, Melbourne, Australia.
Association for Computational Linguistics.
                          References II
Kenton, J. D. M.-W. C. and Toutanova, L. K. (2019).
Bert: Pre-training of deep bidirectional transformers for language understanding.
In Proceedings of NAACL-HLT, pages 4171–4186.
OpenAI (2023).
Gpt-4 technical report.
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C.,
Agarwal, S., Slama, K., Ray, A., et al. (2022).
Training language models to follow instructions with human feedback.
Advances in Neural Information Processing Systems, 35:27730–27744.
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and
Zettlemoyer, L. (2018).
Deep contextualized word representations.
In Proceedings of the 2018 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies,
Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana.
Association for Computational Linguistics.
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.
Improving language understanding by generative pre-training.
                          References III
Thoppilan, R., De Freitas, D., Hall, J., Shazeer, N., Kulshreshtha, A., Cheng,
H.-T., Jin, A., Bos, T., Baker, L., Du, Y., et al. (2022).
Lamda: Language models for dialog applications.
arXiv preprint arXiv:2201.08239.
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., and Zhou, D.
(2022).
Chain of thought prompting elicits reasoning in large language models.
arXiv preprint arXiv:2201.11903.
Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B.,
Zhang, J., Dong, Z., Du, Y., Yang, C., Chen, Y., Chen, Z., Jiang, J., Ren, R., Li, Y.,
Tang, X., Liu, Z., Liu, P., Nie, J.-Y., and Wen, J.-R. (2023).
A survey of large language models.