NUS ACE SUMMER PROGRAMME
AI & MACHINE
              LEARNING
Manoranjan Dash
Professor and Dean
School of Computing and Data Science
FLAME University, Pune
Ex-Senior Research Fellow
Singapore Data Science Consortium
National University of Singapore
Outline
• Large Language Model
  • Introduction
Introduction
Introduction
1. What is LLM?
   a. LLM is an instance of a foundation model
       i. Foundation models are pretrained on unlabeled and self-supervised data
   b. Large is really large, like petabytes of data
   c. Biggest model by parameters count
       i. GPT 3: 175 billion parameters, trained on 45 TB
2. How do they work?
   a. LLM: Data + Architecture (Transformer) +Training
   b. Training: generation starts arbitrarily but learns from BP until it starts generating coherent
      models
3. Business Applications
   a. Chatbots: customer queries
   b. Content Creation: articles, emails, social media posts, etc
   c. Code generation
     Introduction
     • Two years back implementing an LLM was considered very cutting
       edge and esoteric
          • But now it can be implemented and applied quite easily
          • Lots of businesses and large organizations now have built LLMs
               • Example: BloombergGPT (50 billion parameters)
                    • Used to deal with financial data
     • No need to build LLM from scratch
          • More efficient: prompt engineering, fine tuning
https://www.youtube.com/watch?v=ZLbVdvOoTKM
So, we won’t be training an LLM soon.
Let’s discuss the technical aspects of building one of these models.
4 Key Steps
• Data Curation
• Model Architecture
• Training at Scale
• Evaluation
Step 1: Data Curation
• Most important and most time consuming part of the process
• Remember:
  • Garbage in, garbage out
  • The quality of your model is driven by the quality of your data
  • But LLM require large amount of training data set
     • So wherefrom we can get quality training data?
                                                        Trillion words = 1,000,000,000,000
                                                        (approx.) 1,000,000 novels
                                                        (approx.) 1,000,000,000 news articles
Step 1: Data Curation
• Where do we get all these data?
  • Internet
     • Web pages, Wikipedia, forums, books, scientific articles, code bases, etc.
         • Post ChatGPT there are a lot of copyrights laws
             • You may grab data that you are not supposed to grab (web scraping)
             • Or you use the data for a potentially commercial use, etc.
  • Public datasets
     • Common Crawl (Colossal Clean Crawled Corpus (C4), Falcon RefinedWeb)
     • The Pile – tries to bring a wide variety of training datasets
     • Hugging Face Datasets – HuggingFace has emerged a big player in AI and LLM
  • Private Data Sources
     • FinPile – used to train BloombergGPT
     • Advantage: not available to others
  • Using an LLM
     • Alpaca (Stanford) – an LLM trained on structured text generated by GPT-3
   Alpaca (Stanford)
Prompt          TrainingData
While Common Crawl is derived from raw web data, it is
aggregated and standardized in a way that differentiates it
from a random collection of web pages. This added
structure and cleanliness make it more valuable for training
LLMs.
 Step 1: Data
 Curation
- Higher data diversity means models that can perform for wide variety of tasks thus becoming a general purpose
- GPT-3: webpages+some books
- Gopher: webpages+some books and some codes
- Llama: webpages+some books and some codes and some scientific articles
- PaLM: mainly built on conversational data+ webpages, books and codes
- Knowing these will guide us how we query these models
Quality of a model is driven by the quality of the training data
Step 1: Data Curation
• How to prepare the data?
  • Four ways
     1. Quality filtering – remove low quality text from dataset
         • Toxic language, hate speech, objective false (2+2=5), …
         • Two types of filtering
             • Classifier based
                   • Classify text as high or low quality
             • Heuristic based
                   • Removing specific words, repeated words, remove words based on statistical
                     properties
             • Or take a combo of these two approaches
     2. De-duplication – several instances of similar text can bias model
         • If the same web page appears both in training and test
     3. Privacy reduction – removal of sensitive and confidential information
     4. Tokenization – translate text into numbers
         • ANN does not understand text but only numerical values
Step 2: Model Architecture
• Transformers
  • ANN architecture that uses attention mechanisms
     • Attention mechanism – learns dependencies between different elements of a
       sequence based on position and content.
         • E.g., I like samosa. In general I like Indian cuisine.
             • The word Indian is dependent on samosa
  • 3 Types of Transformers
     1. Encoder-only – encoder translates tokens into a semantically meaningful
        representation | tasks: text classification
     2. Decoder-only – similar to encoder but does not allow self-attention with future
        elements | tasks: text generation
     3. Encoder-decoder – combines and allows cross-attention | tasks: translation
  • Most popular: decoder-only architecture
There is a lot more detail about model architecture. But
Considering the interest and level of students, I do not
Discuss them here.
Step 2: Model Architecture
• How big do I make it big?
   • If a model is too big or trained too long, it can underperform
   • If a model is too small or not trained long enough, it can underperform
Step 3: Training at Scale
• Central challenge of the LLM is their scale – when training with trillions
  of tokens and with billions of parameters, lots of computational cost is
  associated with this
• So, some computation tricks/techniques are necessary
• 3 Training Techniques
   1. Mixed Precision Training – uses both 32-bit and 16-bit floating point data types
      • Use 16-bit precision whenever possible; else use 32-bit for higher precision
   2. 3D Parallelism – combination of pipeline, model and data parallelism
      • Pipeline Parallelism – distributes layers across multiple GPUs
      • Model Parallelism – decomposes parameter matrix operation into multiple matrix and
        distributes across multiple GPUs
      • Data Parallelism – distributes training data across multiple GPUs
   3. Zero Redundancy Optimizer (ZeRO) – reduces data redundancy regarding the
      optimizer state, gradient, or parameter partitioning
• Example -- DeepSpeed
Step 3: Training at Scale
• Training Stability
   • Checkpointing – takes a snapshot of model artifacts so training can
     resume from that point
      • E.g., let us say the training was going well and error was going down. But all of a
        sudden there is a big spike in the error. So, with checkpointing, it is possible to go
        back to a point where the training was going well
   • Weight Decay – regularization strategy that penalizes large parameter
     values by adding a term (e.g., L2 norm of weights) to the loss function or
     changing the parameter update rule
   • Gradient Clipping – rescales the gradient of the objective function if its
     norm exceeds a pre-specified value
Step 3: Training at Scale
• Hyperparameters
  • Batch Size: (Static) typically ~16M tokens; (Dynamic) GPT-3 increased
    from 32K to 3.2M
  • Learning Rate (LR): (Dynamic) LR increases linearly until reaching a
    maximum value and then reduces via a cosine decay until the LR is about
    10% if its max value
  • Optimizer: Adam-based optimizers are most commonly used for LLMs
  • Dropout: typical values between 0.2 and 0.5
Step 4: Evaluation
• Just having model is not all
   • You need to evaluate it and find in which cases it works well
   • For this there are many benchmark datasets
• Benchmark Dataset (Open LLM Leaderboard)
Step 4: Evaluation
• Multiple-choice Tasks
  • ARC, Hellaswag, MMLU
     • ARC and MMLU: questions on arts, history, common knowledge
         • E.g., which is the latest technology developed: (a) cellphone, (b) airplane, (c) microwave,
           (d) refrigerator
     • Hellaswag: is different – based on commonsense questions
• Open-ended Tasks
  • TruthfulQA
     • Human Evaluation – a person scores completion based on ground truth, guidelines,
       or both
     • NLP Metrics – quantify completion quality via metrics such as Perplexity, BLEU, or
       ROGUE scores
     • Auxiliary Fine-tuned LLM – use LLM to compare completions to ground truth
What’s Next?
• Base models are typically a starting point, not final solution
Prompt Engineering
• Prompt engineering is a crucial technique in the use of large
  language models (LLMs) like GPT-4, designed to optimize the way
  prompts are crafted to elicit the best possible responses from the
  model.
   • As LLMs are driven by the context provided to them, the way questions or
     tasks are framed can significantly impact the quality and relevance of the
     output.
What is Prompt Engineering?
• Prompt engineering involves designing and refining the input
  prompts given to an LLM to achieve specific, desired responses.
  • This includes choosing the right words, structure, and context to
    maximize the effectiveness of the model's output.
Contextual Information
• Contextual Prompts
  • Use context-rich prompts to help the model understand the scenario
    better. This might include examples, definitions, or prior conversation
    history.
• Multi-turn Prompts
  • In a conversational setting, use a series of prompts that build on previous
    responses to maintain context and coherence.
Format and Structure
• Question Format
  • Frame questions or tasks in a way that directs the model towards the type of response
    needed.
  • For example, asking "List three benefits of exercise" rather than "What are the benefits of
    exercise?"
• Templates
  • Use consistent templates for repetitive tasks to standardize responses and ensure
    reliability.
Prompt Length
• Brevity vs. Detail
   • Balance between being concise and providing enough detail.
   • While too short prompts might lack context, overly long prompts can be
     confusing and dilute the main instruction.
Examples and Demonstrations
• Few-shot Learning
   • Provide examples within the prompt (few-shot learning) to illustrate the
     desired response format.
   • For example, “Translate the following sentence to French: ‘Hello, how are
     you?’ Example: 'Bonjour, comment ça va?' Now translate: ‘Good
     morning.’”
• Role-Playing
   • Use role-playing to set the scene, such as "You are a helpful assistant.
     How would you explain photosynthesis to a 5-year-old?"
Iterative Refinement
• Feedback Loop
  • Continuously refine prompts based on the outputs received. If the
    model’s response is not as expected, tweak the prompt and try again.
• Experimentation
  • Experiment with different phrasings and structures to see which
    variations yield the best results.
Examples of Prompt Engineering
• Example 1: Simple Query
  • Before
     • "Tell me about climate change."
  • After
     • "Explain the causes and effects of climate change in detail, and provide examples of
       how it impacts different regions around the world."
Examples of Prompt Engineering
• Example 2: Simple Query
  • Before
     • "Generate a summary."
  • After
     • "Read the following paragraph and generate a summary that highlights the main
       points: [Insert paragraph here]."
Examples of Prompt Engineering
• Example 3: Few-shot Learning
  • Before
     • "Translate to Spanish: 'Good night.'"
  • After
     • "Translate the following English sentences to Spanish. Example: 'Hello, how are
       you?' -> 'Hola, ¿cómo estás?'. Now translate: 'Good night.'"
Best Practices
• Be Explicit
   • Avoid vague language and be as explicit as possible about what you want
     the model to do.
• Test Variations
   • Try different versions of your prompt to see which one works best.
• Leverage Examples
   • Use in-context examples to guide the model.
• Maintain Consistency
   • Keep a consistent format for similar types of prompts to ensure reliable
     outputs.
Model Fine-Tuning
• Introduction to Model Fine-Tuning
  • Definition: Fine-tuning is the process of taking a pre-trained language
    model and training it further on a specific dataset to adapt it to a
    particular task.
  • Purpose: It helps the model perform better on specific tasks by leveraging
    the general knowledge it has already learned during pre-training.
Why Fine-Tuning is Important?
• Specialization
   • Adapts a general-purpose model to perform well on specific tasks (e.g.,
     summarization, translation, sentiment analysis).
• Improved Performance
   • Enhances the model's accuracy and effectiveness for the given task.
• Resource Efficiency
   • Saves time and computational resources compared to training a model
     from scratch.
Steps in Fine-Tuning a Language Model
1. Select a Pre-trained Model
  • Choose a base model that has been pre-trained on a large corpus of text
    (e.g., GPT-3, BERT).
2. Prepare the Dataset
  • Collect and preprocess a dataset relevant to the specific task you want
    the model to perform.
  • Ensure the dataset is clean, well-labeled, and representative of the task.
3. Adjust Model Architecture (if needed)
  • Sometimes, minor modifications to the model architecture are made to
    better suit the task.
Steps in Fine-Tuning a Language Model
4. Set Hyperparameters
   • Define hyperparameters such as learning rate, batch size, and number of
     training epochs.
5. Training
   • Train the model on the specific dataset by updating its weights through
     backpropagation.
   • Use techniques like gradient descent to minimize the loss function, which
     measures the difference between the model’s predictions and the actual
     results.
6. Validation
   • Evaluate the model’s performance on a validation set to monitor overfitting and
     adjust hyperparameters if necessary.
7. Testing
   • Test the fine-tuned model on a separate test set to assess its final performance.
Key Concepts to Understand
• Pre-training vs. Fine-tuning
   • Pre-training involves training on a large, diverse dataset to learn general
     language patterns, while fine-tuning adapts the model to a specific task.
• Overfitting
   • A situation where the model performs well on the training data but poorly
     on unseen data. Fine-tuning helps mitigate this by focusing on relevant
     features for the specific task.
• Transfer Learning
   • The concept of transferring knowledge from one task (pre-training) to
     another (fine-tuning).
     LLM Hands On
     • Langchain
           • Framework for applications to leverage LLM for various purposes
     • FAISS
https://github.com/codebasics/langchain/blob/main/3_project_codebasics_q_and_a/google_palm_codebasics_q_and_a.ipynb
     LLM Hands On Using Google PaLM
     • Basic working of Google PaLM LLM in langchain
     • Loading and retrieving data from a CSV file using Langchain
       and FAISS
     • Create RetrievalQA chain using with prompt template
https://github.com/codebasics/langchain/blob/main/3_project_codebasics_q_and_a/google_palm_codebasics_q_and_a.ipynb
Basic working of Google PaLM LLM in
langchain
Loading and retrieving data from a CSV file using Langchain
and FAISS