0% found this document useful (0 votes)

8 views11 pages

Model Pretraining

Uploaded by

aashutoshkumar.mishra99

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views11 pages

Model Pretraining

Uploaded by

aashutoshkumar.mishra99

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Model Pretraining :

what is Generative Configuration:

How you decide which Model you have to

choose for you work ?

Differnece between Auto regressive

Models and Auto encoding Models

Model Pretraining : 1
Autoencoding Models: Encoder-Only
Models
Autoencoding models, also known as encoder-only models, are pre-trained
using masked language modeling. In this approach, tokens in the input
sequence are randomly masked, and the model’s objective is to predict the
masked tokens to reconstruct the original sentence.

Autoregressive Models: Decoder-Only

Models
Autoregressive models, or decoder-only models, are pre-trained using causal
language modeling. The objective is to predict the next token based on the
previous sequence of tokens. These models mask the input sequence and can
only see the input tokens leading up to the token in question.

Sequence-to-Sequence Models:
Encoder-Decoder Models

Model Pretraining : 2
Sequence-to-sequence models utilize both the encoder and decoder
components of the original transformer architecture. The pre-training objective
for these models varies depending on the specific model.

Sequence-to-sequence models are commonly used for translation,

summarization, and question-answering tasks. BART is another well-known
encoder-decoder model.

What Matter the Most :

Model Size Matters
Training Dataset Size

increase the compute power and the time you are going to train the model.

A. Give your model more horse power — increase the

compute power and the time you are going to train the model

B. Give your model more muscles — increase the model size

or the model parameters

C. Give your model more training material — increase the

size of the dataset

The paper Scaling Laws for Neural Language Models shows that increasing
model size, dataset size, and training compute all independently enhance
performance.

ChatGPT (175B parameters), Jurassic (178B parameters), or the massive

Megatron-Turing NLG (530B parameters).

Model Pretraining : 3
B : increase the model size or the model parameters

DeepMind paper Training Compute-Optimal Large Language Models (also

known as the Chinchilla paper), the authors suggest that current big models
might be over-parameterised and under-trained in terms of dataset size.

cost issues:

Inference cost — the cost of calling an LLM to generate a response

Tuning cost — the cost of tuning an LLM to drive tailored pre-trained model
responses

Pre-training cost — the cost of training a new LLM from scratch

Hosting cost — the cost of deploying and maintaining a model behind an

API, supporting inference or tuning

these choices are influenced by the available compute budget, encompassing

factors such as hardware limitations, training time constraints, and financial
considerations.

C. Increase the size of the dataset:

The authors show that the optimal training dataset size for a given model is
about
20 times larger

than the number of parameters of the model.

The Chinchilla model trained by Deepmind was trained optimally and the size of
the dataset was 1.4T and the number of parameters is 70B. Llama-65B also
follows a similar pattern, in contrast to GPT-3 or BLOOM models highlighted in
red in the picture below

Model Pretraining : 4
Optimal Models:

Model Pretraining : 5
Why choose a Small Language Model?

Benefits and shortcomings of Small Language

Models

Benefits
Efficient — SLMs are more nimble and require less computational power
which makes them more efficient to deploy in production

Cost-effective — Less parameters means that you need less resources to

train, maintain and run an SLM compared to an LLM

Specialised — SLMs can be trained on high quality datasets for specific

domain tasks. This often leads to better performance within that niche

Explainable — Because these are less complex and use more targeted data
can offer more transparency into their outputs. Explainability is valued in
most Enterprises especially in sensitive applications

Shortcomings
Task-limited — Due to their specialised nature SLMs might struggle to
perform as well on tasks outside their training domain. They lack the
breadth of knowledge that LLMs possess

Performance-limited — SLMs have a lower capacity for learning and

understanding complex language patterns compared to larger models.
This can lead to limitations in the types of tasks they can handle effectively

Dataset-dependent — Smaller but less curated datasets can lead to less

robust models as the performance of SLMs relies heavily on the quality and
relevance of the data they are trained on

Why Customizing Language Models for

Specialized Domains:

Model Pretraining : 6
Understanding Domain Adaptation
Certain domains, such as law, medicine, finance, and science, possess their
own vocabulary and unique language structures. Common terms and phrases
in these domains may be unfamiliar outside their respective fields.

BloombergGPT
A Finance-Focused LLM: BloombergGPT serves as a prime example of a
specialized LLM in the financial domain. Developed by Bloomberg researchers,
this model combines finance-specific data with general-purpose text during
pretraining. By maintaining a balance between finance and public data (51%
financial data and 49% public data), BloombergGPT achieves superior results
on financial benchmarks while still demonstrating competitive performance on
general-purpose LLM benchmarks.

Model Pretraining : 7
What is Quantization?

Quantization involves reducing the memory required to store model weights by

decreasing their precision. Instead of the default 32-bit floating-point numbers
(FP32) used to represent parameters, quantization employs 16-bit floating-
point numbers (FP16) or even 8-bit integers (INT8). This reduction in precision
helps optimize the memory footprint of the models.

Model Pretraining : 8
Quantization Process and Memory
Savings
Quantization statistically projects the original 32-bit floating-point numbers into
lower-precision spaces, utilizing scaling factors derived from the range of the
original numbers. For instance, if a model with one billion parameters requires
approximately 80 gigabytes of GPU RAM at full 32-bit precision, quantization
can yield significant memory savings.
By employing 16-bit half precision (FP16), the memory requirement can be
reduced by 50%, resulting in only 40 gigabytes of GPU RAM. Furthermore,
representing the model parameters as 8-bit integers (INT8) can reduce the
memory footprint even further to just one gigabyte, representing a total 98.75%
reduction compared to full 32-bit precision.

BFLOAT16 (BF16) has emerged as a widely adopted precision format in deep

learning. Developed by Google Brain, BF16 serves as a hybrid between FP16
and FP32, capturing the full dynamic range of FP32 with just 16 bits.

1.reduced require memory for to store and train models

2. lower precision spaces

3. QAT qunatization aware training

4. BFlOAT16

THINK:

500B param : 32 bit ?

How Much Gpu RAM Needed to train

500B parameter model ?

Model Pretraining : 9
what are the Scaling techniques for
Model Training with Multiple GPUs
Improving Efficiency and Performance

Two popular techniques

1. Distributed Data-Parallel (DDP)
2. Fully Sharded Data Parallel (FSDP)

DDP is a widely used model replication technique that distributes large datasets
across multiple GPUs, enabling parallel processing of batches of data. With
DDP, each GPU receives a copy of the model and processes data
independently. Afterward, a synchronization step combines the results,
updating the identical model on each GPU. DDP is suitable when the model
and its additional parameters fit onto a single GPU, resulting in faster
training.

large to fit in the memory of a single GPU.

Sharding involves splitting and distributing one logical data set across
multiple databases that share nothing and can be deployed across multiple
servers.

Fully Sharded Data Parallel (FSDP):

FSDP, inspired by the ZeRO technique, provides a solution when the model is
too large to fit in the memory of a single GPU. ZeRO (Zero Redundancy
Optimizer) aims to optimize memory usage by distributing or sharding model
parameters, gradients, and optimizer states across GPUs. FSDP applies
sharding strategies specified in ZeRO to distribute these components across
GPU nodes. This enables working with models that would otherwise exceed the
capacity of a single chip.

Model Pretraining : 10
Memory Optimization with ZeRO:
ZeRO offers three optimization stages:

Stage 1 shards only optimizer states, reducing memory usage by up to a

factor of four.

Stage 2 shards gradients, further reducing memory usage by up to eight

times when combined with Stage 1.

Stage 3 shards all components, including model parameters, with memory

reduction scaling linearly with the number of GPUs.

Model Pretraining : 11

Autoencoding Models (Encoder Only) : Three LLM Architectures
No ratings yet
Autoencoding Models (Encoder Only) : Three LLM Architectures
5 pages
LLM Compute Challenges & Solutions
100% (1)
LLM Compute Challenges & Solutions
1 page
Small Language Models (SLMS)
No ratings yet
Small Language Models (SLMS)
23 pages
Foundations of Large Language Models 1738142777
No ratings yet
Foundations of Large Language Models 1738142777
101 pages
Foundations of LLM
100% (1)
Foundations of LLM
231 pages
LLM 5
No ratings yet
LLM 5
31 pages
Foundations of Large Language Models: Tong Xiao and Jingbo Zhu
No ratings yet
Foundations of Large Language Models: Tong Xiao and Jingbo Zhu
277 pages
Getting Started With Generative Ai and Foundation Models
No ratings yet
Getting Started With Generative Ai and Foundation Models
16 pages
Transfer Learning with Pre-trained Models
No ratings yet
Transfer Learning with Pre-trained Models
16 pages
Day 5
No ratings yet
Day 5
48 pages
Introduction To Large Language Models-2025072419561496
No ratings yet
Introduction To Large Language Models-2025072419561496
16 pages
Deep Learning: Large Language Models
No ratings yet
Deep Learning: Large Language Models
58 pages
Deep Learning vs Machine Learning
No ratings yet
Deep Learning vs Machine Learning
27 pages
ICML'22 Big Model Tutorial (Public v2)
No ratings yet
ICML'22 Big Model Tutorial (Public v2)
160 pages
How LLM's Work, How GPT Was Trained, and How GPT Generates Outputs
No ratings yet
How LLM's Work, How GPT Was Trained, and How GPT Generates Outputs
12 pages
Efficient Deep Learning (First Early Release) (Gaurav Menghani Naresh Singh) (Z-Library)
No ratings yet
Efficient Deep Learning (First Early Release) (Gaurav Menghani Naresh Singh) (Z-Library)
69 pages
Jntuk r20 Unit-V Deep Learning Techniques (WWW - Jntumaterials.co - In)
No ratings yet
Jntuk r20 Unit-V Deep Learning Techniques (WWW - Jntumaterials.co - In)
61 pages
Achieving Peak Performance For Large Language
No ratings yet
Achieving Peak Performance For Large Language
34 pages
WHITEPAPERs
No ratings yet
WHITEPAPERs
14 pages
Deep Learning UNIT 5
No ratings yet
Deep Learning UNIT 5
182 pages
DL Unit 1
No ratings yet
DL Unit 1
200 pages
Unit-V Deep Learning Techniques
100% (1)
Unit-V Deep Learning Techniques
31 pages
Building Finetuning Aimodels
No ratings yet
Building Finetuning Aimodels
41 pages
Introduction To ML
No ratings yet
Introduction To ML
34 pages
Bay Learn 2015 Deep Mind
No ratings yet
Bay Learn 2015 Deep Mind
69 pages
ML A Deep Dive in The World of AI and LLM Tun'Up Munich - 241021 - 130023
No ratings yet
ML A Deep Dive in The World of AI and LLM Tun'Up Munich - 241021 - 130023
34 pages
(English) Introduction To Large Language Models (DownSub - Com)
No ratings yet
(English) Introduction To Large Language Models (DownSub - Com)
9 pages
How To Train Your Own LLM
No ratings yet
How To Train Your Own LLM
29 pages
Unit - V
No ratings yet
Unit - V
44 pages
NLP Transformer Class Notes
No ratings yet
NLP Transformer Class Notes
3 pages
Fundamentals of Deep Learning
No ratings yet
Fundamentals of Deep Learning
195 pages
To Create A LLM
No ratings yet
To Create A LLM
53 pages
Fundamentals of Deep Learning
No ratings yet
Fundamentals of Deep Learning
26 pages
Lec7 - Large Models
No ratings yet
Lec7 - Large Models
33 pages
2.game AI 1
No ratings yet
2.game AI 1
268 pages
Paper Colossal-AI - A Unified Deep Learning System For Large-Scale Parallel Training
No ratings yet
Paper Colossal-AI - A Unified Deep Learning System For Large-Scale Parallel Training
10 pages
GenAI Preparation
No ratings yet
GenAI Preparation
15 pages
Toc 9780138199302
No ratings yet
Toc 9780138199302
8 pages
Chapter 4 - Fine-Tune Models and Training Algorithms
No ratings yet
Chapter 4 - Fine-Tune Models and Training Algorithms
26 pages
Predibase Fine-Tuning LLMs Ebook
No ratings yet
Predibase Fine-Tuning LLMs Ebook
20 pages
Transformers
No ratings yet
Transformers
2 pages
465-Lecture 1 (Deep Learning)
No ratings yet
465-Lecture 1 (Deep Learning)
47 pages
SSRN Id4655822
No ratings yet
SSRN Id4655822
9 pages
Week 4 - LLM - FineTuning
No ratings yet
Week 4 - LLM - FineTuning
38 pages
Jntuk r20 Unit V Deep Learning Techniqueswwwjntumaterials
No ratings yet
Jntuk r20 Unit V Deep Learning Techniqueswwwjntumaterials
32 pages
23 DeepLearning PDF
No ratings yet
23 DeepLearning PDF
74 pages
80 Deep Learning Interview Questions
No ratings yet
80 Deep Learning Interview Questions
6 pages
Unit IV V Deep Learning Material
No ratings yet
Unit IV V Deep Learning Material
32 pages
LLMs in Medicine: A Guide for Doctors
No ratings yet
LLMs in Medicine: A Guide for Doctors
19 pages
Sony Ai Content
No ratings yet
Sony Ai Content
26 pages
Deep Learning
No ratings yet
Deep Learning
78 pages
Whitepaper - Foundational Large Language Models & Text Generation - v2
100% (1)
Whitepaper - Foundational Large Language Models & Text Generation - v2
86 pages
CM20315 01 Intro
No ratings yet
CM20315 01 Intro
62 pages
Model Compression and Efficient Inference For Large Language Models: A Survey
No ratings yet
Model Compression and Efficient Inference For Large Language Models: A Survey
47 pages
DL
No ratings yet
DL
4 pages
Machine Learning
No ratings yet
Machine Learning
3 pages
Generative AI Introduction
No ratings yet
Generative AI Introduction
51 pages
Generative AI MCQ
No ratings yet
Generative AI MCQ
5 pages
Fine-Tuning Models for Developers
No ratings yet
Fine-Tuning Models for Developers
24 pages
BDE (EdTech) JD
No ratings yet
BDE (EdTech) JD
2 pages
Transformers for NLP Enthusiasts
No ratings yet
Transformers for NLP Enthusiasts
108 pages
EMNLP 2021 REBEL Camera Ready
No ratings yet
EMNLP 2021 REBEL Camera Ready
12 pages
No Reference Point Clouds Quality Assess
No ratings yet
No Reference Point Clouds Quality Assess
6 pages
Spam-T5: Benchmarking Large Language Models For Few-Shot Email Spam Detection
No ratings yet
Spam-T5: Benchmarking Large Language Models For Few-Shot Email Spam Detection
18 pages
Ai Resos
No ratings yet
Ai Resos
16 pages
Bachelor Thesis ToM
No ratings yet
Bachelor Thesis ToM
36 pages
Internship Report
No ratings yet
Internship Report
3 pages
Landing Trajectory Prediction For UAS Based On Generative Adversarial Network
No ratings yet
Landing Trajectory Prediction For UAS Based On Generative Adversarial Network
10 pages
AI Enthusiasts: Inside ChatGPT
No ratings yet
AI Enthusiasts: Inside ChatGPT
3 pages
Introducing Decision Transformers On Hugging Face ?
No ratings yet
Introducing Decision Transformers On Hugging Face ?
12 pages
Preprints202409 0311 v1
No ratings yet
Preprints202409 0311 v1
19 pages
Syed Imran
No ratings yet
Syed Imran
17 pages
Evaluation and Analysis of Large Language Models For Clinical Text Augmentation and Generation
No ratings yet
Evaluation and Analysis of Large Language Models For Clinical Text Augmentation and Generation
10 pages
Implementation of Chatbot On University Website Using RASA Framework
No ratings yet
Implementation of Chatbot On University Website Using RASA Framework
6 pages
COMP 652 Project Final Paper
No ratings yet
COMP 652 Project Final Paper
10 pages
2201.11460v2 Compressed
No ratings yet
2201.11460v2 Compressed
14 pages
Deep Learning: Attention Explained
No ratings yet
Deep Learning: Attention Explained
65 pages
Fake News Detection Using Enhanced BERT
No ratings yet
Fake News Detection Using Enhanced BERT
8 pages
ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration For Large Language Models
No ratings yet
ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration For Large Language Models
19 pages
Excel + ChatGPT For Data Analysis
50% (2)
Excel + ChatGPT For Data Analysis
43 pages
Product Helpfulness Detection With Novel Transformer Based BERT Embedding and Class Probability Features
No ratings yet
Product Helpfulness Detection With Novel Transformer Based BERT Embedding and Class Probability Features
13 pages
A Review On Large Language Models Architectures, Applications, Taxonomies, Open Issues and Challenges
No ratings yet
A Review On Large Language Models Architectures, Applications, Taxonomies, Open Issues and Challenges
37 pages
Internvl: Scaling Up Vision Foundation Models and Aligning For Generic Visual-Linguistic Tasks
No ratings yet
Internvl: Scaling Up Vision Foundation Models and Aligning For Generic Visual-Linguistic Tasks
25 pages
Nvidia h100 Datasheet 2287922 Web
No ratings yet
Nvidia h100 Datasheet 2287922 Web
3 pages
Docbank: A Benchmark Dataset For Document Layout Analysis
No ratings yet
Docbank: A Benchmark Dataset For Document Layout Analysis
12 pages
Langchain Onepager
No ratings yet
Langchain Onepager
1 page
2022-Turn-To-diarize Online Speaker Diarization Constrained by
No ratings yet
2022-Turn-To-diarize Online Speaker Diarization Constrained by
8 pages
Dino: Detr With Improved Denoising Anchor Boxes For End-To-End Object Detection
No ratings yet
Dino: Detr With Improved Denoising Anchor Boxes For End-To-End Object Detection
23 pages
ChatGPT in Healthcare: A Taxonomy and Systematic Review
No ratings yet
ChatGPT in Healthcare: A Taxonomy and Systematic Review
12 pages
Predictive News Filtering
No ratings yet
Predictive News Filtering
32 pages