0% found this document useful (0 votes)

16 views35 pages

Lecture Notes

this is a lecture note on distributed machine learning

Uploaded by

زهرا کشاورز

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views35 pages

Lecture Notes

this is a lecture note on distributed machine learning

Uploaded by

زهرا کشاورز

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

Distributed Machine Learning

and the Parameter Server

CS4787 Lecture 24 — Fall 2023
So far, we’ve been talking about ways to scale our machine
learning pipeline that focus on a single machine. But if we
really want to scale up to huge datasets and models,
eventually one machine won’t be enough.

This lecture will cover methods for using multiple

machines to do learning.
Distributed computing basics
• Distributed parallel computing involves two or more machines
collaborating on a single task by communicating over a network.
• Unlike parallel programming on a single machine, distributed computing requires
explicit (i.e. written in software) communication among the workers.

Network
GPU
GPU

• There are a few basic patterns of communication that are used by

distributed programs.
Basic patterns of communication
Push
• Machine A sends some data to machine B.

A B
Basic patterns of communication
Pull
• Machine B requests some data from machine B.

A B

• This differs from push only in terms of who initiates the communication
Basic patterns of communication
Broadcast
• Machine A sends data to many machines.

C1
C2
A C3
C4
Basic patterns of communication
Reduce
• Compute some reduction (usually a sum) of data on multiple machines
C1, C2, …, Cn and materialize the result on one machine B.

C1
C2
C3 B
C4
Basic patterns of communication
All-Reduce
• Compute some reduction (usually a sum) of data on multiple machines
and materialize the result on all those machines.

C1 C1
C2 C2
C3 C3
C4 C4
Basic patterns of communication
Scatter-Reduce
• Compute some reduction of data on M machines and materialize 1/M of
the result on each machine (sharding the result).

C1 C1
C2 C2
C3 C3
C4 C4
Basic patterns of communication
Wait
• One machine pauses its computation and waits on a signal from another
machine

B
A
⏸
Basic patterns of communication
Barrier
• Many machines wait until all those machines reach a point in their
execution, then continue from there

C1 C1
C2 C2
C3 C3
C4 C4
Patterns of Communication Summary
• Push/Pull. Machine A sends data to machine B, or B requests data from A.
• Broadcast. Machine A sends some data to many machines C1, C2, …, Cn.
• Reduce. Compute some reduction (usually a sum) of data on multiple
machines C1, C2, …, Cn and materialize the result on one machine B.
• All-reduce. Compute some reduction (usually a sum) of data on multiple
machines C1, C2, …, Cn and materialize the result on all those machines.
• Scatter-reduce. Compute some reduction (usually a sum) of data on multiple
machines C1, C2, …, Cn and materialize the result in a sharded fashion.
• Wait. One machine pauses its computation and waits for data to be received
from another machine.
• Barrier. Many machines wait until all other machines reach a point in their
code before proceeding.
Overlapping computation and communication
• Communicating over the network can have high latency
• we want to hide this latency

• An important principle of distributed computing is overlapping

computation and communication

• For the best performance, we want our workers to still be doing useful
work while communication is going on
• rather than having to stop and wait for the communication to finish
• sometimes called a stall
1 2 n

ne machine pauses its computation and waits for data to be received from anot
Running SGD with All-reduce
ng over the network can have high latency, so an important principle of para
g computation and communication. For the best performance, we want
• All-reduce gives us a simple way of running learning algorithms such as
useful work
SGD inwhile communication
a distributed is data
fashion with goingparallelism.
on (rather than having to stop an
n to finish).

• Simply put, the idea is to just parallelize the minibatch. We start with
GD with an all-reduce.
identical copy ofAll-reduce gives
the parameter us a worker.
on each simple way of running learning a
istributed fashion. Simply put, the idea is to just parallelize the minibatch. W
of the•parameter wt onupdate
Recall that SGD each step
worker.
looksIflike:
the SGD update step is
B
X
1
wt+1 = wt ↵t · rfib,t (wt ),
B
b=1
copy ofidentical
the parameter t on
copy ofwthe each worker.
parameter wt onIf each
the SGD update
worker. step
If the SGDis update step is
B
1 X
B
1 X
Running SGDw with
= w All-reduce (continued)
t+1 t ↵tw· t+1 = wrf
B
↵(w
t ib,t t · t ),
B
rfib,t (wt ),
b=1 b=1

e are M•and
worker
If theremachines such that
are M worker machines
B=M such , then
· Bthat
0
B= weMcan
· Bre-write
0 this
, then we canupdate step
re-write asup
this
0 0
M
X B
X M
X B
X
1 1 1 1
wt+1 = wt ↵tw· t+1 = wt 0 ↵t · rfim,b,t (w0t ). rfim,b,t (wt ).
M m=1 B M B
b=1 m=1 b=1

e assign•Now,
the
Now,computation
weweassign
assigntheofcomputation
the the sum when
computation ofof
the =sum
m the towhen
1sum worker
whenmm = 1the
1,= computation
1toto worker1,1,theofcomputa
worker the su
o worker 2,= et
mthe cetera.
to workerAfter
2computation 2,of allthe
et thesum
gradients
cetera. After are
whenallm =computed,
the 2gradients
to workerwe2,computed,
are can perform
et cetera. wethe
canouter sum
perform

• After all the gradients are computed,

1 we can perform
1 the outer sum with
an all-reduce operation.
Running SGD with All-reduce (continued)
• After this all-reduce, the whole sum (which is essentially the minibatch
gradient) will be present on all the machines
• so each machine can now update its copy of the parameters

• Since sum is same on all machines, the parameters will update in lockstep

• Statistically equivalent to sequential SGD!

will be the same. This corresponds to the following algorithm.

Algorithm 1 Distributed SGD with All-Reduce

input: loss function examples f1 , f2 , . . ., number of machines M , per-machine minibatch size B 0
input: learning rate schedule ↵t , initial parameters w0 , number of iterations T
for m = 1 to M run in parallel on machine m
load w0 from algorithm inputs
for t = 1 to T do
select a minibatch im,1,t , im,2,t , . . . , im,B 0 ,t of size B 0
B0
X
1
compute gm,t 0
rfim,b,t (wt 1 )
B
b=1
M
X
all-reduce across all workers to compute Gt = gm,t
m=1
↵t
update model wt wt 1 · Gt
M
end for
end parallel for
return wT (from any machine)

Same approach can be used for momentum, Adam, etc.

It is straightforward to see how one could use the same all-reduce pattern to run variants of SGD such as
What are the benefits of
distributing SGD with all-reduce?
What are the drawbacks?
Beneﬁts of distributed SGD with All-reduce
• The algorithm is easy to reason about, since it’s statistically equivalent to
minibatch SGD.
• And we can use the same hyperparameters for the most part.

• The algorithm is easy to implement

• since all the worker machines have the same role and it runs on top of standard
distributed computing primitives.
Drawbacks of distributed SGD with all-reduce
• While the communication for the all-reduce is happening, the workers are (for
the most part) idle.

• We’re not overlapping computation and communication.

• At least by default
• We can overlap communication with preprocessing/data augmentation

• The effective minibatch size is growing with the number of machines,

and for cases where we don’t want to run with a large minibatch size for
statistical reasons, this can prevent us from scaling to large numbers of
machines using this method.
Where do we get the training examples from?
• There are two general options for distributed learning.

• Training data servers

• Have one or more non-worker servers dedicated to storing the training examples
(e.g. a distributed in-memory ﬁlesystem)
• The worker machines load training examples from those servers.
• These servers can handle preprocessing and data augmentation (but usually don’t)

• Partitioned dataset
• Partition the training examples among the workers themselves and store them
locally in memory on the workers.
The Parameter Server Model
The Basic Idea
• Recall from the early lectures in this course that a lot of our theory talked
about the convergence of optimization algorithms.
rver model. Recall
• This convergence from by
was measured the early
some functionlectures in this
over the parameters course
at time t th
rgence of(e.g.
optimization algorithms.
the objective function or the norm of itsThis convergence
gradient)
which shows that the algorithm is making progress.
was
that is decreasing with t, meas

time t (e.g. the objective function or the norm of its gradient) t

e algorithm is even
• For this to making progress.
make sense, though, weForneedthis toable
to be even make
to talk sense, t
about the
value ofvalue
theofparameters
the parameters atattimetimet as the
t asalgorithm runs.
the algorithm runs. E.g. in
• E.g. in SGD, we had

wt+1 = wt ↵t rfit (wt )

Parameter Server Basics Continued
• For a program running on a Forsingle
SGDmachine, the valueweofcan
with all-reduce, theanswer
parameters
this at
time t is just the value of some array in the memory hierarchy (backed by
DRAM) at that time. question easily, since the value of the parameters is
the same on all workers (it’s guaranteed to be the
same by the all-reduce operation). We just appoint
• But in a distributed setting, there is no shared
this identical memory,
shared value andvalue
to be the communication
of the
must be done explicitly. parameters at any given time.
• Each machine will usually have one or more copies of the parameters live at any given
time, some of which may have been updates less recently than others, especially if we
want to do something more complicated than all-reduce.

• This raises the question: when reasoning about a distributed algorithm,

what we should consider to be the value of the parameters a given time?
The Parameter Server Model
• The parameter server model answers this question diﬀerently by appointing a
single machine, the parameter server, the explicit responsibility of
maintaining the current value of the parameters.
• The most up-to-date gold-standard parameters are the ones stored in memory on the
parameter server.

• The parameter server updates its parameters by using gradients that are
computed by the other machines, known as workers, and pushed to the
parameter server.

• Periodically, the parameter server broadcasts its updated parameters to all

the other worker machines, so that they can use the updated parameters to
compute gradients.
parameter server

parameter server
workers send sends new parameters
gradients to to workers
parameter server

worker worker worker ··· worker

1 2 3 M

training data
Learning with the parameter server
• Many ways to learn with a parameter server

• Synchronous distributed training

• Similar to all-reduce, but with gradients summed on a central parameter server

• Asynchronous distributed training

• Compute and send gradients and add them to the model as soon as possible
• Broadcast updates whenever they are available
Multiple parameter servers
• If the parameters are too numerous for a single parameter server to
handle, we can use multiple parameter server machines.

• We partition the parameters among the multiple parameter servers

• Each server is only responsible for maintaining the parameters in its partition.
• When a worker wants to send a gradient, it will partition that gradient vector and
send each chunk to the corresponding parameter server; later, it will receive the
corresponding chunk of the updated model from that parameter server machine.

• This lets us scale up to very large models!

Other Ways To Distribute
The methods we discussed so far distributed across the minibatch (for all-reduce SGD)
and across iterations of SGD (for asynchronous parameter-server SGD).
But there are other ways to distribute that are used in practice too.
Distribution for hyperparameter optimization
• This is something we’ve already talked about.

• Many commonly used hyperparameter optimization algorithms, such as

grid search and random search, are very simple to distribute.
• They can easily be run on many parallel workers to get results faster.
Model Parallelism
• Main idea: partition the layers of a neural network among different worker
machines.

• This makes each worker responsible for a subset of the parameters.

• Forward and backward signals running through the neural network during
backpropagation now also run across the computer network between the
different parallel machines.
• Particularly useful if the parameters won’t fit in memory on a single machine.
• This is very important when we move to specialized machine learning accelerator
hardware, where we’re running on chips that typically have limited memory and
communication bandwidth.
A Diagram of Model Parallelism
• From “PipeDream: Fast and Eﬀicient Pipeline Parallel DNN Training.”
Pipeline Parallelism
• A variant of model parallelism that tries to improve throughput by
overlapping minibatch computation.
• From “GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism”
Fully Sharded Data
Parallel
• A hybrid of data parallelism
and sharded parameter server
strategies.

• Splits the weights for each

layer among all machines, then
uses a broadcast to get them
whenever they’re needed.
Conclusion and Summary
• Distributed computing is a powerful tool for scaling machine
learning

• We talked about a few methods for distributed training:

• Minibatch SGD with All-reduce
• The parameter server approach
• Model parallelism

• And distribution can be beneficial for many other tasks too!

Lecture 22
No ratings yet
Lecture 22
4 pages
11 Distributed DL Printable
No ratings yet
11 Distributed DL Printable
73 pages
Slide 14 - Distributed Deep Learning
No ratings yet
Slide 14 - Distributed Deep Learning
30 pages
NeurIPS 2018 Lag Lazily Aggregated Gradient For Communication Efficient Distributed Learning Paper
No ratings yet
NeurIPS 2018 Lag Lazily Aggregated Gradient For Communication Efficient Distributed Learning Paper
11 pages
Nsdi21 SwitchML
No ratings yet
Nsdi21 SwitchML
25 pages
Distributed Deep Learning for HPC
No ratings yet
Distributed Deep Learning for HPC
60 pages
Distributed ML with Parameter Server
No ratings yet
Distributed ML with Parameter Server
16 pages
Advanced Systemdesign 2023
No ratings yet
Advanced Systemdesign 2023
65 pages
Cluster Parallel
No ratings yet
Cluster Parallel
29 pages
Papr 11
No ratings yet
Papr 11
18 pages
DistGNN MB
No ratings yet
DistGNN MB
11 pages
DC Unit IV
No ratings yet
DC Unit IV
28 pages
Osdi14 Paper Li Mu
No ratings yet
Osdi14 Paper Li Mu
17 pages
Truncated Doc 2
No ratings yet
Truncated Doc 2
3 pages
Cloud Computing Unit - 3 Final
No ratings yet
Cloud Computing Unit - 3 Final
43 pages
Distributed Machine Learning
No ratings yet
Distributed Machine Learning
23 pages
Distributed Training Scaling Deep Dive
No ratings yet
Distributed Training Scaling Deep Dive
33 pages
Efficient Distributed Learning Algorithm
No ratings yet
Efficient Distributed Learning Algorithm
10 pages
Parallel & Distributed Computing
100% (1)
Parallel & Distributed Computing
52 pages
SGD
No ratings yet
SGD
3 pages
Ditp - ch2 4
No ratings yet
Ditp - ch2 4
2 pages
Parallel & Distributed Computing Guide
No ratings yet
Parallel & Distributed Computing Guide
26 pages
Thesis Proposal: Scaling Distributed Machine Learning With System and Algorithm Co-Design
No ratings yet
Thesis Proposal: Scaling Distributed Machine Learning With System and Algorithm Co-Design
12 pages
Speeding Up Distributed Machine Learning Using Codes
No ratings yet
Speeding Up Distributed Machine Learning Using Codes
16 pages
Lec1 24th Nov
No ratings yet
Lec1 24th Nov
29 pages
Federated Learning
No ratings yet
Federated Learning
50 pages
HPC Scaling
No ratings yet
HPC Scaling
56 pages
Accelerating Federated Learning Via Momentum Gradient Descent
No ratings yet
Accelerating Federated Learning Via Momentum Gradient Descent
13 pages
BCT - Unit-2
No ratings yet
BCT - Unit-2
27 pages
Big Data Concepts Detailed Introductory Notes
No ratings yet
Big Data Concepts Detailed Introductory Notes
8 pages
Module 1
No ratings yet
Module 1
95 pages
NIPS 2016 Short Dot Computing Large Linear Transforms Distributedly Using Coded Short Dot Products Paper
No ratings yet
NIPS 2016 Short Dot Computing Large Linear Transforms Distributedly Using Coded Short Dot Products Paper
9 pages
HPC 3rd Unit
No ratings yet
HPC 3rd Unit
16 pages
CS ELEC 2 Introduce Parallel Computing
No ratings yet
CS ELEC 2 Introduce Parallel Computing
28 pages
Byzantine Machine Learning: A Primer: Rachid Guerraoui Nirupam Gupta Rafael Pinot
No ratings yet
Byzantine Machine Learning: A Primer: Rachid Guerraoui Nirupam Gupta Rafael Pinot
39 pages
BT - UNIT - 2 - Notes
No ratings yet
BT - UNIT - 2 - Notes
29 pages
Take A Close Look At: Ma Ed
No ratings yet
Take A Close Look At: Ma Ed
42 pages
Federated Learning: Strategies For Improving Communication Efficiency
No ratings yet
Federated Learning: Strategies For Improving Communication Efficiency
5 pages
HPC Lab: Parallel Computing Basics
No ratings yet
HPC Lab: Parallel Computing Basics
58 pages
Intro to Serial & Parallel Computing
No ratings yet
Intro to Serial & Parallel Computing
39 pages
BDS Session 2
No ratings yet
BDS Session 2
58 pages
Parallel & Distributed Programming Course
No ratings yet
Parallel & Distributed Programming Course
8 pages
Lec 30
No ratings yet
Lec 30
19 pages
Convergence of Distributed Adaptive Optimization With Local Updates
No ratings yet
Convergence of Distributed Adaptive Optimization With Local Updates
66 pages
Training Neural Networks Without Gradients
No ratings yet
Training Neural Networks Without Gradients
10 pages
Spark Introduction
No ratings yet
Spark Introduction
90 pages
Adaptive Consensus Gradients Aggregation For Scaled Distributed Training
No ratings yet
Adaptive Consensus Gradients Aggregation For Scaled Distributed Training
13 pages
Main
No ratings yet
Main
2 pages
Deep Learning Optimization Guide
100% (1)
Deep Learning Optimization Guide
105 pages
Sparse Allreduce: Efficient Scalable Communication For Power-Law Data
No ratings yet
Sparse Allreduce: Efficient Scalable Communication For Power-Law Data
10 pages
IoT Lecture 5 - Federated Learning
No ratings yet
IoT Lecture 5 - Federated Learning
47 pages
DSECL ZG 522: Big Data Systems: Session 2: Parallel and Distributed Systems
No ratings yet
DSECL ZG 522: Big Data Systems: Session 2: Parallel and Distributed Systems
58 pages
BDS Session 2
No ratings yet
BDS Session 2
58 pages
PDC DA 2 19BCE0660
No ratings yet
PDC DA 2 19BCE0660
29 pages
CC Unit-1
No ratings yet
CC Unit-1
17 pages
Op Tim Ization
No ratings yet
Op Tim Ization
9 pages
Parallel Computing Pastpaper Solve by Noman Tariq
No ratings yet
Parallel Computing Pastpaper Solve by Noman Tariq
30 pages
Near Optimal Topology Adaptive Parameter
No ratings yet
Near Optimal Topology Adaptive Parameter
10 pages
AC11 Sol PDF
100% (1)
AC11 Sol PDF
194 pages
COMP3308/3608 Artificial Intelligence Week 9 Tutorial Exercises Multilayer Neural Networks 2. Deep Learning
No ratings yet
COMP3308/3608 Artificial Intelligence Week 9 Tutorial Exercises Multilayer Neural Networks 2. Deep Learning
2 pages
MCQS (Operating System)
No ratings yet
MCQS (Operating System)
17 pages
DataCaptor Device Interface Help
No ratings yet
DataCaptor Device Interface Help
6 pages
Revised B Tech 19-02-11 (Revised)
No ratings yet
Revised B Tech 19-02-11 (Revised)
239 pages
Master's Theorem
No ratings yet
Master's Theorem
4 pages
38 PDFscap4.Pub Distributed-Systems
No ratings yet
38 PDFscap4.Pub Distributed-Systems
2 pages
PHP OOP Concepts
No ratings yet
PHP OOP Concepts
20 pages
Unit - 2 VB
No ratings yet
Unit - 2 VB
53 pages
Computer System Modeling and Simulation
No ratings yet
Computer System Modeling and Simulation
34 pages
TCS CodeVita Questions R2
No ratings yet
TCS CodeVita Questions R2
22 pages
Log
No ratings yet
Log
13 pages
Matlab Activity 1-1
No ratings yet
Matlab Activity 1-1
1 page
Data Analysis From Scratch With Python Peters Morgan Ebook All Chapters PDF
100% (4)
Data Analysis From Scratch With Python Peters Morgan Ebook All Chapters PDF
55 pages
Chapter 2 Introduction To Java Applications Section 2.2 Your First Program in Java: Printing A Line of Text
100% (2)
Chapter 2 Introduction To Java Applications Section 2.2 Your First Program in Java: Printing A Line of Text
7 pages
Full Stack Developer CV - Lakshay Aghi
No ratings yet
Full Stack Developer CV - Lakshay Aghi
3 pages
Spreadsheets For Statistical Analysis
No ratings yet
Spreadsheets For Statistical Analysis
12 pages
Javafx Notes
No ratings yet
Javafx Notes
23 pages
Systems Development Design, Implementation, Maintenance, and Revie
100% (3)
Systems Development Design, Implementation, Maintenance, and Revie
14 pages
Slides Deep C PDF
100% (2)
Slides Deep C PDF
17 pages
Dokumen Bahan7
No ratings yet
Dokumen Bahan7
2 pages
Oop Microproject
No ratings yet
Oop Microproject
21 pages
Python Programming for Beginners
No ratings yet
Python Programming for Beginners
3 pages
Spring 2025 - CS304P - 1
No ratings yet
Spring 2025 - CS304P - 1
3 pages
CVLe Dinh
No ratings yet
CVLe Dinh
2 pages
CD UNIT-II Syntax Analysis
No ratings yet
CD UNIT-II Syntax Analysis
13 pages
Java Introduction & Key Concepts
No ratings yet
Java Introduction & Key Concepts
56 pages
Cs101 Midterm Solved Mcq's Subjective Mega File
No ratings yet
Cs101 Midterm Solved Mcq's Subjective Mega File
52 pages
IRC Bot Ping Script
No ratings yet
IRC Bot Ping Script
3 pages

Lecture Notes

Uploaded by

Lecture Notes

Uploaded by

Distributed Machine Learning

and the Parameter Server

This lecture will cover methods for using multiple

• There are a few basic patterns of communication that are used by

• An important principle of distributed computing is overlapping

• After all the gradients are computed,

• Statistically equivalent to sequential SGD!

Algorithm 1 Distributed SGD with All-Reduce

Same approach can be used for momentum, Adam, etc.

• The algorithm is easy to implement

• We’re not overlapping computation and communication.

• The effective minibatch size is growing with the number of machines,

• Training data servers

time t (e.g. the objective function or the norm of its gradient) t

wt+1 = wt ↵t rfit (wt )

• This raises the question: when reasoning about a distributed algorithm,

• Periodically, the parameter server broadcasts its updated parameters to all

worker worker worker ··· worker

• Synchronous distributed training

• Asynchronous distributed training

• We partition the parameters among the multiple parameter servers

• This lets us scale up to very large models!

• Many commonly used hyperparameter optimization algorithms, such as

• This makes each worker responsible for a subset of the parameters.

• Splits the weights for each

• We talked about a few methods for distributed training:

• And distribution can be beneficial for many other tasks too!

You might also like