CS516: Parallelization of Programs
Introduction
Vishwesh Jatala
Assistant Professor
Department of CSE
Indian Institute of Technology Bhilai
vishwesh@iitbhilai.ac.in
2023-24 W
1
What is the output?
2
Outline of Today’s Lecture
■ Why?
■ What?
■ How?
3
Moore’s Law
■ The number of transistors on a IC doubles about every
two years
4
Moore’s Law Effect
5
Moore’s Law Effect
Single Core Performance
Power and Heat Multicore Processors
6
Parallel Architectures are Everywhere!
7
Parallel Hardwares
Distributed CPUs
Multicores
GPUs
8
Hardware and Software
Core
L1 Cache
Output
L2 Cache
DRAM
Sequential
Single-core CPU
9
Hardware and Software
Core1 Core2
L1 Cache L1 Cache
Core3 Core4
L1 Cache L1 Cache
L2 Cache
Same sequential?
DRAM
Multi-core CPU
10
Professor P
15 questions
300 Answer sheets
11
Professor P’s Teaching Assistants
TA#1
TA#2 TA#3
12
Benefits of Parallel Programming
Fast: Less Save Money
execution time
Solves Larger Problem
13
Parallel Programming Applications
Deep Learning and
Machine Learning Medical Imaging
Climate Modeling
CFD
14
Parallel Programming Applications
OpenAI used a super computer
with more than 285,000 CPU cores,
10,000 GPUs and
400 gigabits per second of network
connectivity for each GPU server.
15
Challenges!
16
Example-1
Sequential Execution
Core-0
for (int i=0; i<5; i++)
A[i]= i A[0]=0
A[1]=1
A[2]=2
A[3]=3
A[4]=4
17
Example-1
for (int i=0; i<5; i++)
A[i]= i
Parallel Execution
Core-0 Core-1 Core-2 Core-3 Core-4
A[0]=0 A[1]=1 A[2]=2 A[3]=3 A[4]=4
18
Example-2
Sequential Execution
Core-0
int count = 0;
A[0]=0
for (int i=0; i<5; i++)
A[i]= count++;
A[1]=1
A[2]=2
A[3]=3
A[4]=4
19
Example-2
int count = 0;
for (int i=0; i<5; i++)
A[i]= count++;
Parallel Execution
Core-0 Core-1 Core-2 Core-3 Core-4
A[0]=1 A[1]=2 A[2]=0 A[3]=4 A[4]=3
20
Challenges:
Detecting Parallelism is Hard!
21
Example-3: Sequential Version
int sum = 0;
for (i = 0; i < n; i++) {
x = f(i);
sum = sum+x;
}
A Sequential Program for Sum
Core-0
1 4 3 9 2 8 5 1 1 6 2 7 2 5 0 4 1 8 6 5 1 2 3 9
sum = 95
22
Example-3: Parallel Version-1
my_sum = 0;
my_first i = . . . ;
my_last i = . . . ;
for (my_i = my_first i; my_i < p_last_i; my_i++) {
my_x = f(i);
my_sum += my_x;
}
Core-0 Core-1 Core-2 Core-3 Core-4 Core-5 Core-6 Core-7
1 4 3 9 2 8 5 1 1 6 2 7 2 5 0 4 1 8 6 5 1 2 3 9
my_sum 8 19 7 15 7 13 12 14
23
Example-3: Parallel Version-1
if (I’m the master core) {
sum = my_sum;
for (each core other than myself) {
receive value from core;
sum += value;
}
}
else
{ send my_sum to the master; }
Core-0 Core-1 Core-2 Core-3 Core-4 Core-5 Core-6 Core-7
1 4 3 9 2 8 5 1 1 6 2 7 2 5 0 4 1 8 6 5 1 2 3 9
my_sum 8 19 7 15 7 13 12 14
sum 95 (8+19+7+15+7+13+12+14)
24
Example-3: Parallel Version-2
25
Parallel Version-1 or Version-2?
■ Both have same number of operations
■ Version-1 sum is sequential
■ Version-2 exposes parallelism
26
Challenges:
Detecting Parallelism is Hard!
Communication
Synchronization
27
Example-4:
int sum = 0;
for (i = 0; i < n; i++) {
x = f(i);
sum = sum+x;
}
28
Challenges:
Detecting Parallelism is Hard!
Communication
Synchronization
Load Balancing
29
What did we learn so far?
Sequential
Parallel Hardwares
Programs are
are Inevitable
Inefficient
Parallelization is
Challenges!
promising
30
Let’s discuss details from next class!
31
Course Outline (Part-1)
■ Introduction (this lecture)
■ Overview of Parallel Architectures and
Programming models
■ Amdahl's law and Performance
■ Parallel programming
❑ GPUs and CUDA programming
❑ Optimizations
■ Case studies
■ Extracting Parallelism from Sequential Programs
Automatically
32
Course Logistics
■ Lecture Hours:
❑ Mon, Tue, Thursday 10:30 am - 11:25 am
■ Course Website: Canvas platform
❑ Lecture notes
❑ Assignments
❑ Project
❑ Discussions
❑ Marks
33
Course Logistics: Evaluation Scheme
■ Evaluation scheme (can be changed slightly):
❑ Exams: ~40%
❑ Project: ~35%
❑ Attendance: ~10%
❑ Assignments (Paper presentation?): 15%
■ Attendance
❑ 0% - 50%: 0 Marks
❑ >50%: Marks will be awarded out of 10 accordingly.
❑ Example:
■ Total sessions: 16
■ #sessions attended = 7 (<50%), marks = 0
■ #sessions attended = 10 (62.5%), marks = 2.5 (2*10/8)
34
Course Logistics: References
■ Lecture notes will be available on Canvas
■ Reference material will be provided on Canvas
■ Text book for extracting parallelism:
❑ Randy Allen, Ken Kennedy,Optimizing Compilers for
Modern Architectures: A Dependence-based
Approach, Morgan Kaufmann, 2001
35
Course Logistics: Tools
■ Platforms:
❑ Prefer Google Colab for GPUs and CUDA Programming.
❑ A demo session
36
Course Logistics
■ Evaluation Policy:
❑ Acknowledge all the sources
❑ Do not cheat
37
Outcome of the Course?
■ State-of-the-art techniques in parallel computing
■ Develop parallel programming skills
■ Transferable skills -
❑ Parallel programming is used in multiple disciplines
❑ Industries
❑ Education and research
■ Handle projects
■ High-performance computing
38
Course Logistics
Office: B-010
vishwesh@iitbhilai.ac.in
39
References
■ https://www.cse.iitd.ac.in/~soham/COL380/page.html
■ https://s3.wp.wsu.edu/uploads/sites/1122/2017/05/6-
9-2017-slides-vFinal.pptx
■ Miscellaneous resources on internet
40
Thank you!
41