MPI Profiling and Optimization

The document discusses profiling MPI collective operations like Alltoall and Bcast for different processor and data sizes. It also discusses communication graph mapping, parallel I/O access patterns, 3D domain decomposition, and deriving speedup for a parallel program with sequential parts.

Uploaded by

sayo3712

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views21 pages

MPI Profiling and Optimization

Uploaded by

sayo3712

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Profiling – III and Revision

Lecture 25
April 17, 2024
Profiling

for (int i=0; i<50; i++)

{
MPI_Barrier (MPI_COMM_WORLD);
MPI_Alltoall(message, arrSize, MPI_INT, recvMessage, arrSize, MPI_INT,
MPI_COMM_WORLD);
}
Alltoall - NPROCS=4, Data size = 4 KB

Max. barrier time: 0.2 ms

Alltoall - NPROCS=4, Data size = 4 MB
Max. barrier time: 1.2 ms
Max. Alltoall time: 7.8 ms
Alltoall - NPROCS=16, Data size = 4 KB
Alltoall - NPROCS=16, Data size = 4 MB
Bcast – 16P, 4 KB
Bcast – 16P, 4 MB
Bcast – 32P, 4 KB
Bcast – 32P, 4 MB
Revision
Communication Graph Mapping
0 1 512 256
512 64 256
512 64
4
2 3 256 512 512
64 512 64

5 256 64 64

Linear mapping
0 1 2
0 1 2
Q1: What are the communicating pairs?
Q2: Distance/hops between the communicating pairs?
3 4 5 Q3: Total hop-bytes?
3 4 5
12
Estimation Function
• 𝑓𝑒𝑠𝑡 (𝑡, 𝑝, 𝑀) = Cost of placing a task 𝑡 onto processor 𝑝 under current task
mapping 𝑀
• Estimate how critical it is to place a task in the current cycle, select the task
with maximum criticality
• 𝑇𝑘 is the set of tasks yet to be placed
• 𝑃𝑘 is the set of processors that are available

𝑇𝑘 ∪ 𝑇ത𝑘 = Ø
𝑃𝑘 ∪ 𝑃ത𝑘 = Ø

13
Graph Model for SpMV

• Computation load? • Number of communications?

• Similar for both processors • 8 as per this graph
• Actually 6
14
Parallel I/O
Access Pattern Interleaved
access pattern

P0 P1 P0 P1

Each process reads a small chunk of data from a common file

MPI_File_set_view (fh, displacement, etype, filetype, “native”, info)

MPI_File_read_all (fh, data, datacount, MPI_INT, status)

16
Multiple Non-contiguous Accesses

• Every process’ local array is non-

P0 P1 contiguous in file
• Every process needs to make
small I/O requests
P2 P3 • Can these requests be merged?

17
Revision Q3: 3D domain decomposition
17 //initialize
18 for (int i=0; i<N; i++)
19 for (int j=0; j<N; j++)
20 for (int k=0; k<N; k++)
21 data[i][j][k] = (rank+1) * (i+j+k);

22 int xStart=_____________________________,
yStart=____________________________,
zStart=___________________________;

23 int xEnd=_______________________________,
yEnd=______________________________,
zEnd=_____________________________;
Revision Q4

A 3D matrix of size NxNxN was written to the file in the usual XYZ
memory order. P processes read this 3D matrix from a file using parallel
I/O following a 1D domain decomposition along Y-axis. Write an MPI
code snippet for this (you may ignore the obvious initializations and
finalizations). Assume that N is divisible by P.
Revision Q5
A sequential program P consists of three parts A, B, C. Part B is not
parallelizable. Parts A and C are parallelizable. The sequential runtimes
are Sa, Sb, Sc for the parts A, B, C respectively. Derive the speedup of P
on N processes, where the overhead to parallelize part A is Oa,
overhead to parallelize part C is Oc.
Revision Q6

Cs621 Final Term Current Papers
No ratings yet
Cs621 Final Term Current Papers
10 pages
Floyd's Algorithm: Input N: Number of Vertices A (0..n-1) (0..n-1) - Adjacency Matrix
No ratings yet
Floyd's Algorithm: Input N: Number of Vertices A (0..n-1) (0..n-1) - Adjacency Matrix
7 pages
Mpi Sweden Course
No ratings yet
Mpi Sweden Course
146 pages
MPI Programming Guide: Second Edition
No ratings yet
MPI Programming Guide: Second Edition
8 pages
ATPESC 2019 Track-2 1-7-30 830am Guo-Raffenetti-Thakur-MPI For Scalable Computing
No ratings yet
ATPESC 2019 Track-2 1-7-30 830am Guo-Raffenetti-Thakur-MPI For Scalable Computing
199 pages
Parallel Programming 3
No ratings yet
Parallel Programming 3
22 pages
Untitled Document
No ratings yet
Untitled Document
23 pages
ECE 1747H: Parallel Programming: Message Passing (MPI)
No ratings yet
ECE 1747H: Parallel Programming: Message Passing (MPI)
67 pages
High Performance Computing For Computational Mechanics: ISCM-10
No ratings yet
High Performance Computing For Computational Mechanics: ISCM-10
63 pages
MPI Parallel Programming Guide
No ratings yet
MPI Parallel Programming Guide
67 pages
12.revision Parallelization
No ratings yet
12.revision Parallelization
30 pages
MiniTool Partition Wizard Crack 12 Key Download Free 2025
0% (1)
MiniTool Partition Wizard Crack 12 Key Download Free 2025
29 pages
Document 15
No ratings yet
Document 15
5 pages
VSS Mpi 2
No ratings yet
VSS Mpi 2
23 pages
Group 38
No ratings yet
Group 38
8 pages
02 Message Passing Interface Tutorial
No ratings yet
02 Message Passing Interface Tutorial
34 pages
Sunil Kumar L 24
No ratings yet
Sunil Kumar L 24
21 pages
Parallel Programming and MPI
No ratings yet
Parallel Programming and MPI
54 pages
Lect 6
No ratings yet
Lect 6
84 pages
Programming Using The Message-Passing Paradigm
No ratings yet
Programming Using The Message-Passing Paradigm
47 pages
Message Passing and MPI: John Mellor-Crummey
No ratings yet
Message Passing and MPI: John Mellor-Crummey
78 pages
Using MPI-2 Advanced Features of The Message Passing Interface - Gropp W., Lusk E., Thakur R. (1999)
No ratings yet
Using MPI-2 Advanced Features of The Message Passing Interface - Gropp W., Lusk E., Thakur R. (1999)
275 pages
CS621 Final Term Current Papers
100% (1)
CS621 Final Term Current Papers
9 pages
Using MPI
No ratings yet
Using MPI
385 pages
Mpi
No ratings yet
Mpi
46 pages
Mpi Course
No ratings yet
Mpi Course
202 pages
MPI Parallel I/O Techniques
No ratings yet
MPI Parallel I/O Techniques
28 pages
Apk Nokepoi
No ratings yet
Apk Nokepoi
29 pages
Ms. V. Uma Maheswari, Assistant Lecturer, Department of Information Technology, National Institute of Technology, Surathkal
No ratings yet
Ms. V. Uma Maheswari, Assistant Lecturer, Department of Information Technology, National Institute of Technology, Surathkal
91 pages
Introduction MPI - Chap2 - Slide 3
No ratings yet
Introduction MPI - Chap2 - Slide 3
16 pages
Mpi Book
No ratings yet
Mpi Book
673 pages
Lect5 PDF
No ratings yet
Lect5 PDF
24 pages
Eij KH Out Parallel Programming
No ratings yet
Eij KH Out Parallel Programming
679 pages
Map55612 1
No ratings yet
Map55612 1
10 pages
Distributed Memory Programming With: Peter Pacheco
No ratings yet
Distributed Memory Programming With: Peter Pacheco
125 pages
Unit IV
No ratings yet
Unit IV
12 pages
BIg Data Anslysi
No ratings yet
BIg Data Anslysi
57 pages
Distributed Memory Programming With MPI: Peter Pacheco
No ratings yet
Distributed Memory Programming With MPI: Peter Pacheco
121 pages
HPC 2025
No ratings yet
HPC 2025
16 pages
Mid # 2 Solution
No ratings yet
Mid # 2 Solution
4 pages
Intro MPI
No ratings yet
Intro MPI
60 pages
Week09 L2
No ratings yet
Week09 L2
13 pages
Multiprocessors
No ratings yet
Multiprocessors
39 pages
Parallel Random Access Machines
No ratings yet
Parallel Random Access Machines
5 pages
MPI Plamen Krastev
No ratings yet
MPI Plamen Krastev
49 pages
5 P2p-Ii
No ratings yet
5 P2p-Ii
26 pages
PDC Lecture 14 MPI Sockets and Memory Models
No ratings yet
PDC Lecture 14 MPI Sockets and Memory Models
20 pages
04 cmsc416 Mpi
No ratings yet
04 cmsc416 Mpi
31 pages
5 P2p-Ii
No ratings yet
5 P2p-Ii
26 pages
An Introduction To MPI: Parallel Programming With The Message Passing Interface
No ratings yet
An Introduction To MPI: Parallel Programming With The Message Passing Interface
48 pages
In3200 Chap09
No ratings yet
In3200 Chap09
56 pages
Lecture 1
No ratings yet
Lecture 1
23 pages

MPI Profiling and Optimization

Uploaded by

MPI Profiling and Optimization

Uploaded by

Profiling – III and Revision

for (int i=0; i<50; i++)

Max. barrier time: 0.2 ms

• Computation load? • Number of communications?

Each process reads a small chunk of data from a common file

MPI_File_set_view (fh, displacement, etype, filetype, “native”, info)

• Every process’ local array is non-

You might also like