Profiling – III and Revision
Lecture 25
          April 17, 2024
 Profiling
for (int i=0; i<50; i++)
 {
   MPI_Barrier (MPI_COMM_WORLD);
   MPI_Alltoall(message, arrSize, MPI_INT, recvMessage, arrSize, MPI_INT,
MPI_COMM_WORLD);
 }
Alltoall - NPROCS=4, Data size = 4 KB
 Max. barrier time: 0.2 ms
Alltoall - NPROCS=4, Data size = 4 MB
Max. barrier time: 1.2 ms
Max. Alltoall time: 7.8 ms
Alltoall - NPROCS=16, Data size = 4 KB
Alltoall - NPROCS=16, Data size = 4 MB
Bcast – 16P, 4 KB
Bcast – 16P, 4 MB
Bcast – 32P, 4 KB
Bcast – 32P, 4 MB
Revision
    Communication Graph Mapping
        0             1                                   512         256
                                                    512                     64    256
                                                                      512         64
                          4
    2         3                                     256         512         512
                                                          64          512         64
                  5                                       256   64          64
        Linear mapping
    0             1               2
0             1               2
                                      Q1: What are the communicating pairs?
                                      Q2: Distance/hops between the communicating pairs?
    3             4               5   Q3: Total hop-bytes?
3             4               5
                                                                                        12
Estimation Function
• 𝑓𝑒𝑠𝑡 (𝑡, 𝑝, 𝑀) = Cost of placing a task 𝑡 onto processor 𝑝 under current task
  mapping 𝑀
• Estimate how critical it is to place a task in the current cycle, select the task
  with maximum criticality
• 𝑇𝑘 is the set of tasks yet to be placed
• 𝑃𝑘 is the set of processors that are available
  𝑇𝑘 ∪ 𝑇ത𝑘 = Ø
  𝑃𝑘 ∪ 𝑃ത𝑘 = Ø
                                                                                      13
  Graph Model for SpMV
• Computation load?                • Number of communications?
   • Similar for both processors      • 8 as per this graph
                                      • Actually 6
                                                                 14
Parallel I/O
Access Pattern                                                      Interleaved
                                                                   access pattern
P0                   P1                   P0                    P1
     Each process reads a small chunk of data from a common file
     MPI_File_set_view (fh, displacement, etype, filetype, “native”, info)
     MPI_File_read_all (fh, data, datacount, MPI_INT, status)
                                                                                    16
Multiple Non-contiguous Accesses
                     • Every process’ local array is non-
    P0       P1        contiguous in file
                     • Every process needs to make
                       small I/O requests
    P2       P3      • Can these requests be merged?
                                                            17
Revision Q3: 3D domain decomposition
17 //initialize
18 for (int i=0; i<N; i++)
19 for (int j=0; j<N; j++)
20 for (int k=0; k<N; k++)
21 data[i][j][k] = (rank+1) * (i+j+k);
 22 int xStart=_____________________________,
yStart=____________________________,
zStart=___________________________;
 23 int xEnd=_______________________________,
yEnd=______________________________,
zEnd=_____________________________;
Revision Q4
A 3D matrix of size NxNxN was written to the file in the usual XYZ
memory order. P processes read this 3D matrix from a file using parallel
I/O following a 1D domain decomposition along Y-axis. Write an MPI
code snippet for this (you may ignore the obvious initializations and
finalizations). Assume that N is divisible by P.
Revision Q5
A sequential program P consists of three parts A, B, C. Part B is not
parallelizable. Parts A and C are parallelizable. The sequential runtimes
are Sa, Sb, Sc for the parts A, B, C respectively. Derive the speedup of P
on N processes, where the overhead to parallelize part A is Oa,
overhead to parallelize part C is Oc.
Revision Q6