0% found this document useful (0 votes)

193 views50 pages

Introduction To Parallel Programming: Parallel Methods For Matrix Multiplication

The document describes three parallel algorithms for matrix multiplication: block-striped decomposition, Fox's method, and Cannon's method. It provides details on the sequential algorithm, problem statement, and implementation of the block-striped decomposition algorithm.

Uploaded by

Shalu Ojha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

193 views50 pages

Introduction To Parallel Programming: Parallel Methods For Matrix Multiplication

Uploaded by

Shalu Ojha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

2012 Summer School on Concurrency

August 22–29, 2012 | St. Petersburg, Russia

Introduction to Parallel Programming

Section 3.

Parallel Methods for Matrix Multiplication

Victor Gergel, Professor, D.Sc.

Lobachevsky State University of
Nizhni Novgorod (UNN)
Contents

 Problem Statement
 Sequential Algorithm
 Algorithm 1 – Block-Striped Decomposition
 Algorithm 2 – Fox’s method
 Algorithm 3 – Cannon’s method
 Summary

Introduction to Parallel Programming: Matrix Multiplication

St. Petersburg, Russia,2012 © Gergel V.P.
2 → 50
Problem Statement

Matrix multiplication:
C  A B
or
 c0, 0 , c0,1 , ..., c0,l 1   a0, 0 , a0,1 , ..., a0,n 1   b0, 0 , b0,1 , ..., a0,l 1 
    
 ...  ...  ... 
c , c , ..., c  a , a , ..., a   b , b , ..., b 
 m 1, 0 m 1,1 m 1, l 1   m 1, 0 m 1,1 m 1, n 1  n 1, 0 n 1,1 n 1, l 1 

 The matrix multiplication problem can be reduced to the

execution of m·l independent operations of matrix A rows and
matrix B columns inner product calculation
n 1

cij  ai , bTj   a ik  bkj , 0  i  m, 0  j  l
k 0
Data parallelism can be exploited to design
parallel computations
St. Petersburg, Russia,2012 Introduction to Parallel Programming: Matrix Multiplication
© Gergel V.P.
3 → 50
Sequential Algorithm…

// Algorithm 8.1
// Sequential algorithm of matrix multiplication
double MatrixA[Size][Size];
double MatrixB[Size][Size];
double MatrixC[Size][Size];
int i,j,k;
...
for (i=0; i<Size; i++){
for (j=0; j<Size; j++){
MatrixC[i][j] = 0;
for (k=0; k<Size; k++){
MatrixC[i][j] = MatrixC[i][j] + MatrixA[i][k]*MatrixB[k][j];
}
}
}

Introduction to Parallel Programming: Matrix Multiplication

St. Petersburg, Russia,2012 © Gergel V.P.
4 → 50
Sequential Algorithm

 Algorithm performs the matrix C rows calculation sequentially

 At every iteration of the outer loop on i variable a single row of
matrix A and all columns of matrix B are processed
A B C

X =

 m·l inner products are calculated to perform the matrix

multiplication
 The complexity of the matrix multiplication is O(mnl).

Introduction to Parallel Programming: Matrix Multiplication

St. Petersburg, Russia,2012 © Gergel V.P.
5 → 50
Algorithm 1: Block-Striped Decomposition…

 A fine-grained approach – the basic subtask is

calculation of one element of matrix C
  
cij  ai , bTj , ai  ai 0, ai1 ,...,ain1 , bTj  b0 j , b1 j ,...,bn1 j T

 Number of basic subtasks is equal to n2. Achieved

parallelism level is redundant!
 As a rule, the number of available processors is less
then n2 (p<n2), so it will be necessary to perform the
subtask scaling

St. Petersburg, Russia,2012 Introduction to Parallel Programming: Matrix Multiplication

 The aggregated subtask – the calculation of one

row of matrix C (the number of subtasks is n)
 Data distribution – rowwise block-striped
decomposition for matrix A and columnwise block-
striped decomposition for matrix B
A B

St. Petersburg, Russia,2012 Introduction to Parallel Programming: Matrix Multiplication

 Analysis of Information Dependencies…

– Each subtask hold one row of matrix A and one column of
matrix B,
– At every iteration each subtask performs the inner product
calculation of its row and column, as a result the
corresponding element of matrix C is obtained
– Then every subtask i, 0 i<n, transmits its column of matrix
B for the subtask with the number (i+1) mod n.

After all algorithm iterations all the columns of matrix B were

come within each subtask one after another

Introduction to Parallel Programming: Matrix Multiplication

St. Petersburg, Russia,2012 © Gergel V.P.
8 → 50
Algorithm 1: based on block-striped decomposition

 Scheme of Information Dependences

x = x = x = x =

Introduction to Parallel Programming: Matrix Multiplication

St. Petersburg, Russia,2012 © Gergel V.P.
9 → 50
Algorithm 1: Block-Striped Decomposition…

 Aggregating and Distributing the Subtasks among the

Processors:
– In case when the number of processors p is less than the number of
basic subtasks n, calculations can be aggregated in such a way that
each processor would execute several inner products of matrix A rows
and matrix B columns. In this case after the completion of
computation, each aggregated basic subtask determines several rows
of the result matrix C,
– Under such conditions the initial matrix A is decomposed into p
horizontal stripes and matrix B is decomposed into p vertical stripes,
– Subtasks distribution among the processors have to meet the
requirements of effective representation of the ring structure of subtask
information dependencies

Introduction to Parallel Programming: Matrix Multiplication

St. Petersburg, Russia,2012 © Gergel V.P.
10 → 50
Algorithm 1: Block-Striped Decomposition…

 Efficiency Analysis…
– Speed-up and Efficiency generalized estimates

n3 n3
Sp  p Ep  1
p  (n p)
3 3
(n p)

Developed method of parallel computations allows

to achieve ideal speed-up and efficiency characteristics

Introduction to Parallel Programming: Matrix Multiplication

St. Petersburg, Russia,2012 © Gergel V.P.
11 → 50
Algorithm 1: Block-Striped Decomposition…

 Efficiency Analysis (detailed estimates):

- Time of parallel algorithm execution, that corresponds to
the processor calculations:
T p calc   (n 2 / p)  2n  1  
- Time of the data transmission operations can be obtained by means
of the Hockney model:
T p comm    p  1    w  n  n p  /  
(we assume that all the data transmission operations between the
processors during one iteration can be executed in parallel)
Total time of parallel algorithm execution is:
T p  (n 2 / p)2n  1     p  1    w  n  n p  /  

Introduction to Parallel Programming: Matrix Multiplication

St. Petersburg, Russia,2012 © Gergel V.P.
12 → 50
Algorithm 1: Block-Striped Decomposition…

 Results of computational experiments…

– Comparison of theoretical estimations and results of computational
experiments
2 processors 4 processors 8 processors
Matrix Size Model Experiment Model Experiment Model Experiment
500 0,8243 0,3758 0,4313 0,1535 0,2353 0,0968
1000 6,51822 5,4427 3,3349 2,2628 1,7436 0,6998
1500 21,9137 20,9503 11,1270 11,0804 5,7340 5,1766
2000 51,8429 45,7436 26,2236 21,6001 13,4144 9,4127
2500 101,1377 99,5097 51,0408 56,9203 25,9928 18,3303
3000 174,6301 171,9232 87,9946 111,9642 44,6772 45,5482

200
180
160
140
120
Experiment
Time

100
Model
80
60
40
20
0
500 1000 1500 2000 2500 3000
Matrix Size

St. Petersburg, Russia,2012 Introduction to Parallel Programming: Matrix Multiplication

 Results of computational experiments:

– Speedup
2 processors 4 processors 8 processors
Matrix Size Serial Algorithm
Time Speed Up Time Speed Up Time Speed Up
500 0,8752 0,3758 2,3287 0,1535 5,6982 0,0968 9,0371
1000 12,8787 5,4427 2,3662 2,2628 5,6912 0,6998 18,4014
1500 43,4731 20,9503 2,0750 11,0804 3,9234 5,1766 8,3978
2000 103,0561 45,7436 2,2529 21,6001 4,7710 9,4127 10,9485
2500 201,2915 99,5097 2,0228 56,9203 3,5363 18,3303 10,9813
3000 347,8434 171,9232 2,0232 111,9642 3,1067 45,5482 7,6368

14 500
12 1000
Speed Up

1500
10
2000
8 2500
6 3000

0
2 4 8
Number of Processors

St. Petersburg, Russia,2012 Introduction to Parallel Programming: Matrix Multiplication

 Another possible approach for the data distribution

is the rowwise block-striped decomposition for
matrices A and B
A B

St. Petersburg, Russia,2012 Introduction to Parallel Programming: Matrix Multiplication

 Analysis of Information Dependencies

– Each subtask hold one row of matrix A and one row of
matrix B,
– At every iteration the subtasks perform the element-to-
element multiplications of the rows; as a result the row of
partial results for matrix C is obtained,
– Then every subtask i, 0≤ i<n, transmits its row of matrix B
for the subtask with the number (i+1) mod n.

After all algorithm iterations all rows of matrix B were

come within every subtask one after another

Introduction to Parallel Programming: Matrix Multiplication

St. Petersburg, Russia,2012 © Gergel V.P.
16 → 50
Algorithm 1': Block-Striped Decomposition

 Scheme of Information Dependences

x = x = x = x =

Introduction to Parallel Programming: Matrix Multiplication

St. Petersburg, Russia,2012 © Gergel V.P.
17 → 50
Algorithm 2: Fox’s method…

 Data distribution – checkerboard scheme

A B C

X =

 Basic subtask is a procedure, that calculates all

elements of one block of matrix C
 A00 A01... A0 q 1   B00 B01...B0 q 1   C00C01...C0 q 1  q 1
     
      , Cij   Ais Bsj
 A A ... A   B B ...B   c C ...C  s 0
 q 10 q 11 q 1q 1   q 10 q 11 q 1q 1   q 10 q 11 q 1q 1 

St. Petersburg, Russia,2012 Introduction to Parallel Programming: Matrix Multiplication

 Analysis of Information Dependencies

– Subtask with (i,j) number calculates the block Cij, of the result
matrix C. As a result, the subtasks form the qxq two-
dimensional grid,
– Each subtask holds 4 matrix blocks:
• block Cij of the result matrix C, which is calculated in the subtask,
• block Aij of matrix A, which was placed in the subtask before the
calculation starts,
• blocks Aij' and Bij' of matrix A and matrix B, that are received by
the subtask during calculations.

Introduction to Parallel Programming: Matrix Multiplication

St. Petersburg, Russia,2012 © Gergel V.P.
19 → 50
Algorithm 2: Fox’s method…

 Analysis of Information Dependencies – during iteration l,

0 l<q, algorithm performs:
– The subtask (i,j) transmits its block Aij of matrix A to all subtasks of the
same horizontal row i of the grid; the j index, which detemines the
position of the subtask in the row, can be obtained using equation:
j = ( i+l ) mod q,
where mod operation is the procedure of calculating the remainder of
integer-valued division,
– Every subtask performs the multiplication of received blocks Aij' and
Bij' and adds the result to the block Cij

Cij  Cij  Aij  Bij

– Every subtask (i,j) transmits its block Bij' to the neighbor, which is
previous in the same vertical line (the blocks of subtasks of the first
row are transmitted to the subtasks of the last row of the grid).

Introduction to Parallel Programming: Matrix Multiplication

St. Petersburg, Russia,2012 © Gergel V.P.
20 → 50
Algorithm 2: Fox’s method…

 Scheme of Information Dependences

First Iteration

A0,0 A0,1 A0,0 A0,1

A0,0 A0,1 A0,0 A0,0
B0,0 B0,1 B0,0 B0,1

C0,0=0 C0,1=0 C0,0=A0,0·B0,0 C0,1=A0,0·B0,1

A1,0 A1,1 A1,0 A1,1

A1,0 A1,1 A1,1 A1,1

B1,0 B1,1 B1,0 B1,1

C1,0=0 C1,1=0 C1,0= A1,1·B1,0 C1,1= A1,1·B1,1

Second Iteration

A0,0 A0,1 A0,0 A0,1

A0,0 A0,1 A0,1 A0,1
B1,0 B1,1 B1,0 B1,1
C0,0=A0,0·B0,0 C0,1=A0,0·B0,1 C0,0=A0,0·B0,0 C0,1=A0,0·B0,1
+ A0,1·B1,0 + A0,1·B1,1

A1,0 A1,1
A1,0 A1,1 A1,0 A1,1
B0,0 B0,1 A1,0 A1,0
C1,0= A1,1·B1,0 C1,1= A1,1·B1,1 B0,0 B0,1
C1,0= A1,1·B1,0 C1,1= A1,1·B1,1
+ A1,0·B0,0 + A1,0·B0,1

St. Petersburg, Russia,2012 Introduction to Parallel Programming: Matrix Multiplication

 Scaling and Distributing the Subtasks among the

Processors
– The sizes of the matrices blocks can be selected so that the
number of subtasks will coincides the number of available
processors p,
– The most efficient execution of the parallel the Fox’s algorithm
can be provided when the communication network topology is
a two-dimensional grid,
– In this case the subtasks can be distributed among the
processors in a natural way: the subtask (i,j) has to be placed
to the pi,j processor

Introduction to Parallel Programming: Matrix Multiplication

St. Petersburg, Russia,2012 © Gergel V.P.
22 → 50
Algorithm 2: Fox’s method…

 Efficiency Analysis…
– Speed-up and Efficiency generalized estimates
n 2 n2
Sp  p Ep  1
n2 / p p  (n / p)
2

Developed method of parallel computations allows

to achieve ideal speed-up and efficiency characteristics

Introduction to Parallel Programming: Matrix Multiplication

St. Petersburg, Russia,2012 © Gergel V.P.
23 → 50
Algorithm 2: Fox’s method

 Efficiency Analysis (detailed estimates):

-Time of parallel algorithm execution, that corresponds to
the processor calculations:
T p calc   q[( n 2 / p)  2n / q  1  (n 2 / p)]  
- At every iteration one of the processors in each row of the processors
grid transmits its block of matrix A to the rest processors in the row:
Tp1 comm   log 2 q (  w(n 2 / p) /  )
- After the matrix block multiplication each processor transmits its
blocks of matrix B to its neighbor in the vertical column:
T p2 comm     w  (n 2 p) / 
Total time of parallel algorithm execution is:
Tp  q[(n 2 / p)  2n / q  1  (n 2 / p)]   (q log 2 q  (q  1) )(  w(n 2 / p) /  )

St. Petersburg, Russia,2012 Introduction to Parallel Programming: Matrix Multiplication

 Description of the parallel program sample…

– The main function
• implements the computational method scheme by sequential
calling out the necessary subprograms

Code

Introduction to Parallel Programming: Matrix Multiplication

St. Petersburg, Russia,2012 © Gergel V.P.
25 → 50
Algorithm 2: Fox’s method…

 Description of the parallel program sample…

– The function CreateGridCommunicators:
• creates a communicator as a two-dimensional square grid,
• determines the coordinates of each process in the grid,
• creates communicators for each row and each column separately.

Code

Introduction to Parallel Programming: Matrix Multiplication

St. Petersburg, Russia,2012 © Gergel V.P.
26 → 50
Algorithm 2: Fox’s method…

 Description of the parallel program sample…

– The function ProcessInitialization:
• sets the matrix sizes,
• allocates memory for storing the initial matrices and their blocks,
• initializes all the original problem data (in order to determine the
elements of the initial matrices we will use the functions
DummyDataInitialization and RandomDataInitialization)

Code

Introduction to Parallel Programming: Matrix Multiplication

St. Petersburg, Russia,2012 © Gergel V.P.
27 → 50
Algorithm 2: Fox’s method…

 Description of the parallel program sample…

– The function ParallelResultCalculation:
• executes the parallel Fox algorithm of matrix multiplication (the
matrix blocks and their sizes must be given to the function as its
arguments)

Code

Introduction to Parallel Programming: Matrix Multiplication

St. Petersburg, Russia,2012 © Gergel V.P.
28 → 50
Algorithm 2: Fox’s method…

 Description of the parallel program sample…

– Iteration execution: transmission of matrix A blocks to the
processors in the same row of the processors grid (the
function AblockCommunication):
• The number of sending processor Pivot is determined, then the
data transmission is performed,
• For the transmission the block pAblock of matrix A is used. This
block was placed on the processor before the calculation started,
• The block transmission is performed by means the function
MPI_Bcast (the communicator RowComm is used).
Code

Introduction to Parallel Programming: Matrix Multiplication

St. Petersburg, Russia,2012 © Gergel V.P.
29 → 50
Algorithm 2: Fox’s method…

 Description of the parallel program sample…

– Iteration execution: Matrix block multiplication is executed
(the function BlockMultiplication):
• This function multiplies the block of matrix A (pAblock) and the
block of matrix B (pBblock) and adds the result to the block of
matrix C

Code

Introduction to Parallel Programming: Matrix Multiplication

St. Petersburg, Russia,2012 © Gergel V.P.
30 → 50
Algorithm 2: Fox’s method…

 Description of the parallel program sample

– Iteration execution: Cycle shift of the matrix B blocks along
the columns of the processor grid is implemented (the
function BblockCommunication):
• Every processor transmits its block to the processor with the
number NextProc in the grid column,
• Every processor receives the block, which was sent by the
processor with the number PrevProc from the same grid column,
• To perform such actions the fucntion MPI_SendRecv_replace is
used, which provides all necessary block transmissions and uses
only single memory buffer pBblock.
Code

Introduction to Parallel Programming: Matrix Multiplication

St. Petersburg, Russia,2012 © Gergel V.P.
31 → 50
Algorithm 2: Fox’s method…

 Results of computational experiments…

– Comparison of theoretical estimations and results of
computational experiments

100

4 processors 9 processors 90
Matrix Size 80
Model Experiment Model Experiment
70
500 0,4217 0,2190 0,2200 0,1468 60
Experiment

Time
50
1000 3,2970 3,0910 1,5924 2,1565 Model
40
1500 11,0419 10,8678 5,1920 7,2502 30

2000 26,0726 24,1421 12,0927 21,4157 20

10
2500 50,8049 51,4735 23,3682 41,2159 0
3000 87,6548 87,0538 40,0923 58,2022 500 1000 1500 2000 2500 3000
Matrix Size

St. Petersburg, Russia,2012 Introduction to Parallel Programming: Matrix Multiplication

© Gergel V.P.
32 → 50
Algorithm 2: Fox’s method
 Results of computational experiments:
– Speedup
Parallel Algorithm
Matrix Size Serial Algorithm 4 processors 9 processors
Time Speed Up Time Speed Up
500 0,8527 0,2190 3,8925 0,1468 5,8079
1000 12,8787 3,0910 4,1664 2,1565 5,9719
1500 43,4731 10,8678 4,0001 7,2502 5,9960
2000 103,0561 24,1421 4,2687 21,4157 4,8121
2500 201,2915 51,4735 3,9105 41,2159 4,8838
3000 347,8434 87,0538 3,9957 58,2022 5,9764

6
500
5
1000
Speed Up

4 1500
3 2000
2500
2
3000
1

0
4 9
Num ber of Processors

St. Petersburg, Russia,2012 Introduction to Parallel Programming: Matrix Multiplication

 Data distribution – Checkerboard scheme

A B C

X =

 Basic subtask is a procedure, that calculates all

St. Petersburg, Russia,2012 Introduction to Parallel Programming: Matrix Multiplication

© Gergel V.P.
34 → 50
Algorithm 3: Cannon’s Method…
 Analysis of Information Dependencies:
– The subtask with the number (i,j) calculates the block Cij, of the
result matrix C. As a result, the subtasks form the q×q two-
dimensional grid,
– The initial distribution of matrix blocks in Cannon’s algorithm is
selected in such a way that the first block multiplication can be
performed without additional data transmission:
• At the beginning each subtask (i,j) holds the blocks Aij and Bij,
• For the i-th row of the subtasks grid the matrix A blocks are shifted for
(i-1) positions to the left,
• For the j –th column of the subtasks grid the matrix B blocks are
shifted for (j-1) positions upward,
– Data transmission operations are the example of the circular shift
communication
Introduction to Parallel Programming: Matrix Multiplication
St. Petersburg, Russia,2012 © Gergel V.P.
35 → 50
Algorithm 3: Cannon’s Method…

 Redistribution of matrix blocks on the first stage of the

algorithm

A0,0 A0,1 A0,2 A0,0 A0,1 A0,2

B0,0 B0,1 B0,2 B0,0 B1,1 B2,2
C0,0=0 C0,1=0 C0,2=0 C0,0=0 C0,1=0 C0,2=0

A1,0 A1,1 A1,2 A1,1 A1,2 A1,0

B1,0 B1,1 B1,2 B1,0 B2,1 B0,2
C1,0=0 C1,1=0 C1,2=0 C1,0=0 C1,1=0 C1,2=0

A2,0 A2,1 A2,2 A2,2 A2,0 A2,1

B2,0 B2,1 B2,2 B2,0 B0,1 B1,2
C2,0=0 C2,1=0 C2,2=0 C2,0=0 C2,1=0 C2,2=0

Introduction to Parallel Programming: Matrix Multiplication

 Analysis of Information Dependencies:

– After the redistribution, which was performed at the first
stage, the matrix blocks can be multiplied without additional
data transmission operations,
– To obtain all of the rest blocks after the operation of blocks
multiplication:
• Matrix A blocks are shifted for one position left along the grid
row,
• Matrix B blocks are shifted for one position upward along the
grid column.

Introduction to Parallel Programming: Matrix Multiplication

 Scaling and Distributing the Subtasks among the

Processors:
– The sizes of the matrices blocks can be selected so that
the number of subtasks will coincides the number of
available processors p,
– The most efficient execution of the parallel Canon’s
algorithm can be provided when the communication
network topology is a two-dimensional grid,
– In this case the subtasks can be distributed among the
processors in a natural way: the subtask (i,j) has to be
placed to the pi,j processor

Introduction to Parallel Programming: Matrix Multiplication

 Efficiency Analysis…
– Speed-up and Efficiency generalized estimates
n2 n2
Sp  p Ep  1
2
n /p p  (n / p)
2

Developed method of parallel computations allows

to achieve ideal speed-up and efficiency characteristics

Introduction to Parallel Programming: Matrix Multiplication

 Efficiency Analysis (detailed estimates):

- The Cannon’s algorithm differs from the Fox’s algorithm only in
the types of communication operations, that is why:
T p calc   (n 2 / p)  2n  1  
- Time of the initial redistribution of matrices blocks:

T p1 comm   2    w  (n 2 p) /  
- After every multiplication operation matrix blocks are shifted:

T p2 comm   2    w  (n 2 p) /  
Total time of parallel algorithm execution is:
T p  q[( n 2 / p)  2n / q  1  (n 2 / p)]    (2q  2)(  w(n 2 / p) /  )

St. Petersburg, Russia,2012 Introduction to Parallel Programming: Matrix Multiplication

 Results of computational experiments…

– Comparison of theoretical estimations and results of
computational experiments
100
90
4 processors 9 processors
80
Matrix Size Model Experiment Model Experiment 70
60
1000 3,4485 3,0806 1,5669 1,1889 Experiment

Time
50
1500 11,3821 11,1716 5,1348 4,6310 Model
40

2000 26,6769 24,0502 11,9912 14,4759 30

20
2500 51,7488 53,1444 23,2098 23,5398
10
3000 89,0138 88,2979 39,8643 36,3688 0
1000 1500 2000 2500 3000
Matrix Size

St. Petersburg, Russia,2012 Introduction to Parallel Programming: Matrix Multiplication

© Gergel V.P.
41 → 50
Algorithm 3: Cannon’s Method
 Results of computational experiments:
– Speedup
Parallel Algorithm
Matrix Size Serial Algorithm 4 processors 9 processors
Time Speed Up Time Speed Up
1000 12,8787 3,0806 4,1805 1,1889 10,8324
1500 43,4731 11,1716 3,8913 4,6310 9,3872
2000 103,0561 24,0502 4,2850 14,4759 7,1191
2500 201,2915 53,1444 3,7876 23,5398 8,5511
3000 347,8434 88,2979 3,9394 36,3688 9,5643

8 1000
1500
Speed Up

6 2000
2500
4 3000

0
4 9
Num ber of Processors

St. Petersburg, Russia,2012 Introduction to Parallel Programming: Matrix Multiplication

 Three parallel algorithms for matrix multiplication are

discussed:
– Algorithm 1 - Block-Striped Decomposition
– Algorithm 2 – Fox’s method (checkerboard decomposition),
– Algorithm 3 – Cannon’s method (checkerboard decomposition),
 The parallel program sample for the Fox’s algorithm is
described
 Theoretical analysis allows to predict the speed-up and
efficiency characteristics of parallel computations with
sufficiently high accuracy

Introduction to Parallel Programming: Matrix Multiplication

 All presented algorithms have nearly the same theoretical

estimations for speed-up and efficiency characteristics
6

4
Spped Up

Block-striped Algorithm
3 Fox Algorithm
Cannon Algorithm

0
1000 1500 2000 2500 3000
Matrix Size

St. Petersburg, Russia,2012 Introduction to Parallel Programming: Matrix Multiplication

 What sequential algorithms for matrix multiplication are

known? What is the complexity of these algorithms?
 What basic approaches can be used for developing the
parallel algorithms for matrix multiplication?
 What algorithm possesses the best speedup and efficiency
characteristics?
 Which one of the described algorithms requires the maximal
and minimal size of memory?
 What data communication operations are necessary for the
parallel algorithms of matrix multiplication?

Introduction to Parallel Programming: Matrix Multiplication

 Develop the parallel programs for two algorithms of matrix

multiplication based on block-striped decomposition.
Compare the time of their execution.
 Develop the parallel program for the Cannon’s algorithm.
 Formulate the theoretical estimations for the execution time of
these algorithms
 Execute programs. Compare the time of computational
experiments and the theoretical estimations being obtained

Introduction to Parallel Programming: Matrix Multiplication

 Gergel, V.P. Theory and Practice of Parallel Computations. -

Moscow: INTUIT.RU, 2007. - 424 p. (In Russian)
 Gergel, V.P. High Performance Computations for
Multiprocessor Multicore Systems. - Moscow: MSU, 2010. –
544 с. (In Russian)
 Kumar V., Grama, A., Gupta, A., Karypis, G. (1994).
Introduction to Parallel Computing. - The Benjamin/Cummings
Publishing Company, Inc. (2nd edn., 2003)
 Quinn, M. J. (2004). Parallel Programming in C with MPI and
OpenMP. – New York, NY: McGraw-Hill.
 Fox, G.C., Otto, S.W. and Hey, A.J.G. (1987) Matrix
Algorithms on a Hypercube I: Matrix Multiplication. Parallel
Computing. 4 H. 17-31.
Introduction to Parallel Programming: Matrix Multiplication
St. Petersburg, Russia,2012 © Gergel V.P.
47 → 50
Next Section

 Parallel Methods for Solving Linear Systems

Introduction to Parallel Programming: Matrix Multiplication

Gergel V.P., Professor, Doctor of Science in Engineering, Course Author

Grishagin V.A., Associate Professor, Candidate of Science in Mathematics
Abrosimova O.N., Assistant Professor (chapter 10)
Kurylev A.L., Assistant Professor (learning labs 4,5)
Labutin D.Y., Assistant Professor (ParaLab system)
Sysoev A.V., Assistant Professor (chapter 1)
Gergel A.V., Post-Graduate Student (chapter 12, learning lab 6)
Labutina A.A., Post-Graduate Student (chapters 7,8,9, learning labs 1,2,3,
ParaLab system)
Senin A.V., Post-Graduate Student (chapter 11, learning labs on Microsoft
Compute Cluster)
Liverko S.V., Student (ParaLab system)

Introduction to Parallel Programming: Matrix Multiplication

The purpose of the project is to develop the set of educational materials for the
teaching course “Multiprocessor computational systems and parallel programming”.
This course is designed for the consideration of the parallel computation problems,
which are stipulated in the recommendations of IEEE-CS and ACM Computing
Curricula 2001. The educational materials can be used for teaching/training
specialists in the fields of informatics, computer engineering and information
technologies. The curriculum consists of the training course “Introduction in the
methods of parallel programming” and the computer laboratory training “The
methods and technologies of parallel program development”. Such educational
materials makes possible to seamlessly combine both the fundamental education in
computer science and the practical training in the methods of developing the
software for solving complicated time-consuming computational problems using the
high performance computational systems.
The project was carried out in Nizhny Novgorod State University, the Software
Department of the Computing Mathematics and Cybernetics Faculty
(http://www.software.unn.ac.ru). The project was implemented with the support of
Microsoft.

Introduction to Parallel Programming: Matrix Multiplication

HPC Matrix Algorithms Lecture
No ratings yet
HPC Matrix Algorithms Lecture
15 pages
Dense Matrix Parallel Algorithms
No ratings yet
Dense Matrix Parallel Algorithms
55 pages
Matrix Mul
No ratings yet
Matrix Mul
33 pages
A Practical Performance Comparison of Parallel Matrix Multiplication Algorithms On Networks of Workstations
No ratings yet
A Practical Performance Comparison of Parallel Matrix Multiplication Algorithms On Networks of Workstations
2 pages
Unit II Matrix Multiplication
No ratings yet
Unit II Matrix Multiplication
23 pages
Class18 - Linalg II Handout PDF
No ratings yet
Class18 - Linalg II Handout PDF
48 pages
Matrix Multiplication Algorithms
No ratings yet
Matrix Multiplication Algorithms
39 pages
Efficient Parallel Algorithm For
No ratings yet
Efficient Parallel Algorithm For
12 pages
Matrix Multiplication Optimization
No ratings yet
Matrix Multiplication Optimization
32 pages
Unit 4 HPC Part8
No ratings yet
Unit 4 HPC Part8
16 pages
Lect7 Parallel System
No ratings yet
Lect7 Parallel System
20 pages
Cannon Strassen DNS Algorithm
No ratings yet
Cannon Strassen DNS Algorithm
10 pages
Parallel Processing
No ratings yet
Parallel Processing
35 pages
Unit - 2 HPC
No ratings yet
Unit - 2 HPC
96 pages
Chapter 9 - Parallel Computation Problems
No ratings yet
Chapter 9 - Parallel Computation Problems
43 pages
Lecture 4: Principles of Parallel Algorithm Design (Part 4)
No ratings yet
Lecture 4: Principles of Parallel Algorithm Design (Part 4)
27 pages
Algorithms and Data Structure
No ratings yet
Algorithms and Data Structure
29 pages
Chapter 14: Parallel Algorithms
No ratings yet
Chapter 14: Parallel Algorithms
23 pages
Group A Assign 3 Matrix Opearations
No ratings yet
Group A Assign 3 Matrix Opearations
6 pages
Matrix Multiplications and Collective Communication: Michael Hanke
No ratings yet
Matrix Multiplications and Collective Communication: Michael Hanke
38 pages
Parallel Algorithms Underlying MPI Implementations
No ratings yet
Parallel Algorithms Underlying MPI Implementations
55 pages
Parallel Algorithms Underlying MPI Implementations
No ratings yet
Parallel Algorithms Underlying MPI Implementations
55 pages
Introduction To Parallel Computing Design and Anal
No ratings yet
Introduction To Parallel Computing Design and Anal
53 pages
Parallel Fox Algorithm for Matrices
No ratings yet
Parallel Fox Algorithm for Matrices
8 pages
Lecture 4
No ratings yet
Lecture 4
26 pages
RG2 ParallelizationPrinciples HPCAI Jan2020
No ratings yet
RG2 ParallelizationPrinciples HPCAI Jan2020
40 pages
Sheet 2: Problem 1: Matrix Multiplication Using CREW PRAM
No ratings yet
Sheet 2: Problem 1: Matrix Multiplication Using CREW PRAM
3 pages
BHH 93
No ratings yet
BHH 93
27 pages
Parallel Algorithm Design Basics
No ratings yet
Parallel Algorithm Design Basics
63 pages
Lec 38
No ratings yet
Lec 38
19 pages
Advanced Computer Architecture 1
No ratings yet
Advanced Computer Architecture 1
14 pages
Compre 1
No ratings yet
Compre 1
2 pages
Content PDF
No ratings yet
Content PDF
14 pages
Irjet V5i3362
No ratings yet
Irjet V5i3362
4 pages
Week 09 2021b
No ratings yet
Week 09 2021b
52 pages
Matrix Multiplication Algorithm
No ratings yet
Matrix Multiplication Algorithm
9 pages
Blocked Matrix Multiply
No ratings yet
Blocked Matrix Multiply
6 pages
MPI Matrix Multiplication 1 PDF
No ratings yet
MPI Matrix Multiplication 1 PDF
23 pages
Linear Algebra for STEM Students
No ratings yet
Linear Algebra for STEM Students
21 pages
Bhagaban - Dynamic - Programming Intro - Matrix - Elemnts - Unit - II - 4
No ratings yet
Bhagaban - Dynamic - Programming Intro - Matrix - Elemnts - Unit - II - 4
37 pages
Strassen's Algorithm & Optimization
No ratings yet
Strassen's Algorithm & Optimization
8 pages
Multiplym 2
No ratings yet
Multiplym 2
23 pages
Lecture 17
No ratings yet
Lecture 17
7 pages
07 Parallel Algorithms in Parallel and Distributed Computing
No ratings yet
07 Parallel Algorithms in Parallel and Distributed Computing
13 pages
Python Lecture 9 Python List Vector Matrix
No ratings yet
Python Lecture 9 Python List Vector Matrix
11 pages
LinearAlgebra Matlab HW3 V2s
No ratings yet
LinearAlgebra Matlab HW3 V2s
5 pages
2.decomposition Done
No ratings yet
2.decomposition Done
4 pages
Parallel Computing Architectures
No ratings yet
Parallel Computing Architectures
57 pages
Exercise 9
No ratings yet
Exercise 9
5 pages
Parallel Programming With Co-Arrays Robert W. Numrich 2025 Easy Download
No ratings yet
Parallel Programming With Co-Arrays Robert W. Numrich 2025 Easy Download
117 pages
Matrix Chain Mult
No ratings yet
Matrix Chain Mult
11 pages
Adsa Unit - 4
No ratings yet
Adsa Unit - 4
33 pages
Matrix Operations Guide
100% (1)
Matrix Operations Guide
28 pages
Lecture 1 - (Spring 2024)
No ratings yet
Lecture 1 - (Spring 2024)
12 pages
Matrix-Chain Multiplication As An Optimization Problem
No ratings yet
Matrix-Chain Multiplication As An Optimization Problem
23 pages
AdvancedAlgorithms NN
No ratings yet
AdvancedAlgorithms NN
152 pages
Parallel Programming With Co-Arrays Robert W. Numrich Instant Access 2025
No ratings yet
Parallel Programming With Co-Arrays Robert W. Numrich Instant Access 2025
82 pages
To Print - Dynprog2
No ratings yet
To Print - Dynprog2
46 pages
Chapter 7 - Parallel Programming Issues
No ratings yet
Chapter 7 - Parallel Programming Issues
68 pages
The Successive Overrelaxation Method
No ratings yet
The Successive Overrelaxation Method
2 pages
Intranet Chatting Documentation
No ratings yet
Intranet Chatting Documentation
2 pages
Reflection in An Arbitrary Line
No ratings yet
Reflection in An Arbitrary Line
3 pages
Indira Gandhi National Open University Regional Centre, Patna
No ratings yet
Indira Gandhi National Open University Regional Centre, Patna
4 pages
Shubham Yadav: Mobile: +91 - (7004630688)
No ratings yet
Shubham Yadav: Mobile: +91 - (7004630688)
3 pages
Hemiarthroplasty Hip - Apr20
No ratings yet
Hemiarthroplasty Hip - Apr20
10 pages
Non Linear Regression
No ratings yet
Non Linear Regression
13 pages
3d Transformation Coordinates
No ratings yet
3d Transformation Coordinates
9 pages
3D Transformation
No ratings yet
3D Transformation
2 pages
Reflection and Rotation in 3D Geometry
No ratings yet
Reflection and Rotation in 3D Geometry
2 pages
Calculus: Integral Solutions
No ratings yet
Calculus: Integral Solutions
19 pages
KUNDALI
No ratings yet
KUNDALI
4 pages
Acknowledgemnt Slip For Online Examination Form PDF
No ratings yet
Acknowledgemnt Slip For Online Examination Form PDF
2 pages
Chap08 Lec2 PDF
No ratings yet
Chap08 Lec2 PDF
113 pages
OOP and Java: Simplifying Complexity
No ratings yet
OOP and Java: Simplifying Complexity
15 pages
1.1 Background: 1.3.1 Purpose
No ratings yet
1.1 Background: 1.3.1 Purpose
1 page
Swepna Netbeans Project
No ratings yet
Swepna Netbeans Project
14 pages
Rashmi Project
No ratings yet
Rashmi Project
7 pages
Kundan Project
No ratings yet
Kundan Project
8 pages
Intranet Chatting Project
No ratings yet
Intranet Chatting Project
38 pages
Student Details: Enrolement Number Name Programme Session Reference Number: 1020600017205 Category: Ur
No ratings yet
Student Details: Enrolement Number Name Programme Session Reference Number: 1020600017205 Category: Ur
1 page
Java MySQL JDBC Connection Guide
No ratings yet
Java MySQL JDBC Connection Guide
1 page
Greyhound Fleet Manager: Mca Project Report
No ratings yet
Greyhound Fleet Manager: Mca Project Report
4 pages
User Interface Design
No ratings yet
User Interface Design
3 pages
Anuja FinalProjectReport
No ratings yet
Anuja FinalProjectReport
2 pages
Results: Griddb Performance Using Google Cloud Platform
No ratings yet
Results: Griddb Performance Using Google Cloud Platform
2 pages
Sobel 3x3 Image Filter Implementation
No ratings yet
Sobel 3x3 Image Filter Implementation
3 pages
Compiler Design Course Overview
No ratings yet
Compiler Design Course Overview
14 pages
Red Black Trees: Erin Keith
No ratings yet
Red Black Trees: Erin Keith
24 pages
Gtu Mpi Paper Solution
No ratings yet
Gtu Mpi Paper Solution
19 pages
Lastexception 63670259058
No ratings yet
Lastexception 63670259058
25 pages
Jenish Shiroya: React JS Intern & MERN Developer
No ratings yet
Jenish Shiroya: React JS Intern & MERN Developer
1 page
Final Sample 2
100% (1)
Final Sample 2
9 pages
1.write A Program To Check Whether A String Belongs To The Grammar or Not
0% (1)
1.write A Program To Check Whether A String Belongs To The Grammar or Not
18 pages
7.binary Tree Problems
No ratings yet
7.binary Tree Problems
2 pages
1) Program To Find Factorial of A Given Number
No ratings yet
1) Program To Find Factorial of A Given Number
8 pages
PG-Life Database Design Guide
No ratings yet
PG-Life Database Design Guide
7 pages
Unit 6 - Servlets and Java Server Pages (JSP)
No ratings yet
Unit 6 - Servlets and Java Server Pages (JSP)
111 pages
Problem: Write A Program To Explain The Function of Unary Prefix Operator. Code
No ratings yet
Problem: Write A Program To Explain The Function of Unary Prefix Operator. Code
6 pages
OS Components: 1.1 Process Management
No ratings yet
OS Components: 1.1 Process Management
5 pages
Design Analysis Algorithm - Unit-4
No ratings yet
Design Analysis Algorithm - Unit-4
18 pages
Welcome To Introduction To Programming
No ratings yet
Welcome To Introduction To Programming
48 pages
Arduino Ultrasonic Sensor Guide
No ratings yet
Arduino Ultrasonic Sensor Guide
3 pages
Embedded Esd 4 Unit
No ratings yet
Embedded Esd 4 Unit
11 pages
Which of The Following Is Not True About Writing SQL Statements
No ratings yet
Which of The Following Is Not True About Writing SQL Statements
32 pages
Bahria University, Islamabad Campus: Department of Computer Science
No ratings yet
Bahria University, Islamabad Campus: Department of Computer Science
2 pages
Lessons Learned in Software Testing: An Excellent Book Covering A Range of Testing Topics
No ratings yet
Lessons Learned in Software Testing: An Excellent Book Covering A Range of Testing Topics
20 pages
Prog 3112 Second Quarter Exam (Amaoedsources - Blogspot.com)
100% (2)
Prog 3112 Second Quarter Exam (Amaoedsources - Blogspot.com)
166 pages
Ayush's Resume
No ratings yet
Ayush's Resume
1 page
What Is The Difference Between A Library and A Package
No ratings yet
What Is The Difference Between A Library and A Package
32 pages
JJ
No ratings yet
JJ
153 pages
Java Training Report
No ratings yet
Java Training Report
22 pages
COMPROG: Computer Fundamentals and Programming: Important!!!
No ratings yet
COMPROG: Computer Fundamentals and Programming: Important!!!
1 page
Course Outline DDC3232
No ratings yet
Course Outline DDC3232
7 pages
L-2.1.3 Swapping, Fragmentation - Compaction
No ratings yet
L-2.1.3 Swapping, Fragmentation - Compaction
13 pages
Software-Testing - Units 6,7,8 (VTU) 8th Sem
0% (1)
Software-Testing - Units 6,7,8 (VTU) 8th Sem
76 pages