7: Introductory MPI
John Burkardt1
1 Virginia Tech
10-12 June 2008
Burkardt Introductory MPI
Introductory MPI
MPI: Why, Where, How?
Overview of an MPI computation
Designing an MPI computation
The Source Code
Compiling, linking, running.
Burkardt Introductory MPI
MPI: Why, Where, How?
In 1917, Richardson’s first efforts to compute a weather prediction
were simplistic and mysteriously inaccurate.
Over time, it became clear that weather could be predicted, but
that accuracy required huge amounts of data.
Soon there was so much data that making a prediction 24 hours in
advance could take...24 hours of computer time.
Weather events like Hurricane Wilma ($30 billion in damage)
meant accurate weather prediction was worth paying for.
Burkardt Introductory MPI
Richardson’s Weather Computation for 1917
Burkardt Introductory MPI
Predicting the Path of Hurricane Wilma, 2005
Burkardt Introductory MPI
MPI: Why, Where, How?
In the 1990’s, the design and maintenance of specially designed
supercomputers became unacceptably high.
It also seemed to be the case that the performance of commodity
chips was not improving fast enough.
However, inter-computer communication had gotten faster and
cheaper.
It seemed possible to imagine that an “orchestra” of low-cost
machines could work together and outperform supercomputers, in
speed and cost.
But where was the conductor?
Burkardt Introductory MPI
Cheap, Ugly, Effective
Burkardt Introductory MPI
MPI: Why, Where, How?
MPI (the Message Passing Interface) manages a parallel
computation on a distributed memory system.
MPI is told the number of computers available, and the program
to be run.
MPI then
distributes a copy of the program to each computer,
assigns each computer a process id,
synchronizes the start of the programs,
transfers messages between the processors.
Burkardt Introductory MPI
MPI: Why, Where, How?
Suppose a user has written a C program myprog.c that includes
the necessary MPI calls (we’ll worry about what those are later!)
The program must be compiled and loaded into an executable
program. This is usually done on a special compile node of the
cluster, which is available for just this kind of interactive use.
mpicc -o myprog myprog.c
Burkardt Introductory MPI
MPI: Why, Where, How?
On some systems, the executable can be run interactively, with the
mpirun command. Here, we request that 4 processors be used in
the execution:
mpirun -np 4 myprog > output.txt
Burkardt Introductory MPI
MPI: Why, Where, How?
Interactive execution may be ruled out if you want to run for a
long time, or with a lot of processors.
In that case, you write a job script that describes time limits, input
files, and the program to be run.
You submit the job to a batch system, perhaps like this:
qsub myprog.sh
When your job is completed, you will be able to access the output
file as well as a log file describing how the job ran, on any one of
the compute nodes.
Burkardt Introductory MPI
MPI: Why, Where, How?
Since your program ran on N computers, you might have some
natural concerns:
Q: How do we avoid doing the exact same thing N times?
A: MPI gives each computer a unique ID, and that’s enough.
Q: Do we end up with N separate output files?
A: No, MPI collects them all together for you.
Burkardt Introductory MPI
Overview of an MPI Computation
We’ll begin with a discussion of MPI computation ”without MPI”.
That is, we’ll hold off on the details of the MPI language, but we
will go through the motions of re-implementing a sequential
algorithm using the capabilities of MPI.
The algorithm we have chosen is a simple example of domain
decomposition, the time dependent heat equation on a wire (a one
dimensional region).
Burkardt Introductory MPI
Overview of an MPI Computation: DE-HEAT
Determine the values of H(x, t) over a range t0 <= t <= t1 and
space x0 <= x <= x1 , given an initial value H(x, t0 ), boundary
conditions, a heat source function f (x, t), and a partial differential
equation
∂H ∂2H
− k 2 = f (x, t)
∂t ∂x
Burkardt Introductory MPI
Overview of an MPI Computation: DE-HEAT
The discrete version of the differential equation is:
h(i, j + 1) − h(i, j) h(i − 1, j) − 2h(i, j) + h(i + 1, j)
−k = f (i, j)
dt dx 2
We have the values of h(i, j) for 0 <= i <= N and a particular
“time” j. We seek value of h at the “next time”, j + 1.
Boundary conditions give us h(0, j + 1) and h(N, j + 1), and we
use the discrete equation to get the values of h for the remaining
spatial indices 0 < i < N.
Burkardt Introductory MPI
Overview of an MPI Computation: DE-HEAT
Burkardt Introductory MPI
Overview of an MPI Computation: DE-HEAT
At a high level of abstraction, it’s easy to see how this
computation could be done by three processors, which we can call
red, green and blue, or perhaps “0”, “1”, and “2”.
Each processor has a part of the h array.
The red processor, for instance, updates h(0) using boundary
conditions, and h(1) through h(6) using the differential equation.
Because red and green are neighbors, they will also need to
exchange messages containing the values of h(6) and h(7) at the
nodes that are touching.
Burkardt Introductory MPI
One Program Binds Them All
At the next level of abstraction, we have to address the issue of
writing one program that all the processors can carry out.
This is going to sound like a Twilight Zone episode.
You’ve just woken up, and been told to help on the HEAT problem.
You know there are P processors on the job.
You look in your wallet and realize your ID number is ID (ID
numbers run from 0 to P-1.
Who do you need to talk to? What do you do?
Burkardt Introductory MPI
Overview of an MPI Computation: DE-HEAT
Who do you need to talk to?
If your ID is 0, you will need to share some information with your
right handneighbor, processor 1.
If your ID is P-1, you’ll share information with your lefthand
neighbor, processor P-2.
Otherwise, talk to both ID-1 and ID+1.
In the communication, you “trade” information about the current
value of your solution that touches the border with your neighbor.
You need your neighbor’s border value in order to complete the
stencil that lets you compute the next set of data.
Burkardt Introductory MPI
Overview of an MPI Computation: DE-HEAT
What do you do?
We’ll redefine N to be the number of nodes in our own single
program, so that the total number of nodes is P*N.
We’ll store our data in entries H[1] through H[N].
We include two extra locations, H[0] and H[N+1], for values
copied from neighbors. These are sometimes called “ghost values”.
ID∗N (ID+1)∗N−1
We can figure out our range of X values as [ P∗N−1 , P∗N−1 ];
To do our share of the work, we must update H[1] through H[N].
To update H[1], if ID=0, use the boundary condition, else a
stencil that includes H[0] copied from lefthand neighbor.
To update H[N], if ID=P-1, use the boundary condition, else a
stencil that includes H[N+1] copied from righthand neighbor.
Burkardt Introductory MPI
Overview of an MPI Computation: DE-HEAT
Some things to notice from our overview.
This program would be considered a good use of MPI, since the
problem is easy to break up into cooperating programs.
The amount of communication between processors is small, and
the pattern of communication is very regular.
The data for this problem is truly distributed. No single processor
has access to the whole solution.
In this case, the individual program that runs on one computer
looks a lot like the sequential program that would solve the whole
problem. That’s not always how it works, though!
Burkardt Introductory MPI
How to Say it in MPI: Initialize and Finalize
# include <stdlib.h>
# include <stdio.h>
# include "mpi.h"
int main ( int argc, char *argv[] )
{
MPI_Init ( &argc, &argv );
MPI_Comm_rank ( MPI_COMM_WORLD, &id );
MPI_Comm_size ( MPI_COMM_WORLD, &p );
Here’s where the good stuff goes!
MPI_Finalize ( );
return 0;
}
Burkardt Introductory MPI
How to Say it in MPI: The “Good Stuff”
As we begin our calculation, processes 1 through P-1 must send
what they call h[1] “to the left”.
Processes 0 through P-2 must receive these values, storing them in
the ghost value slot h[n+1].
Similarly, h[n] gets tossed ”to the right” into the ghost slot h[0] of
the next higher processor.
Sending this data is done with matching calls to MPI Send and
MPI Recv. The details of the call are more complicated than I am
showing here!
Burkardt Introductory MPI
How to Say it in MPI: The “Good Stuff”
if ( 0 < id )
MPI_Send ( h[1] => id-1 )
if ( id < p-1 )
MPI_Recv ( h[n+1] <= id+1 )
if ( id < p-1 )
MPI_Send ( h[n] => id+1 )
if ( 0 < id )
MPI_Recv ( h[0] <= id-1 )
Burkardt Introductory MPI
How to Say it in MPI: The “Good Stuff”
Our communication scheme is defective however. It comes very
close to deadlock.
Remember deadlock? when a process waits for access to a device,
or data or a message that will never arrive.
The problem here is that by default, an MPI process that sends a
message won’t continue until that message has been received.
If you think about the implications, it’s almost surprising that the
code I have describe will work at all.
It will, but more slowly than it should!
Don’t worry about this fact right now, but realize that with MPI
you must also consider these communication issues.
Burkardt Introductory MPI
How to Say it in MPI: The “Good Stuff”
All processes can use the four node stencil now to compute the
updated value of h.
Actually, hnew[1] in the first process, and hnew[n] in the last
one, need to be computed by boundary conditions.
But it’s easier to treat them all the same way, and then correct the
two special cases afterwards.
Burkardt Introductory MPI
How to Say it in MPI: The “Good Stuff”
for ( i = 1; i <= n; i++ )
hnew[i] = h[i] + dt * (
+ k * ( h[i-1] - 2 * h[i] + h[i+1] ) /dx/dx
+ f ( x[i], t ) );
\* Process 0 sets left node by BC *\
\* Process P-1 sets right node by BC *\
if ( 0 == id ) hnew[1] = bc ( x[1], t );
if ( id == p-1 ) hnew[n] = bc ( x[n], t );
\* Replace old H by new. *\
for ( i = 1; i <= n; i++ ) h[i] = hnew[i]
Burkardt Introductory MPI
THE SOURCE CODE
Here is almost all the source code for a working version of the heat
equation solver.
I’ve chopped it up a bit and compressed it, but I wanted you to see
how things really look.
This example is also available in a FORTRAN77 version. We will
be able to send copies of these examples to an MPI machine for
processing later.
Burkardt Introductory MPI
Heat Equation Source Code (Page 1)
# include <s t d l i b . h>
# include <s t d i o . h>
# include <math . h>
# include ” mpi . h”
i n t main ( i n t a r g c , c h a r ∗ a r g v [ ] )
{
i n t id , p ;
d o u b l e wtime ;
M P I I n i t ( &a r g c , &a r g v ) ;
MPI Comm rank ( MPI COMM WORLD, &i d ) ;
MPI Comm size ( MPI COMM WORLD, &p ) ;
update ( id , p ) ;
MPI Finalize ( );
return 0;
}
Burkardt Introductory MPI
Heat Equation Source Code (Page 2)
double b o u n d a r y c o n d i t i o n ( double x , double time )
/∗ BOUNDARY CONDITION r e t u r n s H( 0 ,T) o r H( 1 ,T) , any t i m e . ∗/
{
i f ( x < 0.5 )
{
return ( 100.0 + 10.0 ∗ s i n ( time ) ) ;
}
else
{
return ( 75.0 ) ;
}
}
double i n i t i a l c o n d i t i o n ( double x , double time )
/∗ INITIAL CONDITION r e t u r n s H(X , T) f o r initial t i m e . ∗/
{
return 95.0;
}
double rhs ( double x , double time )
/∗ RHS r e t u r n s r i g h t hand s i d e f u n c t i o n f ( x , t ) . ∗/
{
return 0.0;
}
Burkardt Introductory MPI
Heat Equation Source Code (Page 3)
/∗ S e t t h e X c o o r d i n a t e s o f t h e N n o d e s . ∗/
x = ( double ∗ ) malloc ( ( n + 2 ) ∗ s i z e o f ( double ) ) ;
f o r ( i = 0 ; i <= n + 1 ; i++ )
{
x [ i ] = ( ( double ) ( i d ∗ n + i − 1 ) ∗ x max
+ ( double ) ( p ∗ n − id ∗ n − i ) ∗ x min )
/ ( double ) ( p ∗ n − 1 );
}
/∗ S e t t h e v a l u e s o f H at the initial t i m e . ∗/
time = time min ;
h = ( double ∗ ) malloc ( ( n + 2 ) ∗ s i z e o f ( double ) ) ;
h new = ( d o u b l e ∗ ) m a l l o c ( ( n + 2 ) ∗ s i z e o f ( d o u b l e ) ) ;
h [0] = 0.0;
f o r ( i = 1 ; i <= n ; i++ )
{
h [ i ] = i n i t i a l c o n d i t i o n ( x [ i ] , time ) ;
}
h [ n +1] = 0 . 0 ;
t i m e d e l t a = ( time max − time min ) / ( double ) ( j max − j m i n ) ;
x d e l t a = ( x max − x m i n ) / ( d o u b l e ) ( p ∗ n − 1 ) ;
Burkardt Introductory MPI
Heat Equation Source Code (Page 4)
f o r ( j = 1 ; j <= j m a x ; j++ ) {
time new = j ∗ t i m e d e l t a ;
/∗ Send H [ 1 ] t o ID −1. ∗/
i f ( 0 < id ) {
tag = 1;
MPI Send ( &h [ 1 ] , 1 , MPI DOUBLE , i d −1, tag , MPI COMM WORLD ) ;
}
/∗ R e c e i v e H [ N+1] from ID +1. ∗/
i f ( i d < p−1 ) {
tag = 1;
MPI Recv ( &h [ n +1] , 1 , MPI DOUBLE , i d +1 , tag , MPI COMM WORLD, &s t a t u s ) ;
}
/∗ Send H [ N ] t o ID +1. ∗/
i f ( i d < p−1 ) {
tag = 2;
MPI Send ( &h [ n ] , 1 , MPI DOUBLE , i d +1 , tag , MPI COMM WORLD ) ;
}
/∗ R e c e i v e H [ 0 ] from ID −1. ∗/
i f ( 0 < id ) {
tag = 2;
MPI Recv ( &h [ 0 ] , 1 , MPI DOUBLE , i d −1, tag , MPI COMM WORLD, &s t a t u s ) ;
}
Burkardt Introductory MPI
Heat Equation Source Code (Page 5)
/∗ Update t h e t e m p e r a t u r e b a s e d on t h e f o u r p o i n t s t e n c i l . ∗/
f o r ( i = 1 ; i <= n ; i++ )
{
h new [ i ] = h [ i ]
+ ( time delta ∗ k / x d e l t a / x d e l t a ) ∗ ( h [ i −1] − 2 . 0 ∗ h [ i ] + h [ i +1] )
+ time delta ∗ rhs ( x [ i ] , time ) ;
}
/∗ Correct s e t t i n g s of f i r s t H in first interval , last H in last i n t e r v a l . ∗/
i f ( 0 == i d ) h new [ 1 ] = b o u n d a r y c o n d i t i o n ( x [ 1 ] , t i m e n e w ) ;
i f ( i d == p − 1 ) h new [ n ] = b o u n d a r y c o n d i t i o n ( x [ n ] , t i m e n e w ) ;
/∗ Update t i m e and t e m p e r a t u r e . ∗/
time = time new ;
f o r ( i = 1 ; i <= n ; i++ ) h [ i ] = h new [ i ] ;
/∗ End o f t i m e l o o p . ∗/
}
Burkardt Introductory MPI
COMPILING, Linking, Running
Now that we have a source code file, let’s go through the process
of using it.
The first step is to compile the program.
On the MPI machine, a special version of the compiler
automatically knows how to include MPI:
mpicc -c myprog.c
Compiling on your laptop can be a convenient check for syntax
errors. But you may need to find a copy of mpi.h for C/C++ or
mpif.h for FORTRAN and place that in the directory.
gcc -c myprog.c
Burkardt Introductory MPI
Compiling, LINKING, Running
We can only link on a machine that has the MPI libraries. So let’s
assume we’re on the MPI machine now.
To compile and link in one step:
mpicc myprog.c
To link a compiled code:
mpicc myprog.o
Either command creates the MPI executable a.out.
Burkardt Introductory MPI
Compiling, Linking, RUNNING
Sometimes it is legal to run your program interactively on an MPI
machine, if your program is small in time and memory.
Let’s give our executable a more memorable name:
mv a.out myprog
To run interactively with 4 processors:
mpirun -np 4 ./myprog
Burkardt Introductory MPI
Compiling, Linking, RUNNING
Most jobs on an MPI system go through a batch system. That
means you copy a script file, change a few parameters including
the name of the program, and submit it.
Here is a script file for System X, called myprog.sh
Burkardt Introductory MPI
Compiling, Linking, RUNNING
#!/bin/bash
#PBS -lwalltime=00:00:30
#PBS -lnodes=2:ppn=2
#PBS -W group_list=???
#PBS -q production_q
#PBS -A $$$
NUM_NODES=‘/bin/cat $PBS_NODEFILE | /usr/bin/wc -l \
| /usr/bin/sed "s/ //g"‘
cd $PBS_O_WORKDIR
export PATH=/nfs/software/bin:$PATH
jmdrun -printhostname \
-np $NUM_NODES \
-hostfile $PBS_NODEFILE \
./myprog &> myprog.txt
exit;
Burkardt Introductory MPI
Compiling, Linking, RUNNING
If you look, you can spot the most important line in this file, which
says to run myprog and put the output into myprog.txt.
The command -lnodes=2:ppn=2 says to use two nodes, and to
use that there are two processors on each node, for a total of four
processors. (System X uses dual core chips).
Burkardt Introductory MPI
Compiling, Linking, RUNNING
So to use the batch system, you first compile your program, then
send the job to be processed:
qsub myprog.sh
The system will accept your job, and report to you a queueing
number that can be used to locate the job while it is waiting, and
which will be part of the name of the log files at the end.
If your output does not show up in a reasonable time, you can
issue the command qstat to see its status.
Burkardt Introductory MPI
Conclusion
We’ve gone very quickly over the steps involved in using MPI,
starting with and algorithm, writing a code that will use MPI, and
sending the code through the machine.
We will return to discuss some features of MPI a little more
carefully. In particular, we will look at the six MPI commands we
have used so far, add a few more, and explain how they work.
We will look at another programming example, which uses the
“master-worker” model, where one processor is in charge, and
issues work to the others.
Burkardt Introductory MPI