Introduction To Parallel Computing: Solution Manual
Introduction To Parallel Computing: Solution Manual
Parallel Computing
Solution Manual
Ananth Grama
Anshul Gupta
George Karypis
Vipin Kumar
Copyright 2003
c by Asdison Wesley
Contents
C HAPTER 1 Introduction 1
C HAPTER 2 Models of Parallel Computers 3
C HAPTER 3 Principles of Parallel Algorithm Design 11
C HAPTER 4 Basic Communication Operations 13
C HAPTER 5 Analytical Modeling of Parallel Programs
17
C HAPTER 6 Programming Using the Message-Passing
Paradigm 21
C HAPTER 7 Programming Shared Address Space
Platforms 23
C HAPTER 8 Dense Matrix Algorithms 25
C HAPTER 9 Sorting 33
C HAPTER 10 Graph Algorithms 43
C HAPTER 11 Search Algorithms for Discrete Optimization
Problems 51
C HAPTER 12 Dynamic Programming 53
C HAPTER 13 Fast Fourier Transform 59
Bibliography 63
i
Preface
This instructors guide to accompany the text Introduction to Parallel Computing contains solutions to selected prob-
lems.
For some problems the solution has been sketched, and the details have been left out. When solutions to problems
are available directly in publications, references have been provided. Where necessary, the solutions are supplemented
by figures. Figure and equation numbers are represented in roman numerals to differentiate them from the figures and
equations in the text.
iii
C HAPTER 1
Introduction
1 At the time of compilation (11/02), the five most powerful computers on the Top 500 list along with their peak
GFLOP ratings are:
1. NEC Earth-Simulator/ 5120, 40960.00.
2. IBM ASCI White, SP Power3 375 MHz/8192 12288.00.
3. Linux NetworX MCR Linux Cluster Xeon 2.4 GHz -Quadrics/ 2304, 11060.00.
4. Hewlett-Packard ASCI Q - AlphaServer SC ES45/1.25 GHz/ 4096, 10240.00.
5. Hewlett-Packard ASCI Q - AlphaServer SC ES45/1.25 GHz/ 4096 10240.00.
2 Among many interesting applications, here are a representative few:
1. Structural mechanics: crash testing of automobiles, simulation of structural response of buildings and
bridges to earthquakes and explosions, response of nanoscale cantilevers to very small electromagnetic
fields.
2. Computational biology: structure of biomolecules (protein folding, molecular docking), sequence match-
ing for similarity searching in biological databases, simulation of biological phenomena (vascular flows,
impulse propagation in nerve tissue, etc).
3. Commercial applications: transaction processing, data mining, scalable web and database servers.
3 Data too fluid to plot.
4 Data too fluid to plot.
1
C HAPTER 2
Models of Parallel
Computers
1 A good approximation to the bandwidth can be obtained from a loop that adds a large array of integers:
for (i = 0; i < 1000000; i++)
sum += a[i];
with sum and array a suitably initialized. The time for this loop along with the size of an integer can be used
to compute bandwidth (note that this computation is largely memory bound and the time for addition can be
largely ignored).
To estimate L1 cache size, write a 3-loop matrix multiplication program. Plot the computation rate of this
program as a function of matrix size n. From this plot, determine sudden drops in performance. The size at
which these drops occur, combined with the data size (2n 2 ) and word size can be used to estimate L1 cache
size.
2 The computation performs 8 FLOPS on 2 cache lines, i.e., 8 FLOPS in 200 ns. This corresponds to a compu-
tation rate of 40 MFLOPS.
3 In the best case, the vector gets cached. In this case, 8 FLOPS can be performed on 1 cache line (for the
matrix). This corresponds to a peak computation rate of 80 MFLOPS (note that the matrix does not fit in the
cache).
4 In this case, 8 FLOPS can be performed on 5 cache lines (one for matrix a and four for column-major access
to matrix b). This corresponds to a speed of 16 MFLOPS.
5 For sample codes, see any SGEMM/DGEMM BLAS library source code.
6 Mean access time = 0.8 1 + 0.1 100 + 0.8 400 50ns. This corresponds to a computation rate of 20
MFLOPS (assuming 1 FLOP/word).
Mean access time for serial computation = 0.7 1 + 0.3 100 30ns. This corresponds to a computation
rate of 33 MFLOPS.
Fractional CPU rate = 20/33 0.60.
7 Solution in text.
8 Scaling the switch while maintaining throughput is major challenge. The complexity of the switch is O( p 2 ).
9 CRCW PRAM is the most powerful because it can emulate other models without any performance overhead.
The reverse is not true.
10 We illustrate the equivalence of a butterfly and an omega network for an 8-input network by rearranging the
switches of an omega network so that it looks like a butterfly network This is shown in Figure 2.1 [Lei92a].
3
4 Models of Parallel Computers
<000,1> <000,2>
<000,0> <000,3>
<010,1> <100,2>
<001,0> <001,3>
<010,0> <010,3>
<100,1> <001,2>
<011,0> <011,3>
<110,1> <101,2>
<001,1> <010,2>
<100,0> <100,3>
<011,1> <110,2>
<101,0> <101,3>
<110,0> <110,3>
<101,1> <011,2>
<111,0> <111,3>
<111,1> <111,2>
Figure 2.1 An 8-input omega network redrawn to look like a butterfly network. Node i, l (node i at level l) is identical to node
j, l in the butterfly network, where j is obtained by right circular shifting the binary representation of i l times.
from one partition to the other, we eliminate one communication link across the boundary. However, this
processor is connected to d 1 processors in the original subcube. Therefore, an additional d 1 links are
added. In the next step, one of these d 1 processors is moved to the second partition. However, this processor
is connected to d 2 processors other than the original processors. In this way, moving processors across the
boundary, we can see that the minima resulting from any perturbation is one in which the two partitions are
subcubes. Therefore, the minimum number of communication links across any two halves of a d-dimensional
hypercube is 2d1 .
16 Partitioning the mesh into two equal parts of p/2 processors each would leave at least p communication
links between the partitions. Therefore, the bisection width is p. By configuring the mesh appropriately, the
distance between any two processors can be made to be independent of the number of processors. Therefore,
the diameter of the network is O(1). (This can however be debated because reconfiguring the network in a par-
ticular manner might leave other processors that may be more than one communication link away from each
other. However, for several communication operations, the network can be configured so that the communi-
cation time is independent of the number of processors.) Each processor has a reconfigurable set of switches
associated with it. From Figure 2.35 (page 80), we see that each processor has six switches. Therefore, the
total number of switching elements is 6 p. The number of communication links is identical to that of a regular
two-dimensional mesh, and is given by 2( p p).
The basic advantage of the reconfigurable mesh results from the fact that any pair of processors can commu-
nicate with each other in constant time (independent of the number of processors). Because of this, many
communication operations can be performed much faster on a reconfigurable mesh (as compared to its regular
counterpart). However, the number of switches in a reconfigurable mesh is larger.
17 Partitioning the mesh into two equal parts of p/2 processors each would leave at least p communication
links between the partitions. Therefore, the bisection width of a mesh of trees is p. The processors at the
two extremities of the mesh of trees require the largest number of communication links to communicate. This
is given by 2 log( p) + 2 log( p), or 2 log p. A complete binary tree is imposed on each row and each
column of the mesh of trees. There are 2 p such rows and columns. Each such tree has p 1 switches.
Therefore, the total number of switches is given by 2 p( p 1), or 2( p p).
Leighton [Lei92a] discusses this architecture and its properties in detail.
18 In the d-dimensional mesh of trees, each dimension has p 1/d processors. The processor labels can be expressed
in the form of a d-tuple. The minimum number of communication links across a partition are obtained when
the coordinate along one of the dimensions is fixed. This would result in p (d1)/d communication links.
Therefore, the bisection width is p (d1)/d .
Connecting p 1/d processors into a complete binary tree requires p 1/d 1 switching elements. There are
p(d1)/d distinct ways of fixing any one dimension and there are d dimensions. Therefore, the total number
of switching elements is given by d p (d1)/d ( p 1/d 1), or d( p p (d1)/d ).
Similarly, the number of communication links required to connect processors along any one dimension is
given by 2( p1/d 1). Using a procedure similar to the one above, we can show that the total number of
communication links is given by d p (d1)/d 2( p1/d 1), or 2d( p p (d1)/d ).
The diameter of the network can be derived by traversing along each dimension. There are d dimensions and
traversing each dimension requires 2 log( p 1/d ) links. Therefore, the diameter is d2 log( p 1/d ), or 2 log p.
The advantages of a mesh of trees is that it has a smaller diameter compared to a mesh. However, this comes
at the cost of increased hardware in the form of switches. Furthermore, it is difficult to derive a clean planar
structure, as is the case with 2-dimensional meshes.
19 Leighton [Lei92a] discusses this solution in detail.
20 Figure 2.2 illustrates a 4 4 wraparound mesh with equal wire lengths.
21 Consider a p q r mesh being embedded into a 2d processor hypercube. Assume that p = 2 x , q = 2 y , and
r = 2z . Furthermore, since p q r = 2d , x + y + z = d.
The embedding of the mesh can be performed as follows: Map processor (i, j, k) in the mesh to processor
6 Models of Parallel Computers
G(i, x)G( j, y)G(k, z) (concatenation of the Gray codes) in the hypercube using the Gray code function G
described in Section 2.7.1 (page 67).
To understand how this mapping works, consider the partitioning of the d bits in the processor labels into
three groups consisting of x, y, and z bits. Fixing bits corresponding to any two groups yields a subcube
corresponding to the other group. For instance, fixing y + z bits yields a subcube of 2x processors. A processor
(i, j, k) in the mesh has direct communication links to processors (i + 1, j, k), (i 1, j, k), (i, j + 1, k),
(i, j 1, k), (i, j, k + 1), and (i, j, k 1). Let us verify that processors (i + 1, j, k) and (i 1, j, k) are indeed
neighbors of processor (i, j, k). Since j and k are identical, G( j, y) and G(k, z) are fixed. This means that the
two processors lie in a subcube of 2x processors corresponding to the first x bits. Using the embedding of a
linear array into a hypercube, we can verify that processors (i + 1, j, k) and (i 1, j, k) are directly connected
in the hypercube. It can be verified similarly that the other processors in the mesh which are directly connected
to processor (i, j, k) also have a direct communication link in the hypercube.
23 Ranka and Sahni [RS90] present a discussion of the embedding of a complete binary tree into a hypercube.
24 The mapping of a mesh into a hypercube follows directly from an inverse mapping of the mesh into a hyper-
cube. Consider the congestion of the inverse mapping. A single subcube of p processors is mapped onto
each row of the mesh (assuming a p p mesh). To compute the congestion of this mapping, consider
the number of links on the mesh link connecting one half of this row to the other. The hypercube has p/2
links going across and a single row of the mesh (with wraparound) has two links going across. Therefore, the
congestion of this mapping is p/4.
It can be shown that this mapping yields the best congestion for the mapping of a hypercube into a mesh.
Therefore, if the mesh links are faster by a factor of p/4 or more, the mesh computer is superior. For the
example cited, p = 1024. Hence, for the mesh to be better, its links must be faster by a factor of 1024/4 = 8.
Since the mesh links operate at 25 million bytes per second and those of the hypercube operate at 2 million
bytes per second, the mesh architecture is indeed strictly superior to the hypercube.
25 The diameter of a k-ary d-cube can be derived by traversing the farthest distance along each dimension. The
farthest distance along each dimension is k/2 and since there are d such dimensions, the diameter is dk/2.
Each processor in a k-ary d-cube has 2d communication links. Therefore, the total number of communication
links is pd.
The bisection width of a k-ary d-cube can be derived by fixing one of the dimensions and counting the number
of links crossing this hyperplane. Any such hyperplane is intersected by 2k (d1) (for k > 2) communication
links. The factor of 2 results because of the wraparound connections. (Note that the bisection width can also
be written as 2k d1 ).
Chapter 2 7
700
p = 256
600
p = 512
Time p = 1024
500
400
300
200
100
1 2 3 4 5 6 7 8 9 10
degree (d)
Figure 2.3 Communication time plotted against the degree of a cut-through network routing using number of communication
links as a cost metric.
700
p = 256
600
p = 512
Time 500 p = 1024
400
300
200
100
1 2 3 4 5 6 7 8 9 10
degree (d)
Figure 2.4 Communication time plotted against the degree of a cut-through routing network using bisection width as a cost
metric.
The average communication distance along each dimension is k/4. Therefore, in d dimensions, the average
distance lav is kd/4.
27 (1) The cost of a k-ary d-cube of p processors in terms of number of communication links is given by d p (for
k > 2). The corresponding cost for a binary hypercube is given by p log p/2. Therefore, if the width of each
channel in the k-ary d-cube is r, then the total cost is given by d pr. If this cost is identical to the cost of a
binary hypercube, then d pr = p log p/2, or r = log p/(2d).
(2) The bisection width of a k-ary d-cube of p processors is given by 2k d1 and that of a binary hypercube is
given by p/2. If the channel width of the k-ary d-cube is r, then equating the costs yields r 2k d1 = p/2.
Therefore, the channel width is r = p/(4 k d1 ). Since k d = p, r = k/4.
(The cost and bisection width of these networks is given in Table 2.1 (page 44))
28 The average distance between two processors in a hypercube is given by log p/2. The cost of communicating
a message of size m between two processors in this network with cut-through routing is
log p
Tcomm = ts + th + tw m.
2
The average distance between two processors in a k-ary d-cube is given by kd/4. The cost of communicating
8 Models of Parallel Computers
900
800
p = 256
Time p = 512
700
p = 1024
600
500
400
300
200
2 3 4 5 6 7 8 9 10
degree (d)
Figure 2.5 Communication time plotted against the degree of a store-and-forward network routing using number of communica-
tion links as a cost metric.
35000
30000
p = 256
Time p = 512
25000
p = 1024
20000
15000
10000
5000
0
2 3 4 5 6 7 8 9 10
degree (d)
Figure 2.6 Communication time plotted against the degree of a store-and-forward network routing using bisection width as a cost
metric.
a message of size m between two processors in this network with cut-through routing is
kd tw
Tcomm = ts + th + m,
4 r
where r is the scaling factor for the channel bandwidth.
From Solution 2.20, if number of channels is used as a cost metric, then we have r = s = log p/(2d).
Therefore,
kd 2tw d
Tcomm = ts + th + m.
4 log p
Similarly, using the bisection width as a cost metric, we have r = s = k/4. Therefore,
kd ktw
Tcomm = ts + th + m.
2 4
The communication times are plotted against the dimension of the k-ary d-cube for both of these cost metrics
in Figures 2.3 and 2.4.
29 The cost of communicating a message of size m between two processors in a hypercube with store-and-forward
Chapter 2 9
routing is
log p
Tcomm = ts + tw m .
2
Using the number of links as a cost metric, for a k-ary d-cube the corresponding communication time is given
by
2d kd
Tcomm = ts + tw m.
log p 2
This communication time is plotted against the degree of the network in Figure2.5.
Using the bisection width as a cost metric, for a k-ary d-cube the corresponding communication time is given
by
ktw kd
Tcomm = ts + m.
4 2
This communication time is plotted against the degree of the network in Figure2.6.
C HAPTER 3
Principles of Parallel
Algorithm Design
11
12 Principles of Parallel Algorithm Design
1 P0 1 P0 1
2 P0 3 P2 P1 4 2 2 P0 P3 3 P2 4 P1 5
5 P0 P1 6 8 P2 3 6 P0 P1 7 P2 8 P3 9
7 P0 P2 9 P1 10 4 10 P0
P1 11 12 P0 5 P0 11 P1 12
13 P0 6 P0 13
14 P0 7 P0 14
Figure 3.1 Task-dependency graphs, critical-path lengths, and mappings onto 3 and 4 processes for 33 block LU factorization.
C HAPTER 4
Basic Communication
Operations
1 Add an extra iteration to the outer loop. Some processes in the first iteration in 4.1 and 4.2 and the last iteration
in 4.3 (iteration d is all cases) will not participate in the communication.
2 Note: There is a typo in the problem statememt. Algorithm 3.1 should read Algorithm 4.1.
5 Refer to Figure 4.7 (page 155). In the first iteration, the following pairs of processors exchange their data of
size m: (0,4), (1,5), (2,6), and (3,7). This step takes ts + 4tw m time because four messages of size m pass
through the root of the tree. At the end of this step, each processor has data of size 2m. Now the following
pairs of processors exchange their data: (0,2), (1,3), (4,6), and (5,7). In this step, the size of each message
is 2m, and two messages traverse in each channel in the same direction. Thus, this step takes t s + 4tw m. In
the final step, messages of size 4m are exchanged without channel contention between the following pairs of
processors: (0,1), (2,3), (4,5), and (6,7). This step also takes ts + 4tw m time. In general, all-to-all broadcast
can be performed on a p-processor tree of the type shown in Figure 4.7 (page 155) in (ts + tw mp/2) log p
time.
Note that if the order of the steps is changed, then the communication cost will increase. For example, if
the pairs of processors that exchange data in the first step are (0,1), (2,3), (4,5), and (6,7), and the pairs that
exchange data in the last step are (0,4), (1,5), (2,6), and (3,7), then the total communication time will be
ts log p + tw m( p 2 1)/3.
On the other hand, another scheme can perform all-to-all broadcast on a p-processor tree with CT routing in
(ts + tw m)( p 1) time. This can be done by embedding a logical ring onto the tree, as shown in Figure 4.1.
Note that no two messages travelling in the same direction share a communication link. It is easy to verify
that executing the algorithms of Figure 4.9 (page 159) of the text results in total communication time of
(ts + tw m)( p 1). This scheme leads to a higher ts term, but a lower tw term.
7 In the hypercube algorithm for all-to-all personalized communication described in Section 4.4 (page 167), the
entire message received by a processor is not required by that processor. On a completely connected network,
this operation can be performed in (ts + tw m)( p 1) time. However, the best communication time for one-to-
all broadcast is (ts + tw m) log p (the same as in the case of a hypercube) if an entire message is routed along
the same path.
8 The communication pattern for multinode accumulation on a hypercube is exactly the opposite of the pattern
shown in Figure 4.11 (page 164) for all-to-all broadcast. Each processor has p messages to start with (as in
Figure 4.11 (page 164)(d)), one for each processor (Figure 4.8 (page 157)). In the i th iteration (0 < i log p)
of the multinode accumulation algorithm, processors communicate along the (log p i + 1)th dimension
13
14 Basic Communication Operations
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
Figure 4.1 All-to-all broadcast on an eight-processor tree by mapping a logical eight-processor ring onto it.
of the hypercube so that processor j communicates with processor j 2log pi . For example, in the first
step, processor 0 sends the messages destined for processors 4, 5, 6, and 7 to processor 7, and processor 7
sends the messages destined for processors 0, 1, 2, and 3 to processor 0. After the first communication step,
each processor adds the mp/2 numbers received to the mp/2 numbers already residing on them. The size
of the data communicated and added in an iteration is half of that in the previous iteration. The total time is
log p
i =1 (ts + (tw + tadd )mp/2i ), which is equal to ts log p + m(tw + tadd )( p 1).
9 The shortest distance between processors i and j is their Hamming distance Hi, j (that is, the numbers of 1s
in bitwise exclusive-or of i and j). Let log p = d. The solution to this problem is the same as the number of
d-bit numbers with exactly Hi, j 1s in their binary representation. The solution is d!/(Hi, j !(d Hi, j )!).
10 First, each processor performs a local prefix sum of its n/ p numbers in (n/ p 1)tadd time. In the second
step, the p processors compute prefix sums of p numbers by using the last prefix sum resulting from the local
computation on each processor. This step takes (tadd + ts + tw ) log p time. Finally, the result of the parallel
prefix sums operation of the second step is added to all the n/ p prefix sums of the first step at each processor
in tadd n/ p time. Therefore,
n
TP = (2 1)tadd + (tadd + ts + tw ) log p.
p
11 Consider a square mesh without wraparound connections.
3tw mp 1/3 and tw mp time, respectively) than on a regular 3-D mesh (on which they take the same amount of
time as on the sparse 3-D mesh). On the other hand, a sparse 3-D mesh is less cost-effective for all-to-all
broadcast (approximately 2tw mp time) than a regular 3-D mesh (approximately tw mp time).
14 In this communication model, one-to-all broadcast, all-to-all broadcast, and one-to-all personalized commu-
nication take approximately 3ts p1/3 time on both the architectures. Thus, a sparse 3-D mesh is more cost-
effective.
15 In all-to-all broadcast on a hypercube, the message size doubles in each iteration. In k-to-all broadcast, in the
worst case, the message size doubles in each of the first log k iterations, and then remains mk in the remaining
log( p/k) iterations. The total communication time of the first k iterations is ts log k + tw m(k 1). The total
communication time of the last log( p/k) iterations is (ts + tw mk) log( p/k). Thus, the entire operation is
performed in ts log p + tw m(k log( p/k) + k 1) time.
21 As the base case of induction, it is easy to see that the statement is true for a 2-processor hypercube. Let all
the p data paths be congestion-free in a p-processor hypercube for all q < p. In a 2 p-processor hypercube, if
q-shifts for all q < p are congestion-free, then q-shifts for all q < 2 p are also congestion-free (Hint (1)). So
the proof is complete if we show that q-shifts for all q < p are congestion-free in a 2 p-processor hypercube.
Consider a circular q-shift for any q < p on a 2 p-processor hypercube. All the p q data paths leading
from a processor i to a processor j such that i < j < p are the same as in a p-processor hypercube, and
hence, by the induction hypothesis, do not conflict with each other. The remaining q data paths leading from
a processor i to a processor j on the p-processor hypercube, such that j < i < p, lead to processor j + p on
a 2 p-processor hypercube. Processor j + p is connected to processor j by a single link in the highest (that
is, (log p + 1)th ) dimension of the 2 p-processor hypercube. Thus, following the E-cube routing, the data path
from processor i to processor j + p in a circular q-shift on the 2 p-processor hypercube is the data path from
processor i to processor j in a circular q-shift on a p-processor hypercube appended by a single link. The
original path from processor i to j is congestion free (induction hypothesis) and the last link is not shared by
any other message because it is unique for each message. Thus, q-shifts for all q < p are congestion-free in a
2 p-processor hypercube.
22 In a 1-shift on a 2-processor hypercube, the highest integer j such that 1 is divisible by 2 j is 0. Since log 2 = 1,
the statement is true for p = 2 (because q = 1 is the only possible value of q for p = 2). Let the statement
hold for a p-processor hypercube. As shown in the solution to Problem 4.22 (page 193), the length of a path
between processors i and j, such that j < i < p, increases by a single link if the same shift operation is
performed on a 2 p-processor hypercube. Thus, the length of the longest path increases from log p (q) to
log p (q) + 1 (which is equal to log(2 p) + (q)) as the number of processors is increased from p to 2 p.
Note that a circular q shift is equivalent to a (2 p q)-shift on a 2 p-processor hypercube. So the result holds
for all q such that 0 < q < 2 p.
24 Cost of Network Total Number of Links:
4tw m( p 1)
Tone to all per sonali zed = 2ts ( p 1) +
log p
4tw mp( p 1)
Tall to all per sonali zed = 2ts ( p 1) +
log p
If the cost of a network is proportional to the total number of links in it, then a mesh is asymptotically more
cost-effective than a hypercube for all operations except all-to-all personalized communication.
If the cost of a network is proportional to its bisection width, then a mesh is asymptotically more cost-effective
than a hypercube for all operations except all-to-all personalized communication. For all-to-all personalized
communication, both interconnection networks have the same asymptotic cost-performance characteristics.
25 Assuming the one-port communication model, one-to-all broadcast, all-to-all broadcast, and one-to-all person-
alized communication take the same amount of time on a completely connected network as on a hypercube.
All-to-all personalized communication can be performed in (ts + tw m)( p 1) time on a completely connected
network.
C HAPTER 5
Analytical Modeling of
Parallel Programs
1
W W
S = =
TP W S + W W
p
S
As p increases, the fraction (W W S )/ p approaches zero. However, no matter how large p is, S cannot
exceed W/W S .
2 On a single processor, 11 arcs are traversed by DFS before the solution is found. On two processors, only four
arcs are traversed by P1 before it finds the solution. Therefore, speedup on two processors is 11/4 = 2.75.
The anomaly is due to the fact that in Figure 5.10 (page 228)(b), the algorithm being executed is not the same
as in Figure 5.10 (page 228)(a). Tha parallel algorithm performs less overall work than the serial algorithm.
The algorithm in Figure 5.10 (page 228)(b) is not true depth-first search; it is part depth-first and part breadth-
first. If a single processor alternates between the nodes of the left and right subtress (thereby emulating the
two-processor algorithm of Figure 5.10 (page 228)(b)), then the serial algorithm traverses only eight arcs.
3 The degree of concurrency, the maximum speedup, and speedup when p = 1/2 degree of concurrency for
the graphs are given in the following table:
The efficiency corresponding to various speedups can be computed by using E = S/ p, and the overhead
function can be computed by using the following relations:
W
E =
pTP
W
pTP =
E
To = pTP W
1
= W ( 1)
E
17
18 Analytical Modeling of Parallel Programs
250
Plot (c)
200
Plot (b)
150
100
S 50
Plot (a)
0
0 50 100 150 200 250 300
Figure 5.1 Standard (Plot (a)), scaled (Plot (b)), and isoefficient (Plot (C)) speedup curves for Problems 5.5 (page 230) and 5.6
(page 230).
4 Let the isoefficiency function f ( p) be greater than ( p). If p = (W ), or W = ( p), then W < f ( p)
because f ( p) > ( p). Now if W < f ( p), then E < (1), or pT P > (W ).
The converse implies that pT P = (W ) only if p < (W ), or E = (1) only if p < (W ); that is,
W > ( p). Thus, the isoefficiency function is greater than ( p).
5 Plots (a) and (b) in Figure 5.1 represent the standard and scaled speedup curves, respectively.
6 See Plot (c) in Figure 5.1.
7 Figure 5.2 plots the efficiency curves corresponding to the speedup curves in Figure 5.1.
8 Scaled speedup, in general, does not solve the problem of declining efficiency as the number of processors is
increased. The scaled speedup curve is linear only if the isoefficiency function of a parallel system is linear
(that is, ( p)).
9
n
512 TP = 1 + 11 log p
p
(513 11 log p) p n
It is not possible to solve an arbitrarily large problem in a fixed amount of time, even if an unlimited number
of processors are available. For any parallel system with an isoefficiency function greater than ( p) (such as,
the one in Problem 5.5 (page 230)), a plot between p and the size of the largest problem that can be solved in
a given time using p processors will reach a maximum.
It can be shown that for cost-optimal algorithms, the problem size can be increased linearly with the number
of processors while maintaining a fixed execution time if and only if the isoefficiency function is ( p). The
proof is as follows. Let C(W ) be the degree of concurrency of the algorithm. As p is increased, W has to
be increased at least as ( p), or else p will eventually exceed C(W ). Note that C(W ) is upper-bounded by
(W ) and p is upper-bounded by C(W ). TP is given by W/ p + To (W, p)/ p. Now consider the following
Chapter 5 19
1 Plot (c)
0.9
0.8 Plot (b)
0.7
0.6
0.5
0.4
0.3
E
0.2
0.1 Plot (a)
0
0 50 100 150 200 250 300
p
two cases. In the first case C(W ) is smaller than (W ). In this case, even if as many as C(W ) processors
are used, the term W/C(W ) of the expression for T P will diverge with increasing W , and hence, it is not
possible to continue to increase the problem size and maintain a fixed parallel execution time. At the same
time, the overall isoefficiency function grows faster than ( p) because the isoefficiency due to concurrency
exceeds ( p). In the second case in which C(W ) = (W ), as many as (W ) processors can be used. If
(W ) processors are used, then the first term in T P can be maintained at a constant value irrespective of W .
The second term in TP will remain constant if and only if To (W, p
p)
remains constant when p = (W ) (in
other words, To /W remains constant while p and W are of the same order). This condition is necessary and
sufficient for linear isoefficiency.
For further details, see paper by Kumar and Gupta [KG91].
11 See solution to Problem 4.10 (page 191).
n
TP = 2 1 + 10 log p
p
n/ p 1
S =
2 np 1 + 10 log p
Isoefficiency function = ( p log p)
12 The algorithm is a single-node accumulation in which the associative operation is a comparison that chooses
the greater of the two numbers. The algorithm is not cost-optimal for finding the maximum of n numbers on
an n-processor mesh.
13 Refer to the paper by Gupta and Kumar [GK93].
14
n log n 10n log p
Equation 5.21, inthiscase, leadstothe f ollowing : TP = +
p p
dTP n log n 10n log p 10n
= + 2 = 0
dp p2 p2 p
10 log p = 10 log n
log p = 1 log n 1/10
2
p = 1/10
n
20 Analytical Modeling of Parallel Programs
This is actually a condition for a maximum, rather than a minimum (this can be verified by a taking a second
derivative). Since To < ( p) in this case, the minimum parallel run time is obtained when as many processors
as possible are used to solve the problem [GK93]. Hence, the actual value of T Pmi n should be derived by putting
p = n in the expression for TP . This leads to T Pmi n = 11 log n.
15 Refer to the paper by Gupta and Kumar [GK93].
C HAPTER 6
Programming Using
the Message-Passing
Paradigm
21
C HAPTER 7
Programming Shared
Address Space
Platforms
Since all of the questions in this chapter are implementation based and implementations (and their performance) differ
markedly, we only give outlines of solutions.
1 To time thread creation, create a large number of threads in a for loop and time this loop.
To time thread join, time the corresponding join functions executed in a for loop.
To time successful locks, create a large array of locks and lock all of these in a for loop.
To time successful unlocks, time corresponding unlocks in a for loop.
To time successful trylock, repeat the process above for successful locks.
To time unsuccessful trylocks, first lock a mutex and repeatedly try to lock it using trylock in a for loop.
To time a condition wait, set up two threads, one that waits and another that signals. Execute a large
number of wait-signal pairs. The smallest of these times, with high probability, gives the condition-wait
time.
To time a condition signal, time the function with various number of waiting threads.
The condition broadcast case is similar to the condition signal case, above.
2 The queue is implemented as a heap. The heap is protected by a single lock. Each insertion/extraction is
protected by a lock associated with the heap. At 64 threads, in most thread systems, students should observe
considerable contention.
3 The condition wait case is similar to the one above, except, instead of locking, a thread executes a condition
wait and instead of unlocking, a thread does a condition signal. The performance of this formulation should
be marginally better but will saturate as well.
4 This is a simple extension of a producer-consumer paradigm in which the network thread is the producer for
the decompressor thread, which in turn is the producer for the render thread. The code can be implemented
starting from the code in Example 7.6.
7 The read-write locks are illustrated in 7.8.1. These functions can be easily adapted to the problem of binary
tree search. Only insertions require write locks. All other threads only need read locks.
8 The performance saturates as the number of threads increases because of contention on the global lock.
23
24 Programming Shared Address Space Platforms
9 The saturation point is pushed back because of the use of read-write locks since reads can be concurrently
executed.
10 There are several strategies for implementing a Sieve. As the numbers pass through, the first number installs
itself as a thread. The thread is responsible for filtering all multiples of the associated number. The key
programming task is to ensure coordination between threads eliminating different multiples (i.e., the thread
eliminating multiples of 3 must be able to filter a block before the thread responsible for eliminating multiples
of 5).
An alternate strategy is to determine all the primes whose multiples need to be eliminated first and them to
instantiate them as threads. This algorithm has a significant serial component and does extra work.
11 The implementation described saturates because of contention on the open list.
12 The saturation point is pushed back (depending on the choice of k). Also, as the number of heaps is increased,
excess work increases, causing a reduction in the speedup.
14 When the function becomes computationally expensive, all of the formulations can be expected to perform
similarly.
16 The implementation is very similar to the pthreads version, except the task of creation and joining is accom-
plished using sections, and explicit locks can be used in OpenMP, much the same way as in pthreads.
C HAPTER 8
Dense Matrix
Algorithms
1 The communication time of the algorithm for all-to-all personalized communication on a hypercube with
cut-through routing (Section 4.4 (page 167)) is
With the given parameters, the communication time is 22428. Thus the algorithm for cut-through routing is
faster for the given parameters. For a different set of parameters, the algorithm for store-and-forward routing
may be faster.
4 Let us start from the central relation that must be satisfied in order to maintain a fixed efficiency.
W = K To
n 2
= K (ts p log p + tw n p log p
0 = n 2 n(K tw p log p) K ts p log p
K tw p log p K 2 tw2 p(log p)2 + 4K (ts p log p)
n =
2
K 2 tw2 p(log p)2
n2 = + K ts p log p
2
K tw p log p K 2 tw2 (log p)2 + 4K (ts log p)
2
From the preceding equation, it follows that if K tw2 (log p)2 is greater than (which is most likely to be the
case for practical values of K , tw , and p), then the overall isoefficiency function is ( p(log p)2 ) due to the tw
term of To .
5 The communication time remains 2(ts log p + tw n 2 / p). The algorithm performs p matrix multiplications
with submatrices of size n/ p n/ p. Each of these multiplications take (n 2.81 / p1.41 ) time. The total
parallel run time is (n 2.81 / p0.91 ) + 2(ts log p + tw n 2 / p). The processor-time product is (n 2.81 p0.09 ) +
2( pts log p + tw n 2 p). The (n 2.81 p0.09 ) term of the processor-time product is greater than the (n 2.81 )
sequential complexity of the algorithm. Hence, the parallel algorithm is not cost-optimal with respect to a
serial implementation of Strassens algorithm
6 This variant of the DNS algorithm works with n 2 q processors, where 1 < q < n. The algorithm is similar
to that for one element per processor except that a logical processor array of q 3 superprocessors (instead
of n 3 processors) is used, each superprocessor containing (n/q)2 hypercube processors. In the second step,
25
26 Dense Matrix Algorithms
multiplication of blocks of (n/q) (n/q) elements instead of individual elements is performed. Assume that
this multiplication of (n/q) (n/q) blocks is performed by using Cannons algorithm for one element per
processor. This step requires a communication time of 2(ts + tw )n/q.
In the first stage of the algorithm, each data element is broadcast over q processors. In order to place the
elements of matrix A in their respective positions, first A[ j, k] stored at processor P(0, j,k) is sent to processor
P(k, j,k) in log q steps. This data is then broadcast from processor P(k, j,k) to processors P(k, j,l) , 0 l < q,
again in log q steps. By following a similar procedure, the elements of matrix B are transmitted to their
respective processors. In the second stage, groups of (n/q)2 processors multiply blocks of (n/q) (n/q)
elements each processor performing n/q computations and 2n/q communications. In the final step, the ele-
ments of matrix C are restored to their designated processors in log q steps. The total communication time is
(ts + tw )(5 log q + 2n/q) resulting in the parallel run time given by the following equation:
n3 p n3
TP = + (ts + tw )(5 log( 2 ) + 2 )
p n p
The communication time of this variant of the DNS algorithm depends on the choice of the parallel matrix
multiplication algorithm used to multiply the (n/q) (n/q) submatrices. The communication time can be
reduced if the simple algorithm is used instead of Cannons algorithm.
7 Let the division operation on the i th row (0 i < n) be performed during time step ti . Then in time step
ti + 1, the active part of the i th row after division is sent to the processor storing the (i + 1)st row. In time step
ti + 2, the processor storing the (i + 1)st row sends this data to the processor storing the (i + 2)nd row. In time
step ti + 3, the elimination operation is performed on the (i + 1)st row. and in time step ti + 4, the division
operation on the (i + 1)st row. Thus, ti +1 = ti + 4. In other words, four time steps separate the division
operations on any two consecutive rows. Hence, in 4n steps, all the n division operations can be performed
and Gaussian elimination can be completed on an n n matrix. However, the communication operations of
time steps tn2 + 3, tn1 + 1, tn1 + 2, and tn1 + 3 are not performed. Thus, the actual number of steps is
4(n 1).
8 See papers by Geist and Romine [GR88] and Demmel et al. [DHvdV93] for solutions to Problems 8.8
(page 373), 8.9 (page 374), and 8.10 (page 374).
11 The communication time of each one-to-all broadcast is ( n). Therefore, the total time spent in performing
these broadcast in all the iterations is (n n). Since the number of processors is n 2 , the communication
time contributes (n 3 n) to the processor-time product, which is greater than the (n 3 ) complexity of serial
Gaussian elimination.
12 As shown in Section 8.3.1 (page 353), disregarding the communication time, the parallel run time of Gaussian
elimination with block-striped and block-checkerboard partitioning is n 3 / p and 2n 3 / p, respectively. Since
the sequential run time is 2n 3 /3, the total overhead due to processor idling in the two cases is n 3 /3 and 4n 3 /3,
respectively.
With cyclic-striped mapping, the difference between the maximally loaded processor and the minimally loaded
processor in any iteration is, at most, that of one row. Thus, the overhead due to idle time per iteration is O(np),
and the overhead due to idle time over all the iterations s O(n 2 p).
With cyclic-checkerboard mapping, the difference between the maximally loaded processor and the minimally
loaded processor in any iteration is, at most, that of one row and one column. The maximum overhead due to
idle time per iteration is O( p (n 2 / p (n/sqrt p 1)2 )), which is O(n p). The overhead due to idle time
over all the iterations s O(n 2 p).
13 As shown in Section 8.3.1 (page 353), the parallel run time of the asynchronous (pipelined) version of Gaussian
elimination with checkerboard partitioning is (n 3 / p). This implies that for all values of n and p, the order
of magnitude of the communication time is either the same as or less than that of the time spent in useful
computation. The reason is that the sequential complexity of Gaussian elimination is (n 3 ), and the parallel
run time on p processors has to be at least (n 3 / p). Hence, the overall isoefficiency function is the same as
Chapter 8 27
that due to concurrency. Since p n 2 , W = n 3 = ( p 3/2 ) is the isoefficiency function due to concurrency.
16 The communication pattern (and hence, the parallel run time) of pipelined Cholesky factorization with checker-
board partitioning is very similar to that of pipelined Gaussian elimination without pivoting. Figure 8.1 illus-
trates the first 16 steps of Cholesky factorization on a 5 5 matrix on 25 processors. A comparison of
Figures 8.11 (page 364) and 8.1 shows that the communication and and computation steps in the upper trian-
gular halves of coefficient matrices are the same in both cases. For simplicity we remove the step of line 5
of Program 8.6 (page 375) and replace line 7 by A[k, j] := A[k, j]/ A[k, k]. The original A[k, k] can be
replaced by A[k, k] later, whenever Pk,k is free, and this step is not shown in Figure 8.1.
The actions in first eight steps of Figure 8.1 are as follows:
(a) A[0, 0] sent to P0,1 from P0,0 (iteration k = 0 starts).
(b) A[0, 1] := A[0, 1]/ A[0, 0].
(c) A[0, 0] sent to P0,2 from P0,1 ;
A[0, 1] (after division) sent to P1,1 from P0,1 .
(d) A[0, 2] := A[0, 2]/ A[0, 0];
A[1, 1] := A[1, 1] A[0, 1] A[0, 1].
(e) A[0, 0] sent to P0,3 from P0,2 ;
A[0, 2] (after division) sent to P1,2 from P0,2 ;
A[0, 1] sent to P1,2 from P1,1 .
(f) A[0, 3] := A[0, 3]/ A[0, 0];
A[1, 2] := A[1, 2] A[0, 1] A[0, 2].
(g) A[1, 1] sent to P1,2 from P1,1 (iteration k = 1 starts);
A[0, 0] sent to P0,4 from P0,3 ;
A[0, 3] (after division) sent to P1,3 from P0,3 ;
A[0, 1] sent to P1,3 from P1,2 ;
A[0, 2] sent to P2,2 from P1,2 .
(h) A[1, 2] := A[1, 2]/ A[1, 1];
A[0, 4] := A[0, 4]/ A[0, 0];
A[1, 3] := A[1, 3] A[0, 1] A[0, 3];
A[2, 2] := A[2, 2] A[0, 2] A[0, 2].
17 Plots (a) and (b) in Figure 8.2 represent the standard and scaled speedup curves, respectively.
18 See Plot (c) in Figure 8.2.
19 Figure 8.3 plots the efficiency curves corresponding to the speedup curves in Figure 8.2.
20 Scaled speedup, in general, does not solve the problem of declining efficiency as the number of processors is
increased. Only if the isoefficiency function of a parallel system is linear (that is, ( p)), then the efficiency
corresponding to the scaled speedup curve will be constant.
21 The following table gives the largest n corresponding to a p such that two n n matrices can be multiplied in
time less than or equal to 512 units. However, it should be noted that these values are based on the expression
for TP given by Equation 8.14 (page 347). Equation 8.14 (page 347) gives an expression for the parallel
run time when the matrices are partitioned uniformly among the processors. For the values of n given in the
following table, it may not be possible to partition the matrices uniformly.
It is not possible to solve an arbitrarily large problem in a fixed amount of time, even if an unlimited number
of processors are available. For any parallel system with an isoefficiency function greater than ( p) (such as,
the one in Problem 8.17 (page 375)), a plot between p and the size of the largest problem that can be solved
in a given time using p processors will reach a maximum.
See the solution to Problem 5.9 (page 231) for more details.
23 Steps 3 and 6 of Program 8.7 (page 376) involve the initialization of the resultant matrix and can be performed
in constant time. Step 8 requires processor (i, j) to read the values of elements A[i, k] and B[k, j] for k = 0
to n 1. A CREW PRAM allows these read operations to be performed concurrently. There are 3 memory
read operations and one memory write operation and one multiply and add operation. Therefore the total time
for step 8 is 4tlocal + tc . For n iterations of this loop, the total time taken is n(4tlocal + tc ). The formulation is
cost optimal because the processor time product of (n 3 ) is equal to the serial runtime of the algorithm.
24 In absence of the concurrent read operation, memory reads in step 8 will be serialized. During iteration 1,
(k = 0), a processor (i, j) accesses values C[i, j], A[i, 0], and B[0, j]. There is no conflict in accessing
C[i, j]. However, there are n processors trying to access a single element of A and n processors trying
to access a single element of B. Similarly, in any iteration, there are n accesses to any memory location.
Serializing these accesses, we have a single memory read and write operation (for C[i, j]), n read operations
for A[i, k], and n read operations for B[k, j]. Therefore, the time taken for this step is 2tlocal + 2ntlocal + tc .
For n iterations of the loop, the total time taken is n(2tlocal + 2ntlocal + tc ).
The algorithm is not cost optimal because the processor-time product is (n 4 ), which is higher than the serial
runtime of (n 3 ).
25 Let us consider the time taken in step 8 of Program 8.7 (page 376). Elements C[i, j], A[i, j], and B[i, j]
are available locally. The remaining values have to be fetched from other processors. We assume that at any
point of time, only one processor can access a processors memory. In any iteration, there are n 1 non-
local accesses to each element of matrices A and B, and one local reference. This takes time 2(tlocal + (n
1)tnonlocal ). Furthermore, 2tlocal time is taken for reading and writing the values of C[i, j]. Therefore, step 8
takes time tc + 2tlocal + 2(tlocal + (n 1)tnonlocal ). For n iterations, the total time is n(tc + 2tlocal + 2(tlocal +
(n 1)tnonlocal )), which is equal to ntc + 4ntlocal + 2n(n 1)tnonlocal .
26 (a) Access times described in Problem 5.31: In Program 8.8 (page 377), step 8 staggers access to elements
of matrices A and B such that there are no concurrent read operations. Therefore, step 8 can be performed
in time 4tlocal + tc . For n iterations, the total time taken is n(4tlocal + tc ).
The formulation is cost optimal because the processor time product of (n 3 ) is equal to the serial runtime
of the algorithm.
(b) Access times described in Problem 5.32: During any iteration of step 8, one processor is accessing an
element of A locally and the remaining processors are accessing n1 different elements of A non-locally.
This is also the case with matrix B. Therefore, the time taken for executing step 8 is tc +2tlocal +2tnonlocal .
The formulation is still cost optimal because the processor time product of (n 3 ) is equal to the serial
runtime of the algorithm.
27 Consider the case in which p processors are organized in a logical mesh of p p. Each processor is
assigned a block of n/ p n/ p elements. In the k th step, the processor responsible for computing C[i, j]
reads the values of A[i, k] and B[k, j] and keeps a running sum C[i, j] = C[i, j] + A[i, (i + j + k)modn]
B[(i + j + k)modn, j]. The processor then computes the values of the other product matrix elements
assigned to it. To compute each value of C, the processors read 2(n n/ p) values from non-local memories
and 2n/ p values locally. The time taken for computing each value of the resulting matrix is ntc + 2tlocal +
2tlocal n/ p + 2tnonlocal (n n/ p). The parallel runtime of this formulation is
n3 n n2 n n2
TP = tc + 2 n tnonlocal + tlocal .
p p p p p
28 In this algorithm, each processor first receives all the data it needs to compute its n 2 / p elements and then
Chapter 8 29
performs the computation. The total amount of non-local data required by a processor is 2n 2 / p 2n 2 / p.
The time taken to compute all n 2 / p elements assigned to a processor is
n3 n2 n2 n2
TP = tc + 2tlocal + 2tlocal + 2tnonlocal (2 2 )
p p p p
n3 n2
TP = tc + 2 tnonlocal .
p p
The relative performance of the two formulations of Problems 8.27 (page 377) and 8.28 (page 377) becomes
obvious when we compute the speedups obtained from the two. The speedup from the previous formulation
(Problem 5.34) is
p
S = .
1 + 2tnonlocal /tc
From the above expression, the algorithms maximum speedup is determined by the ratio 2tnonlocal /tc . Typi-
cally the time for a non-local memory access tnonlocal is much higher than tc ; consequently, the speedup may
be poor. The seedup of the latter formulation is
p
S = .
1+ p/n 2tnonlocal /tc
As the size of the matrices increase for a fixed value of p, the contribution of essential computation grows
faster than that of the overhead. It is thus possible to obtain parallel run times close to optimal by solving
larger problem instances. Furthermore, the speedup approaches p when n/ p 2tnonlocal /tc . This algorithm
is therefore better than the first.
29 Problems 8.23 (page 376)8.28 (page 377) illustrate the inadequacy of PRAM models for algorithm design.
They fail to model the communication costs of parallel algorithms on practical computers. Consequently, an
algorithm designed the PRAM model may exhibit very poor performance, even on well connected practical
architectures like shared address space computers. In general, for an architecture in which the cost of non-local
data access is more expensive than local data access, maximizing locality is critical. All practical architectures
fall into this category.
Furthermore, these problems also illustrate that shared address space computers should in fact be programmed
with a view to maximizing data locality, much the same way message passing computers are programmed.
Ignoring data locality in a shared address space computer leads to poor performance.
30 Dense Matrix Algorithms
(0,0) (0,1) (0,2) (0,3) (0,4) 1 (0,1) (0,2) (0,3) (0,4) 1 (0,1) (0,2) (0,3) (0,4) 1 (0,1) (0,2) (0,3) (0,4)
(1,0) (1,1) (1,2) (1,3) (1,4) (1,0) (1,1) (1,2) (1,3) (1,4) (1,0) (1,1) (1,2) (1,3) (1,4) 0 (1,1) (1,2) (1,3) (1,4)
(2,0) (2,1) (2,2) (2,3) (2,4) (2,0) (2,1) (2,2) (2,3) (2,4) (2,0) (2,1) (2,2) (2,3) (2,4) (2,0) (2,1) (2,2) (2,3) (2,4)
(3,0) (3,1) (3,2) (3,3) (3,4) (3,0) (3,1) (3,2) (3,3) (3,4) (3,0) (3,1) (3,2) (3,3) (3,4) (3,0) (3,1) (3,2) (3,3) (3,4)
(4,0) (4,1) (4,2) (4,3) (4,4) (4,0) (4,1) (4,2) (4,3) (4,4) (4,0) (4,1) (4,2) (4,3) (4,4) (4,0) (4,1) (4,2) (4,3) (4,4)
1 (0,1) (0,2) (0,3) (0,4) 1 (0,1) (0,2) (0,3) (0,4) 1 (0,1) (0,2) (0,3) (0,4) 1 (0,1) (0,2) (0,3) (0,4)
0 (1,1) (1,2) (1,3) (1,4) 0 (1,1) (1,2) (1,3) (1,4) 0 (1,1) (1,2) (1,3) (1,4) 0 1 (1,2) (1,3) (1,4)
(2,0) (2,1) (2,2) (2,3) (2,4) 0 (2,1) (2,2) (2,3) (2,4) 0 (2,1) (2,2) (2,3) (2,4) 0 (2,1) (2,2) (2,3) (2,4)
(3,0) (3,1) (3,2) (3,3) (3,4) (3,0) (3,1) (3,2) (3,3) (3,4) (3,0) (3,1) (3,2) (3,3) (3,4) 0 (3,1) (3,2) (3,3) (3,4)
(4,0) (4,1) (4,2) (4,3) (4,4) (4,0) (4,1) (4,2) (4,3) (4,4) (4,0) (4,1) (4,2) (4,3) (4,4) (4,0) (4,1) (4,2) (4,3) (4,4)
1 (0,1) (0,2) (0,3) (0,4) 1 (0,1) (0,2) (0,3) (0,4) 1 (0,1) (0,2) (0,3) (0,4) 1 (0,1) (0,2) (0,3) (0,4)
0 1 (1,2) (1,3) (1,4) 0 1 (1,2) (1,3) (1,4) 0 1 (1,2) (1,3) (1,4) 0 1 (1,2) (1,3) (1,4)
0 (2,1) (2,2) (2,3) (2,4) 0 0 (2,2) (2,3) (2,4) 0 0 (2,2) (2,3) (2,4) 0 0 (2,2) (2,3) (2,4)
0 (3,1) (3,2) (3,3) (3,4) 0 (3,1) (3,2) (3,3) (3,4) 0 (3,1) (3,2) (3,3) (3,4) 0 0 (3,2) (3,3) (3,4)
(4,0) (4,1) (4,2) (4,3) (4,4) 0 (4,1) (4,2) (4,3) (4,4) 0 (4,1) (4,2) (4,3) (4,4) 0 (4,1) (4,2) (4,3) (4,4)
1 (0,1) (0,2) (0,3) (0,4) 1 (0,1) (0,2) (0,3) (0,4) 1 (0,1) (0,2) (0,3) (0,4) 1 (0,1) (0,2) (0,3) (0,4)
0 1 (1,2) (1,3) (1,4) 0 1 (1,2) (1,3) (1,4) 0 1 (1,2) (1,3) (1,4) 0 1 (1,2) (1,3) (1,4)
0 0 (3,2) (3,3) (3,4) 0 0 (3,2) (3,3) (3,4) 0 0 (3,2) (3,3) (3,4) 0 0 0 (3,3) (3,4)
0 (4,1) (4,2) (4,3) (4,4) 0 0 (4,2) (4,3) (4,4) 0 0 (4,2) (4,3) (4,4) 0 0 (4,2) (4,3) (4,4)
250
Plot (c)
200
150
Plot (b)
100
S
50
Plot (a)
0
0 50 100 150 200 250 300
p
Figure 8.2 Standard, scaled, and isoefficient speedup curves for Problems 8.17 (page 375) and 8.18 (page 375).
0.9
Plot (c)
0.8
0.6
0.5
0.4
0.3
E
0.2
0.1
Plot (a)
0
0 50 100 150 200 250 300
Sorting
1 The correctness of this compare-split operation can be proved using induction on the number of compare-
exchange operations. Processor Pi has a sequence of k elements x = x 1 , x 2 , . . . , x k sorted in increasing
order and processor P j has k elements y = y1 , y2 , . . . , yk sorted in decreasing order. During the nth step
of the compare-split operation processors Pi and P j exchange their nth elements. Processor Pi keeps the
max{x n , yn } and P j keeps min{x n , yn }. The inductive proof is as follows:
1. After the first step of the compare-split operation, Pi holds z = max{x 1 , y1 } and P j holds w =
min{x1 , y1 }. For the operation to be correct z must be larger than at least k elements of the combined
sequences x and y, and w must be smaller than at least k elements of the combined sequence x and
y. Since sequence x is sorted in increasing order, we know that x 1 is smaller than the remaining k 1
elements of sequence x. Since w = min{x 1 , y1 }, w is either x 1 or y1 . If w = x 1 , then x 1 < y1 ; thus, y1
and x 2 , x 3 , . . . , x k are the k elements that are larger than w. If w = y1 , then the elements of sequence x
are the k elements that are larger than w. Therefore w is correctly assigned to processor P j . Similarly
we can show that z is also correctly assigned to processor Pi .
2. Assume that up to step n 1 of the bitonic-split operation, elements were correctly assigned to processors
Pi and P j .
3. We need to prove that w = min{x n , yn }, and z = max{x n , yn } are assigned to processors Pi and P j
correctly. Again, consider the case where w = x n . w is smaller than the remaining k n elements of
sequence x. Also, w is smaller than yn . However, since sequence y is sorted in decreasing order, y is
smaller than the first n 1 elements of sequence y. Thus, there are k elements that are larger than w. If
w = yn , then w is smaller than the n k + 1 elements of x (that is, x n , x n+1 , . . . , x k ) and also is smaller
than the previous n 1 elements of y (that is, y1 , y2 , . . . , yn1 ). Therefore, in either case, w is correctly
assigned to P j . Similarly, it can be shown that z is assigned to Pi correctly.
As mentioned in the problem, the compare-split operation can terminate as soon as x n yn . The reason
is that xi yi , i > n, as the x sequence is sorted in increasing order and the y sequence is sorted in
decreasing order. Consequently, if the compare-exchange operations are performed, all elements x 1 i > n go
to processor Pi and all the yi elements i > n go to processor P j .
Because, the compare-exchanges can stop as soon as x n yn , the run time of this operation varies. If we have
to sort the sequences after each compare-split operation, the run time of this formulation will be (k log k).
However, at the end of the compare split operation, processor Pi has two subsequences, one from x and one
from y. Furthermore, each of these subsequences is sorted. Therefore, they can be merged in linear time to
obtained the final sorted sequence of size k. An advantage of this operation is that requires no extra memory.
Fox et al. [fox] provide further details of this compare-split operation.
33
34 Sorting
2 A relation R on A is
reflexive if for every a A (a, a) R.
antisymmetric if for every a, b A with (a, b) R and (b, a) R, then a = b.
transitive if for every a, b, c A whenever (a, b) R, and (b, c) R, then (a, c) R.
From the above definitions the fact that is a partial ordering can be easily shown.
3 To prove (a) note that there is an element ai (0 < i < n/2) of sequence s such that for all j < i,
min{a j , an/2+ j } belongs to {a0 , a1 , . . . , an/21 } and for all i < j < n/2 min{a j , an/2+ j } belongs to {an/2 , an/2+1 , . . . , an1 }.
Similarly, for all j < i, max{a j , an/2+ j } belongs to {an/2 , an/2+1 , . . . , an1 } and for all n/2 > j > i,
max{a j , an/2+ j } belongs to {a0 , a1 , . . . , an/21 }. Therefore,
and
s2 = {an/2 , an/2+1 , . . . , an/2+i , ai +1 , . . . , an/21 }.
Note that both s1 and s2 are bitonic sequences.
Parts (b) and (c) can be proved in a similar way. For more detailed proof of the correctness of the bitonic split
operation refer to Knuth [Knu73], Jaja [Jaj92], Quinn [Qui87], or Batcher [Bat68].
5 From Figure 9.10 (page 389) we see that bitonic merge operations of sequences of size 2k take place during the
kth stage of the bitonic sort. The elements of each bitonic sequence merged during the kth stage correspond to
wires whose binary representation have the same (log n k) most significant bits (where n is the number of
element to be sorted). Because each wire is mapped to a hypercube processor whose label is the same as that
of the wires, each sequence is mapped onto a k-dimensional subcube. Since the most significant (log n k)
bits of each bitonic sequence of length 2k are different, these sequences are mapped onto distinct subcubes.
7 Let n be the number of processors such that d = log n.
(a) From Figure 9.11 (page 391) we can see that the communication requirements are shown in Table 9.1.
Let A be the communication required during the first d/2 iterations. Then
d/21 i
d/21 1
A= 2j = 22i 1 = 2 n 2 log n.
2
i =0 j=0 i =0
Let B be the communication required during the first d/2 steps of the last d/2 iterations. Then
d i
d/21
1 1 1
B= 2 = log n( n 1) = n log n log n.
2 2 2 2
i =0
Chapter 9 35
1st iteration 1
2nd iteration 1 + 2
..
.
(d/2)th iteration 1 + 2 + + 2d/21
(d/2 + 1)st iteration 1 + 2 + + 2d/21 + 1+ n
(d/2 + 2)nd iteration 1 + 2 + + 2d/21 + 1+ n + 2
(d/2 + 3)rd iteration 1 + 2 + + 2d/21 + 1+ n + 2 + 4+ n
..
.
dth iteration 1 + 2 + + 2d/21 + 1+ n + 2 + 4+ n + + 2d/21
The total communication performed by the row-major formulation of the bitonic sort is
1 3
B + 2A = n log n + 4 n log n 4.
2 2
(b) From Figure 9.11 (page 391) we can see that the communication requirements are shown in Table 9.2.
Note that the communication requirements for the first d/2 iterations are similar to the row-major map-
ping. However, during the last d/2 iterations, the communication requirements of snakelike mapping is
higher than that of the row-major mapping. The additional communication is approximately
d/4
n log2 n
n2 i .
16
i =1
Now note that in Table 9.3, column two differs from column one by 1, column four differs from column
two by 2 and so on. Therefore, the sum of the even numbered columns is equal to that of the odd
36 Sorting
Therefore, row-major shuffled mapping requires less communication than the other two mappings. However,
Nassimi and Sahni [NS79] have developed variants of bitonic sorting for the row-major and the snakelike
mapping that have the same communication requirement as the row-major shuffled mapping.
8 The solution is presented by Knuth [Knu73] (merging network theorem).
9 In a ring-connected parallel computer, input wires are mapped onto processors in a way similar to the hyper-
cube mapping. The wire with label a is mapped onto processor with label a.
The total amount of communication performed by the processors is:
log n1
j
2i = 2n log n 2 2n
j=0 i =0
The parallel run time, speedup, and efficiency when p processors are used to sort n elements are:
n n n
TP = log + log p + (n)
2
p p p
(n log n)
S =
p log p + np log2 p + (n)
n n
1
E =
log p log2 p p
1 log n + log n + log n
For this formulation to be cost optimal p/ log n = O(1). Thus, this formulation can efficiently utilize up to
p = (log n) processors. The isoefficiency function is ( p2 p ).
10 This proof is similar to that presented in Solution 8.
11 Recall that in the hypercube formulation of shellsort, each processor performs compare-split operations with
each of its neighbors. In a d-dimensional hypercube, each processors has d neighbors. One way of applying
the spirit of shellsort to a mesh-connected parallel computer is for each processor to perform compare-split
operations with each of its four neighboring processors. The power of shellsort lies in the fact that during the
first few iterations, elements move significant closer to their final destination. However, this is not the case
with the mesh formulation of shellsort. In this formulation elements move at most p processors.
An alternate way of performing shellsort on a mesh-connected computer is shown in Figure 9.1. The mesh
is divided in to equal halves along the y-dimension, and corresponding processors compare-exchange their
elements. Then, the mesh is halved and the same is repeated for the smaller meshes. This is applied recursively
until communication takes place between adjacent mesh rows. A similar sequence of compare-exchanges is
performed along the x-dimension. Note that the processors paired-up for compare-split are similar to those
paired-up in the hypercube formulation of shellsort.
12 The sequential shellsort algorithm consists of a number of iterations. For each iteration i there is an associated
distance di . The n elements to be sorted are divided into groups, such that each group consists of elements
whose indices are multiples of di . These groups are then sorted. In each iteration di decreases. During the last
step, di = 1, and the algorithms sorts the entire list. Shellsort can be parallelized by performing each iteration
in parallel. In order to take advantage of the fact that as the sort progresses sequences tend to become almost
sorted, a sorting algorithm such as the odd-even transposition must be used.
Chapter 9 37
13 [fox] The worst case can be summarized as follows. Let n be the number of elements to be sorted on a d-
dimensional hypercube. Each processors is assigned n/2d = m elements. Now if more than m of the largest
2m elements occupy a particular pair of processors in the first compare-split step, some of the items will be
put in the wrong subcube in this first step. The subsequent (d 1) stages will just rearrange the items in
each (d 1)-dimensional subcube, but these elements will stay in the wrong subcube. Consequently, in the
final odd-even transposition sort, these items must travel a long distance to reach the top of the list, requiring
2d1 1 steps. The probability that such a collection of items will occur, for random initial lists, is given by:
m1
1 (2m)!
2d1 m!m!
This probability becomes small rapidly as m increases.
14 The CREW formulation of quicksort that assigns each subproblem to a different processor can be extended
for a p processor system by stopping the recursive subdivision as soon as p subproblems have been created.
At this point, each of the p processors internally sorts one of the p subsequences. At the end of this step, the
n elements are sorted.
If we assume that the pivot selection leads to perfectly balanced splits, then, if n = 2d and p = 2k , the parallel
run time is:
n n
k1
n n n 1
log + i
= log + 2n(1 ).
p p 2 p p p
i =0
The efficiency of this formulation is:
n log n
E =
n log n n log p + 2np 2n
1
=
1 log p/ log n + 2 p/ log n 2/ log p
From this equation, we see that for a cost efficient formulation p = O(log n). Therefore, this formulation
cannot use more that O(log n) processors. The isoefficiency function is ( p2 p ).
15 The algorithm is described by Chlebus and Vrto [CV91].
16 The expected height of the tree is less than or equal to 3 log n + 6. The analysis is discussed by Chlebus and
Vrto [CV91].
17 The solution is provided in the first edition of this book.
18 The proof that the hypercube formulation of quicksort correctly sorts n elements on p processors, is as follows.
After the first split, p/2 processors will end up having elements greater than the pivot while the remaining
p/2 processors have elements smaller than the pivot. In particular, the processors whose processor label has
38 Sorting
a zero at the most significant bit get the smaller, while the other processors get the larger elements. Thus, the
first split, puts the smaller elements in the right half-cube.
Assume that after k splits, the elements are sorted according to the k most significant bits of the processors
labels. That is, if we treat each subcube of size 2dk as a single processor, and the elements assigned to this
subcube, as a block of elements, then these blocks are sorted. Now, during the (k + 1)st split each block is
partitioned into two smaller blocks. In each subcube of size 2dk , each of these two smaller blocks is assigned
to one of the two 2dk1 half-cubes contained into a 2dk subcube. This assignment is such that the block with
the smaller elements is assigned to the half-cube that has a zero at the (d k 1)st bit, while the block with
larger elements is assigned to the half-cube that has a one. Therefore, the label of the half-cube determines the
ordering. Furthermore, since this is a local rearrangement of the elements in each 2dk subcube, the global
ordering has not changed.
20 This pivot selection scheme was proposed by Fox et al., [fox]. The performance of the algorithm depends on
the sample size l; the larger the sample the better the pivot selection. However, if the sample is large, then
the algorithm spends a significant amount of time in selecting the pivots. Since l cannot be too large, a good
choice seems to be l = (log n). Furthermore, if the distribution of elements on each processor is fairly
uniform, this pivot selection scheme leads to an efficient algorithm.
21 This problem was motivated by the (n) worst case selection algorithm is described in [CLR89]. Following
a proof similar to that in [CLR89] it can be shown that this scheme selects a pivot element such that there are
(n) elements smaller than the pivot, and (n) elements larger than the pivot.
Initially, the sequences stored in each processor are sorted. For each processor pair, computing the median
of the combined sequences takes (log n) time. Sorting 2i 1 medians can be done using bitonic sort, which
takes (i 2 ) time. Finally, the median of the median is broadcasted in (i) time.
22 The barrier synchronization is not necessary. Removing it does not affect neither the correctness nor the
asymptotic performance.
23 The parallel run time of the hypercube formulation of quicksort if we include the tc , ts , and tw costs is approx-
imately
n n n
TP = log tc + tw + ts log p + log2 p(ts + tw )
p p p
The efficiency is:
n log ntc
E =
n log ntc n log ptc + (ntw + pts ) log p + p log2 p(ts + tw )
1
log p log p tw p log p ts p log2 p ts + tw
= 1 + + + (9.1)
log n log n tc n log n tc n log n tc
If K = E/(1E), then from Equation 9.1 we have that the isoefficiency function due to the (log p/ log n)(tw /tc )
term is
log p tw 1 tw tw
= log n = K log p n = 2(K tw /tc ) log p = p K tw /tc n log n = K p K tw /tc log p.
log n tc K tc tc
The isoefficiency function due to the ( p log p)/(n log n)(ts /tc ) term is K (ts /tc ) p log p, due to the ( p log2 p)/(n log n)(ts /tc )
term is K (ts /tc ) p log2 p, and due to the ( p log2 p)/(n log n)(tw /tc ) term is K (tw /tc ) p log2 p. Therefore, the
most significant isoefficiency functions are:
tw tw ts
K p K tw /tc log p, K p log2 p, and K p log2 p.
tc tc tc
Note that the asymptotic complexity of the first term depends on the value of K and the tw /tc ratio.
(a) tc = 1, tw = 1, ts = 1
Chapter 9 39
The next free processor is specified along with the number of elements that can be added to the proces-
sors. This takes ( p) time.
(d) Elements are moved in large messages. Notice that a message from one processor may have to be split
across two destination processors. Then the portion for the other destination processor can be sent in one
hop. This takes (n/ p) time.
Since we assume perfect pivot selection, the algorithm requires (log p) iterations, the parallel run time is:
n n n n
TP = log + ( p) + + (log p)
p p p p
p log p p log
The isoefficiency function of the algorithm is (2 p).
26 The CRCW PRAM algorithm can be easily adapted to other architectures by emulation.
CREW PRAM The main difference between a CREW and a CRCW PRAM architecture is the ability to
perform a concurrent write. However, a concurrent write operation performed by n processors can be
emulated in a CREW PRAM in (log n) time. Therefore, the parallel run time of the enumeration sort
is (log n).
EREW PRAM In this model, in addition to requiring (log n) time to emulate a concurrent write operation,
it needs a (log n) time to perform a concurrent read operation. Therefore, the parallel run time of the
enumeration sort is (log n).
Hypercube The hypercube formulation is similar to the EREW PRAM formulation. Each read or write
operation takes (log n) time. Also, the elements can be permuted to their final destination in (log n)
time, since n 2 processors are used. Therefore, the parallel run time of the enumeration sort is (log n).
Mesh In a mesh-connected computer we can emulate a concurrent write or read operation in ( n) time.
Therefore, the parallel run time of the enumeration sort is ( n).
When p processors are used, we assign n/ p elements to each processor. Each processor now performs compu-
tations for each of its elements, as if a single element is assigned to it. Thus, each physical processor emulates
n/ p virtual processors. This results in a slowdown by a factor of n/ p.
27 Since the sequential run time is (n), the speedup and efficiency are:
(n)
S =
(n/ p) + ( p log p)
(n)
E =
(n) + ( p2 log p)
1
=
1 + (( p log p)/n)
2
29 When the sizes of the subblocks are roughly the same, the speedup and efficiency of sample sort are:
(n log n)
S =
((n/ p) log(n/ p)) + ( p2 logp) + ( p log(n/ p)) + (n/ p) + ( p log p)
(n log n)
E =
(n log(n/ p)) + ( p3 log p) + ( p2 log(n/ p)) + (n) + ( p2 log p)
1
log p p3 log p p2 p 2 log p 1
= 1+ + + + +
log n n log n n n log n log n
Graph Algorithms
If p = (n), the parallel run time is (n log n) and is determined by the second term of the above equation.
The minimum parallel run time can be obtained by differentiating Equation 10.1 with respect to p, setting the
derivative equal to zero, and solving for p as follows:
n2 n
2 + = 0 p = (n)
p p
However, if we use (n/ log n) processors, then the parallel run time is
TP = (n log n) + (n log n n log log n) = (n log n). (10.3)
From Equations 10.2 and Equation 10.3 we see that the parallel run times obtained using (n) and (n/ log n)
processors are the same in asymptotic terms. However, the exact run time when (n/ log n) processors are
used is larger than the one obtained when p = n, by a constant factor.
2 Dijkstras single-source shortest paths algorithm can record the shortest path by using an additional array path,
such that path[v] stores the predecessor of v in the shortest path to v. This array can be updated by modifying
Line 12 of Algorithm 10.2 (page 437). Whenever l[u] + w(u, v) is smaller than l[v] in Line 12, path[v] is set
to u. This modification does not alter the performance of the algorithm.
3 Let G = (V, E) be a graph. For each edge (u, v) E let w(u, v) = 1. The breadth-first ranking can
be obtained by applying Dijkstras single-source shortest path algorithm starting at node u on graph G =
(V, E, W ). The parallel formulation of this algorithm is similar to that of Dijkstras algorithm.
4 The solution is discussed by Bertsekas and Tsitsiklis [BT89].
5 The sequential all-pairs shortest paths algorithm requires (n 2 ) memory. This memory is used to store the
weighted adjacency matrix of G = (V, E). The source-partitioning formulation of Dijkstras algorithm re-
quires (n 2 ) memory on each processor, and the source-parallel formulation requires (n 3 / p) memory on
43
44 Graph Algorithms
Table 10.1 Memory overhead of the all-pairs shortest paths algorithm presented in Section 10.4 (page 437). Memory overhead
is defined as the ratio of the memory required by all the p processors with the memory required by the sequential algorithm.
each processor. The reason is that in the source-parallel formulation, there are n groups of p/n processors,
each running a copy of Dijkstras single-source shortest path. For Floyds algorithm (both the 2D block-
mapping and its pipelined variant) requires (n 2 / p) memory per processor. Table 10.1 summarizes the mem-
ory requirements, and the memory overhead factors of each formulation. For further details refer to Kumar
and Singh [KS91].
6 Replacing Line 7 of Algorithm 10.3 (page 440) by
we perform the updates of the D matrix in place, without requiring any additional memory for the various
(k1) (k1)
D (k) matrices. The replacement is correct because during the kth iteration, the values of di,k and dk, j are
(k) (k)
the same as di,k and dk, j . This is because the operation performed at Line 7, does not change these values.
7 Recall from Section 10.4.2 (page 441) that in each iteration of Floyds algorithm, the algorithm performs
two broadcasts, one along the rows and one along the columns. Each broadcast requires a message of size
(n/ p) to be sent to all the processors in a row or column.
On a mesh with store-and-forward routing, each broadcast operation takes (n) time, and on a mesh with cut-
through routing, it takes ((n log p)/ p + p) time. Since the algorithm requires n iterations, the parallel
run time on a mesh with store-and-forward routing is
n3
TP = + (n 2 ),
p
The isoefficiency function due to the first term is ( p1.5 log3 p) and the isoefficiency function due to the
second term is ( p 2.25 ). Thus, the overall isoefficiency function is ( p 2.25 ).
(k) (k)
8 (a) In each iteration of Floyds algorithm, each processor needs to know the di,k and dk, j values to compute
di,(k+1)
j
(k)
. By partitioning the D (k) matrix in a block-striped fashion, the dk, j values are stored locally in
Chapter 10 45
(k)
each processor. However, the di,k values needs to be obtained by the processor that stores the kth column
(k)
of the D matrix. Therefore, in each iteration of the block-striped formulation, the processor that has
the kth column needs to broadcast it to all the other processors. Since each column contains n elements,
this operation takes (n log p) time on a hypercube-connected parallel computer. Therefore, the parallel
run time is
n3
TP = + (n 2 log p).
p
The isoefficiency function due to communication is (( p log p)3 ), which is the overall isoefficiency
function. The isoefficiency function of the block-striped formulation is worse than that of the 2D block
formulation, which is ( p1.5 log3 p). Therefore, this formulation is less scalable, and has no advantages
over the 2D block partitioning.
(b) On a mesh with store-and-forward routing, it takes (n p) time to broadcast a column. Therefore, the
parallel run time is
n3
TP = + (n 2 p).
p
The isoefficiency function due to the first communication term is (( p log p)3 ) and due to the second
communication term is ( p2.25 ). Therefore, the overall isoefficiency function is (( p log p)3 ).
On a ring with store-and-forward routing, the column broadcast takes (np) time. The parallel run time
is
n3
TP = + (n 2 p).
p
The isoefficiency function is ( p 6 ). The same column broadcast takes (n log p + p) time on a ring
with cut-through routing. The parallel run time is
n3
TP = + (n 2 log p) + (np).
p
The isoefficiency function is (( p log p)3 ). Thus block-striped partitioning has similar performance on
a ring with CT routing and on a mesh with CT routing.
9 The pipelined formulation of the block-striped partitioning proceeds along the same lines as that with the 2D
block partitioning. As soon as the processor containing the kth column has finished the (k 1)st iteration, it
sends the (k 1)st column to its two neighbor processors. When a processor receives elements of the (k 1)st
column it stores them locally and then forwards them to its other neighbor. A processor starts working on the
kth iteration as soon as it has computed the (k 1)st iteration and has received the (k 1)st columns.
It takes (np) time for the first column to reach processor Pp . After that each subsequent column follows after
(n 2 / p) time in a pipelined mode. Hence, processor P p finishes its share of the shortest path computation in
(n 3 / p) + (np) time. When processor Pp has finished the (n 1)st iteration, it sends the relevant values of
the nth column to the other processors. These values reach processor P1 after (np) time. The overall parallel
run time is
n3
TP = + (np).
p
46 Graph Algorithms
Figure 10.1 Merging the spanning forests along a row of a mesh. Each arrow indicates the direction of data movement. In this
example, each row has 16 processors.
The efficiency is
1
E= .
1 + ( p2 /n 2 )
Therefore, the isoefficiency function is ( p 3 ).
10 The analysis is presented by Kumar and Singh [KS91]. The exact parallel run time is
n3 n
TP = tc + 4( p 1) ts + tw , (10.4)
p p
where tc is the time to perform the operation in Line 7 of Algorithm 10.3 (page 440), ts is the message
startup time, and tw is the per-word transfer time. Note that Equation 10.4 is valid under the following two
assumptions (a) n 2 tc / p ts + ntw / p and (b) when a processor transmits data it is blocked for ts time; after
that it can proceed with its computations while the message is being transmitted.
11 The main step in deriving a parallel formulation of the connected components algorithm for a mesh-connected
parallel computer is finding an efficient way to merge the spanning forests. One possible way is to first merge
the spanning forests along each row of the mesh. At the end of this step, a column of mesh processors stores
the merged spanning forests of each row. The global spanning forest is found by merging these spanning
forests. This is accomplished by performing a merge along a single column of the mesh.
One possible way of merging the spanning forests in each row is shown in Figure 10.1. Notice that in each but
the last step, the distance among processors that need to communicate doubles.
Recall that in order for a pair of processors to merge their spanning forests, one processor must send (n)
edges to the other processor. In a mesh with store-and-forward routing in order to merge the spanning forests
along a row, the total time spent in sending edges is
log( p/4)
(n) 2i + 1 = (n p).
i =0
Similarly, the time to perform the merge along the column is also (n p). Therefore, the parallel run time
of the algorithm is
n2
TP = + (n p).
p
The speedup and efficiency are:
(n 2 )
S = ,
(n 2 / p) + (n p)
1
E = .
1 + ( p1.5 /n)
In a mesh with cut-through routing, the time spent merging the spanning forests along each row is:
log( p/4)
(n log p) + 2i = (n log p) + ( p).
i =0
Since the sequential complexity of Dijkstras algorithm is (n 2 ), the speedup and efficiency are as follows:
(n 2 )
S =
(n 2 / p) + (n log p) + (nm + n log p)
1
E =
1 + (( p log p)/n) + (mp/n + ( p log p)/n)
For a cost optimal formulation, ( p log p) = (n) and (mp) = (n). Therefore, this formulation can use
min{(n/m), (n/ log n)} processors.
Now consider the case where each processor is assigned m/ p elements from each adjacency list. Assume
that the l-values of the vertices are distributed among the p processors, so each processor has n/ p l-values.
Finding the vertex u with the minimum value of l takes (n/ p) + (log p) time. In order for the l-values
to be updated for each vertex v adjacent to u, the processor responsible for l[v] must know w(u, v). Each
processor has m/ p elements of the adjacency list of u. These elements need to be send to different processors.
Therefore each processor spends at least (m log p/ p) time. Thus, in the worse case the overall time spent in
communication is (m log p). Therefore, the overall run time for n iterations is
n2
TP = ( + (nm log p).
p
Since the sequential complexity is (n 2 ), this formulation is cost-optimal. However, this formulation can use
only p = min{n/(m log n), m} processors.
Note that Dijkstras algorithm has the same complexity for both sparse and dense graphs. The operation that
determines the complexity is finding the vertex with the minimum value of l.
15 The solution to this problem is similar to Solution 14.
17 If we ignore overhead due to extra work, we essentially make the assumption that computation progresses in a
wave fashion. That is, at any time, only the vertices belonging to a diagonal of the grid graph are performing
computations. The average length of each diagonal is n/2 vertices, and there are a total of (2n 1) diagonals.
In the 2D cyclic mapping, each diagonal of vertices is mapped onto a diagonal of the p p processor
grid. This diagonal of processors is wrapped-around in such a way that at any time, at most p processors are
performing computation. This can be verified by looking at Figure 10.2.
Since there are an average of n/2 vertices in each diagonal and p processors assigned to it, the shortest path
computation performed by these processors takes (n/(2 p)) = (n/ p) time. However, each shortest
path computation requires sending the data to neighbor processors. Therefore, after computing each diagonal,
the algorithm spends (n/ p) time sending path information. Therefore, the overall run time is
n2 n2
TP = + .
p p
The processor-time product of this formulation is (n 2 p); therefore, this algorithm is not scalable.
18 Let b be the block size of the 2D block-cyclic-mapping. As in Solution 17, each diagonal has n/2 vertices.
Each diagonal now can intersect up to 2 p processors. Therefore, the algorithm spends (n/ p) time for
each diagonal. Since the block size is b, each of the 2 p processors has approximately (n/(b p)) blocks
assigned to it. Each processor needs to send path information for only the boundary vertices of each block,
therefore the communication is (n/(b p)). The parallel run time of this formulation is
n2 n2
TP = + .
p b p
The processor-time product is (n 2 p); thus, this formulation is not cost-optimal.
Chapter 10 49
Figure 10.2 (a) A grid graph, (b) Mapping the grid graph onto a processor grid using 2D cyclic mapping. (c) The processors
intersected by the diagonal in (a). Note that each diagonal intersects at most p processors.
1 The value of V ( p) for the GRR-M scheme is identical to that of the GRR scheme. However, the number of
accesses to target are reduced drastically in this case. Each request now incurs a delay of time at each of
the log p processors in the path. Therefore, there is a delay time of O( log p) associated with each request.
This does not change the overall time, since the time to transfer work across the network is also (log p). The
isoefficiency function due to communication overhead is given by O(V ( p) log p log W ), or O( p log p log W ).
Equating this with the essential work W , we have the isoefficiency function of this scheme to be O( p log2 p).
Since there is no contention in this load balancing scheme, the overall isoefficiency is given by O( p log2 p).
Kumar, Ananth, and Rao [KGR91] discuss this scheme in detail and provide experimental evaluation.
2 This load balancing technique uses a distributed scheme for implementing the counter. The step in the counter
is used to locate the processor to be requested for work. Each time a request is made, the step moves to the
right. In this way, after p requests, the step returns to the processor it started at. The step in the value of
counters can be detected in (log p) using the algorithm presented by Lin [Lin92].
This algorithm can be used in conjunction with the load balancing strategy described in the problem. V ( p) for
this scheme is p. Therefore, the communication overhead is given by O( p log p log W ). Equating this with
the essential work W , we can determine the isoefficiency of this scheme to be O( p log2 p).
The isoefficiency term resulting from the number of messages required to balance load is the same as that of
GRR (and GRR-M), which is O( p 2 log p). Therefore, the overall isoefficiency of this scheme on a network
of workstations is O( p 2 log p).
Lin [Lin92] discusses this result in detail.
3 Size of message grows as w: The contribution to the overall isoefficiency due to contention remains the
same and can be obtained by equating
W/ p = O(V ( p) log W )
51
52 Search Algorithms for Discrete Optimization Problems
by
log W 1
Tcomm = O( V ( p) (1 )i W log p)
i =0
For GRR, V ( p) = p. If (1 ) = r, this expression reduces to
W 1
log
Tcomm = O( p W log p ri)
i =0
(1 r log W 1 )
Tcomm = O( p W log p )
1r
Since r < 1, this expression reduces to
Tcomm = O( p W log p)
Equating this overhead with essential computation W , we get
W = O( p2 log2 p)
Therefore the overall isoefficiency is now O( p 2 log2 p).
4 The run time of a termination detection algorithm is considered as the time taken by an algorithm to signal
termination after the last processor has finished its computation. Since the token may need to traverse the ring
once after all processors have gone idle, in the worst case, the run time of the termination algorithm is ( p).
Since all processors have to spend this time, the total overhead due to termination detection is ( p 2 ).
The constant associated with this isoefficiency term is very small. Therefore, unless the value of p is very
large, the contribution of termination detection to overall isoefficiency can be neglected.
6 The solution is discussed in detail by Mahapatra and Dutt [DM93].
7 The solution is discussed in detail by Kimura and Nobuyuki [KI90]. Ignoring communication latency, they
show that the isoefficiency function of the single level load balancing scheme is given by O( p 2 ).
8 The solution is discussed in detail by Kimura and Nobuyuki [KI90]. Ignoring communication latency, they
show that the isoefficiency function of the multi-level load balancing scheme is given by O( p (l+1)/l (log p)(l+1)/l ),
where l is the number of levels.
9 Ferguson and Korf [FK88] discuss this scheme in detail.
10 Let the total number of nodes to be expanded at a certain moment by the search algorithm be r and number of
processors with at least one node in its local open list be X r . As shown by Manzini [MS90], E[X r ] = ( p)
as r becomes ( p) (which is a reasonable assumption).
In each iteration, ( p) nodes are expanded and the corresponding communication overhead is ( p). This
corresponds to an isoefficiency function of ( p). Since there are no other overheads, the overall isoefficiency
of this scheme is given by ( p).
The analysis considers only the number of nodes and not the quality of nodes expanded. The quality of
the nodes can be analyzed in a similar manner. Manzini [MS90] discusses these results in detail. Dutt and
Mahapatra [MD93] experimentally evaluate this scheme and propose improved variants.
11 Each node in the parallel best-first search formulation is hashed to a random processor. The time taken to
perform this communication is given by O(m + log p) for a message of size m. Since there are W/ p such
hash operations, the total communication time is given by O(W m/ p + n/ p log p). The total communication
overhead is given by O(W m + W log p). Equating this with the total essential work, which is (W ), we can
see that the formulation is unscalable. This is because no matter how fast the problem size W is increased, the
efficiency cannot be held constant.
The formulation is scalable only for architectures for which the cost of communicating a node across is O(1);
e.g., on PRAMs.
C HAPTER 12
Dynamic Programming
1
= 2ts tw
1
p + ctc + ptc
Now, as we increase the problem size (by increasing c), all terms in the denominator reduce except 1 + tw /tc .
Therefore, 1 + tw /tc is the lower bound on the value of the denominator. Therefore, the efficiency is bounded
from above by
1
E=
1 + ttwc
2 The solution is discussed by Lee and Sahni in [LSS88].
3 NOTE
The hint in the problem should point to block-cyclic striped mapping and not cyclic striped map-
ping.
In the cyclic striped mapping shown in Figure 12.1, in the computation of the first diagonal, one processor
is busy; in the computation of the second diagonal, two processors are busy; and so on. This continues until
diagonal p. The computation of each of these diagonals takes time tc + ts + tw . This is because a single word
needs to be communicated from the processor on its left and a single entry needs to be computed. Therefore
the total time for computing these p diagonals is p(tc + ts + tw ). The computation of each of the next p
diagonals takes time 2(tc + ts + tw ) since some of the processors may be computing two entries in the table.
The total time taken for these p diagonals is therefore 2 p(tc + ts + tw ). Similarly, the computation of the next
p diagonals takes time 3(tc + ts + tw ) and the total time is 3 p(tc + ts + tw ). The time for the p diagonals
leading up to the longest diagonal is given by n/ p(tc + ts + tw ). The total parallel run time of the algorithm is
53
54 Dynamic Programming
0 1 2 3 0 3
twice the time taken to compute the first n diagonals. This is given by
n/ p
TP = 2 p(tc + ts + tw ) i
i =1
n n
= p(tc + ts + tw ) ( + 1)
p p
n
= (tc + ts + tw )n( + 1)
p
n 2
= (tc + ts + tw ) + (tc + ts + tw )n
p
n 2 tc
S = 2
(tc + ts + tw ) np + (tc + ts + tw )n
1
E = ts +tw p
1+ tc + n (tc + ts + tw )
We can see that on increasing the problem size (by increasing n), the efficiency increases. However, it is
bounded from above by
1
E=
1 + ts +t
tc
w
This upper bound can be removed by using block-cyclic striped mapping. In this case, processors are assigned
p columns in a cyclic manner. Therefore, the first p columns are assigned to processor P0 , the next p
Chapter 12 55
0 1
p
Figure 12.2 Block cyclic partitioning of the LSC problem. The nodes are organized into blocks of size p p and assigned
to the processors in a cyclic fashion.
columns to processor P1 and so on. The assignment wraps back to processor P0 after all processors have been
assigned p columns. Computation is performed in blocks of size p p as shown in Figure 12.2. The
computation of the block at the top left corner is performed first. After this block has been processed, two
blocks can be processed. One by processor P0 and the other by processor P1 . Using this mapping scheme,
the computation of the first block is done by processor 0 in time ( p p)tc . The next set of two blocks is
computed by processors P0 and P1 in time (ts + tw p) + ptc . This continues until all processors become busy
(that is, until diagonal p1.5 ). For the next p1.5 diagonals, computing each block takes time 2((ts +tw p)+ ptc ).
This continues and the longest diagonal. The total parallel run-time of the algorithm is twice the time for the
computation of the first n diagonals. This is given by
n/( p 1.5 )
TP = 2((ts + tw p) + ptc ) p i
i =1
n n
= ((ts + tw p) + ptc ) p
( + 1)
p1.5 p1.5
n n2 n2 n2
= ts ( + 2 ) + tw ( 1.5 + n) + tc ( + n p)
p p p p
The corresponding speedup and efficiency are given by
n 2 tc
S =
n2 2 2
ts ( np + p2
) + tw ( pn1.5 + n) + tc ( np + n p)
1
E =
p 1.5 tw 1 p ts p
1+ n + tc ( p + n + tc ( n + 1p )
56 Dynamic Programming
From this expression, we can see that the efficiency does not have an upper bound.
4 The TSP can be solved by constructing a table which stores values of f (S, k) for increasingly larger sets S.
Starting at city v1 , we can construct n 1 sets of two cities with the second city being the terminating city.
Each of these n 1 sets now leads to n 2 sets of 3 cities, with the added city being the terminating city. In
general, at the k th level, there are (n 1)n2 C k nodes (since the initial and terminating cities are fixed). The
total number of entires to be computed, N , is given by:
n2
N= (n 1)n2 C k = (n 1)2n2
k=0
Since computing an entry in the (n 2)nd row takes time (n 1)tc , the serial complexity of this algorithm is
(n 2 2n ).
The formulation is serial monadic, but is considerably harder to parallelize than the standard multistage graph
procedure. This is because the average number of nodes at a level is O(2n ), but each node is connected to a
small number of nodes (O(n)) at the next level. A simple parallel formulation of this algorithm is given below.
The formulation uses only O(n) processors, but it uses them efficiently.
The table can be constructed in parallel by assigning the responsibility of one terminating city to each pro-
cessor. Therefore, using n 1 processors, processor i in step k computes all paths through nodes in set k
terminating at node i. After each step, an all-to-all broadcast is used to communicate all the results from
the current step to all the processors. Therefore, in step k, an all to all broadcast of n2 C k elements is re-
quired. The corresponding computation for the step is given by (k n2 C k ). All-to-all broadcast takes time
ts log p + tw m( p 1). The first term is dominated by the second and can be ignored without significant loss in
accuracy. The all-to-all broadcast operation can be performed in approximately tw mp time. Since m =n2 C k
and p = n 1, the time taken is given by tw n n2 C k .
The total parallel time is therefore given by
n2
n2
TP = k n2
C k tc + tw n n2 C k
k=0 k=0
This corresponds to TP = (n2n ). Since (n) processors are used, the formulation is cost optimal. The
same formulation gives identical performance on the mesh connected computer because the dominant term in
all-to-all broadcast is the same for the mesh.
5 Let c(k) be the number of records in file k and f (S) be the cost of merging the set of files S. Furthermore, let
uv represent the file of length c(u) + c(v) resulting from the merger of files u and v. The recursive equation
for solving the optimal merge problem is given by
0 S = {} or |S| = 1
f (S) = min { f (S {u} {v} + {uv}) + c(u) + c(v)}
u,vS
However, a better greedy formulation of the algorithm is represented by the following recursive equation
0 S = {} or |S| = 1
f (S) = f (S {u} {v} + {uv}) + min c(u) + min c(v)
uS vS{u}
The serial algorithm requires the two smallest files to be determined at each step. This is based on the heap
data structure or on explicit determination of the smallest files. Parallel formulations of the algorithm can be
derived based on the technique used to determine the smallest files.
Chapter 12 57
7 Given a polygon v0 , v1 , . . . , vn1 , the optimal triangulation problem can be solved using the following DP
formulation: define C[i, j] as the weight of an optimal triangulation of vertices vi 1 , . . . , v j . The objective
is to determine C[1, n 1]. The following recursive equation can be used to determine C[i, j]:
min {C[i, k] + C[k + 1, j] + f (vi 1 , vk , v j )} i < j
C[i, j] = i k j (12.1)
0 i= j
Here, f (vi 1 , vk , v j ) is a weight function corresponding to the triangle formed by vertices vi 1 , vk , and v j .
Using the perimeter of the triangle as the weight function,
Comparing this with the recursive DP formulation for the matrix parenthesization problem, we can see that the
two are identical. It is a nonserial polyadic DP formulation and parallel formulations identical to the optimal
matrix parenthesization problem can be used here.
Cormen [CLR90] provides details of the DP formulation to solve this problem.
C HAPTER 13
1 (a) This problem assumes a hypothetical parallel computer with a hypercube interconnection network, in
which there is no message startup time and a message contains only one word. The parallel run time of
the binary exchange algorithm is
tc n log n tw n log p
TP = +
p p
p
S = tw log p
1+ tc log n
1
E =
1 + 1.2 log p
log n
59
60 Fast Fourier Transform
Table 13.1 The binary representation of the various powers of calculated in different iterations of an 16-point FFT (also see
Figure 13.1 (page 539)). The value of m refers to the iteration number of the outer loop, and i is the index of the inner loop of
Program 13.2 (page 540).
i
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
m =0 0000 0000 0000 0000 0000 0000 0000 0000 1000 1000 1000 1000 1000 1000 1000 1000
m =1 0000 0000 0000 0000 1000 1000 1000 1000 0100 0100 0100 0100 1100 1100 1100 1100
m =2 0000 0000 1000 1000 0100 0100 1100 1100 0010 0010 1010 1010 0110 0110 1110 1110
m =3 0000 1000 0100 1100 0010 1010 0110 1110 0001 1001 0101 1101 0011 1011 0111 1111
For E = 0.6,
n log n = 3 p3 log p
For E = 0.4,
4 4/3
n log n = p log p
3
2 Refer to the paper by Thompson [Tho83].
3 On a linear array of p processors, the distance between the communicating processors in the i th iteration (of
the log p iterations that require interprocessor communication) is 2i (0 i < log p).
tc n log n
log p1
n
TP = + (ts + tw 2i )
p p
i =0
tc n log n
+ ts log p + tw n
p
Table 13.2 The maximum number of new powers of used by any processor in each iteration of an 8-point FFT computation.
p = 1 p = 2 p = 4 p = 8 p = 16
m =0 2 1 1 1 1
m =1 2 2 1 1 1
m =2 4 4 2 1 1
m =3 8 8 4 2 1
Total = h(16, p) 16 15 8 5 4
value of the entry (log n 1, p) is n/ p; that is, except for p = 1, all entries in the last row of the tables are
n/ p. Thus, if the number of processors is changed from 2 p to p, all but the first entry of column 2 p are present
in column p, but these entries are offset by one row. An entry n/ p is added to the last row of column p and the
original entry (0, 2 p) = 1 does not move from column 2 p to column p. Hence, h(n, p) = h(n, 2 p)+n/ p 1,
which is the last equation of the recurrence.
n
h(n, p) = h(n, 2 p) + 1
p
n n
= h(n, 4 p) + + 2
2p p
.. ..
. .
n n n n
= h(n, n) + + + + log( )
n/2 n/4 p p
n
= log n + 2 + 4 + + log n + log p
p
n
= 2 2 + log p
p
8 Since the computation time is the same in each case, we give the communication times in the following tables.
A indicates that the algorithm is not applicable for the given values of n and p. The least run time in each
case is boldfaced, and hence, indicates the algorithm of choice.
n = 215 , p = 212 :
Communication Runtimes
constants 2-D trans. 3-D trans. 4-D trans. 5-D trans. Bin. ex.
ts = 250, tw = 1 3.15E4 8032 3096
ts = 50, tw = 1 6316 1632 696
ts = 10, tw = 1 1276 352 216
ts = 2, tw = 1 268 96 120
ts = 0, tw = 1 16 32 96
n = 212 , p = 26 :
Communication Runtimes
constants 2-D trans. 3-D trans. 4-D trans. 5-D trans. Bin. ex.
ts = 250, tw = 1 1.58E4 3628 2442 1884
ts = 50, tw = 1 3214 828 642 684
ts = 10, tw = 1 694 268 282 444
ts = 2, tw = 1 190 156 210 396
ts = 0, tw = 1 64 128 192 384
62 Fast Fourier Transform
n = 220 , p = 212 :
Communication Runtimes
constants 2-D trans. 3-D trans. 4-D trans. 5-D trans. Bin. ex.
ts = 250, tw = 1 1.04E6 1.28E4 9024 6072
ts = 50, tw = 1 2.05E5 3168 2624 3672
ts = 10, tw = 1 4.12E4 1248 1344 3192
ts = 2, tw = 1 8446 864 1088 3096
ts = 0, tw = 1 256 768 1024 3072
9 With a constant tw , the parallel run time of the binary exchange algorithm on a two-dimensional mesh is
approximately tc n log n/ p + ts log p + 2tw n/ p. If the per-word time is tw / p x for a p-processor mesh, then
the parallel run time is tc n log n/ p + ts log p + 2tw n/ p0.5+x . The isoefficiency function due to ts is ( p log p).
To compute the isoefficiency function due to tw ,
tc n log n 2tw n
p p0.5+x
tw
log n 2 p0.5x
tc
/tc 0.5x
n 22tw p
tw 0.5x 2tw p0.5x /tc
n log n = W 2 p 2
tc
If the channel width is p x , then in each communication step, at least p x words must be transferred between
any two communicating processors to utilize all links of a channel. Hence,
n
px
p
n p1+x
n log n = W (1 + x) p1+x log p
Thus, the isoefficiency function due to communication overhead is ( p 0.5x 22(tw /tc ) p
0.5x
) and that due to
concurrency is ( p 1+x log p). For x > 0.5, the isoefficiency function due to concurrency is greater than
( p1.5 log p) and that due to communication is less than ( p 1.5 log p) if all channels are fully utilized. For
x < 0.5, the isoefficiency function due to communication exceeds ( p 1.5 log p). The best isoefficiency
function is therefore ( p1.5 log p) for x = 0.5. A higher rate of increase of channel width with respect to the
number of processors does not improve the overall scalability because the size of data stored at each processor
cannot be increased beyond (n/ p) without causing the scalability due to concurrency to deteriorate.
Bibliography
[Bat68] K. E. Batcher. Sorting networks and their applications. In Proceedings of the 1968 Spring Joint Computer
Conference, 307314, 1968.
[BLM+ 91] G. E. Blelloch, C. E. Leiserson, B. M. Maggs, C. C. Plaxton, S. J. Smith, and M. Zagha. A comparison of
sorting algorithms for the connection machine cm-2. Technical report, Thinking Machines Corporation,
1991.
[BT89] D. P. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods. Prentice-
Hall, NJ, 1989.
[CLR89] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. MIT Press, McGraw-Hill,
Cambridge, MA, 1989.
[CLR90] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. MIT Press, McGraw-Hill,
New York, NY, 1990.
[CV86] R. Cole and U. Vishkin. Deterministic coin tossing and accelerating cascades: Micro and macro tech-
niques for designing parallel algorithms. In Proceedings of the 18th Annual ACM Symposium on Theory
of Computing, 206219, 1986.
[CV91] B. S. Chlebus and I. Vrto. Parallel quicksort. Journal of Parallel and Distributed Processing, 1991.
[DHvdV93] J. W. Demmel, M. T. Heath, and H. A. van der Vorst. Parallel numerical linear algebra. Acta Numerica,
111197, 1993.
[DM93] S. Dutt and N. R. Mahapatra. Parallel A* algorithms and their performance on hypercube multiproces-
sors. In Proceedings of the Seventh International Parallel Processing Symposium, 797803, 1993.
[FK88] C. Ferguson and R. Korf. Distributed tree search and its application to alpha-beta pruning. In Proceedings
of the 1988 National Conference on Artificial Intelligence, 1988.
[fox]
[GK93] A. Gupta and V. Kumar. Performance properties of large scale parallel systems. Journal of Parallel and
Distributed Computing, 19:234244, 1993. Also available as Technical Report TR 92-32, Department of
Computer Science, University of Minnesota, Minneapolis, MN.
63
64 Bibliography
[KG91] V. Kumar and A. Gupta. Analyzing scalability of parallel algorithms and architectures. Technical Report
TR 91-18, Department of Computer Science Department, University of Minnesota, Minneapolis, MN,
1991. To appear in Journal of Parallel and Distributed Computing, 1994. A shorter version appears in
Proceedings of the 1991 International Conference on Supercomputing, pages 396-405, 1991.
[KGR91] V. Kumar, A. Y. Grama, and V. N. Rao. Scalable load balancing techniques for parallel computers.
Technical Report 91-55, Computer Science Department, University of Minnesota, 1991. To appear in
Journal of Distributed and Parallel Computing, 1994.
[KI90] K. Kimura and N. Ichiyoshi. Probabilistic analysis of the optimal efficiency of the multi-level dynamic
load balancing scheme. Technical report, ICOT, 1990.
[Knu73] D. E. Knuth. The Art of Computer Programming: Sorting and Searching. Addison-Wesley, Reading,
MA, 1973.
[KR84] S. C. Kwan and W. L. Ruzzo. Adaptive parallel algorithms for finding minimum spanning trees. In
Proceedings of the 1984 International Conference on Parallel Processing, 439443, 1984.
[KS91] V. Kumar and V. Singh. Scalability of parallel algorithms for the all-pairs shortest path problem. Journal
of Parallel and Distributed Computing, 13(2):124138, October 1991. A short version appears in the
Proceedings of the International Conference on Parallel Processing, 1990.
[Lei92a] F. T. Leighton. Introduction to Parallel Algorithms and Architectures. Morgan Kaufmann, San Mateo,
CA, 1992.
[Lei92b] F. T. Leighton. Introduction to Parallel Algorithms and Architectures. Morgan Kaufmann, San Mateo,
CA, 1992.
[Lin92] Z. Lin. A distributed fair polling scheme applied to or-parallel logic programming. International Journal
of Parallel Programming, 20(4), August 1992.
[LSS88] J. Lee, E. Shragowitz, and S. Sahni. A hypercube algorithm for the 0/1 knapsack problem. Journal of
Parallel and Distributed Computing, (5):438456, 1988.
[MD93] N. R. Mahapatra and S. Dutt. Scalable duplicate pruning strategies for parallel A* graph search. In
Proceedings of the Fifth IEEE Symposium on Parallel and Distributed Processing, 1993.
[MS90] G. Manzini and M. Somalvico. Probabilistic performance analysis of heuristic search using parallel hash
tables. In Proceedings of the International Symposium on Artificial Intelligence and Mathematics, 1990.
[NS79] D. Nassimi and S. Sahni. Bitonic sort on a mesh connected parallel computer. IEEE Transactions on
Computers, C28(1), January 1979.
[Qui87] M. J. Quinn. Designing Efficient Algorithms for Parallel Computers. McGraw-Hill, New York, NY,
1987.
[RS90] S. Ranka and S. Sahni. Hypercube Algorithms for Image Processing and Pattern Recognition. Springer-
Verlag, New York, NY, 1990.
[SKAT91] V. Singh, V. Kumar, G. Agha, and C. Tomlinson. Efficient algorithms for parallel sorting on mesh
multicomputers. International Journal of Parallel Programming, 20(2):95131, 1991.
[SS88] Y. Saad and M. H. Schultz. Topological properties of hypercubes. IEEE Transactions on Computers,
37:867872, 1988.
[SS90] H. Shi and J. Schaeffer. Parallel sorting by regular sampling. Journal of Parallel and Distributed Com-
puting, (14):361372, 1990.
Bibliography 65
[Tho83] C. D. Thompson. Fourier transforms in VLSI. IBM Journal of Research and Development, C-
32(11):10471057, 1983.
[WI89] K. Wada and N. Ichiyoshi. A distributed shortest path algorithm and its mapping on the Multi-PSI. In
Proceedings of International Conference of Parallel Processing, 1989.
[WS89] J. Woo and S. Sahni. Hypercube computing: Connected components. Journal of Supercomputing,
3:209234, 1989.