The PRAM Model
and Algorithms
Advanced Topics Spring 2008
Prof. Robert van Engelen
Overview
The PRAM model of parallel computation
Simulations between PRAM models
Work-time presentation framework of parallel algorithms
Example algorithms
1/23/08
HPC Fall 2007
The PRAM Model of Parallel
Computation
Parallel Random Access Machine (PRAM)
Natural extension of RAM: each processor is a RAM
Processors operate synchronously
Earliest and best-known model of parallel computation
Shared memory with m locations
Shared Memory
p processors, each with private memory
P1
1/23/08
P2
P3
Pp
All processors operate synchronously, by
executing load, store, and operations on data
HPC Fall 2007
Synchronous PRAM
Synchronous PRAM is a SIMD-style model
All processors execute the same program
All processors execute the same PRAM step instruction stream
in lock-step
Effect of operation depends on local data
Instructions can be selectively disabled (if-then-else flow)
Asynchronous PRAM
Several competing models
No lock-step
1/23/08
HPC Fall 2007
Classification of PRAM Model
A PRAM step (clock cycle) consists of three phases
Read: each processor may read a value from shared memory
2. Compute: each processor may perform operations on local data
3. Write: each processor may write a value to shared memory
1.
Model is refined for concurrent read/write capability
Exclusive Read Exclusive Write (EREW)
Concurrent Read Exclusive Write (CREW)
Concurrent Read Concurrent Write (CRCW)
CRCW PRAM
Common CRCW: all processors must write the same value
Arbitrary CRCW: one of the processors succeeds in writing
Priority CRCW: processor with highest priority succeeds in
writing
1/23/08
HPC Fall 2007
Comparison of PRAM Models
A model A is less powerful compared to model B if either
The time complexity is asymptotically less in model B for solving
a problem compared to A
Or the time complexity is the same and the work complexity is
asymptotically less in model B compared to A
From weakest to strongest:
1/23/08
EREW
CREW
Common CRCW
Arbitrary CRCW
Priority CRCW
HPC Fall 2007
Simulations Between PRAM
Models
An algorithm designed for a weaker model can be
executed within the same time complexity and work
complexity on a stronger model
An algorithm designed for a stronger model can be
simulated on a weaker model, either with
Asymptotically more processors (more work)
Or asymptotically more time
1/23/08
HPC Fall 2007
Simulating a Priority CRCW on
an EREW PRAM
Theorem: An algorithm that runs in T time on the p-processor priority
CRCW PRAM can be simulated by EREW PRAM to run in O(T log p)
time
1/23/08
A concurrent read or write of an p-processor CRCW PRAM can be
implemented on a p-processor EREW PRAM to execute in O(log p) time
Q1,,Qp CRCW processors, such that Qi has to read (write) M[ji]
P1,,Pp EREW processors
M1,,Mp denote shared memory locations for special use
Pi stores <ji,i> in Mi
Sort pairs in lexicographically non-decreasing order in O(log p) time
using EREW merge sort algorithm
Pick representative from each block of pairs that have same first
component in O(1) time
Representative Pi reads (writes) from M[k] with <k,_> in Mi and copies
data to each M in the block in O(log p) time using EREW segmented
parallel prefix algorithm
Pi reads data from Mi
HPC Fall 2007
Reduction on the EREW PRAM
Reduce p values on the p-processor EREW PRAM in
O(log p) time
Reduction algorithm uses exclusive reads and writes
Algorithm is the basis of other EREW algorithms
1/23/08
HPC Fall 2007
Sum on the EREW PRAM
Sum of n values using n processors (i)
Input: A[1,,n], n = 2k
Output: S
begin
B[i] := A[i]
for h = 1 to log n do
if i < n/2h then
B[i] := B[2i-1] + B[2i]
if i = 1 then
S := B[i]
end
1/23/08
HPC Fall 2007
10
Matrix Multiplication
Consider nn matrix multiplication with n3 processors
Each cij = k=1..n aik bkj can be computed on the CREW
PRAM in parallel using n processors in O(log n) time
On the EREW PRAM exclusive reads of aij and bij values
can be satisfied by making n copies of a and b, which
takes O(log n) time with n processors (broadcast tree)
Total time is still O(log n)
Memory requirement is huge
1/23/08
HPC Fall 2007
11
Matrix Multiplication on the
CREW PRAM
Matrix multiply with n3 processors (i,j,l)
Input: nn matrices A and B, n = 2k
Output: C = AB
begin
C[i,j,l] := A[i,l]B[l,j]
for h = 1 to log n do
if i < n/2h then
C[i,j,l] := C[i,j,2l-1] + C[i,j,2l]
if l = 1 then
C[i,j] := C[i,j,1]
end
1/23/08
HPC Fall 2007
12
The WT Scheduling Principle
The work-time (WT) scheduling principle schedules p
processors to execute an algorithm
Algorithm has T(n) time steps
A time step can be parallel, i.e. pardo
Let Wi(n) be the number of operations (work) performed
in time unit i, 1 < i < T(n)
Simulate each set of Wi(n) operations in Wi(n)/p
parallel steps, for each 1 < i < T(n)
The p-processor PRAM takes
i Wi(n)/p < i (Wi(n)/p+1) < W(n)/p + T(n)
steps, where W(n) is the total number of operations
1/23/08
HPC Fall 2007
13
Work-Time Presentation
The WT presentation can be used to determine
computation and communication requirements of an
algorithm
The upper-level WT presentation framework describes
the algorithm in terms of a sequence of time units
The lower-level follows the WT scheduling principle
1/23/08
HPC Fall 2007
14
Matrix Multiplication on the
CREW PRAM WT-Presentation
Input: nn matrices A and B, n = 2k
Output: C = AB
begin
for 1 < i, j, l < n pardo
C[i,j,l] := A[i,l]B[l,j]
for h = 1 to log n do
for 1 < i, j < n, 1 < l < n/2h pardo
C[i,j,l] := C[i,j,2l-1] + C[i,j,2l]
for 1 < i, j < n pardo
C[i,j] := C[i,j,1]
end
1/23/08
HPC Fall 2007
WT scheduling principle:
O(n3/p + log n) time
15
PRAM Recursive Prefix Sum
Algorithm
Input: Array of (x1, x2, , xn) elements, n = 2k
Output: Prefix sums si, 1 < i < n
begin
if n = 1 then s1 = x1; exit
for 1 < i < n/2 pardo
yi := x2i-1 + x2i
Recursively compute prefix sums of y and store in z
for 1 < i < n pardo
if i is even then si := zi/2
else if i = 1 then s1 := x1
else si := z(i-1)/2 + xi
end
1/23/08
HPC Fall 2007
16
Proof of Work Optimality
Theorem: The PRAM prefix sum algorithm correctly
computes the prefix sum and takes T(n) = O(log n) time
using a total of W(n) = O(n) operations
Proof by induction on k, where input size n = 2k
Base case k = 0: s1 = x1
Assume correct for n = 2k
For n = 2k+1
For all 1 < j < n/2 we have
zj = y1 + y2 + + yj = (x1 + x2) + (x3 + x4) + (x2j-1 + x2j)
Hence, for i = 2j < n we have si = s2j = zj = zi/2
And i = 2j+1 < n we have si = s2j+1 = s2j + x2j+1 = zj + x2j+1 = z(i-1)/2 + xi
T(n) = T(n/2) + a
W(n) = W(n/2) + bn
1/23/08
T(n) = O(log n)
W(n) = O(n)
HPC Fall 2007
17
PRAM Nonrecursive Prefix Sum
Input: Array A of size n = 2k
Output: Prefix sums in C[0,j], 1 < j < n
begin
for 1 < j < n pardo
B[0,j] := A[j]
for h = 1 to log n do
for 1 < j < n/2h pardo
B[h,j] := B[h-1,2j-1] + B[h-1,2j]
for h = log n to 0 do
for 1 < j < n/2h pardo
if j is even then C[h,j] := C[h+1,j/2]
else if i = 1 then C[h,1] := B[h,1]
else C[h,j] := C[h+1,(j-1)/2] + B[h,j]
end
1/23/08
HPC Fall 2007
18
First Pass: Bottom-Up
B[3,j] =
27
B[2,j] =
B[1,j] =
23
-4
17
B[0,j] =
-6
10
-2
A[j] =
-6
10
-2
1/23/08
HPC Fall 2007
19
Second Pass: Top-Down
B[3,j] =
C[3,j] =
27
27
B[2,j] =
C[2,j] =
B[1,j] =
C[1,j] =
23
27
-4
17
21
27
B[0,j] =
C[0,j] =
-6
10
-2
11
21
19
27
A[j] =
-6
10
-2
1/23/08
HPC Fall 2007
20
Pointer Jumping
Finding the roots of a forest using pointer-jumping
1/23/08
HPC Fall 2007
21
Pointer Jumping on the CREW
PRAM
Input: A forest of trees, each with a self-loop at its root,
consisting of arcs (i,P(i)) and nodes i, where 1 < i < n
Output: For each node i, the root S[i]
begin
for 1 < i < n pardo
S[i] := P[i]
while S[i] S[S[i]] do
S[i] := S[S[i]]
end
T(n) = O(log h) with h the maximum height of trees
W(n) = O(n log h)
1/23/08
HPC Fall 2007
22
PRAM Model Summary
PRAM removes algorithmic details concerning
synchronization and communication, allowing the
algorithm designer to focus on problem properties
A PRAM algorithm includes an explicit understanding of
the operations performed at each time unit and an
explicit allocation of processors to jobs at each time unit
PRAM design paradigms have turned out to be robust
and have been mapped efficiently onto many other
parallel models and even network models
1/23/08
A SIMD network model considers communication diameter,
bisection width, and scalability properties of the network topology
of a parallel machine such as a mesh or hypercube
HPC Fall 2007
23
Further Reading
An Introduction to Parallel Algorithms, by J. JaJa, 1992
1/23/08
HPC Fall 2007
24