ACA UNIT-1 B Kai Hwang
ACA UNIT-1 B Kai Hwang
CONDITIONS OF FARALLELISM
i
Thc cxploitation ofparrllclism has crcatcd a ncw dimension in computcr scicncc. ln ortlcrto
move parallel processing into the mainstream of computing, ll.'I'. Kung H991) has identified
the need to make significant progress in three lrey areas: c-mnpumrrmr l'i'NJtl'c’ll1‘ ibr parallcl computing,
inrcrprrx-::s.s'0r c-ommurrinnrion in parallcl architoclurcs, and sjrsrorrr irrregrariori lbr incorporating parallel
systems into general computing cnvirorunents.
A thcorctical trcatmcnt ofparallclism is thus ncc-dc-tl to build a basis for thc above challcngcs. ln practice,
parallelism appears in various ibrms in a computing cnvironmcnt. All forms can bc attributcd to lcvcls oi
parallelism, computational granularity. time and space complexities, communication latencies, scheduling
policies. and load balancing. Very often. tradcoffs exist among time, space. performance. and cost factors.
Data Dependence Thc ordering rclalionship bctwecn stalcrncnls is indicated by thc data dependence.
Five types of data dependence are defined below:
,.,W,mm,,,,k,.,,,,,,,. 45
['1] Ffon-' dependence: A statement S2 is_fiot1-'-d't*pendt*nr on statement S1 if an execution path exists from
S1 to S2 and if at least one output {variables as-signedj of S1 feeds in as input {operands to be used] to
S2. Flow dependence is denoted as S1 -3» S2.
(21 .-intidepertdertce: Statement S2 is anridependenr on statement S1 if S2 follows S1 in program order and
ifthc output ofS2 overlaps the input to S]. A direct arrow crossed with a bar as in S1 -l—ZI> S2 dicates
antidependence from 5-1 to S2.
(31 Output dependertce: Two statements are oropur-depenrienr if they produce [write] the same output
variable. S1 0-Eb S2 indicates output dependence from Si to S2.
(41 HO d-epertdeneer. Read and write are [IO statements. l.~"O dependence occurs not because the same
variable is involved but because the same file is referenced by both l.-‘O statements.
(5) Unknon-"n dependent-¢'. The dependence relation between two statements cannot be determined in the
following situations:
I»)
g Example 2.1 Data dependence in programs
Consider the following code fiagment of four instructions:
SI: Load R1, A IRI 1- Men1ory(A).-'
S2: Add R2. RI IRS! <— (RI) + (R2)!
S3: Move Rl,R3 IR] <—[R3]-I
S4: Store B. R1 fMemorv(B) 1- (R1)!
As illustrated in Fig. 2,la, S2 is flow—dep-endent on SI because the variable A is passed via the register
R1. S3 is antidcpcndent on S1 because of potential conflicts in register content in R1. S3 is output-dependent
on S1 because they both modify the same register R1. Other data dependence relationships can be similarly
revealed on a pairwise basis. Note that dependence is a partial ordering relation; that is, the members ofnot
every pair of statements are related. For example, the statements S2 and S4 in the above program are totally
t'rte2'e']Je’rte2'e'Hf.
Nest, we consider a code fiagrnent involving HO operations:
SI: Read (4), A(T) fllead array A from file 4?
S2: Process fProcess data.’
S3: Write [4]. BU) fwrite array B into file 41‘
S4: Close (4) K.‘-lose file 4.-'
: _ PM‘ I Ifllli t'm'rIq|r_.\.I|n*\ _
4% i Adwmced Corr|pu'tae|'Architecbtiv12
As shown in Fig. 2.11:-, the rcadiwrite statements, S1 and S3, are UCI-dependent on each other because
they both access the same file. The above data dependence relations should not be arbitrarily violated during
program execution. Otherwise, erroneous results may be produced with changed program order. The order in
which staternents are executed in a sequential program is well defined and repetitive runs produce identical
results. On a multiprocessor system, the program order may or may not be preserved, depending on the
memory model used. Dctcrrninisrn yielding predictable results can be controlled by a programmer as well as
by constrained modification of writable data in a shared memory.
Control Dependence This refers to the situation where the order of execution of statements cannot be
determined before run time. For example, conditional statements will not be resolved until run time. Different
paths taken afier a conditional branch may introduce or eliminate data dependence among instructions.
Dependence may also exist between operations performed in successive iterations of a looping procedure.
[n the following, we show one loop example with and another without control-dependent iterations. The
successive iterations ofthe following loop are control-ino'ependt'nr:
[lo 201 = 1, N
Ail) = Ctll
IF tau) .LT. oi Ail) = 1
2'0 Continue
[I-It llll = l, N
IF [Ail — 1) 1'11. D) Ail) = 0
I0 Continue
Control dependence often prohibits parallelism from being exploited. Compiler techniques or hardware
branch prediction techniques are needed to get around the control dependence in order to exploit more
parallelism.
R-an urce Dependence This isdiifetent from data orcontrol dependence, which demands the independence
of the work to be done. Resource dependence is concerned with the conflicts in using shared resources,
,.,,,,,,,,,,,,,,,,,,,,,,,,,,, 4,
such as integer units, l]oating—poinl units, registers, and memory areas, among parallel events. When the
conflicting resource is anALU. We call it ALL’ a’epenrir-nee.
Ifthe conflicts involve workplace storage, we call it srorngerfepertderiee. [n the case of storage dependence,
each task must work on independem storage locations or use protected access (such as locks or monitors to
be described in Chapter ll] to shared writable data.
The detection of parallelism in programs requires a check of the various dependence relations.
Bernstein’: Condition: In 1966, Bernstein revealed a set of conditions based on which two processes can
execute in parallel. A pr0ce.ss' is a software entity corresponding to the abstraction of a program fragment
defined at various processing levels. We define the input set I, of a process P, as the set ofall input variables
needed to execute the process.
Similarly, the onninr sor O‘; consists of all output variables generated ailer execution oi" the process P,-.
[nput variables are essentially operands which can be fetched from memory or registers, and output variables
are the results to be stored in working registers or memory locations.
Now, consider two processes P, and P; with theirinput sets 1| and I; and output sets O I and O3, respectively.
These two processes can execute in parallel and are denoted P| P; if they are independent and therefore
create deterministic results.
Formally, these conditions are stated as follows:
1, rs 0: = pl
12rwO,=::i} (2.1)
O, rw 03 = p t
These three conditions are known as Bernstein is eonriirrimis. The input set If is also call-od the mod set‘ or
the .n’om.nin of P; by other authors. Similarly, the output set O; has been called the \\1--l"t'l!'r.’.i‘-r.’I or the range of a
process P,-. in terms of data dependences, Bernstein's conditions simply imply that two processes can execute
in parallel if they are flow-independent, antiindcpendent, and output-independent.
The parallel execution of sueh two processes produces the same results regardless of whether they are
executed sequentially in any order or in parallel. This is because the output of one process will not be used
as input to the other process. Furthermore, die two processes do not modify (write) the same set of variables,
either in memory or in the registers.
In general, a set ofprocesscs, P I, P3, , Pg can execute in parallel if Bomstein's conditions are satisfied
on apairwise basis; that is,P| | Pg |P_q | |P;,_ ifand only il"P, |P_,- for all iafij. This is citeirlpliliotl by the
following program illustrated in Fig. 2.2.
P)
g Example 2.2 Detection of parallelism in a program
using Bernstein's conditions
Consider the simple casein which each process is a single HLL statement. We want to detect the parallelism
embedded in the following five statements labeled P I, Pg, P3, P4, and P, in program order
W _ H‘-r Mclinrw HJ'lI|:'im-.-;n_.-.-|-rs _
43 Ii mama Cnu'npu1JerArchitecuue
.F‘|:C=D><E]
l=5;.w=c;+c
@;,1=s+c} (2.2)
1=:,;c=r.+.w-
.F_',:F=G'+E j
Assume that each statement requires one step to execute. No pipelining is considered here. The depenclenee
graph shown in Fig. 2.2a dernenstrates flow dependence as well as resouree dependence. In sequential
execution. five steps are needed (Flg. 2.2b).
0P~
0 Pi'*®
P3
D
E Time
P1
BG C
+
Pg as
1'. A
G Pdlfll
s P15 as
I
E 3 LP2 Paps
P5 M
P4
F 1 C A F
[bi Sequential execution in five sens, [c] Parallel execution in three steps,
assuming one step perstatement assuming two adders are available
{no pipelining) per step
If two adders are available simultaneously, the parallel execution requires only three steps as shmvn in
Fig. 22¢. Pairwise, there are ll] pairs of statements to check against Bernstein's eoitditicms. Only 5 pairs, P,
| P5, P; | P3, P; | P5, P5 | P3, and P, '_| P5, can execute in parallel as revealed in Fig. 2.2a if there are no
,.,,,,,,,,,,,,_,,,,,.,,,,,,,, 4,
resource conflicts. Collectively, only P; I P3 I P, is possible (Fig. 2.20] because P; I P3, P3 I P5, and P, I
P; arc all possible.
In gcncral, thc parallelism relation I is commutative; i.c., P, I implies Pf; I P,-. Bill the relation is not
transitive; i.c., P, I 111,- and 11,- I Pk do not necessarily guarantee P, I Pg For example, we have P, I P, and F,
I P3, but P, II P3, whcrc II mcans P, and P3 cannot cxccutc in parallcl. ln other words, thc ordcr in which P,
and P; arc crtccutcd will make a diifcrcncc in thc computational results.
Thcrciorc, I is not an equivalence relation. l-Iowcver,.P,- I Ff, I Pk implies assoclativity; i.c. (P, I Pi] I P, =
P, I t P,- I P1], since the order in which the parallel executable processes are executed should not make any
diffcrcncc in thc output scts. lt should bc notcd that thc condition 1, {"1 I,-as cl docs not prcvcnt parallelism
bctwccn P, and i
Violations ofany one or more ofthe three conditions in 2.1 prohibits parallelism between two processes.
In general. violation of any one or more ofthe 3n('n W2 Bernstein's conditions among n processes prohibits
parallelism collectively or partially.
ln gcncral, data dcpcndcncc, control dcpcndcncc, and rcsourcc depctrdcn-cc all prcvcnt parallelism from
bcirrg cxplo itab lc.
The statement-level dependence can be generalized to higher levels, such as code segment, subroutine,
process, task, and program lcvcls. Thc dcpcndcncc of two highcr lcvcl objects can bc inicrrcd from thc
dcpcnidcnoc of statements in the corresponding objects. Thc goals of analyzing thc data dcp-cndcncc,
control dependence, and resource dependence in a code an": to identify opportunities for parallelization or
vectorization. Hardware techniques for detecting instruction-level parallelism in a running program are
dcscribod in Chapter 12.
Very often program restructuring or code transformations need to be performed before such opportunities
can bc rcvcalcd. Thc dcpcndcncc tclations arc also used in instruction issuc and pipclinc scheduling operations
dcscribod in Chapter 6 and 12.
Hardware Parallelism This refers to the type of parallelism defined by the machine architecture and
hardware multiplicity. Hardware parallelism is ottcn a firnction of cost and performance tradeoffs. It displays
thc resource utilization patterns of simultaneously cxccutablc operations. It can also indicate the pcalr
performance of thc proocs sor rcsourccs.
One way to characterize the parallelism in a processor is by the number of instruction issues per machine
cycle. If a processor issues It instructions pcr machine cycle, then it is called a It-i.ssrrc process-o r.
_ F|'>r'MfGJ'|Ili' H“ I'mt!I;|(1rHt\ _
50 i Aohioinccd Computer architecture
A conventional pipelinod processor takes one machine cycle to issue a single instruction. These types of
processors an: called one-issue machines, with a single instruction pipeline in the processor. ln a modem
processor, two or more instructions can be issued per machine cycle.
For example, the lntel i96OCA was a three-issue processor with one arithmetic, one memory access, and
one branch instruction issued per cycle. The LBM FJSC!System 6000 is a four-issue processor capable of
issuing one arithmetic, one memory access, one floating-point. and one branch operation per cycle.
Sofhtnnre Hnra.llcl.i:m This type of parallelism is revealed in the program profile or in the program flow
graph. Software parallelism is it function of slgoritlun, programming style, and program design. The program
flow graph displays the partems of simultaneously executable operations.
Q Cyda1
{Q Cycla2
--"G ‘1Y='='3
es-1 e e e “ Q, Cydo4
Cyda 2 Q Q G Cyde 5
Cydo 2. Q. ‘Q 9 Cycle s
A El
Q Cycle 7
Ly: Load opsation
Xy: Multiply operation 5
{ai snfiwaa pad“-Em {bl Hadwae paalslism
Q Cydoi
99 Q Clyde-2
US: Loaoilfi-‘ore op-orainn
JIC Mtlqaly operation
Gilda 3 +.l—: Aodfiubioet operation
so Cyde 4}
Added
I‘tEtlI't.|Glol'E
or iPe
I’ Cycle s
1.-fies 0 Cycle 6
A s
Fig.1.4 Dual-processor omcutloo niche program in Fig. 13:.
Of the many types of sottware parallelism, two are most frequently cited as important to parallel
programming: The first is mnrrol parol'!ch's.-rt, which allows two or more operations to be performed
simultaneously. The second type has been called darn porrillclisrri, in which almost the same operation is
perfomied over many data elements by many processors simultaneously.
Control parallelism, appearing in the form of pipelining or multiple functional units, is limited by the
pipeline length and by the lmlltiplicity of functional units. Both pipelining and functional parallelism are
handled by thc hardware; programmers need take no special actions to invoke thorn.
Data parallelism offers thc highest potential for concurrency. It is practiced inboth SIMD and MIM.D modes
on MPP systems. Data parallel code is easier to write and to debug than control parallel code. Synchronization
in SIMD data parallelism is handled by the hardware. Data parallelism exploits parallelism in proportion to the
quantity ofdata involved. Thus data parallel computations appeal to scaled problems, in which the pcrforniancc
of an MPP system does not drop sharply with the possibly small sequential fraction in the program.
To solve the mismatch problem bctwccn software parallelism and hardware parallelism, one approach is
to develop compilation support, and the other is through hardware redesign for more efficient exploitation of
parallelism. These two approaches must cooperate with each othcrto produce the best result.
FM Mtfiruw titmpwtns _
51 i Aahtcvtced Con'tpu'terAt'chtitectut'e
Hardware processors can be better designed to exploit parallelism by an optimizing compiler. Pioneer work
in processor technology with this objective was seen in the IBM 801, Stanford MIPS, and Berkeley IUSC.
Such processors use a large register filc and sustained instruction pipelining to cxceutc nearly one instruction
per cycle. The large register file supports fast access to temporary values generated by an optimizing compiler.
The registers are exploited by the code optimizer and global register allo-cater in such a compiler.
The instruction scheduler exploits the pipeline hardware by filling brunch and Juan’ delay slots. [rt
superscalar processors, hardware and sofiware branch prediction, multiple instruction issue, speculative
execution, high bandwidth instruction cache, and support for dynamic scheduling are needed to facilitate the
detection of parallelism opportunities. Further discussion on these topics can be found in Chapters 6 and 12.
Instruction Level At the lowest level, a typical grain contains less than 20 instructions, calledfine grain
in Fig. 2.5. Depending on individual programs, fine-grain parallelism at this level may range from two to
thousands. Butler et al. (1991) has shown that single-inst1'uetion-stream parallelism is greater than two. Wall
{I991} finds that the average parallelism at instruction level is around five, rarely exceeding seven. in an
ordinary program. For scientific applications. Kumar (1988) has measured the average parallelism in the
range of 500 to 3000 Forl:ra.n statements executing concurrently in an idealized environment.
,
-Coarse grain
Suop-rograms, job T
Level 4 stops or rotated
parts of a progam
>~ lvlacllum grain
Increasing
conrnunleatlon Procodires,
Laval 3 suh-routlrrn-s, taslas, Higher dagoa
demand and of parallelism
schaclullng L) or no-routines
overhead
il
Nonrneurslvo to-ops or
Lovol 2
unfolded ltaratlons
t» Flna gra In
Instructions or
Level 1 statements
Y
i}
Fig. 2.5 Levels of parallelism in program execution on modem cornprueers {Reprinted from l-twang. Prue. JEEE,
October 1937}
The exploitation of fine-grain parallelism can he assisted by an optimizing compiler which should be able
to automatically detect parallelism and translate the source code to a parallel form which can be recognized
by the run-time system. Instruction-level parallelism can be detected and exploited within the processors, as
we shall sec in Chapter 12.
ma if run! |:'nm;u;um1
54 i AoManccdCon'|po'tcr.luchritectusc
Loop Level This corresponds to the iterative loop operations. A typical loop contains less than SUD
instructions. Some loop operations, if independent in successive iterations, can be vectorlzed for pipelined
execution or for lock-step ertocution on SIMD machines. Some loop operations can be self-scheduled for
parallel execution on MIMD machines.
Loop-level parallelism is often the most optimized program construct to execute on a parallel or vector
-computer. However, recursive loops are rather diflieult to parallclizc. Vector processing is mostly exploited
at the loop level {level 2 in Fig. 2.5) by a vectorizing compiler. The loop level may also he considered a line
grain of computation.
Procedure Level This level corresponds to medium-grain parallelism at the task, procedural, subroutine,
and coroutine levels. A typical grain at this level contains less than 2000 instructions. Detection ofparallelism
at this level is much more diflicult than at the finer-grain levels. lnterprocedural dependence analysis is much
more involved and history-sensitive.
{Iommunication requirement is often less compared with that required in MIMD execution mode. SPlv'I.D
execution mode is a special case at this level. Multitasking also belongs in this category. Significant efforts
by programmers may be nccdod to restructure a program at this level, and some compiler assistance is also
needed.
Subprogram Level This corresponds to the level ofjob steps and related subprograms. The grain size may
typically contain tens or hundreds of thousands of instructions. Job steps can overlap across different jobs.
Subprograms can be scheduled for dii1"erent processors in SPMD or MPMD mode, often on message-passing
multico mp uters.
Multiprogramrning on a uniprocessor or on a multiprocessor is conducted at this level. Traditionally,
parallelism at this level has been exploited by algorithm designers or progranuners, rather than by compilers.
Good compilers for exploiting medium- or coarse-grain parallelism require suitably designed parallel
programming languages.
job |‘Pnogrnmj Level This corresponds to the parallel execution oi‘ essentially independent jobs (programs)
on a parallel computer. The grain size can be as bigh as millions of instructions in a single program. For
supercomputers with a small number of very powerful processors, sueh coarse-grain parallelism is practical.
Job—level parallelism is handled by the program loader and by the operating system in general. Time-sharing
or space-sharing multiproccssors explore this level of parallelism. In fact, both time and space sharing are
extensions of multipmgramm ing.
To surnmarizc, line-grain parallelism is often exploited at instruction or loop levels, preferably assisted by
a parallelizing or vectorizing compiler. Medium-grain parallelism at the task orjob step demands significant
roles for the programrner as well as compilers. Coarse-grain parallelism at thc program level relics heavily
on an eilective OS and on the efficiency ofthe algorithm used. Shared-variable communication is often used
to support fine-grain and medium-grain computations.
Message-passing multicomputers have been used for medium- and coarse—grain computations. Massive
parallelism is ofien explored at the fine-grain level, such as data parallelism on SIMD or Milt-ID computers.
Communiontion Lntency By balancing granularity and latency, one can achieve better performance of
a computer system. Various latencies are attributed to machine architecture, implementing technology, and
communication pottcms involved. The architecture and tcclmology afiiect the design choices for latency
tolerance between subsystems. 1n fact, latency imposes a limiting factor on the scalability of the machine
Pro-gum and Network Props ' i 55
sine. For example, over the years memory latency has increased with respect to processor cycle time. ‘Various
latency hiding or tolerating techniques will be studied in Chapters 9' and I2.
The latency incurred with interproeessor communication is another important parameter for a system
designer to minimize. Besides signal delays in the data path, LPC latency is also affected by the communication
pattems involved. In gcnctal, n tasks communicating with each other may require ntn - l)f2 communication
links among them. Thus the complexity grows quadratically. This leads to a communication bound which
limits the number of processors allowed in a large computer system.
Communication patterns are determined by the algorithms used as well as by the architectural support
provided. Frequently encountered patterns include permurruions and br0on‘c¢r.sr. rrtrrfricttsr, and c-oryierenc-e
{many-to-many) eommunications. The communication demand may limit the granularity or parallelism. Very
oilen tradoo fi's do exist between the two.
The communication issue thus involves the reduction of latency or complexity, the prevention ofdeadlock,
tninintizing blocking in communication patterns, and the tradeoff between parallelism and communication
overhead. We will study techniques that minimize communication latency, prevent deadlock, and optimize
grain size in latter chapters of the book.
I»)
8] Example 2.4 Program graph before and after grain packing
(Kruatrachue and Lewis, 1988)
The basic concept of program partitioning is introduced below. In Fig. 2.6, we show an example program
graph in two different grain sizes. A program graph shows the strucnirc of a program. It is very similar
to the dependence graph introduced in Section 2.1.1. Each node in the program graph corresponds to a
computational unit in the program. The grain .st':e is measured by the number of basic machine cycle-5
-[including both pro-eessor and memory cycles) needed to execute all the operations within the node.
We denote each node in Fig. 2.6 by a pair (rt. st, where rt is the node mime {id} and s is the grain size of the
node. Thus grain size reflects the number of computations involved in a program segment. Fine-grain nodes
have a smaller grain size. and coarse-grain nodes have a larger grain size.
1 _ . 1 I.‘ IBM‘ ln¢r.q|r_.u|»rs
Sfi i Aahvmced Compuoernrchitecuue
The edge label (rt, cl) between two end nodes specifies the output variable rt from the souroe node or the
input variable to the destination node. and the communication delay d between them. This delay includes all
the path delays and memory latency involved.
There are 1? nodes in the fine-grain program graph (Fig. 2.6a) and 5 in the coarse-grain program graph
(Fig. lob]. The coarse-grain node is obtained by combining {grouping} multiple line-grain nodes. The line
grain corresponds to the following program:
Q 6""
tsflsd
. , was
C‘! 11"_o.-t _o.'5orF.‘"
"en
o"*
"'4 at Yd
Fig. 2.! A program graph before and alter grain paddrt; in Example 1.4 {Modified from Kmatradwe and Lewis.
lEE.E Seflvm.ns,]an. 1933]
The idea of grain packing is to apply fine grain 1"n'st in order to achieve a higher degree of parallelism.
Then one combines (packs) multiple fine~g1'ain nodes into a coarsegrain node if it can eliminate unnecessary
communications delays or reduce thc overall scheduling overhead.
Usually, all fine-grain operations within a single cc-arse-grain node arc assigned to the same processor for
execution. Fine-grain partition of a program often demands more interprocessor communication than that
required in a coarse-grain partition. Thus grain packing offers a tradeofi‘between parallelism and scheduling!
communication overhead.
Internal delays among fine-grain operations within thc same coarse-gtain node are negligible because thc
communication delay is contributed mainly by interprocessor delays rather than by delays within the same
processor. The choice of the optimal grain size is meant to achieve the shortest schedule for the nodes on a
parallel computer system.
9.
11}
11
12
7 14 t 14*
T|m a 13: _. 15
2°'— 21
22-— F3
I _..._-3
_=-=.
24 24
2a 2!'f__§ I
:.=.2=
‘I
"' _-mi
E
_.
es-4
#33 Iii
[3] Flno grain [Fig. 2.63] [h-) Coarse grain [Flg. 2.61:]
Fig. 2.‘! Scheduling of the fins-grain and coarse-grain programs (arrows: idle dme; shaded area: communication
fiche!
_ War MIGIIILH H“ I'mr!I;|(1rtnr\ _
5-B i Aoktolnced Colrnputer Architecture
With respect to the line-grain versus coarse-grain program graphs in Fig. 2.6, two multiprocessor schedules
are shown in Fig. 2.7. The fine—grain schedule is longer (42 time units) because more communication delays
were included as shown by the shaded area. The coarse-grain schedule is shorter (38 time units) because
communication delays among nodes 12, 13, and 14 within the same node D (and also the delays among 15,
115, and 17 within the node E) are eliminated after grain packing.
Node Duplication In order to eliminate the idle time and to further reduce the communication delays
among processors, one can duplicate some ofthe nodes in more than one processor.
Figure 2.3a shows a schedule without duplicating any of the five nodes. This schedule contains idle time
as well as long interproccssor delays {S units) between Pl and P2. In Fig. 2.3b, node A is duplicated into A’
and assigned to P2 besides retaining the original copy A in P1. Similarly, a duplicated node C’ is copied into
Pl besides the original node C in P2. The new schedule shown in Fig. 2.Sb is almost 50% shorter than that
in Fig. 2.3a. Thc reduction in schedule time is causcd by elimination ofthe (a, 8] and (c, B] delays between
the two processors.
P, P2 P, P2 P2 P1 P2
4
5 a,1 a.1
e 4
5 4
o o re H
3,1 III B
l D 9
‘U
P’
Q u,1 e,1 P ah.-5
'i‘ 13
.4?"B
‘o
o1* ~5- _@4»_n-
2“ ca:- if
5
II
‘
86
-"= a
l\Jto
-n InIt-I
'-I
[a] Schedule without node dupd lcaion [bl Schooua with node clu plleatlon {A-1 A and A’; C -1» C and -C’)
Fig. 2.8 Node-duplication scheduling to eliminate cmtamutlcarlon delays between processors (I: idle rims;
shaded areas: communication delays}
Grain packing and node duplication are often usedjointly to determine the best grain size and corresponding
schedule. Four mqor steps are involved in the grain determination and the process ofscheduling optimization:
Step l . Construct a fine-grain program graph.
Step 2. Schedule the fine-grain computation.
Step 3. Perform grain packing to produce the coarse grains.
Step 4. Generate a parallel schedule based on the packed graph.
Pm, mMm,k,¢ 5,
The purpose ofmultiproeessor scheduling i5 to obtain a minimal time schedule for the eompulations
involved. The following example clarifies this concept.
‘t1t><5'-g G1g‘*11><511"-‘\:2><32;
CPU CYCLE CPU CYCLE
Mme w An. D1 15 Move L PARLD1 213
Move W B1101, D2 15 Move L PARZU2 20
MPTYD1, D2 T1 ADDL DI, D2 B
MOVE L D2. PAR 219 MOVE L D2. P5-UM 20
E {S I-in E d=T'l+T2+T3l'T4+T5+TG
=ZO+2D+32+2‘D+2CI+10fl
T1 T2T3_m T5 =212cydea
T3 = 32-on ianarllssen tine at 20 Mops
nom1aizedtoM68fiDC|cydea120 MI-lz.
T6 = oehy one ta eoilware proboeh {aaeune 5
Mow Iemiotlene, 1130;
{h}1Calc1ulauen oi oemmuueaton delay d
A 4:» C D E F -1-1,1
d d *
N G 5 u '5'
P
Sun
(<>1FIw~;:I-I1 WWW each
Fig. 2.! Caicu-laeion of graln size and comrmmdcadon delay for the pang:-nan graph in Example 15 {Cotnecsy of
Kruatnehue and Lewis: reprinted witli permission irmn IEEE Sofhmie. 19$}
H _ l'h1'Ml.'I;IflH1I' HI" l'nrr.q|r_.u||rs 5
HI i Aohlolnced Corrtpu'ae|'At'chitecbt11re
C11:-’l|1><3|1'l'/l|2><32|
C12 =fl|1><3|2‘l'fl|;'><B22
C21:-"i21><311+-"i2:><B:1
C22 = 11:1 >< 5'12 + -"122 >< 322
5Hm =C11+C1:+C:1+C-'2:
As shown in Fig. 2.9*a., the eight multiplications are performed in eight 431 nodes, each ofwh ich has a grain
size of 101 CPU cycles. The remaining seven additions are performed in a 3-level binary tree consisting of
seven $ nodes. Each additional node requires S CPU cycles.
The interprocessor cornrnunication latency along all edges in the program graph is eliminated as rt = 212
cycles by adding all path delays between two communicating processors (Fig. 2.9b).
A fine-grain program graph is thus obtained in Fig. 2.9c. Note that the grain size and communication delay
may vary with the difierent processors and communication links used in the system.
Figure 2.10 shows scheduling of the fine—grain program first on a sequential uniprocessor (Pl) and then
on an eigltt-processor [Pl to P8) system (Step 2}. Based on the fine-grain graph (Fig. 2.9c), the sequential
execution requires 364 cycles to complete without incurring any communication delay.
Figure 2.lDb shows the reduced schedule of T41 cycles needed to execute the I5 nodes on E processors
with incurred communication delays (shaded areas}. Note that the communication delays have slowed clown
the parallel execution significantly, resulting in many processors idling {indicated by 1), except for Pl which
produces the final sum. A speedup factor ol‘li64l"1‘4l = 1.16 is observed.
P
__ Q il _ P1 P2 P3 P4 P5 Pa P? Pa
A Cl
o E F o 1-1 C
"11 1
B 101 . . 1 . '
2o2—
303-i‘
D 313
45¢ i 321 me
" E
“M sos T 522.. 1
G
filfi
T41 &\\W%/
B24
%/
117,:
F*—"-{ft
"2 Ill]
M
8-48- Ir]
M Isl
r 554 Iil
[a] Asoquontlal schedule [b] A parallel schedule
Next we show how to use grain packing (Step 3] to reduce the communication overhead. As shown in
Fig. 2.11, we group the nodes in the top two levels into four coarse-grain nodes labeled V, W, X, and Y. The
,.,,E,,,,,,,,,._,,,,,,,,,,,,,, ‘I
remaining three nodes (N, O, P) then form the fifth node Z. Note that there is only one level of interprocessor
communication required as marked by rt in Fig. 2.lla.
O P1 P2 P‘3 F’4
co <-:.
°
ca
E
E ><
G -<
o e_ o _
'l'l I
202 J
210 -
,1 . ' <1 Time If
, Z
Hg. 2.11 Parallel seheduiing for Exarrnpie ‘1.5 atrer grain pedcirtg to reduce eornnnmieariort delays
Since the maximum degree of parallelism is now reduced to 4 in the program graph, we use only four
processors to cxccutc this ooarsc-grain program. Aparallcl schedule is worked out (Fig. 2. 1 1] for this program
in 4415 cycles, resulting in an improved speedup of8t'1-H446 — 1.94.
ln a daraflow computer, the execution of an instruction is driven by data availability instead of being
guided by a program counter, ln theory, any instruction should be ready for execution whenever operands
become available. The instructions in a data-driven program are not ordered in any way. instead of being
stored separately in a main memory, data are directly held inside instructions.
Computational results. {rrnra mirens} are passed directly between instructions. The data generated by an
instruction will be duplicated into many copies and forwarded directly to all needy instructions. Data tokens,
once consumed by an instruction, will no longer he available for reuse by other instructions.
This data-driven scheme requires no program counter, and no eonlrol sequencer. However, it requires
special mechanisms to detect data availability, to match data tokens with needy instructions, and to enable
the chain reaction of asynchronous instruction executions. No memory sharing between instructions results
in no side effects.
Asynchrony implies the need for handshaking or token-matching operations. A pure dataflow computer
exploits fine—grain parallelism at the instruction level. Massive parallelism would be possible if the data-
driven mechanism could be cost-effectively implemented with low instruction execution overhead.
A Dataflow Architecture There have been quite a few experimental datafiow computer projects. Anrind
and his associates at MIT developed a tagged-token architeetme for building dataflow computers. As shown
in Fig. 2.12, the global architecture consists ofn processing elements {PEs} interconnected by an n >< n routing
network. The entire system supports pipclined dataflow operations in all n PEs. Inter-PE eontnntnieations are
done through the pipelinod muting network.
O’
Local path "5
I P
D §
Global pan
-
H 81¢
Processing Etamo 3
To Routing Network
[3] Tho global architecture [bi lntorla design of a pro-coaxing olomont
Flg.I.1‘Z The l*'lIT uggnd-'|tolten clamflow oornputier {aehptted from Arvind and lannuoel. 1936 with permission)
Within each PE, the machine provides a low-level Iokm-nmfr-hing mechanism which dispatches only
those instructions whose input data [tokens] are already available. Each datum is tagged with the address of
.,W, mmN,kkg _._ H
thc instruction to which it belongs and the context in which thc instntction is being executed. Instructions
are stored in the program memory. Tagged tokens enter the PE through a local path. The tokens can also
be passed to other PEs through the routing network. All internal token circulation operations are pipclincd
without blocking.
{Jae can think oi‘ the instruction address in a datafiow computer as replacing the program counter, and
the context identifier replacing the frame base register in a control flow computer. It is the machine’s job to
match up data with the same tag to needy instructions. In so doing, new data will be produced with a new tag
indicating the successor instructiontsl. Thus, each instruction represents a synchronization operation. New
tokens are formed and circulated along the PE pipeline for reuse or to other PEs through the global path,
which is also pip-clined.
Another synchronization mechanism, called the 1-so-in-runs, is provided within each PE. Tl1e 1'-structure is
a tagged memory urtit for overlapped usage ofa data structure by both the producer and consumer processes.
Each word Of I-stt't.tCtt.tDE uses a I-bit tag indicating whether the word is £'.|'flpI_'1-‘, isjirli, or has pending react‘
requests. The use ofI-structure is a retreat fi'om the pure dataflow approach. The purpose is to reduce excessive
copying of large data structures in datallow operations.
Ir)
g Example 2.6 Comparison of dataflow and control-flow
computers (Gajski,Padua, Kuel-r,and K|.|hn,1982)
The dataflow graph in Fig. 2.l3a shows that 24 instructions are to be executed (8 niw'des, 8 mulriplres, and
8 adds). A dataflow graph is similar to a dependence graph or program graph. The only difference is that data
tokens are passed around the edges in a dataflow graph. Assume that each rrriri rmri'rr}Jl_t-', and dit-'r'rr’c requires
1, 2, and 3 cycles to complete, respectively. Sequential execution oi‘ the 24 instructions on a control [low
uniprocessor takes 48 cycles to complete, as shown in Fig. 2.l3b.
On the other hand, a dataflow multiprocessor completes the execution in 14 cycles in Fig. 2.13c. Assume
that all the cxtcmal inputs (d,-, cl-,_f; ihri = l, 2, . ..,8 and q-,'j are available bcibrc entering the loop. With i'ot.n'
processors, instructions. 0|, Hg, 113, and or are all ready for execution in the l‘n'st three cycles. The results
prod uccd then triggcr the -execution of115, in | , rig, and .n-_|- starting finm cycle 4, The data-driven chain reactions
arc shown in Fig. 2. l 3e. The output ca is the last one to produce, due to itsdcpcndcncc on all thcprcvious cl-‘s.
Figure 2. l 3d shows the execution of the same sct of computations on a conventional multiprocessor using
shared memory to hold the intermediate results [s,- and r,- for i = 1, 2, 3, 4). Note that no shared memory is
used in the dataflow implementation. The example does not show any time advantage of datafiow execution
over control flow execution.
The theoretical minimum time is I3 cycles along the critical path rt | h |c|r-3 . “C8. The chain reaction control
in datallow is more ditlicult to implement and may result in longer overhead, as compared with the uniform
operations performed by all the processors in Fig. llid.
H _ rhr I.‘ IBM!‘ I l'nrr.q|r_.u||r\ 5
'54 i Aohtotnced Cornptmer Architecture
1"P"1d~ 9-f ‘*1 "152 9243 93d-ti ‘*4 do "5 dc “ed? “Wis °s
cD=0
iorlfromltofldo
h I
232.; 312 33 34 35 36 3? 36
bl; .-. f1 f f f4 f fa f? f
e
"cg: .l?_~_p- f-
_¢-
+ —-In
'1-|'*2"-Baht: be babtbe
Otllfll-.l'l3, U, C ' f1 1:3 Cd 13-5 gs‘ QB’
[ajfitsampieptogramandttsdataflowgraph
1 4 6 7? 1|} 12 #3 46 4-B
I *1 I b1l*=1l as I be Isl
[bi Sequential execution on a unlpro-oeesor In #8 cycles
4 (7 1491011121314
*1 I *5 l°1|°2|°al°#l°5l°sl°t|e°B|
£1ti’ f.
K J
.%.
E‘
Zr .%.
gr_
ha‘ mg‘E
at . _.
T 9 11 121314
31 as '11 I *5 |$1|l1|'=1|°5| $1=t’2*t*1-*1='*3"$1-‘=1=b1*“o-°5=*’s*°4
no ‘*6 l [_$2[l2|_°;|°t-§~| @'2=b4*b3-*2==-1"52-¢2=51*'=o-¢c=$s*°=1
‘*3;.N 3? D3] ll? l"_3[_‘a|°3]°?| 53:56‘be-la=t'?"5a~°3=‘1*°o-°?=‘a*°4
-‘viviji1
‘*4 ‘iilk- “B ?if mu’ |s4ll4|°4|°B| *4=be*t'?-*4=s=1*$3'°=1=l2"°o-‘=e=l4"°4
tjd] Parallel execution on a shared-memory 4-pro-oossor system in H cycles
Fig. 2.13 Comparison between datiflerw and control-flow computers [adapted from Grajsltl. Pacl1.n,K.u|:lt. and
Kuhn, ‘E952; reprinted with permission from IEEE Computer. Feb. 1931}
Cine advantage of tagging each datum is that data from different contexts can be mixed freely in the
instruction execution pipeline. Thus, instruction-level parallelism of datafiow graphs can absorb the
communication latency and minimize the losses due to synchronization waits. Besides token matching
and l-strllctttre, compiler technology is also needed to generate datafiow graphs for tagged-token dataflow
computers. The dataflow architecture ol‘l'e1s in theory a promising model For massively parallel computations
because all for-reaching side effects are removed. However. implementation ofthese concepts onacommercial
scale has proved to be very diflicult.
Pm mmN,kkg _._ is
2.3.2 Demand-Driven Mechanisms
In a reduction rnnt-hinc, the computation is triggered by the demand for an operation's result. Consider the
evaluation ofa nested arithmetic expression tr = {U1 + 1} '>< c — {rt + cl]. The data-driven computation Seen
above chooses a bottom-up approach, starting from the innermost operations b + I and tr’ + c, then proceeding
to the >< operation, and finally to the outermost operation —_ Such a computation has been called eager
ct-mtinrion because operations are carried out immediately after all their operands become available.
A dcmmiri’o‘rit-on computation chooses a top-down approach by first demanding the value ofn, which
triggers the demand for evaluating the next-level expressions [in + l}>< c and n‘ + c, which in tum triggers tl1c
demand tor evaluating b + 1 at the innermost level. The resttlts are then returned to the nested dectnandcr in
the reverse order before tr is evaluated.
Adetnand-driven computation corresponds to hrzy e-t-nlmrinn. because operations are executed only when
their results are required by another instruction. "The demand driven approach matches naturally with the
functional programming concept. The removal of side efiircts in functional programming makes programs
easier to parallelize. There are two types of reduction machine models, both having a recursive control
mechanism as characterized below.
Reduction Machine Model: In a string reduction model, each demander gets a separate copy of the
expression for its o\vt| evaluation. A long string expression is reduced to a single value in a recursive fashion.
Each reduction step has an operator followed by an embedded reference to demand the corresponding input
operands. The operator is suspended while its input arguments are being evaluated. An expression is said to
be fully reduced when all the arguments have been replaced by literal values.
In a graph rcducrimi model, the expression is represented as a directed graph. The graph is reduced by
evaluation of branches or subgraphs. Diiilicreot parts of a graph or subgraphs can be reduced or evaluated
in parallel upon demand. Each demander is given a pointer to the result of the reduction. The demander
manipulates all rciercnccs to that graph.
Graph manipulation is based on sharing the atgtrtncnts using pointers. This traversal of the graph and
reversal ofthe rcilercnccs are continued tmtil constant arguments are encountered. This proceeds until the
value of ti is determined and a copy is rettuned to the original demanding insuuction.
."t-fuclrin-t.' .'H0dt'f I.’_'o.rr.|'m-1' Firm‘ fconrmf-dri ten) Dalaflan' |"da.t.r.t-drr'ven) Re'dr.rc.rion {J£'rru.rnuLd'ri\=e'n}
(Courtesy olwali, Lowrie, and Li; reprinI.e=d with p-eimission from Computersfor A‘rnfficiol intelligence Pmcersing edited
by Wah and Ra.maJ:noorthy, Wiley and Sons. l.uc., 1990]
In this book, we study mostly control-flow parallel computers. But dataflow and rnultithreadcd architectures
will be further studied in Chapter 9. Dataflow or hybrid von Neumann and dataflow machines otter design
alternatives; .i‘fl"-EH1’?! proccs.-rr'ng { see C hapter 13) can be considered an example.
As far as innovative computer architecture is conccmed thc dataflow or hybrid models cannot he ignored.
Both thc Electroteclmical Laboratory (ETL) in Japan and the Massachusetts institute of Technology have
paid attention to these approaches. The book edited by Gaudiot and Bic (1991) provides details of some
development on dataflow computers in that period.