0% found this document useful (0 votes)

13 views23 pages

ACA UNIT-1 B Kai Hwang

The document discusses fundamental properties of program behavior and interconnection networks, focusing on computational granularity, program partitioning, and parallelism. It introduces various types of data and resource dependencies that affect parallel execution, including flow, anti-dependence, output dependence, and control dependence. The text also outlines Bernstein's conditions for parallel execution and emphasizes the need for analyzing dependencies to identify opportunities for parallelization in computing environments.

Uploaded by

ashishkumari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views23 pages

ACA UNIT-1 B Kai Hwang

Uploaded by

ashishkumari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

FM Mnfﬁrirli-' Hflllmminnm

Program and Network

Properties
This daapter covers fundamental properties of program behavior and introduces major classes of
interconnection networlts.V5e begin with a study of computational granularity. conditions for program
partitioning. matching software with hardware. program ﬂow mechanisms. and compilation support for
parallelism. interconnection architectures introduced include static and dynamic networks. Network
complexity communication bandwidth. and data-routing capabilities are discussed.

CONDITIONS OF FARALLELISM
i
Thc cxploitation ofparrllclism has crcatcd a ncw dimension in computcr scicncc. ln ortlcrto
move parallel processing into the mainstream of computing, ll.'I'. Kung H991) has identiﬁed
the need to make significant progress in three lrey areas: c-mnpumrrmr l'i'NJtl'c’ll1‘ ibr parallcl computing,
inrcrprrx-::s.s'0r c-ommurrinnrion in parallcl architoclurcs, and sjrsrorrr irrregrariori lbr incorporating parallel
systems into general computing cnvirorunents.
A thcorctical trcatmcnt ofparallclism is thus ncc-dc-tl to build a basis for thc above challcngcs. ln practice,
parallelism appears in various ibrms in a computing cnvironmcnt. All forms can bc attributcd to lcvcls oi
parallelism, computational granularity. time and space complexities, communication latencies, scheduling
policies. and load balancing. Very often. tradcoffs exist among time, space. performance. and cost factors.

2.1.1 Data and Resource Dependence:

The ability to execute several program segments in parallel requires each segrnent to be independent of the
other segments. Thc independence comes in various forms as deﬁned below separately. For simplicity. to
illustratc thc idca, we corttidcr thc clcpcndcncc relations among instructions in a program. In gcncral, each
code segment may contain one or more statements.
We use a deyieneferrc-r.' graph to dcscribc thc relation s. Thc nodcs ofa clcpcndcncc graph correspond to thc
program statcmcnts [instructions], and thc dircctcd ctlgcs with dificrcnt labels show tl1c ordcrcd rclations
among the statements. The analysis of dependence graphs shows where opportunity exists for parallellzation
and vectorizalion.

Data Dependence Thc ordering rclalionship bctwecn stalcrncnls is indicated by thc data dependence.
Five types of data dependence are defined below:
,.,W,mm,,,,k,.,,,,,,,. 45
['1] Ffon-' dependence: A statement S2 is_fiot1-'-d't*pendt*nr on statement S1 if an execution path exists from
S1 to S2 and if at least one output {variables as-signedj of S1 feeds in as input {operands to be used] to
S2. Flow dependence is denoted as S1 -3» S2.
(21 .-intidepertdertce: Statement S2 is anridependenr on statement S1 if S2 follows S1 in program order and
ifthc output ofS2 overlaps the input to S]. A direct arrow crossed with a bar as in S1 -l—ZI> S2 dicates
antidependence from 5-1 to S2.
(31 Output dependertce: Two statements are oropur-depenrienr if they produce [write] the same output
variable. S1 0-Eb S2 indicates output dependence from Si to S2.
(41 HO d-epertdeneer. Read and write are [IO statements. l.~"O dependence occurs not because the same
variable is involved but because the same file is referenced by both l.-‘O statements.
(5) Unknon-"n dependent-¢'. The dependence relation between two statements cannot be determined in the
following situations:

' The subscript of a variable is itself subscribed.

' The subscript does not contain the loop index variable.
v A variable appears more than once with subscripts having diﬁerent coeﬂicients of the loop
variable.
' The subscript is nonlinear in the loop index variable.
When one or more ofthesc condition s ex ist, a conservative assumption is to c laim tmknown dependence
among the statements involved.

I»)
g Example 2.1 Data dependence in programs
Consider the following code fiagment of four instructions:
SI: Load R1, A IRI 1- Men1ory(A).-'
S2: Add R2. RI IRS! <— (RI) + (R2)!
S3: Move Rl,R3 IR] <—[R3]-I
S4: Store B. R1 fMemorv(B) 1- (R1)!
As illustrated in Fig. 2,la, S2 is flow—dep-endent on SI because the variable A is passed via the register
R1. S3 is antidcpcndent on S1 because of potential conflicts in register content in R1. S3 is output-dependent
on S1 because they both modify the same register R1. Other data dependence relationships can be similarly
revealed on a pairwise basis. Note that dependence is a partial ordering relation; that is, the members ofnot
every pair of statements are related. For example, the statements S2 and S4 in the above program are totally
t'rte2'e']Je’rte2'e'Hf.
Nest, we consider a code fiagrnent involving HO operations:
SI: Read (4), A(T) fllead array A from file 4?
S2: Process fProcess data.’
S3: Write [4]. BU) fwrite array B into file 41‘
S4: Close (4) K.‘-lose file 4.-'
: _ PM‘ I Ifllli t'm'rIq|r_.\.I|n*\ _
4% i Adwmced Corr|pu'tae|'Architecbtiv12

As shown in Fig. 2.11:-, the rcadiwrite statements, S1 and S3, are UCI-dependent on each other because
they both access the same file. The above data dependence relations should not be arbitrarily violated during
program execution. Otherwise, erroneous results may be produced with changed program order. The order in
which staternents are executed in a sequential program is well defined and repetitive runs produce identical
results. On a multiprocessor system, the program order may or may not be preserved, depending on the
memory model used. Dctcrrninisrn yielding predictable results can be controlled by a programmer as well as
by constrained modification of writable data in a shared memory.

(3) Dgpgndanog graph

@—~@
my |!Ct dependence caused try
accessing tho same filo by
the road and wrtto state-
merits

Flg.2.1 Dara and IID dependence: in the ping:-an ofE>-temple 11

Control Dependence This refers to the situation where the order of execution of statements cannot be
determined before run time. For example, conditional statements will not be resolved until run time. Different
paths taken aﬁer a conditional branch may introduce or eliminate data dependence among instructions.
Dependence may also exist between operations performed in successive iterations of a looping procedure.
[n the following, we show one loop example with and another without control-dependent iterations. The
successive iterations ofthe following loop are control-ino'ependt'nr:

[lo 201 = 1, N
Ail) = Ctll
IF tau) .LT. oi Ail) = 1
2'0 Continue

The following loop has eonnnf-dqrendenr iterations:

[I-It llll = l, N
IF [Ail — 1) 1'11. D) Ail) = 0
I0 Continue

Control dependence often prohibits parallelism from being exploited. Compiler techniques or hardware
branch prediction techniques are needed to get around the control dependence in order to exploit more
parallelism.

R-an urce Dependence This isdiifetent from data orcontrol dependence, which demands the independence
of the work to be done. Resource dependence is concerned with the conflicts in using shared resources,
,.,,,,,,,,,,,,,,,,,,,,,,,,,,, 4,
such as integer units, l]oating—poinl units, registers, and memory areas, among parallel events. When the
conflicting resource is anALU. We call it ALL’ a’epenrir-nee.
Ifthe conflicts involve workplace storage, we call it srorngerfepertderiee. [n the case of storage dependence,
each task must work on independem storage locations or use protected access (such as locks or monitors to
be described in Chapter ll] to shared writable data.
The detection of parallelism in programs requires a check of the various dependence relations.
Bernstein’: Condition: In 1966, Bernstein revealed a set of conditions based on which two processes can
execute in parallel. A pr0ce.ss' is a software entity corresponding to the abstraction of a program fragment
defined at various processing levels. We define the input set I, of a process P, as the set ofall input variables
needed to execute the process.
Similarly, the onninr sor O‘; consists of all output variables generated ailer execution oi" the process P,-.
[nput variables are essentially operands which can be fetched from memory or registers, and output variables
are the results to be stored in working registers or memory locations.
Now, consider two processes P, and P; with theirinput sets 1| and I; and output sets O I and O3, respectively.
These two processes can execute in parallel and are denoted P| P; if they are independent and therefore
create deterministic results.
Formally, these conditions are stated as follows:
1, rs 0: = pl
12rwO,=::i} (2.1)
O, rw 03 = p t
These three conditions are known as Bernstein is eonriirrimis. The input set If is also call-od the mod set‘ or
the .n’om.nin of P; by other authors. Similarly, the output set O; has been called the \\1--l"t'l!'r.’.i‘-r.’I or the range of a
process P,-. in terms of data dependences, Bernstein's conditions simply imply that two processes can execute
in parallel if they are flow-independent, antiindcpendent, and output-independent.
The parallel execution of sueh two processes produces the same results regardless of whether they are
executed sequentially in any order or in parallel. This is because the output of one process will not be used
as input to the other process. Furthermore, die two processes do not modify (write) the same set of variables,
either in memory or in the registers.
In general, a set ofprocesscs, P I, P3, , Pg can execute in parallel if Bomstein's conditions are satisfied
on apairwise basis; that is,P| | Pg |P_q | |P;,_ ifand only il"P, |P_,- for all iafij. This is citeirlpliliotl by the
following program illustrated in Fig. 2.2.

P)
g Example 2.2 Detection of parallelism in a program
using Bernstein's conditions
Consider the simple casein which each process is a single HLL statement. We want to detect the parallelism
embedded in the following ﬁve statements labeled P I, Pg, P3, P4, and P, in program order
W _ H‘-r Mclinrw HJ'lI|:'im-.-;n_.-.-|-rs _
43 Ii mama Cnu'npu1JerArchitecuue

.F‘|:C=D><E]
l=5;.w=c;+c
@;,1=s+c} (2.2)
1=:,;c=r.+.w-
.F_',:F=G'+E j
Assume that each statement requires one step to execute. No pipelining is considered here. The depenclenee
graph shown in Fig. 2.2a dernenstrates ﬂow dependence as well as resouree dependence. In sequential
execution. ﬁve steps are needed (Flg. 2.2b).

0P~
0 Pi'*®
P3

{aj A dependenee graph showing both data dependence [solid arrows)

and resource deperdence[dasl1ed snows)

D
E Time

BG C
+

Pg as
1'. A
G Pdlﬂl
s P15 as
I

E 3 LP2 Paps
P5 M
P4
F 1 C A F

[bi Sequential execution in five sens, [c] Parallel execution in three steps,
assuming one step perstatement assuming two adders are available
{no pipelining) per step

Fig.1.1 Denecnbn of parallelism in the prugrarn of Example 11

If two adders are available simultaneously, the parallel execution requires only three steps as shmvn in
Fig. 22¢. Pairwise, there are ll] pairs of statements to check against Bernstein's eoitditicms. Only 5 pairs, P,
| P5, P; | P3, P; | P5, P5 | P3, and P, '_| P5, can execute in parallel as revealed in Fig. 2.2a if there are no
,.,,,,,,,,,,,,_,,,,,.,,,,,,,, 4,
resource conﬂicts. Collectively, only P; I P3 I P, is possible (Fig. 2.20] because P; I P3, P3 I P5, and P, I
P; arc all possible.

In gcncral, thc parallelism relation I is commutative; i.c., P, I implies Pf; I P,-. Bill the relation is not
transitive; i.c., P, I 111,- and 11,- I Pk do not necessarily guarantee P, I Pg For example, we have P, I P, and F,
I P3, but P, II P3, whcrc II mcans P, and P3 cannot cxccutc in parallcl. ln other words, thc ordcr in which P,
and P; arc crtccutcd will make a diifcrcncc in thc computational results.
Thcrciorc, I is not an equivalence relation. l-Iowcver,.P,- I Ff, I Pk implies assoclativity; i.c. (P, I Pi] I P, =
P, I t P,- I P1], since the order in which the parallel executable processes are executed should not make any
diffcrcncc in thc output scts. lt should bc notcd that thc condition 1, {"1 I,-as cl docs not prcvcnt parallelism
bctwccn P, and i
Violations ofany one or more ofthe three conditions in 2.1 prohibits parallelism between two processes.
In general. violation of any one or more ofthe 3n('n W2 Bernstein's conditions among n processes prohibits
parallelism collectively or partially.
ln gcncral, data dcpcndcncc, control dcpcndcncc, and rcsourcc depctrdcn-cc all prcvcnt parallelism from
bcirrg cxplo itab lc.
The statement-level dependence can be generalized to higher levels, such as code segment, subroutine,
process, task, and program lcvcls. Thc dcpcndcncc of two highcr lcvcl objects can bc inicrrcd from thc
dcpcnidcnoc of statements in the corresponding objects. Thc goals of analyzing thc data dcp-cndcncc,
control dependence, and resource dependence in a code an": to identify opportunities for parallelization or
vectorization. Hardware techniques for detecting instruction-level parallelism in a running program are
dcscribod in Chapter 12.
Very often program restructuring or code transformations need to be performed before such opportunities
can bc rcvcalcd. Thc dcpcndcncc tclations arc also used in instruction issuc and pipclinc scheduling operations
dcscribod in Chapter 6 and 12.

2.1.2 Hardware and Soﬁnuare Parallelism

For implementation ofparallclism, we need spccia] hardware and software support. In this section, we address
these support issues. We first distinguish between hardware and soflware parallelism. The mismatch problem
bctwccn hardware and sofiwarc is discusscd. Thcn wc dcscribc thc fundamental conocpt of compilation
support nccdod to closc thc gap bctwccn hardware and software.
Details of special hardware functions and software support for parallelism will be treated in the remaining
chapters. The key idea being conveyed is that parallelism cannot be achieved flee. Besides theoretical
conditioning, joint cfibrts bctwccn hardware dcsigncrs and swoihvarc programmers arc nccdcd to cxploit
parallelism in upgrading computer pcrformancc.

Hardware Parallelism This refers to the type of parallelism deﬁned by the machine architecture and
hardware multiplicity. Hardware parallelism is ottcn a ﬁrnction of cost and performance tradeoffs. It displays
thc resource utilization patterns of simultaneously cxccutablc operations. It can also indicate the pcalr
performance of thc proocs sor rcsourccs.
One way to characterize the parallelism in a processor is by the number of instruction issues per machine
cycle. If a processor issues It instructions pcr machine cycle, then it is called a It-i.ssrrc process-o r.
_ F|'>r'MfGJ'|Ili' H“ I'mt!I;|(1rHt\ _
50 i Aohioinccd Computer architecture

A conventional pipelinod processor takes one machine cycle to issue a single instruction. These types of
processors an: called one-issue machines, with a single instruction pipeline in the processor. ln a modem
processor, two or more instructions can be issued per machine cycle.
For example, the lntel i96OCA was a three-issue processor with one arithmetic, one memory access, and
one branch instruction issued per cycle. The LBM FJSC!System 6000 is a four-issue processor capable of
issuing one arithmetic, one memory access, one floating-point. and one branch operation per cycle.
Sofhtnnre Hnra.llcl.i:m This type of parallelism is revealed in the program profile or in the program flow
graph. Software parallelism is it function of slgoritlun, programming style, and program design. The program
flow graph displays the partems of simultaneously executable operations.

.39 Example 2.3 Mismatch between software parallelism and

hardware parallelism (Wen-Mel Hwu,1991)
Consider the example program graph in Fig. 2.3a. There are eight instructions {four loans and fourririrhrrierie
operations) to be executed in three consecutive machine cycles. Four load operations are peribrrned in the
first cycle, followed by two nmlri;J{i- operations in the second cycle and two addrlsubrrric-I operations in the
third cycle. Therefore. the parallelism varies fi'om 4 to 2 in three cycles. The average software parallelism is
equal to SE3 = 2.6? instructions per cycle in this example prograrn.
New consider execution of the same program by a two-issue processor which can execute one memory
access (load or write) and one arithmetic {aid subrraer. mrrlriflrilt-1 etc.) operation simultaneously. Willi this
hardware restriction, the program must execute in seven machine cycles as shown in Fig. 2.3b. Therefore.
the hernia-‘are par'oHel'r'.sm displays an average value of Sf? = 1.14 instmctions executed per cycle. This
demonstrates a mismatch between the software parallelism and the hardware parallelism.

Q Cyda1
{Q Cycla2

--"G ‘1Y='='3
es-1 e e e “ Q, Cydo4
Cyda 2 Q Q G Cyde 5
Cydo 2. Q. ‘Q 9 Cycle s

A El
Q Cycle 7
Ly: Load opsation
Xy: Multiply operation 5
{ai snﬁwaa pad“-Em {bl Hadwae paalslism

Fig. 2.3 E:-reetrclng an in-rarrtp-lc program by a 1:we~lssue swperscalar processor

yaw, mMm,yyy _ 5|
Let us try to match the software parallelism shown i11 Fig. 2.3a in a hardware platform oi‘ a dual-processor
wstem, where single-issue processors are used. The achievable hardware parallelism is shown in Fig. 2.4.
where US stands for lonrfflsrore operations. Note that six processor cycles are needed to execute the I2
instructions by two processors. .S'| and .5‘; are two inserted stone operations, and I5 and lg, are two inserted
form‘ operations. These added instructions are needed for interproccssor communication through the shared
memory.

Q Cydoi

99 Q Clyde-2
US: Loaoilﬁ-‘ore op-orainn
JIC Mtlqaly operation
Gilda 3 +.l—: Aodﬁubioet operation

so Cyde 4}
Added
I‘tEtlI't.|Glol'E
or iPe
I’ Cycle s

1.-ﬁes 0 Cycle 6

A s
Fig.1.4 Dual-processor omcutloo niche program in Fig. 13:.

Of the many types of sottware parallelism, two are most frequently cited as important to parallel
programming: The ﬁrst is mnrrol parol'!ch's.-rt, which allows two or more operations to be performed
simultaneously. The second type has been called darn porrillclisrri, in which almost the same operation is
perfomied over many data elements by many processors simultaneously.
Control parallelism, appearing in the form of pipelining or multiple functional units, is limited by the
pipeline length and by the lmlltiplicity of functional units. Both pipelining and functional parallelism are
handled by thc hardware; programmers need take no special actions to invoke thorn.
Data parallelism offers thc highest potential for concurrency. It is practiced inboth SIMD and MIM.D modes
on MPP systems. Data parallel code is easier to write and to debug than control parallel code. Synchronization
in SIMD data parallelism is handled by the hardware. Data parallelism exploits parallelism in proportion to the
quantity ofdata involved. Thus data parallel computations appeal to scaled problems, in which the pcrforniancc
of an MPP system does not drop sharply with the possibly small sequential fraction in the program.
To solve the mismatch problem bctwccn software parallelism and hardware parallelism, one approach is
to develop compilation support, and the other is through hardware redesign for more efficient exploitation of
parallelism. These two approaches must cooperate with each othcrto produce the best result.
FM Mtﬁruw titmpwtns _
51 i Aahtcvtced Con'tpu'terAt'chtitectut'e

Hardware processors can be better designed to exploit parallelism by an optimizing compiler. Pioneer work
in processor technology with this objective was seen in the IBM 801, Stanford MIPS, and Berkeley IUSC.
Such processors use a large register filc and sustained instruction pipelining to cxceutc nearly one instruction
per cycle. The large register file supports fast access to temporary values generated by an optimizing compiler.
The registers are exploited by the code optimizer and global register allo-cater in such a compiler.
The instruction scheduler exploits the pipeline hardware by filling brunch and Juan’ delay slots. [rt
superscalar processors, hardware and sofiware branch prediction, multiple instruction issue, speculative
execution, high bandwidth instruction cache, and support for dynamic scheduling are needed to facilitate the
detection of parallelism opportunities. Further discussion on these topics can be found in Chapters 6 and 12.

1.1.3 The Role of Compilers

Compiler techniques are used to exploit hardware featttres to improve pcrfortnartee. The pioneer work on the
IBM PL.8 and Stanford sttss compilers aimed for this goal. Other early optimizing compilers for exploiting
parallelism included the CDC STACK.LlB, Cray CFT, Illinois Parafrase, Rice PFC. Yale Bulldog, and Illinois
IMPACT.
In Chapter ll], we will study loop transformation, software pipelining, and features developed in existing
optimizing compilers for supporting parallelism. Interaction between compiler and arcllitecttue design is a
necessity in modern computer development. Conventional scalar processors issue at most one instruction
per cycle and provide a few registers. This may causc excessive spilling of temporary results from the
available registers. Therefore, more software parallelism may not improve performance in conventional
scalarprocessors.
There exists a vicious cycle of limited hardware support and the use of a naive compiler. To break the
cycle, ideally one must design the compiler and the hardware jointly at the same time. Interaction between
the two can lcad to a better solution to the mismatch problem between software and hardware parallelism.
The general guideline is to increase the flexibility in hardware parallelism and to exploit software
parallelism in control-intensive programs. Hardware and softtvaie design tradcoffs also exist in terms of cost.
complexity, expandability, compatibility, and perfort-nancc_ Compiling for multiproccssors is much more
challenging than for uniprocessors. Both granularity and communication latency play important roles in the
code optimization and scheduling process.

PROGRAM PART ITIOHI NG AND SCHEDULING

_ This section introduces the basic definitions of computational granularity or level of
parallelism in programs. Communication latency and scheduling issues are illustrated
with programming examples.

2.1.1 Grain Sizes and Latency

Grrtirt size or grrmtdririrtr is a measure of the amount ofcomputation involved in a software process. The
simplest measure is to count the number of instructions in a grain [program segment). Grain size determines
thc basic program segment chosen for parallel processing. Grain sizes are commonly dcscribcd as fine.
me:-fiuni or course, depending on thc processing levels involvod.
,,,,W, ,,,,_,,_,,,,,.,,,,,,, 5,
Latency is a time measure of the communication overhead incurred between machine subsystems. For
example, the mcrrrcry infers-_r-' is the time required by a processor to access the memory. The time required for
two processes to synehroniioe with each other is called l.l'lB s'r=nc.irronr':orri'Jn l'are'nc_t-'. Computational granularity
and communication latency are closely related, as we shall see below.
Parallelism has been exploited at various processing levels. As illustrated in Fig. 2.5, five levels of
program execution represent different computational grain sizes and changing communication and control
requirements. The lower thc level, the finer the granularity of the software processes.
[n general, thc execution of a program may involve a combination ofthesc levels. The actual combination
depends on the application, formulation, algorithm, language, program, compilation support, and hardware
characteristics. We characterize below the parallelism levels and review their implementation issues from the
viewpoints ofa programmer and of a compiler writcr.

Instruction Level At the lowest level, a typical grain contains less than 20 instructions, calledfine grain
in Fig. 2.5. Depending on individual programs, fine-grain parallelism at this level may range from two to
thousands. Butler et al. (1991) has shown that single-inst1'uetion-stream parallelism is greater than two. Wall
{I991} finds that the average parallelism at instruction level is around five, rarely exceeding seven. in an
ordinary program. For scientific applications. Kumar (1988) has measured the average parallelism in the
range of 500 to 3000 Forl:ra.n statements executing concurrently in an idealized environment.
,

|_m,-9| 5 Jobs or programs

-Coarse grain
Suop-rograms, job T
Level 4 stops or rotated
parts of a progam
>~ lvlacllum grain
Increasing
conrnunleatlon Procodires,
Laval 3 suh-routlrrn-s, taslas, Higher dagoa
demand and of parallelism
schaclullng L) or no-routines
overhead
il
Nonrneurslvo to-ops or
Lovol 2
unfolded ltaratlons

t» Flna gra In

Instructions or
Level 1 statements
Y
i}
Fig. 2.5 Levels of parallelism in program execution on modem cornprueers {Reprinted from l-twang. Prue. JEEE,
October 1937}

The exploitation of ﬁne-grain parallelism can he assisted by an optimizing compiler which should be able
to automatically detect parallelism and translate the source code to a parallel form which can be recognized
by the run-time system. Instruction-level parallelism can be detected and exploited within the processors, as
we shall sec in Chapter 12.
ma if run! |:'nm;u;um1
54 i AoManccdCon'|po'tcr.luchritectusc

Loop Level This corresponds to the iterative loop operations. A typical loop contains less than SUD
instructions. Some loop operations, if independent in successive iterations, can be vectorlzed for pipelined
execution or for lock-step ertocution on SIMD machines. Some loop operations can be self-scheduled for
parallel execution on MIMD machines.
Loop-level parallelism is often the most optimized program construct to execute on a parallel or vector
-computer. However, recursive loops are rather diﬂieult to parallclizc. Vector processing is mostly exploited
at the loop level {level 2 in Fig. 2.5) by a vectorizing compiler. The loop level may also he considered a line
grain of computation.

Procedure Level This level corresponds to medium-grain parallelism at the task, procedural, subroutine,
and coroutine levels. A typical grain at this level contains less than 2000 instructions. Detection ofparallelism
at this level is much more diflicult than at the finer-grain levels. lnterprocedural dependence analysis is much
more involved and history-sensitive.
{Iommunication requirement is often less compared with that required in MIMD execution mode. SPlv'I.D
execution mode is a special case at this level. Multitasking also belongs in this category. Significant efforts
by programmers may be nccdod to restructure a program at this level, and some compiler assistance is also
needed.
Subprogram Level This corresponds to the level ofjob steps and related subprograms. The grain size may
typically contain tens or hundreds of thousands of instructions. Job steps can overlap across different jobs.
Subprograms can be scheduled for dii1"erent processors in SPMD or MPMD mode, often on message-passing
multico mp uters.
Multiprogramrning on a uniprocessor or on a multiprocessor is conducted at this level. Traditionally,
parallelism at this level has been exploited by algorithm designers or progranuners, rather than by compilers.
Good compilers for exploiting medium- or coarse-grain parallelism require suitably designed parallel
programming languages.

job |‘Pnogrnmj Level This corresponds to the parallel execution oi‘ essentially independent jobs (programs)
on a parallel computer. The grain size can be as bigh as millions of instructions in a single program. For
supercomputers with a small number of very powerful processors, sueh coarse-grain parallelism is practical.
Job—level parallelism is handled by the program loader and by the operating system in general. Time-sharing
or space-sharing multiproccssors explore this level of parallelism. In fact, both time and space sharing are
extensions of multipmgramm ing.
To surnmarizc, line-grain parallelism is often exploited at instruction or loop levels, preferably assisted by
a parallelizing or vectorizing compiler. Medium-grain parallelism at the task orjob step demands significant
roles for the programrner as well as compilers. Coarse-grain parallelism at thc program level relics heavily
on an eilective OS and on the efficiency ofthe algorithm used. Shared-variable communication is often used
to support fine-grain and medium-grain computations.
Message-passing multicomputers have been used for medium- and coarse—grain computations. Massive
parallelism is ofien explored at the fine-grain level, such as data parallelism on SIMD or Milt-ID computers.
Communiontion Lntency By balancing granularity and latency, one can achieve better performance of
a computer system. Various latencies are attributed to machine architecture, implementing technology, and
communication pottcms involved. The architecture and tcclmology afiiect the design choices for latency
tolerance between subsystems. 1n fact, latency imposes a limiting factor on the scalability of the machine
Pro-gum and Network Props ' i 55

sine. For example, over the years memory latency has increased with respect to processor cycle time. ‘Various
latency hiding or tolerating techniques will be studied in Chapters 9' and I2.
The latency incurred with interproeessor communication is another important parameter for a system
designer to minimize. Besides signal delays in the data path, LPC latency is also affected by the communication
pattems involved. In gcnctal, n tasks communicating with each other may require ntn - l)f2 communication
links among them. Thus the complexity grows quadratically. This leads to a communication bound which
limits the number of processors allowed in a large computer system.
Communication patterns are determined by the algorithms used as well as by the architectural support
provided. Frequently encountered patterns include permurruions and br0on‘c¢r.sr. rrtrrfricttsr, and c-oryierenc-e
{many-to-many) eommunications. The communication demand may limit the granularity or parallelism. Very
oilen tradoo ﬁ's do exist between the two.
The communication issue thus involves the reduction of latency or complexity, the prevention ofdeadlock,
tninintizing blocking in communication patterns, and the tradeoff between parallelism and communication
overhead. We will study techniques that minimize communication latency, prevent deadlock, and optimize
grain size in latter chapters of the book.

2.1.2 Grain Packing and Scheduling

Two ﬁtndamental questions to ask in parallel programming are: (i) How can we partition a program into
parallel branehes, program modules, mierotasks, or grains to yield the shortest possible execution time? and
(ii) What is the optimal size ot" concurrent grains in a computation‘?
This grain-size problem demands determination of both the number and the size of grains {or microtasks)
in a parallel program. Of course, the solution is both problem-dependent and machinedependent. The goal is
to produce a short schedule for fast execution of subdivided program modules.
There exists a trradeoffbetwcen parallelism and scheduling." synchronization overhead. The time oomplexity
involves both computation and eommtmication overheads. The program partitioning invoh-es the algorithm
designer, programmer, compiler, operating system support, etc. We describe below a grain ;mt-It-ing approach
introduced by Kruatraehue and Lewis (I938) for parallel programming applications.

I»)
8] Example 2.4 Program graph before and after grain packing
(Kruatrachue and Lewis, 1988)
The basic concept of program partitioning is introduced below. In Fig. 2.6, we show an example program
graph in two different grain sizes. A program graph shows the strucnirc of a program. It is very similar
to the dependence graph introduced in Section 2.1.1. Each node in the program graph corresponds to a
computational unit in the program. The grain .st':e is measured by the number of basic machine cycle-5
-[including both pro-eessor and memory cycles) needed to execute all the operations within the node.
We denote each node in Fig. 2.6 by a pair (rt. st, where rt is the node mime {id} and s is the grain size of the
node. Thus grain size reflects the number of computations involved in a program segment. Fine-grain nodes
have a smaller grain size. and coarse-grain nodes have a larger grain size.
1 _ . 1 I.‘ IBM‘ ln¢r.q|r_.u|»rs

Sﬁ i Aahvmced Compuoernrchitecuue

The edge label (rt, cl) between two end nodes speciﬁes the output variable rt from the souroe node or the
input variable to the destination node. and the communication delay d between them. This delay includes all
the path delays and memory latency involved.
There are 1? nodes in the ﬁne-grain program graph (Fig. 2.6a) and 5 in the coarse-grain program graph
(Fig. lob]. The coarse-grain node is obtained by combining {grouping} multiple line-grain nodes. The line
grain corresponds to the following program:

Q 6""
tsﬂsd
. , was
C‘! 11"_o.-t _o.'5orF.‘"
"en
o"*
"'4 at Yd

[x,t'_| {lnp-it, delay]

mg. [u.k) - (output, delay) ‘M W,‘ mg

[bl Coarse-grain program graph

{at Fine-grain program graph before packing aﬁgf pagkjng,

Fig. 2.! A program graph before and alter grain paddrt; in Example 1.4 {Modiﬁed from Kmatradwe and Lewis.
lEE.E Seflvm.ns,]an. 1933]

Var rt, b,c-, d,e,_,i", g, Fr, i,_,r', k, I, m, n,o,p, q

Begin
. rr:= :=e><f
_. {J := :=d><_f
:= =j><k
= :=4>< l
_ = :=3>-Cm
= :=n><r'
g:=.n><h :=o><ft
'= P:ovu-.pt-.:~.1-— >4 E. i li in-mi :~1:'="t*:'=-E"F‘~":-P ~e"t:o:5- >v-" -:=p><q
Home
-:.~.—
.‘°P .“'~?"*. '~‘=*!-" “ r:=n‘><e
End
,,_,W,, mMN,k,¢ 5,
Nodes l, 2, 3, 4, 5, and 6 are memory reference (data fetch} operations. Each takes one cycle to address
and six cycles to fetch from memory. All remaining nodes (7 to 1?) are CPU operations. each requiring two
cycles to complete. After packing, thc coarse-grain nodes have larger grain sizes ranging from 4 to R as
shown.
The node (A, 8) in Fig. 2.61) is obtained by combining the nodes (1, 1), (2, 1), (3, 1}, [4, I), (5, 1], (6, 1),
and (ll. 2) in Fig. 2.6a. The grain size, 8. of nodeAis the summation of all grain sizes (1 + l + l + 1 * 1 +
1 + 2 = 8) being combined.

The idea of grain packing is to apply fine grain 1"n'st in order to achieve a higher degree of parallelism.
Then one combines (packs) multiple fine~g1'ain nodes into a coarsegrain node if it can eliminate unnecessary
communications delays or reduce thc overall scheduling overhead.
Usually, all fine-grain operations within a single cc-arse-grain node arc assigned to the same processor for
execution. Fine-grain partition of a program often demands more interprocessor communication than that
required in a coarse-grain partition. Thus grain packing offers a tradeofi‘between parallelism and scheduling!
communication overhead.
Internal delays among fine-grain operations within thc same coarse-gtain node are negligible because thc
communication delay is contributed mainly by interprocessor delays rather than by delays within the same
processor. The choice of the optimal grain size is meant to achieve the shortest schedule for the nodes on a
parallel computer system.

11}
11
12
7 14 t 14*
T|m a 13: _. 15

2°'— 21
22-— F3
I _..._-3
_=-=.
24 24
2a 2!'f__§ I
:.=.2=

‘I
"' _-mi
E

_.
es-4
#33 Iii
[3] Flno grain [Fig. 2.63] [h-) Coarse grain [Flg. 2.61:]

Fig. 2.‘! Scheduling of the ﬁns-grain and coarse-grain programs (arrows: idle dme; shaded area: communication
ﬁche!
_ War MIGIIILH H“ I'mr!I;|(1rtnr\ _
5-B i Aoktolnced Colrnputer Architecture

With respect to the line-grain versus coarse-grain program graphs in Fig. 2.6, two multiprocessor schedules
are shown in Fig. 2.7. The ﬁne—grain schedule is longer (42 time units) because more communication delays
were included as shown by the shaded area. The coarse-grain schedule is shorter (38 time units) because
communication delays among nodes 12, 13, and 14 within the same node D (and also the delays among 15,
115, and 17 within the node E) are eliminated after grain packing.

2.1.3 Static Multiprocessor Scheduling

Grain packing may not always produce a shorter schedule. In general, dynarnic multiprocessor scheduling
is an NP-hard problem. Very olien heuristics are used to yield suboptimal solutions. We introduce below the
basic concepts behind multiprocessor scheduling using static schemes.

Node Duplication In order to eliminate the idle time and to further reduce the communication delays
among processors, one can duplicate some ofthe nodes in more than one processor.
Figure 2.3a shows a schedule without duplicating any of the ﬁve nodes. This schedule contains idle time
as well as long interproccssor delays {S units) between Pl and P2. In Fig. 2.3b, node A is duplicated into A’
and assigned to P2 besides retaining the original copy A in P1. Similarly, a duplicated node C’ is copied into
Pl besides the original node C in P2. The new schedule shown in Fig. 2.Sb is almost 50% shorter than that
in Fig. 2.3a. Thc reduction in schedule time is causcd by elimination ofthe (a, 8] and (c, B] delays between
the two processors.

P, P2 P, P2 P2 P1 P2

4
5 a,1 a.1
e 4
5 4

o o re H
3,1 III B

l D 9

‘U
P’
Q u,1 e,1 P ah.-5
'i‘ 13
.4?"B
‘o
o1* ~5- _@4»_n-
2“ ca:- if
5
II
‘
86
-"= a
l\Jto
-n InIt-I
'-I

[a] Schedule without node dupd lcaion [bl Schooua with node clu plleatlon {A-1 A and A’; C -1» C and -C’)

Fig. 2.8 Node-duplication scheduling to eliminate cmtamutlcarlon delays between processors (I: idle rims;
shaded areas: communication delays}

Grain packing and node duplication are often usedjointly to determine the best grain size and corresponding
schedule. Four mqor steps are involved in the grain determination and the process ofscheduling optimization:
Step l . Construct a fine-grain program graph.
Step 2. Schedule the fine-grain computation.
Step 3. Perform grain packing to produce the coarse grains.
Step 4. Generate a parallel schedule based on the packed graph.
Pm, mMm,k,¢ 5,
The purpose ofmultiproeessor scheduling i5 to obtain a minimal time schedule for the eompulations
involved. The following example clarifies this concept.

5*) Example 2.5 Program decomposition for static multipro-

cessor scheduling (Kruatrachue and Lewis, 1 988)
Figure 2.9 shows an example ofhow to calculate the grain size and communication latency. ln this example,
two 2 >< 2 manic-es A and B are multiplied to compute the sum ofthe four elements in the n:.a"|.|lting product
matrbt C = .-I >< B. There are eight multiplications and sea-'-en additions to be perfonned in this pmgrzmi, as
written below:
[A11/412] X[51131:] =[c11C12]
1411142 321322 C2152
5“ -“+1 *1 B11 A12 *1 33

lc=1er2 G-rmiazeiﬂi Gyg-15;-93

‘t1t><5'-g G1g‘*11><511"-‘\:2><32;
CPU CYCLE CPU CYCLE
Mme w An. D1 15 Move L PARLD1 213
Move W B1101, D2 15 Move L PARZU2 20
MPTYD1, D2 T1 ADDL DI, D2 B
MOVE L D2. PAR 219 MOVE L D2. P5-UM 20

{a)GII"1 alze cdulaim I‘| M6-B000 aasemlily code e120-MHz eyele-

E {S I-in E d=T'l+T2+T3l'T4+T5+TG
=ZO+2D+32+2‘D+2CI+10ﬂ
T1 T2T3_m T5 =212cydea
T3 = 32-on ianarllssen tine at 20 Mops
nom1aizedtoM68ﬁDC|cydea120 MI-lz.
T6 = oehy one ta eoilware proboeh {aaeune 5
Mow Iemiotlene, 1130;
{h}1Calc1ulauen oi oemmuueaton delay d

A 4:» C D E F -1-1,1
d d *

N G 5 u '5'
P
Sun
(<>1FIw~;:I-I1 WWW each
Fig. 2.! Caicu-laeion of graln size and comrmmdcadon delay for the pang:-nan graph in Example 15 {Cotnecsy of
Kruatnehue and Lewis: reprinted witli permission irmn IEEE Sofhmie. 19$}
H _ l'h1'Ml.'I;IﬂH1I' HI" l'nrr.q|r_.u||rs 5
HI i Aohlolnced Corrtpu'ae|'At'chitecbt11re

C11:-’l|1><3|1'l'/l|2><32|
C12 =fl|1><3|2‘l'fl|;'><B22
C21:-"i21><311+-"i2:><B:1
C22 = 11:1 >< 5'12 + -"122 >< 322
5Hm =C11+C1:+C:1+C-'2:
As shown in Fig. 2.9*a., the eight multiplications are performed in eight 431 nodes, each ofwh ich has a grain
size of 101 CPU cycles. The remaining seven additions are performed in a 3-level binary tree consisting of
seven $ nodes. Each additional node requires S CPU cycles.
The interprocessor cornrnunication latency along all edges in the program graph is eliminated as rt = 212
cycles by adding all path delays between two communicating processors (Fig. 2.9b).
A fine-grain program graph is thus obtained in Fig. 2.9c. Note that the grain size and communication delay
may vary with the difierent processors and communication links used in the system.
Figure 2.10 shows scheduling of the fine—grain program first on a sequential uniprocessor (Pl) and then
on an eigltt-processor [Pl to P8) system (Step 2}. Based on the fine-grain graph (Fig. 2.9c), the sequential
execution requires 364 cycles to complete without incurring any communication delay.
Figure 2.lDb shows the reduced schedule of T41 cycles needed to execute the I5 nodes on E processors
with incurred communication delays (shaded areas}. Note that the communication delays have slowed clown
the parallel execution significantly, resulting in many processors idling {indicated by 1), except for Pl which
produces the final sum. A speedup factor ol‘li64l"1‘4l = 1.16 is observed.
P
__ Q il _ P1 P2 P3 P4 P5 Pa P? Pa
A Cl
o E F o 1-1 C
"11 1
B 101 . . 1 . '
2o2—
303-i‘
D 313
45¢ i 321 me
" E
“M sos T 522.. 1
G

ﬁlﬁ
T41 &\\W%/
B24
%/
117,:
F*—"-{ft
"2 Ill]
M
8-48- Ir]
M Isl
r 554 Iil
[a] Asoquontlal schedule [b] A parallel schedule

Fig. 2.10 Seqoertrlal versus parallel sehed1.tI1-tgin Exarnple 15

Next we show how to use grain packing (Step 3] to reduce the communication overhead. As shown in
Fig. 2.11, we group the nodes in the top two levels into four coarse-grain nodes labeled V, W, X, and Y. The
,.,,E,,,,,,,,,._,,,,,,,,,,,,,, ‘I
remaining three nodes (N, O, P) then form the ﬁfth node Z. Note that there is only one level of interprocessor
communication required as marked by rt in Fig. 2.lla.

O P1 P2 P‘3 F’4

co <-:.
°
ca
E
E ><
G -<

o e_ o _
'l'l I
202 J
210 -
,1 . ' <1 Time If

, Z

@210 ' v=w=x=~r=1o1+1o1+s=21o {M5

z=a+a+a=24 '
[oi Grain packing of 15 small nodes |nto5 bigger no-dos [bi Parallel s1::i'1eciutoforti'1o packed program

Hg. 2.11 Parallel seheduiing for Exarrnpie ‘1.5 atrer grain pedcirtg to reduce eornnnmieariort delays

Since the maximum degree of parallelism is now reduced to 4 in the program graph, we use only four
processors to cxccutc this ooarsc-grain program. Aparallcl schedule is worked out (Fig. 2. 1 1] for this program
in 4415 cycles, resulting in an improved speedup of8t'1-H446 — 1.94.

PROGRAM FLOW MECHANISMS

2
Conventional computers arc based on il control ﬂow mechanism by which thc order ofprogram
execution is explicitly stated in the user programs. Datailow computers are based on a data-
driven mechanism which allows the execution of any instruction to be driven by data (operand) availability.
[Jatatiow computers emphasize a high dcgrcc of parallelism at the fine-grain instructional lcvcl. Reduction
computers are based on a demand-driven mechanism which initiates an operation based on the demand for
its results by other computations.

2.3.1 C-orrtml Flow\i"ersus Data Flow

Conventional yon Ncu mann computers use a progmrrr -:-onmer (PC) to sequence the execution of insir|.tctio1:1s
in a program. The PC is sequenced by instruction ﬂow in a program. This sequential execution style has been
called cm1rroi-dr1'w_v1, as program ﬂow is explicitly controlled by programmers.
A uniprocessor computer is inherently sequential, due to use of the cone-oi driven mechanism. However,
control flow can he made parallel by using parallel language constructs or parallel compilers. In this book,
we study primarily parallel control—ilow computers and their programming techniques. Until the data-driven
or demand-driven mechanism is proven to be cost-effective, the control-flow approach will continue to
dominate thc computer industry.
I _ Par I J11!!!‘ l'mrJI||r_.u|n¢\
BI i Aahmtced Corr|pn'te|'Ar'chritecbtrn2

ln a daraflow computer, the execution of an instruction is driven by data availability instead of being
guided by a program counter, ln theory, any instruction should be ready for execution whenever operands
become available. The instructions in a data-driven program are not ordered in any way. instead of being
stored separately in a main memory, data are directly held inside instructions.
Computational results. {rrnra mirens} are passed directly between instructions. The data generated by an
instruction will be duplicated into many copies and forwarded directly to all needy instructions. Data tokens,
once consumed by an instruction, will no longer he available for reuse by other instructions.
This data-driven scheme requires no program counter, and no eonlrol sequencer. However, it requires
special mechanisms to detect data availability, to match data tokens with needy instructions, and to enable
the chain reaction of asynchronous instruction executions. No memory sharing between instructions results
in no side effects.
Asynchrony implies the need for handshaking or token-matching operations. A pure dataflow computer
exploits fine—grain parallelism at the instruction level. Massive parallelism would be possible if the data-
driven mechanism could be cost-effectively implemented with low instruction execution overhead.
A Dataflow Architecture There have been quite a few experimental datafiow computer projects. Anrind
and his associates at MIT developed a tagged-token architeetme for building dataflow computers. As shown
in Fig. 2.12, the global architecture consists ofn processing elements {PEs} interconnected by an n >< n routing
network. The entire system supports pipclined dataflow operations in all n PEs. Inter-PE eontnntnieations are
done through the pipelinod muting network.

From routing network

O’
Local path "5
I P
D §
Global pan

I n i< n Rotting Hatwortr 1

s-
gal Compute
Tao

-
H 81¢
Processing Etamo 3

To Routing Network
[3] Tho global architecture [bi lntorla design of a pro-coaxing olomont

Flg.I.1‘Z The l*'lIT uggnd-'|tolten clamﬂow oornputier {aehptted from Arvind and lannuoel. 1936 with permission)

Within each PE, the machine provides a low-level Iokm-nmfr-hing mechanism which dispatches only
those instructions whose input data [tokens] are already available. Each datum is tagged with the address of
.,W, mmN,kkg _._ H
thc instruction to which it belongs and the context in which thc instntction is being executed. Instructions
are stored in the program memory. Tagged tokens enter the PE through a local path. The tokens can also
be passed to other PEs through the routing network. All internal token circulation operations are pipclincd
without blocking.
{Jae can think oi‘ the instruction address in a datafiow computer as replacing the program counter, and
the context identifier replacing the frame base register in a control flow computer. It is the machine’s job to
match up data with the same tag to needy instructions. In so doing, new data will be produced with a new tag
indicating the successor instructiontsl. Thus, each instruction represents a synchronization operation. New
tokens are formed and circulated along the PE pipeline for reuse or to other PEs through the global path,
which is also pip-clined.
Another synchronization mechanism, called the 1-so-in-runs, is provided within each PE. Tl1e 1'-structure is
a tagged memory urtit for overlapped usage ofa data structure by both the producer and consumer processes.
Each word Of I-stt't.tCtt.tDE uses a I-bit tag indicating whether the word is £'.|'flpI_'1-‘, isjirli, or has pending react‘
requests. The use ofI-structure is a retreat fi'om the pure dataflow approach. The purpose is to reduce excessive
copying of large data structures in datallow operations.

Ir)
g Example 2.6 Comparison of dataﬂow and control-ﬂow
computers (Gajski,Padua, Kuel-r,and K|.|hn,1982)

The dataflow graph in Fig. 2.l3a shows that 24 instructions are to be executed (8 niw'des, 8 mulriplres, and
8 adds). A dataflow graph is similar to a dependence graph or program graph. The only difference is that data
tokens are passed around the edges in a dataflow graph. Assume that each rrriri rmri'rr}Jl_t-', and dit-'r'rr’c requires
1, 2, and 3 cycles to complete, respectively. Sequential execution oi‘ the 24 instructions on a control [low
uniprocessor takes 48 cycles to complete, as shown in Fig. 2.l3b.
On the other hand, a dataflow multiprocessor completes the execution in 14 cycles in Fig. 2.13c. Assume
that all the cxtcmal inputs (d,-, cl-,_f; ihri = l, 2, . ..,8 and q-,'j are available bcibrc entering the loop. With i'ot.n'
processors, instructions. 0|, Hg, 113, and or are all ready for execution in the l‘n'st three cycles. The results
prod uccd then triggcr the -execution of115, in | , rig, and .n-_|- starting finm cycle 4, The data-driven chain reactions
arc shown in Fig. 2. l 3e. The output ca is the last one to produce, due to itsdcpcndcncc on all thcprcvious cl-‘s.
Figure 2. l 3d shows the execution of the same sct of computations on a conventional multiprocessor using
shared memory to hold the intermediate results [s,- and r,- for i = 1, 2, 3, 4). Note that no shared memory is
used in the dataflow implementation. The example does not show any time advantage of datafiow execution
over control flow execution.
The theoretical minimum time is I3 cycles along the critical path rt | h |c|r-3 . “C8. The chain reaction control
in datallow is more ditlicult to implement and may result in longer overhead, as compared with the uniform
operations performed by all the processors in Fig. llid.
H _ rhr I.‘ IBM!‘ I l'nrr.q|r_.u||r\ 5
'54 i Aohtotnced Cornptmer Architecture

1"P"1d~ 9-f ‘*1 "152 9243 93d-ti ‘*4 do "5 dc “ed? “Wis °s
cD=0
iorlfromltoﬂdo
h I
232.; 312 33 34 35 36 3? 36
bl; .-. f1 f f f4 f fa f? f

e
"cg: .l?_~_p- f-
_¢-
+ —-In
'1-|'*2"-Baht: be babtbe
Otllﬂl-.l'l3, U, C ' f1 1:3 Cd 13-5 gs‘ QB’

[ajﬁtsampieptogramandttsdataflowgraph

1 4 6 7? 1|} 12 #3 46 4-B
I *1 I b1l*=1l as I be Isl
[bi Sequential execution on a unlpro-oeesor In #8 cycles

4 (7 1491011121314
*1 I *5 l°1|°2|°al°#l°5l°sl°t|e°B|

£1ti’ f.
K J
.%.
E‘
Zr .%.
gr_
ha‘ mg‘E
at . _.

[c] Data-driven exotatﬂon on a 4-1:-1'0-oemor dataﬂow computer In 14 cyctes

T 9 11 121314
31 as '11 I *5 |$1|l1|'=1|°5| $1=t’2*t*1-*1='*3"$1-‘=1=b1*“o-°5=*’s*°4
no ‘*6 l [_$2[l2|_°;|°t-§~| @'2=b4*b3-*2==-1"52-¢2=51*'=o-¢c=$s*°=1
‘*3;.N 3? D3] ll? l"_3[_‘a|°3]°?| 53:56‘be-la=t'?"5a~°3=‘1*°o-°?=‘a*°4
-‘viviji1
‘*4 ‘iilk- “B ?if mu’ |s4ll4|°4|°B| *4=be*t'?-*4=s=1*$3'°=1=l2"°o-‘=e=l4"°4
tjd] Parallel execution on a shared-memory 4-pro-oossor system in H cycles

Fig. 2.13 Comparison between datiﬂerw and control-ﬂow computers [adapted from Grajsltl. Pacl1.n,K.u|:lt. and
Kuhn, ‘E952; reprinted with permission from IEEE Computer. Feb. 1931}

Cine advantage of tagging each datum is that data from different contexts can be mixed freely in the
instruction execution pipeline. Thus, instruction-level parallelism of datafiow graphs can absorb the
communication latency and minimize the losses due to synchronization waits. Besides token matching
and l-strllctttre, compiler technology is also needed to generate datafiow graphs for tagged-token dataflow
computers. The dataflow architecture ol‘l'e1s in theory a promising model For massively parallel computations
because all for-reaching side effects are removed. However. implementation ofthese concepts onacommercial
scale has proved to be very diflicult.
Pm mmN,kkg _._ is
2.3.2 Demand-Driven Mechanisms
In a reduction rnnt-hinc, the computation is triggered by the demand for an operation's result. Consider the
evaluation ofa nested arithmetic expression tr = {U1 + 1} '>< c — {rt + cl]. The data-driven computation Seen
above chooses a bottom-up approach, starting from the innermost operations b + I and tr’ + c, then proceeding
to the >< operation, and finally to the outermost operation —_ Such a computation has been called eager
ct-mtinrion because operations are carried out immediately after all their operands become available.
A dcmmiri’o‘rit-on computation chooses a top-down approach by first demanding the value ofn, which
triggers the demand for evaluating the next-level expressions [in + l}>< c and n‘ + c, which in tum triggers tl1c
demand tor evaluating b + 1 at the innermost level. The resttlts are then returned to the nested dectnandcr in
the reverse order before tr is evaluated.
Adetnand-driven computation corresponds to hrzy e-t-nlmrinn. because operations are executed only when
their results are required by another instruction. "The demand driven approach matches naturally with the
functional programming concept. The removal of side efiircts in functional programming makes programs
easier to parallelize. There are two types of reduction machine models, both having a recursive control
mechanism as characterized below.
Reduction Machine Model: In a string reduction model, each demander gets a separate copy of the
expression for its o\vt| evaluation. A long string expression is reduced to a single value in a recursive fashion.
Each reduction step has an operator followed by an embedded reference to demand the corresponding input
operands. The operator is suspended while its input arguments are being evaluated. An expression is said to
be fully reduced when all the arguments have been replaced by literal values.
In a graph rcducrimi model, the expression is represented as a directed graph. The graph is reduced by
evaluation of branches or subgraphs. Diiilicreot parts of a graph or subgraphs can be reduced or evaluated
in parallel upon demand. Each demander is given a pointer to the result of the reduction. The demander
manipulates all rciercnccs to that graph.
Graph manipulation is based on sharing the atgtrtncnts using pointers. This traversal of the graph and
reversal ofthe rcilercnccs are continued tmtil constant arguments are encountered. This proceeds until the
value of ti is determined and a copy is rettuned to the original demanding insuuction.

2.3.3 Comparison of Flow Mechanisms

Control-flow. dataflow, and reduction computer architectures are compared in Table 2.1. The degree of
explicit control decreases fi'om control-driven to demand-driven to data-driven. Highlighted in the table are
the dillcrences between eager ct-'.trlu.trIi7r;tn and ltrzy ct-‘nitration in data-driven and demand-driven computers.
respectively.
Furthermore. control tokens are used in control-flow computers and reduction machines, respectively. The
listed advantages and disadvantages of the dataflow and reduction machine models are based on research
findings rather than on extensive operational experience.
liven though conventional von Neutnann model has many disadvantages, the industry is still building
computers following the control-flow model. The choice was based on cost—efl'ectiveness, marketability, and
the narrow windows of competition used by the industry. Program flow mechanisms dictate architectural
choices. Both dataflow and reduction models, despite a higher potential for parallelism, are still concepts in
the research stage. Control-flow machines still dominate the market.
'55 N
.
Aohioinced Computer Ai'chitecture

Table 2.1 Corrtrol-Flow, Datoﬂow, and Reduction Computers

."t-fuclrin-t.' .'H0dt'f I.’_'o.rr.|'m-1' Firm‘ fconrmf-dri ten) Dalaﬂan' |"da.t.r.t-drr'ven) Re'dr.rc.rion {J£'rru.rnuLd'ri\=e'n}

Coiiveiitioual computation; token Eager evaluation; statements Leary evaltiatio-J1; statements

Basic of control indicate: when a are esecuted when all ot'tl1ei|' are executed only when
Definition statement should he executed operands are available their result is required for
another computation
Full control
Very high potential for Only required instructions
The most sueoensliil model parallelism are executed
for commercial products
fidvaiitage-i —
l [iph throughput lligli degree of para llelisrn
Complex data and control
structures are easily implemented Free from side effects Easy manipulation of data
structures
lri theory, less cflicieirit than the Time lost waiting for Docs not support shining ct‘
other two unneeded arguments objects with cluutgiug local
state
Disadvamagcs Difficult in preventing run-time lligh control overhead
t".‘.'l'l'O‘l’S
Time needed to propagate
Difiicult in manipulating demand tokens.
data structures

(Courtesy olwali, Lowrie, and Li; reprinI.e=d with p-eimission from Computersfor A‘rnfﬁciol intelligence Pmcersing edited
by Wah and Ra.maJ:noorthy, Wiley and Sons. l.uc., 1990]

In this book, we study mostly control-flow parallel computers. But dataflow and rnultithreadcd architectures
will be further studied in Chapter 9. Dataflow or hybrid von Neumann and dataflow machines otter design
alternatives; .i‘fl"-EH1’?! proccs.-rr'ng { see C hapter 13) can be considered an example.
As far as innovative computer architecture is conccmed thc dataflow or hybrid models cannot he ignored.
Both thc Electroteclmical Laboratory (ETL) in Japan and the Massachusetts institute of Technology have
paid attention to these approaches. The book edited by Gaudiot and Bic (1991) provides details of some
development on dataflow computers in that period.

SYSTEM INTERCONNECT ARCHITECTURES

_ Static and dynamic networks for interconnecting computer subsystems or for constructing
multiproccssors or multicomputers are introduced below. We study ﬁrst the distinctions
between direct networks for static connections and indirect networks for dynamic connections. These
networks can be used for internal connections among processors, memory modules, and HO adaptors in a
centralized system, or for distributed networking of rnulticoniputcr nodes.
Various topologies for building networks are speciﬁed below. Theo we focus on thc communication
properties ofinterconnection networks. These include latency analysis, bisection bandwidth, and data-routing
functions. Finally, we analyze the scalability ofparallcl architecture in solving scaled problems.

Program and Network Properties 2.1 Conditions of Parallelism 2.2 Program Partitioning and Scheduling
No ratings yet
Program and Network Properties 2.1 Conditions of Parallelism 2.2 Program Partitioning and Scheduling
47 pages
Parallel Computing Dependencies
No ratings yet
Parallel Computing Dependencies
27 pages
Program and Network Properties
No ratings yet
Program and Network Properties
27 pages
Parallel Computing Essentials
No ratings yet
Parallel Computing Essentials
71 pages
Parallel Computing for Experts
No ratings yet
Parallel Computing for Experts
36 pages
Understanding Parallelism in Computing
No ratings yet
Understanding Parallelism in Computing
29 pages
Grain Packing & Scheduling Ch2 Hwang
No ratings yet
Grain Packing & Scheduling Ch2 Hwang
80 pages
15CS72 ACA Module1 Chapter2Final
No ratings yet
15CS72 ACA Module1 Chapter2Final
28 pages
ACA Chapter2
No ratings yet
ACA Chapter2
66 pages
Performnce Metrics and Measures
No ratings yet
Performnce Metrics and Measures
24 pages
Module 1 Chapter2
No ratings yet
Module 1 Chapter2
98 pages
Module 1 Chapter2
No ratings yet
Module 1 Chapter2
100 pages
Chapter 2: Program and Network Properties
No ratings yet
Chapter 2: Program and Network Properties
94 pages
Hardware vs. Software Parallelism
50% (2)
Hardware vs. Software Parallelism
55 pages
MCP Unit 1
No ratings yet
MCP Unit 1
41 pages
Instruction Level Parallelism: Soner Onder
No ratings yet
Instruction Level Parallelism: Soner Onder
25 pages
2 TypesofParallelism
No ratings yet
2 TypesofParallelism
69 pages
Dependence Alanysis and Loop Normalization
No ratings yet
Dependence Alanysis and Loop Normalization
23 pages
U3.1 Concepts and Challenges
No ratings yet
U3.1 Concepts and Challenges
12 pages
Parallelism in Algorithms - Bernstein's Conditions
100% (2)
Parallelism in Algorithms - Bernstein's Conditions
2 pages
Parallel Programming 1
No ratings yet
Parallel Programming 1
32 pages
Capp 1
No ratings yet
Capp 1
38 pages
Dependency Graph and Bernstein Conditions
No ratings yet
Dependency Graph and Bernstein Conditions
39 pages
Week 5
No ratings yet
Week 5
41 pages
Unit 3 Complete APP
No ratings yet
Unit 3 Complete APP
49 pages
CSE 820 Graduate Computer Architecture Week 5 - Instruction Level Parallelism
No ratings yet
CSE 820 Graduate Computer Architecture Week 5 - Instruction Level Parallelism
38 pages
Unit 3
No ratings yet
Unit 3
49 pages
StudM1p2CSA Conditions of Parallelismppt2
No ratings yet
StudM1p2CSA Conditions of Parallelismppt2
25 pages
Instruction Level Parallelism and Its Exploitation: Unit Ii by Raju K, Cse Dept
No ratings yet
Instruction Level Parallelism and Its Exploitation: Unit Ii by Raju K, Cse Dept
201 pages
Module 5 Instruction Level Parallelism and Pipelining
No ratings yet
Module 5 Instruction Level Parallelism and Pipelining
54 pages
03 Dynamic Sched
No ratings yet
03 Dynamic Sched
84 pages
CompanionAsset 9780128119051 Chapter03
No ratings yet
CompanionAsset 9780128119051 Chapter03
67 pages
Parallel Programming: Aaron Bloomfield CS 415 Fall 2005
No ratings yet
Parallel Programming: Aaron Bloomfield CS 415 Fall 2005
24 pages
4-Independent and Cooperating Processes
No ratings yet
4-Independent and Cooperating Processes
3 pages
c3 Dependence Analysis p1
No ratings yet
c3 Dependence Analysis p1
32 pages
Cosc530 Ch3all6up
No ratings yet
Cosc530 Ch3all6up
8 pages
CCS358-Evaluation Shecme & Answer Key
No ratings yet
CCS358-Evaluation Shecme & Answer Key
8 pages
Instruction-Level Parallelism: Stalls Control Stalls WAW Stalls WAR Stalls RAW Stalls Structural CPI CPI
No ratings yet
Instruction-Level Parallelism: Stalls Control Stalls WAW Stalls WAR Stalls RAW Stalls Structural CPI CPI
50 pages
Inter-Process Communication Guide
No ratings yet
Inter-Process Communication Guide
16 pages
Lec5 - ILP Issues in Pipeline Design
No ratings yet
Lec5 - ILP Issues in Pipeline Design
38 pages
Layers of Implementing An Application in Software or Hardware Using Parallel Computers
No ratings yet
Layers of Implementing An Application in Software or Hardware Using Parallel Computers
46 pages
Dependencies, Instruction Scheduling, Optimization, and Parallelism
No ratings yet
Dependencies, Instruction Scheduling, Optimization, and Parallelism
49 pages
COA UNIT 5 (AutoRecovered)
No ratings yet
COA UNIT 5 (AutoRecovered)
14 pages
Unit 3 Parallel Programming: Structure Nos
No ratings yet
Unit 3 Parallel Programming: Structure Nos
26 pages
Lecture 5
No ratings yet
Lecture 5
80 pages
Parallelism I: Inside The Core
No ratings yet
Parallelism I: Inside The Core
61 pages
Parallelism
No ratings yet
Parallelism
67 pages
Unit 1
No ratings yet
Unit 1
34 pages
Parallel Algorithms: Theory and Practice: Deterministi C Parallelism
No ratings yet
Parallel Algorithms: Theory and Practice: Deterministi C Parallelism
51 pages
Unit 6
No ratings yet
Unit 6
22 pages
Parallel Programming Essentials
No ratings yet
Parallel Programming Essentials
40 pages
Parallel Programming Models Guide
No ratings yet
Parallel Programming Models Guide
32 pages
Unit 4
No ratings yet
Unit 4
42 pages
Parallel Computing
100% (1)
Parallel Computing
241 pages
Concurrency
No ratings yet
Concurrency
99 pages
Con Currency
No ratings yet
Con Currency
99 pages
Lec18-Synchronization Issues-Online
No ratings yet
Lec18-Synchronization Issues-Online
17 pages
Lab 3 Synchronization v2-1
No ratings yet
Lab 3 Synchronization v2-1
19 pages
Multiprocessor Concepts
No ratings yet
Multiprocessor Concepts
40 pages
ACA UNIT-1 A Kai Hwang
No ratings yet
ACA UNIT-1 A Kai Hwang
32 pages
CNNs in 5G Intelligent Transport Systems
No ratings yet
CNNs in 5G Intelligent Transport Systems
20 pages
CNNs in 5G Intelligent Transport
No ratings yet
CNNs in 5G Intelligent Transport
10 pages
10 Switches
No ratings yet
10 Switches
36 pages
Excel To SAS
No ratings yet
Excel To SAS
6 pages
Dimensionless Number
No ratings yet
Dimensionless Number
14 pages
NTDLL Unhooking - From KnownDlls Directory
No ratings yet
NTDLL Unhooking - From KnownDlls Directory
9 pages
Class 12 - Assignment 3 (AOD) (Rate Measure, Tangent and Normal)
No ratings yet
Class 12 - Assignment 3 (AOD) (Rate Measure, Tangent and Normal)
5 pages
Fe C 2si PDF
No ratings yet
Fe C 2si PDF
1 page
Course Logistics and Introduction To Machine Learning
No ratings yet
Course Logistics and Introduction To Machine Learning
34 pages
Changi Airport Sand Compaction
No ratings yet
Changi Airport Sand Compaction
4 pages
The Nuts and Bolts of First-Principles Simulation: 6: Plane Waves, Unit Cells, K-Points and All That
No ratings yet
The Nuts and Bolts of First-Principles Simulation: 6: Plane Waves, Unit Cells, K-Points and All That
33 pages
LESSON 1 - Operations of Signed Numbers
No ratings yet
LESSON 1 - Operations of Signed Numbers
4 pages
Grade 9 Physics Chapter 1 Test
No ratings yet
Grade 9 Physics Chapter 1 Test
10 pages
Bioinformatics Workshop LDH Worksheet-1
No ratings yet
Bioinformatics Workshop LDH Worksheet-1
4 pages
A2 Mybmrchuc: VDP-1507-G0011-D01-01
No ratings yet
A2 Mybmrchuc: VDP-1507-G0011-D01-01
1 page
C Lab Manual
No ratings yet
C Lab Manual
36 pages
Progecad LT 2006 Tutorial: 1. Download and Install The Program
No ratings yet
Progecad LT 2006 Tutorial: 1. Download and Install The Program
11 pages
Form Two Biology 1 QWCQW
No ratings yet
Form Two Biology 1 QWCQW
63 pages
October Physics Test
No ratings yet
October Physics Test
2 pages
Understanding Process Conditions
100% (1)
Understanding Process Conditions
41 pages
Installation Manual-ADSS Tension Set
No ratings yet
Installation Manual-ADSS Tension Set
2 pages
Geography Form 2 Schemes of Work
No ratings yet
Geography Form 2 Schemes of Work
14 pages
2G PS Call Setup
No ratings yet
2G PS Call Setup
23 pages
Kistler Balistyka nn180
No ratings yet
Kistler Balistyka nn180
125 pages
DMM 2, Q&a
No ratings yet
DMM 2, Q&a
25 pages
Differential Amplifiers Guide
No ratings yet
Differential Amplifiers Guide
38 pages
Algebra IDiagnostic Multiple Choice (D.lee)
No ratings yet
Algebra IDiagnostic Multiple Choice (D.lee)
3 pages
Run Generator Without Fossil Fuel
No ratings yet
Run Generator Without Fossil Fuel
10 pages
Lexical Issues
100% (2)
Lexical Issues
2 pages
SPEC. FOR ELECTRIC HEATER SYSTEM rg6s911388003 - 001
No ratings yet
SPEC. FOR ELECTRIC HEATER SYSTEM rg6s911388003 - 001
11 pages
Part 3. Supplemental Document - v2.01.08 - Sup
No ratings yet
Part 3. Supplemental Document - v2.01.08 - Sup
8 pages
Speed Control of Induction Motor
100% (1)
Speed Control of Induction Motor
33 pages

ACA UNIT-1 B Kai Hwang

Uploaded by

ACA UNIT-1 B Kai Hwang

Uploaded by

FM Mnfﬁrirli-' Hflllmminnm

Program and Network

2.1.1 Data and Resource Dependence:

' The subscript of a variable is itself subscribed.

(3) Dgpgndanog graph

Flg.2.1 Dara and IID dependence: in the ping:-an ofE>-temple 11

The following loop has eonnnf-dqrendenr iterations:

{aj A dependenee graph showing both data dependence [solid arrows)

Fig.1.1 Denecnbn of parallelism in the prugrarn of Example 11

2.1.2 Hardware and Soﬁnuare Parallelism

.39 Example 2.3 Mismatch between software parallelism and

Fig. 2.3 E:-reetrclng an in-rarrtp-lc program by a 1:we~lssue swperscalar processor

1.1.3 The Role of Compilers

PROGRAM PART ITIOHI NG AND SCHEDULING

2.1.1 Grain Sizes and Latency

|_m,-9| 5 Jobs or programs

2.1.2 Grain Packing and Scheduling

[x,t'_| {lnp-it, delay]

[bl Coarse-grain program graph

Var rt, b,c-, d,e,_,i", g, Fr, i,_,r', k, I, m, n,o,p, q

2.1.3 Static Multiprocessor Scheduling

5*) Example 2.5 Program decomposition for static multipro-

lc=1er2 G-rmiazeiﬂi Gyg-15;-93

{a)GII"1 alze cdulaim I‘| M6-B000 aasemlily code e120-MHz eyele-

Fig. 2.10 Seqoertrlal versus parallel sehed1.tI1-tgin Exarnple 15

@210 ' v=w=x=~r=1o1+1o1+s=21o {M5

PROGRAM FLOW MECHANISMS

2.3.1 C-orrtml Flow\i"ersus Data Flow

From routing network

I n i< n Rotting Hatwortr 1

[c] Data-driven exotatﬂon on a 4-1:-1'0-oemor dataﬂow computer In 14 cyctes

2.3.3 Comparison of Flow Mechanisms

Table 2.1 Corrtrol-Flow, Datoﬂow, and Reduction Computers

Coiiveiitioual computation; token Eager evaluation; statements Leary evaltiatio-J1; statements

SYSTEM INTERCONNECT ARCHITECTURES

You might also like