ACA UNIT-1 A Kai Hwang
ACA UNIT-1 A Kai Hwang
— I _
4 i Advanced Colnpuoerfirehitecru-.re
Obviously, the fact that computing and communication were carried out with moving mechanical parts
greatly limited the computing speed and reliability' ofmechanical co mputers. Modern computers were marked
by the introduction of electronic components. The moving parts in mechanical computers were replaced by
high-mobility electrons in electronic computers. lnibrmation transmission by mechanical gears or levers was
replaced by electric signals traveling almost at the speed of light.
C/nmputer Generation: Over the past several doeades, electronic computers have gone through roughly
five generations of development. Table l.l provides s summary ofthe five generations ofelectronic computer
development. Each of the first three generations lasted about 10 years. The fourth generation covered a time
span of 15 years. The fifili generation today has processors and memory devices with more than l billion
transistors on a single silicon chip.
The division ofgcnerations is marked primarily by major changes in hardware and software technologies.
The entries in Table l.l indicate the new hardware and soflware features introduced with each generation.
Most i'-eatures introduced in earlier generations have been passed to later generations.
Progr-as in Hardwun: As far as hardware technology is concerned, the first generation (I945 I954} used
vacu|.n"n tubes and relay memories interconnected by insulated wires. The second generation (1955-1964)
HM‘ If TILE I'm-I!I;|r1rIit\
Rrreilel Cnrnprrter Models _ M 5
was marked by the use of discrete transistors, diodes, and magnetic ferrite cores, interconnected by printed
circuits.
The third generation -[i 1965-1974) began to use irrregrritcri circuits {IC s) for both logic and memory in
.srmH-scale or mcdirrrn-scale irrregruzirion {SSI or MSI ) and multilayered printed circuits. The fourth generation
( l 974-1991 ] usod l'rrr'ge-scale or 1-'cr'__\-‘large-sem"e irrrcgrariorr (LSI or ‘s-‘LS1 ). Semiconductor memory replaced
core memory as computers movcd from the third to the fourth generation.
The fifth generation (l99I-present) is highlighted by the use of high-density and high-speed processor
and memory chips based on advanced VLSI ncchoology. For example, 64-bit Crl-Iz range processors are now
available on a single chip with over one billion transistors.
The first Generation From the architectural and software points of view, first generation computers
were built with a single eerrrroi proeessirrg rm ir (CPU) which performed serial fixed-point arithmetic using a
program co1.mter,branch instructions, and an accumulator. The CPU must be involved in all memory access
and irrprrrr'r1rr.§rrrrr{:lr'Cl] operations. Machine or assembly languages were used.
Representative systems include the ENI.r*iC (Electronic Numerical Integrator and Calculator} built at the
Moore School of the University of Pennsylvania in 195'[l;the [AS [Institute ibrhdvaneed Studies) computer
ba_sed on a design proposed by John von Neumann, Arthur Burks, and Herman Goldstinc at Princeton in
1946; and the IBM Till, the first electronic stored-program commercial computer built by IBM in l953.
S-ubroutinc linkage was not implemented in early computers.
The Second Generation Index registers, floating-point arithmetic. multiplexed memory. and U0
processors were introduced with second-generation computers. High iei-‘oi lringrrriges {l~lLLs], such as Fortran,
Algol, and Cobol, were introduced along with compilers, subroutine libraries, and batch proccs sing monitors.
Register transfer language was developed by Irving Reed {I 957) for systematic design of digital computers.
Representative systems include the IBM 7030 (the Stretch computer) featining instruction lookahcad and
error-correcting memories built in 1962, the Univac LARC [Livennore Atomic Research Computer} built in
1959, and the CDC 1604 built in the 1960s.
The Third Genernti-on The third generation was represented by the lBMr‘3tii'i—3?O Series, the CDC
ofitllifitilltl Series, Texas lnstrurnents ASC {Advanced Scientific Computer), and Digital Equipmenfs PDP-8
Series from the mizl-I9-fifls to the mid l9'F'l]s.
Microprogrammed control became popular with this generation. Pipelining and cache memory were
introduced to close |.|p the specd gap between the CPU and main memory. The ideaofmultiprogramming was
implemented to interleave C PU and i-"O activities across multiple user programs. This lcd to the development
of time-sharing opcrrifing .s_y.n'cms {OS} using virtual memory with greater sharing or multiplexing of
resources.
The Fourth Generation Parallel computers in various architectures appeared in the fourth generation of
computers using shared or distributed memory or optional vector hardware. Multiprocessing OS, special
languages, and compilers were developed for parallelism. Sofhvare tools and environments were created for
parallel processing or distributed computing.
Representative systems include the VAX 9000, Cray X-MP, IBMr'3(l9D VF, BBN TC-2000, etc. During
these 15 years (l9'F5—l990), the technology ofparallel processing gradually became mature and entered the
production mainstream.
E i Advanced Cmnpimerfiichitecture
The Fifth Generation "These systems emphasise supersealar processors. cluster computers, and rnosrii-'e'I_1-'
por.oHer' processing ('M'PF). Scalable and latency tolerant architectures are being adopted in MPP systems
toting advanced VLSI tectmologics, high-density packaging, a11d optical technologies.
Fifih-generation computers achieved Terafiops (I0 II' floating-point operations per second) performance
by the mid-19905, and have now crossed the Petaflop (IOU floating point operations per sooond] range.
Ueremgemous processing is emerging to solve lalgoseale problems using a network of heterogeneous
computers. Early fifth-generation MPP systems were represented by several projects at Fujitsu (VTP500),
Cray Research [MPP), Thinking Machines Corporation [the CM-5}, and Intel (the Paragon]. For present-day
examples ofadvanced processors and systems; Chapter 13.
Computing Problem: It has been long recognized that the concept of computer architecture is no longer
rest rictod to the st ructure ofthe bare machine hardware. A modern computer is an integrated system consisting
of machine hardware, an instruction set, system software, application programs, and user interfaces. These
system elements are depicted in Fig. 1. 1. The use ofa computer isdriven by real-life problemsdemanding cost
effective solutions. Depending on the nature of the problems, the soh.|tions may require different computing
resources.
"W19
PS’ “ms Operatl ng
System
A‘and Data
"ms Manning Hardware
5
Strum
H%
Arelitoeture
Pro-gramrntng
Performance
Evsl uatlon
For numerical problems in science and technology, the solutions demand complex mathematical
formulations and intensive integer or floating-point computations. For alphanumerieal problems in business
F?» if run! I'nrl'J|||;1rlM'\
lh|'o0lelCornpu'te|'.fl-llodels i 1
and government. the solutions demand efficient transaction processing, large database management, and
information retrieval operation s.
For artificial intelligence [Al] problems, the solutions demand logic inferences and symbolic manipulations.
These computing problems have been labeled rrurrrericof eompnrirrg. rrtrnsoerion pro:-tosing, and logical
reasoning. S-omc complex problems may demand a combination ofthesc processing modes.
Algorithm: and Dam Strueturm Special algorithms and data structures are needed to specify the
computations and communications involved in computing problems. Most numerical algorithms are
deterministic, using regularly structured data. Symbolic processing may use heuristics or nondctcnninistic
searches over large knowledge bases.
Problem formulation and the development ofparallcl algorithmsoficn require interdisciplinary interactions
among thcoreticians, esperimcntalists, and computer programmers. There arc many books dealing with the
design and map ping ofalgorithms or heuristics onto parallel computers. ln this book, we are more conccmed
about the resources mapping problem than about the design and analysis ofparallcl algorithms.
Hardware Resource: The system architecture ofa computer is represented by three nested circles on the
right in Fig. 1 .l ..-it modem computer system demonstrates its power through coordinated cfiorts by hardware
resotnccs, an operating system, and application software. Processors, memory, and peripheral devices form
the hardware core of a computer system. We will study instruction-set processors, memory organization,
muhiproccssors, supercomputers, multicomputers, and massively parallel computers.
Special hardware inte rfams are oflen built into l-“O devices such as display terminals, workstations, optical
page scanners, magnetic ink character recognizers, modems. network adaptors, voice data entry, printers,
and plotters. These peripherals are connected to main frame computers directly or through local or wide-area
networks.
In addition, sofiware interface programs are needed. These software interfaces include file transfer
systems, editors, word processors, device drivers, interrupt handlers, network communication programs, etc.
These programs greatly facilitate the portability of user programs on difierent machine architectures.
Operating System An effective operating system manages the allocation and deallocation of resources
during the execution ofuser programs. Beyond the OS, application software must be developed to benefit the
users. Standard bench mark programs are needed for performance evaluation.
Mapping is a bidirectional process matching algorithmic structure with hardware architecture, and vice
versa. Efficient mapping will benefit the programmer and produce belier source codes. The mapping of
algorithmic and data structures onto the machine architecture includes processor scheduling, memory maps,
interproccssorcommunications, etc. These activities are usually architecture-dependent.
Optimal mappings are sought for various computer architectures. The implementation of these mappings
relies on efficient compiler and operating system support. Parallelism can be exploited at algorithm design
time, at program time, at compile time, and at run time. Techniques for exploiting parallelism at these levels
fortn the core of parallel processing technology.
System S-ofhvure Support Software support is needed for the development of etficient programs in high-
lcvcl languages. The source code written in a I-[LL must be first translated into object code by an optimizing
compiler. The cornpifer assigns variables to registers or to memory words, and generates machine operations
corresponding to HLL operators. to produce machine code which can he recognized by the machine hardware.
A loader is used to initiate the program execution through the CIS kcmcl.
re» Meemw um r-...=-mm. '
B i _ Admrrcad Compunerfirchitectn-.re
Resource binding demands the use of the compiler, assembler, loader. and OS kernel to commit physical
machine resources to program execution. The effectiveness of this process determines the eflieieney
of hardware utilization and the programmability of the computer. Today, programming parallelism is
still difiicult for most programmers due to the fact that existing languages were originally developed for
sequential computers. Programmers arc sometimes forced to program hardware-dependent f-eatures instead of
programming parallelism in a generic and portable w'ay. Ideally, we need to develop a parallel programming
environment with architecturc-independent languages, compilers, and software tools.
To develop a parallel language, we aim for etfieiency in its implementation, portability across different
machines, compatibility with existing sequential languages, expressiveness of parallelism, and ease of
programming. One can attempt a new language approach or try to extend existing sequential languages
gradually. A new language approach has the advantage of using explicit high-level constructs for specifying
parallelism. However, new languages are often incompatible with existing languages and require new
compilers or new passes to existing compilers. Most systems choose the language extension approach; one
way to achieve this is by providing appropriate function libraries.
Compiler Support There are three compiler upgrade approaches: preproecssor, preer:|mpii'er, and
pnrnifctfrbing compiler. A prcprocessor uses a sequential compiler and a low-level library of the target
computer to implement high-level parallel constructs. The precompiler approach requires some program flow
analysis, dependence checking, and liniitod optimizations towanl parallelism detection. The third approach
demands a fully developed parallelizing or vectorizing compiler which can automatically detect parallelism
in source code and transform sequential codes into parallel constnrcts. These approaches will be studied in
Chapter ltl.
The etficieucy of the binding process depends on the effectiveness of the preprocessor, the precompiler,
the parallelizing compiler, the loader, and the OS support. Due to unpredictable program behavior, none ofthe
existing compilers can be considered fully automatic or lirlly intelligent in detecting all types of parallelism.
Very often eonrjriifer dircerit-'e.s' are inserted into the source code to help the compiler do a better job. Users
may interact with the compiler to restr|.|cture the programs. This has been proven uselirl in enhancing the
performance of parallel computers.
Legends:
|l'E: Instruction Fetch and Execute.
SIMD: Sings Instruction stream and
Multlplo Data strmms
E “finial MIMD: Multiple Instruction straarls
snd Multiple Data streams
Functional
i-it
Q Parall rn
st
Mu ltlpla
Func Units Pipeline
tr
y-to oglstor-to
- es"
s rnory -Register
Associative
Processor QEE
5
'E
%
PTO-09$
Arte !l'
Muitleo
Massively parslte-l
processors {M PP)
Fig. 1.‘! Tree showing arehlneerural evolution from sequential scalar composers no veeror processors and
Mulll
parallel computers
Loulalhaeld, Paruilelism, and Pijllelining Lookahead techniques were introduced to prefetch instructions
in order to overlap HE (instnlction fctchidecode and execution] operations and to enable functional
parallelism. Ftmctional parallelism was supported by two approaches: Cine is to use multiple ftmctional units
simultaneously, and the other is to practice pipelining at va.rious processing levels.
The latter includespipelined instruction execution, pipelinod arithmetic computation s, and memory-access
operations. Pipelining has proven especially attractive in performing identical operations repeatedly over
vector data strings. Vector operations were originally carried out implicitly by sofiwarc-controlled looping
using scalar pipeline processors.
Flynn's Clnuiflcntlon Michael Flynn ('1 9‘i2] introduced a classification of various computer architectures
based on notions of instruction and data stream s. As illustrated in Fig. 1.3a, conventional sequ-cntial machines
are called SISD -['.s'ingie inslrnerion stream over rt singie rinrn stream] computers. Vector computers are
equipped with scalar and vector hardware or appear as SIMD (singie insnruerion stream over nrrlirr}-Jie ainm
snvsnnrsl machines {_Fig. l.3b]. Parallel computers are reserved for MIMD -[hnuiripie insrrnerion .-rrmnms over
naiiripie dam snennts) machines.
An MISD {rnuiripie instruction sl‘renrn.s' nnrfn single dnrn .s'rrenrn) machine is modeled in Fig. l.3d. The
same data stream flows through a linear array of processors executing different instruction streams. This
l'h1'Ml.'I;Ifl\lI' HI" l'n¢r.q|r_.u||r\
IO i Admirrced Cempmerfirehitecurre
architecture is also ltnown as systolic arr.n_]-‘s (Kttng and Leiserson, I978] For pipelinod execution ofspeeifie
algorithms.
LM
uses DP-13
ls
IS |g I
I -I 5°“
loaded
1 Program loaded I I from
[an 5'50 unlpmcefim archnecmm [hj SIMD architecture [with dstributed mernory)
CU = Control Unit |
= |rQ CU PU
PU Processing Unit is D5
MU = Memory Unit I I Sh;-Gd I
IS = Instruction Stream : : Memory :
D3 = Data Stream Ir‘-0 IS D5
C PU
PE = Processing Element
LM = Locai lI.|lfl'i10i']|'
[c] MIMD architecture [with shared memory]
IS ul IS
cu, cu: ''-
Memory ls IS '5
matn%aa~ H1
[program
US H
U0
Fig. 1.3 Flynn‘; dassiilcadon of computer ardtinecmres [Derived from Hicha-ei Flynn, 19??!)
Ofthe fourmachine models, most parallel computers built in thepast assumed the MIMD model forgene1al-
purposecomputations. The SIMD and MISD models are more suitable for special-purpose computations. For
this reason, MIMD is the most popular model, SIMD next, and MISD the least popular model being applied
in commercial machines.
Parallel! Wet-or Computer: Intrinsic parallel computers are those that execute programs in MIMD mode.
There ate two major classes of parallel computers, namely, s.horeo‘~memorj-* m1rIIipr0ees.s0r.t and message-
prrssing nwlfieomprrrers. The major distinction between multiproccssors and multicomputers lies in memory
sharing and the mechanisms used for interprocessor communication.
War If J11!!!‘ r'mx-;|umn
Rrroitlel Cornpu'tae|'Med-els i i | |
The processors in a multiprocessor system communicate with each other through shared \-'.nri.trbl'es in a
common memory. Each computer node in a multicomputer system has a local memory, unshared with other
nodes. lnterprocessor communication is done through rrressngc prrssirrg among the nodes.
Explicit vector instructiorts were introduced with the appearance of veemr processors. A voctorprocessor
is equipped with multiple vector pipelines that can be concurrently used under hardware or firmware control.
There are two iamilies ofpipelincd vector processors:
Merrrorjt=-to-rrrerrrrrrjt-' architecture supports the pipelinod flow of vector operands directly from the memory
to pipelines and then back to the memory. Register-m-rtrgisrer architecture uses vector registers to interface
between the memory and functional pipelines. ‘tfcctor processor architectures will be studied in Chapter B.
Another important branch of the architecture tree consists of the STMTJ oomputers for symchronized
vector processing. An SIMD computer exploits spatial parallelism rather than rcrry'Jortr1;JarttlieIism as in a
pipelinod computer SIMD computing is achievod through the use of an array oi'proeessirrg eIerrrcm.~'r {PEs]
synchronired by the same controller. Associative memory can be used to build Silt-'l'D associative processors.
SIMD machines will be treated in Chapter 8 along with pipelined vector computers.
Development Layer: A layered development of parallel computers is illustrated in Fig. 1.4, based on a
classification by Lionel Ni [I990]. Hardware configurations differ from machine to machine, even those oi"
the same model. The address space ofa processor in a computer system varies among difiierent architectures.
it depends on the tnentory organization, which is mar:hine—dependent. These features are up to the designer-
and should match the target application domains.
Ap-plicatiorts T
7 Programming Eiwironment Machhe
T Languages Supported lfldflfififltlflfll
Machine _ Commm|eatio_n__hiode| l
Dfiflefldfll“ Adcireesin t-I sP309
l Hardware Architecture
Fig. 1.4 Six layers for computer system development [Courtesy of Lionel Ni. 1990}
On the other hand, we want to develop application programs and programming environments which
are machine-independent. lndcpendcrrt of machine architecture, the user programs can be ported to many
computers with min imtrm conversion costs. High- level languages and communication models depend on the
architectural choices made in a computer system. From a programmer's viewpoint, these two layers should
be architecture-transparent.
Programming languages such as Fortran, C, C++, Pascal, Ada, Lisp and others can be supported by most
computers. However, the communication models, shared variables versus message passing, are mostly
mac hinc-dependent. The Linda approach using triple srxrcesoffers an arch itecture-tran sparcnt oommunication
model for parallel computers. These language features will be studied in Chapter 10.
Application programmers prefer more architecttrral transparency. However, kernel programmers have to
explore the opportunities supported by hardware. As a good computer architect, one has to approach the
problem from both ends. The compilers and CIS support should be designed to remove as many arehitecttrral
constrairlts as possible from the programmer.
F?» Mtfiruw Hfllrltmjtwrnw
II i Advanced Cnvnpunerfirchiteetu-.re
New Challenge: The technology of parallel processing is the outgrowth of several decades of research
and industrial advances in microelectronics, printed circuits, high density packaging, advanced processors,
memory systems, peripheral devices, communication channels, language evolution, compiler sophistication,
operating systems, programming environmcncs, and application challenges.
The rapid progress made in hardware technology has significantly increased the economical feasibility of
building a new generation ofoomputcrs adopting parallel processing. However, the major barrier preventing
parallel processing from entering the production mainstream is on the software and application side.
To dare. it is still fairly difficult to program parallel and vector computers. We need to strive for major
progress in the sofiware area in order to create a user-friendly environment for high-power computers.
A whole new generation of programmers need to be trained to program parallelism effectively. High-
performance computers provide fast and accurate solutions to scientific. engineering, business, social, and
defense problems.
Representative real-liii: problems include weather forecast modeling, modeling of physical, chemical and
biological processes, computer aided design, large-scale database management, artificial intelligence, crime
controL and strategic defense initiatives, just to name a few. The application domains ofparallcl processing
computers are expanding steadily. With a good understanding ofscalablc computer architectures and mastery
of parallel programming techniques, the reader will be better prepared to face future computing challenges.
Clock Rate and CPI‘ The CPU {or simply theproeesserj oftoday‘s digital computer is driven by a clock
with a constant cycle time t‘. The inverse ofthe cycle time is the eloeir rare -[f= Ht‘). The size ofa pregrarn is
determined by its irrsrruerion mum‘ {L}, in terms of the number of machine instructions to be executed in the
program. Different machine instructions may require dificrent numbers ofcloclt cycles to execute. Therefore,
the eydes per instruction { C PI '1 becomes an important parameter for measuring the time needed to execute
each instrucfjon.
Fora given instnretion set, we can calculate an merrrge CPI over all instruction types, provided we know
their frequencies of appearance in the program. An accurate estimate of the average CPI requires a large
amount of program eode to be traced over s long period of time. Unless specifically focusing on a single
instruction type, we simply use the term CPI to mean the average value with respect to a given instruction set
and a given program mix.
Perfbrmance Factors Let I, be the number of instructions in a given program, or the instruction count.
The CPU time ( Tin secondsrprogram} needed to execute the program is estimated by finding the product of
three eontributing factors:
T=1,.><C'PI><t' {1.1}
The execution ofan instruction requires going through a cycle ofevents involving the instruction fetch,
decode, operand( s] fetch, execution, and store results. In this cycle, only the instruction decode and execution
phases are carried out in the CPU. The remaining three operations may require access to the memory. We
define a memorjt-‘ e_1-‘dc as the time needed to complete one memory reference. Usually, a memory cycle is it
times the processor cycle t. The value ofir depends on the speed of the cache and memory technology and
processor-memory interconnection scheme used.
The CPI of an instruction type can be divided into two component terms corresponding to the total
processor cycles and memory cycles needed to complete the execution of the instruction. Depending on the
instnrct ion type, the complete instruction cycle may involve one to as many as four memory references (one
for instruction fetch, two for operand fetch, and one for store results]. Therefore we can rewrite Eq. 1.1 as
follows:
T=I,.><{p+m><k)><r {L21
where p is the number ofprocessor cycles needed for the instruction decode and execution, m is the number
of memory references needed, It is the ratio between memory cycle and processor cycle, 1,. is the instnrction
eount, and r is the processor cycle time. Equation 1.2 can be further refined once the CPI components (p, m,
Ir) are weighted over the entire instnrction set.
System Attribute: The above five pczrformancc factors (1,, p, m, Ir, r) arc influenced by four system
attributes: instnretion-set architecture, compiler technology, CPU implementation and control, and cache and
memory hierarchy, as specified in Table 1.2.
The instn.|ction-set architecture afiects the program length (1,) and processor cycles needed (p). The
eompiler technology aifects the values ofr',.,p, and the memory reference count (m). The CPU implementation
and control determine the total processor time (p - I‘) needed. Finally, the memory technology and hierarchy
design affect the memory access latency (Ir - t‘). The above CPU time can be used as a basis in estimating the
execution rate ofa processor.
re» Mum-w not I'm'l!Il|(1rlM'\ '
I4 i _ Advorrced Compunerfirchitecto-.re
Per_;l"urmnr|r.-.9 Frrr.'mrs-
Y )'n'.'rt'c rlvemge f.:vr.'r‘e.s per hr.s|'r'r.re!i0:r_ CPI PIr0c'e'.s.wr'
. ysrem - i - --
' _ C01!!!‘-I‘. P.roees."ror .Irf£'mr).|jt' .'l'Ili£'J'7‘l£.l'J’}'- Crete
.-"1.r.rrrbr.r.re.s _
IL. yerll.-'.sper Rre_',le.re'm.'e.s per .-"Ice-cs.s Time,
hr.'r.rruc'! iorr. p .i'n'.'t|‘nrc't‘lwr. m .[a!e'J'rqs'. It T
Instruction-set
. J’ if
Archrrectu re
Compiler
1" \/ \/
Technology
Processor
Implementation vi’ \/
and Control
Cache and
Memory 1/ st’
Hierarchy
MIPS Rate Let C be the total number of clock cycles needed to execute a given program. Then the CPU
time in Eq. 1.2 can be estimated as T= C>< t= Cf)’. Ftn-thermore, CPI = C.-‘L. and T= 1,. ><CPI >< t= L. >< CPlrjf.
The processor speed is often measured in terms of miflion instructions per secornrf (MIPS). We simply call
it the MIPS rate of a given processor. It should be emphasized that the MIPS rate varies with respect to a
number of factors, including the clock rate (fl, the instruction count {L}, and the CPI ofa given machine, as
defined be-low:
MIPS ratc= I" 6 = "f 6 = ‘ii X ['6 (1.31
T><1{1 CPI><l0 C><1t"1
Based on Eq. 1.3, the CPU time in Eq. 1.2 can also be written as T=1,.>< 1O'6r'MIPS. Based on the system
attributes identified in Table 1.2 and the above derived expressions, we conclude by indicating the fact that
the MIPS rate ofa given computer is directly proportional to the clock rate and inversely proportional to the
CPI. All fo|.|r system attributes, instmction set, compiler, processor, and memory technologies, affect the
MIPS rate, which varies also from program to program because ofvariations in the instruction mix.
Floating Point Operations per Second Most compute-intensive applications in science and engineering
make heavy use of floating point operations. Compared to inshuctiorts per second, for such applications a
more relevant measure ofperforrrrsnce is floating point operations per second, which is abbreviated as flops.
With prefix mega. (1051, gigfl tlflglt him (113121 of new 1105'}, this is written as rnegaflops {mt1ops), gigaflops
(gflops), terafleps or petafleps.
‘Throughput Rots: Another important concept is related to how many programs a system can execute per
unit time, called the .s_'t-sterrr throughput l -"___ {in programs?!->econdj. In a multiprogrammcd systcm,thc system
throughput is often lower than the C PU rhmugfipur ll"), defined by:
. f _
“P: % ‘ti-‘“
War If J11!!!‘ r'mr:-;|umn
Rrrallel Cumputaer Models i P |5
Note that Ht}, = {MIPS} >< 106.-1', irom Eq. 1.3. The unit for H1, is also programsfsecond. The CPU
throughput is a measure of how many programs can be executed per second, based only on the MIPS rate
and average program length (_I,.)_ Usually Hf, -=: Hr}, due to the additional system overheads caused by the l.-‘D,
oompilcr, and CIS when multiple programs are interleaved for CPU execution by multiprogramming ortime-
sharing operations. [ftheCPU is kept busy in a perfect prograrn-interleaving fashion, then Hf, = Hr}, This will
probably neverhappen, since the system overhead often causes an extra delay and the CPU may be left idle
ibr some cycles.
These data indicate that the measured CPU time on S, is 12 times longer than that measured on The
object codes n.|nning on the two machines have dii-Terent lengths due to the differences in the machines and
oompilers used. All other overhead times are ignored.
Based on Eq. 1.3, we can soc that the instruction count ofthe object code nrnning on S; must be 1.5 times
longer than that ofthe code running on SI. Furthermore, the average CPI on S, is seen to be 5, while that on
S3 is 1.39 executing the same benchmark program.
S, has a typical CISC (enrrrpfex in.srrr.1erirJn set (."||'JI.|'!‘i_||'J.IU'i-fig] architecture, while S; has a typical RISC
(reduced r'nsIrucrr'on set computing} architecture to be characterized in Chapter 4. This example offers a
simple comparison between the two types ofoomputcrs based on a single program run. When a different
program is run, the conclusion may not be the same.
We cannot calculate the CPU throughput H1, unless we know the program length and the average CPI of
each code. The system throughput Ii-'f._. should be measured across a large number of programs over a long
observation period. The message being conveyed is that one should not draw a sweeping conclusion about
the performance ofa machine based on one ora few program runs.
I6 i Advanced Celnpmerfirehlteetu-.re
Wlren using a parallel computer, one desires a prrrrrllei ant-ircnmenr where parallelism is automatically
exploited. Language extensions or new constructs must be developed to specifi; parallelism or to facilitate
easy detection of parallelism at various granularity levels by more intelligent compilers.
Besides parallel languages and compilers, the operating systems must be also extended to support parallel
processing. The OS must be able to manage the resources behind parallelism. Important issues include
parallel scheduling of concurrent processes, inter-process communication and sync-lironizatin-n, shared
memory allocation, and shared peripheral and communication links.
Implicit Parallelism An implicit approach uses aconventional language, such asC, C-H-, Fortran, or Pascal,
to write the source program. The sequentially coded source program is translated into parallel object code
by a parallciizing compiler. As illustrated in Fig. 1.5a, this compiler must be able to detect parallelism and
assign target machine resources. This compiler approach has been applied in programming shared-memory
multiprocessors.
With parallelism being implicit, success relics heavily on the “intelligence” of a parallclizing compiler.
This approach requires less eifort on the part of the programmer.
Explicit Parallelism The second approach (Fig. l.5b) requires more efibrt by the programmer to develop
a source program using parallel dialects of C, C++, Fortran, or Pascal. Parallelism is explicitly specified in
tl1e user programs. This reduces the burden on the compiler to detect parallelism. In stead, the compiler needs
to preserve parallelism and, where possible, assigns target machine resources. New programming language
Chapel (see Chapter 13) is in this category.
Parallel Concurrent
object cod lb object code
Fig. 1.5 Two approaches to paraiicl prcgramrning {Courtesy cf Gtaries Seirz: adapted with pcrrnisslon from
“CcincurrenrArd1ireerl.rres". pt 51 an-dp.53. VLSI met‘ Parole! Cornpl.ltatlol1.edi1:cd by Strays and Blrrwiscic.
M-organ Kauflrarn Pubiifliers. 1990}
HM‘ If J11!!!‘ I'mi!I;|r1rHt\
Rrrellel Cunputaw Models _ T | -y
Special software tools are noeded to make an environment more iriendly to user groups. Some of the
tools are parallel extensions ofconventional high -level languages. Others are integrated environments which
include tools providing difierent levels ofpmgram abstraction, validation, testing, debugging, and tuning;
performance prediction and monitoring; and visttalization support to aid program development, perfomiance
measurement, and graphics display and animation ofcomputational results.
The UMA Modal ln a LFMA multiprocessor model (Fig. 1.6), the physical memory is uniformly shared
by all the processors. All processors have equal access time to all memory words, which is why it is called
uniform memory access. Each processor may usea private cache. Peripherals are also shared in some fashion.
Procemors
System Interconnect
[Bus Crossbar, Multistage netwofit]
SM II I I
Shared Merrnry
Multiprocessors are called rightly £‘OIJ]J.f£'£'|l.S1FS!£'J'fl.§‘ due to the high degree ofrcsource sharing. The system
interconnect takes the form of a common bus, a crossbar switch, or a multistage network to be studied in
Chapter 7.
Some computer manufacturers have mrr}tipr0r*es.sor {MP} cstcnsions of their unipmcassor {UP) product
line. The UM.-"L model is suitable for general-purpose and times haring applicationsby multiple users. lt can be
used to speed up the execution ofa single large program in time-critical applications. To coordinate parallel
events, synchronization and communication among processors arc done through using shared variables in
the common memory.
Par MIGIITLH Hf" l'mrJI||r_.u|r¢\ :
Wlren all processors have equal access to all peripheral devices, the system is called a synsrnerrfe
multiprocessor. In this case, all the processors are equally capable ofrunning the executive programs, such as
the CIS kernel and l.-‘Cl service routines.
ln an mryrrrnrerrie multiprocessor, only one or a sub set o fprocessors are executive-capable. An executive
or a master processor can execute the operating system and handle IICI. The remaining processors have no
HO capability and thus are called mrnehedpmeessors {A Ps]. Attached processors execute user codes under
the supervision of the master processor. ln both MP and AP configurations, memory sharing among master
and attached processors is still in place.
I»)
lg Example 1.2 Approximated performance of
a multiprocessor
This example exposes the r-rsider to parallel program execution on a shared memory multiprocessor system.
Consider the following Fortran program written for sequential execution on a uniproccssor system. All the
arrays, Ail ], Bfl), and Cfl ], are assumed to have N elements.
Suppose each line ofcode L2, L4, and L6 takes 1 mach inc cycle to execute. Tilt? time required to execute the
program control statements; Ll , L3, L5, and L? is ignored to simplify the analysis. .-°..ssumethat It cycles are needed
for each intcrproocssor commun icat ion operation via the shared memory.
Initially, all arrays are assumed already loaded in the main memory and the short program fragment
already loaded in the instruction cache. In other words, instruction fetch and data loading overhead is ignored.
Also. we ignore bus oontention or memory access conflicts problems. in this way. we can concentrate on the
analysis ofCPU demand.
The above program can be executed on a sequential machine in 23"." cycles under the above assumptions.
Ncycles are needed to execute the N independent iterations in the 1 loop. Similarly, N cycles are needod for
theJ loop, which contains N recursive iterations.
To execute tl1e program on an M-processor system, we partition the looping operations into M sections
with L =Nr‘M elements per section. ln the following parallel code, Dnall declares that all M sections be
executed by M processors in parallel.
For M-way parallel execution, the sectioned I loop can be done in L cycles.
The sectioned J loop produces Mpartial sums in L cycles. Thus EL cycles are consumed to produce all M
partial sums. StilL we need to merge these M partial sums to produce the final sum of N elements.
rm‘ MIGIELH H“ l'm'rIq|r_.r.I|n*\ _
Rrrullel Ccrnpuur Models 1 |9
Doall K = l, M
Do 1t]l=(K—1]*L+1,l(*L
A{_l)= B(l'j +C=['l)
10 Continue
SUM(K) =0
Du 20 .l = l, L
SUM(K) = sumoq + AIIK — 11* L +11
20 Continue
Endall
The addition of each pair of partial sums requires It cycles through the shared memory. An I-level
binary adder tree can be constructed to merge all the partial sums, where I = log; M. The adder tree takes
Mr + 1] cycles to merge the M partial sums sequentially from the leaves to the root ofthe tree. Therefore, the
mu ltiprccessor rcqu ires EL + Kit + 1]= 2Ni‘.M + (Ir + 1)log2 M cycles to produce the final sum.
Suppose N = 23} elements in the array. Sequential execution of the original program takes EN = 23'
machine cycles. Assume that each IPC synchronization overhead has an average value of k = 204} cycles.
Parallel execution on M = 256 processors requires E '3 + 1608 = 9809 machine cycles.
Comparing the abovetiming results, the multiproccs sorshowsa speedup iactorofl 14 out ofthe maximum
value of 256. Therefore, an effi ciency ot'214f256 = 83.6% has been achieved. We will study the speedup and
cflicieney issues in Chapter 3.
The above result was obtained under favorable assumptions about overhead. [n reality, the resulting
speedup might be lower after considering all software overhead and potential resource conflicts. Nevertheless,
the example shows the promising side of parallel processing ifthe intcrprocessor communication overhead
can be maintairieid to a suflieiently low level, represented here in the value of r.
The NUMJN. Model A NUNIA multiprocessor is a shared-mernory system in which the access time varies
with the location of the memory word. Two N'Lll‘vI.-4|. machine models are depicted in Fig. 1.7. The shared
memory is physically distributed to all processors, called local memories. The collection ofall fora! nrenrorics
forms a global address space accessible by all processors.
[t is faster to access a local memory with a local processor. The access ofremote memory attached to other
processors taltes longer due to the added delay through the interconnection network. The BBN TC-2000
Butterfly multiprocessor had the configuration shown in Fig. Lia.
Besides distributed memories, globally shared memory can be added to a multiprocessor system. In this
case, there are three memory-access patterns: The fastest is local memory access. The next is global memory
access. The slowest is access of remote memory as illustrated in Fig. l.'l'b. As a matter offact, the models
shown in Figs. 1.6 and 1.? can he easily modified to allow a mixture of shared memory and private memory
with prespecified access rights.
Ahierarehically structured multiprocessor is modeled in Fig. l.Tb. The processors are divided into several
cl'us'rers'*. Each cluster is itself an UMA or a NUMA multiprocessor. The clusters are connected to global
s'hnrcra'-rrrtrmory modules. The entire system is considered a NUMA multiproccs sor. All processors belonging
to the same cluster are allowed to uniformly access the efrrsrer shared-rncmori-' modules.
‘The word ‘cluster’ is used in a ditlierent sense in cluster computing, as we shall see later.
rhr MIEIHW HI-l'l_lNf.l]l(1|llf\
Zfl i Admlrrced Cempiimerfirelritecm-re
All clusters have equal access to the global memory. However, the access time to the cluster memory is
shorter than that to the global memory. One can specify the access rights among intereluster memories in
va.rious ways. The Cedar multiprocessor, built at the University of Illinois, had such a structure in which each
clusterwas an Alliant FX.-‘SD multiprocessor.
B =:|-|rCli.is1et1
|=—15I' QQl§=
:
i“"_'_“_'_“"_“"_“] '|Il:
I Il:
I l_ _ _ _ _ _ -_. -_. -_. -_. -_.|
Cii.ls'tetN _ _ _ _ _ ?:_ _§ _ i
{aj Sharedlo-cal memories [e.g. the [b] Ahierarehieal cluster model [e.g. the Cedar system at the Uni-
H Butterfly] versity of Illinois}
The CONIA Model A multiprocessor using cache-only memory assumes the COl\-‘IA model. Early
examples ofCOMA machines include the Swedish Institute of Computer Science's Data Diffusion Machine
(DDM, Hagersten ct aI., 1990) and Kendall Square Research's KSR-1 rnaehine{Burli:l1ardt et al., I992]. The
COMA model is depicted in Fig. 1.8. Details ofKSR-1 are given in Chapter 9.
l lnteremnectbn Network |
III
C I in I C
II II II
Fig. 1.! The COMA model of a rriuiltlproeessor (P: Prc.\c-ess-rir.C: Cache. D: Directory; -2.3. the KER-1)
The COM.-"L model is a special ease ofa NLll\-‘IA machine, in which the distributed main memories are
oonverted to cliches. There is no memory hierarchy at each processor node. All the caches form a global
Rriellel Cunputer Models
.
1 2|
address space. Remote cache access is assisted by the distributed cache directories [D in Fig, 1.8). Depending
on the interconnection network used. sometimes hierarchical directories may be used to help locate copies oi
eachcblocks. Initial dataplacemcnt is not critical because data will eventually migrate to where it will be usod.
Besides the UMA, NUMA, and COMA models specified above, other variations exist for multiprocessors.
For example, a einehe-eoherzrrir rion-unifiirni l'fl€’l';lI-tJl':'l-' flC'£"£’S.5' {CC-NUl'l-'lA]I model can be specified with
distributed shared memory and cache directories. Early examples of the CC-NUMA model include the
Stanford Dash (Lenoslci ct al., 1990) and the MIT Alewiie (Agarwal et al., 1990] to be studied in Chapter 9.
A cache-ooherent COM.-it machine is one in which all cache copies must be kept consistent.
The S-S1 was a transaction processing multiprocessor consisting of 3-U BB6-"i486 microprocessors tied
to a common badizplane bus. The IBM ESIQODO models were the latest IBM mainframes having up to 6
processors with attached vector facilities. The TC-2000 could be configured to have 512 M83100 processors
interconnected by a multistage Butterfly network, This was designed as a NUMA machine for real-time or
time-c ritical applications.
Par MIGIITLH HI" l'mrJI||r_.u|n¢s :
Z1 i Advanced Cmnprroerfirchriteeturc
Multiprocessor systems are suitable for general-purpose mu ltiuser applications where programmabilit'y is
the major concern. A major shortcoming ofrnultiproeessors is the lack of scalability. It is rather diflicult to
build MPP machines using centralized shared memory model. Latency tolerance for remote memory access
is also a major limitation.
Packaging and cooling impose additional constraints on scalability. We will study scalability and
programmability in subsequent chapters.
M
P II H
> Message-passing P M
l['|lIBl'(X)l'tl1-B\C.1Ibfl network
: [Me-sh, ring, torus, :
hypercube, cube-
oc-nnected cycle, etc.) p- M
P
M H Ill
Fig. 1.9 Generic model ofa message-pasrlng multloorrrp-Lmer
The message-passing network provides point-to-point static connections among the nodes. All local
memories are private and are accessible only by local processors. Forth is reason, traditional multicomputers
have also been called no-nrnmrwmenmr_r=nec=.'ss (NORMA) machines. lnternode communication is carried
out by passing messages through the static connection network. With advances in interconnection and
network technologies, this model of computing has gained importance, because ofits suitability for certain
applications, scalability, and fault-tolerance.
Nlulticqrrlput-er Gal-emtion: Modern multicomputersuse hardware routers to pass messages. A computer
node is attached to each router. The botmdary router may be connected to l.-‘O and peripheral devices. Mes sage
pa_s.sing between any two nodes involves a sequence of routers and channels. Mixed types of nodes are
allowed in a heterogeneous multico mputer. The imemode conununication in a heterogeneous multicomputer
is achieved through compatible data representations and message-passing protocols.
Early message-passing multicomputers were based on processor board technology using hypercube
architecture and software-controlled message switching. The Caltech Cosmic and lntel iPSCr"l represented
this early development.
The second generation was implemented with mesh-connected architecture, hardware message routing,
and a software environment for medium-grain distributed computing, as represented by the lntel Paragon and
the Parsys SuperNode 1CI'l]'U.
F?» if run! r'nm;|wm1
RJrcMlelCunpote|'Mod-sis M 23
Subsequent systems of this lytte are fine-grain multicomputers, early examples being the MIT I-Machine
and Caltech Mosaic, implemented with both processor and communication gears on the same ‘s-‘LS-l chip. For
further discussion; sec Chapter l3.
In Section 2.4, we will study various static network topologies used to construct multicomputers.
Commonly used topologies include the rirrg, tree. rrrcsh. torus. in-‘perenbe. enbe-mrrrrer:rede\'eie, etc. Various
communication patterns are demanded among the nodes, such as one-to-one, broadcasting, permutations, and
multicast pattern s.
Important issues for multicomputers include message-routing schemes, network flow control strategies,
deadlock avoidance, virtual channels, message-pas.sing primitives, and program decomposition techniques.
Z4 ii Adrorrcad Compmerfirehltacm-re
The Paragon system had a mesh architecture, and the nCUBE.-'2 had a hypercube architecture. The lntel
iilfifls and some custom-designed VLSI processors were used as building blocks in these machines. All three
O5s were UNLX-compatible with extended lbnctions to support message passing.
Most multicomputers can be upgraded to yield a higher degree ofparallclism with enhanced processors.
We will study various massively parallel systems in Part lll where the tradooffs between scalability and
progranmtability are arraiymod.
Fig. 1.10 Bells tavronorrly efM|MD cornp-uters (Courtesy -ofGorelon Bel: rqarlnnecl with perrnlsdon from the
Commtmlcntlons ofAC.l'rl, August 1991}
PM‘ I Ifllli l'I>I'rIqIr_.I.III¢I _
Rrrollal CorrIpI.rI5er Models 1 25
Multicomputers use distributed memories with multiple address spaces. They are scalable with distributed
memory. The evolution of fast LAN (Inert! trretr nenvork]-cormeeted workstations has er-eated “commodity
supcreomputing”. Bell was the first to advocate high-speed workstation clusters interconnected by high-
speed switches in lieu of special-purpose multicomputers. The C M-5 development was an early move in that
direction.
The scalability of MIMD computers will be further studied in Section 3.4 and Chapter 9. In Part lll, we
will study distributedonemory multiproccssors (KER-1, SCI, etc.}; central-memory multiproccssors (Cray,
IBM, DEC, Fujitsu, Encore, ete.); multicomputers by lntel, TMC, and nCUBE; fast LAN-ba-red wor'kstation
clusters, and other exploratory research systems.
r
I I I I I I I I I I I II I I
“'1
Scalar Pro-oeosor -J
Seals
FI.nc1I_on:I
Pp-olnea
"""""" 'TuTI5.5I'I5rIi>E55&""““'”'
Seals |nai'Ix:'non
M M»Mm Z
We
i
i
‘labour
Control
l'l9"l"|9'"°'1'|' I-botar
3'53“ {PI'og'ar1HId 'l|l'gI(;i'3|'
“la Dara; Rggqrm
—|
Hoot
Mme “"93
Shaw F__ _ -_ I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I
IIO {Llaor)
Ii i Advanced Cmnpireerfirchitocturc
[fthe instruction is decoded as a vector operation, it will be sent to the vector control unit. This control
unjt will supervise the flow of vector data between the main memory and vector functional pipelines. The
vector data flow is coordinated by the control unit. A number ofvector functional pipelines may be built into
a vectorprocessor. Two pipeline vector supercomputer models are described below.
Hector Processor Model: Figure l .l l shows a register-to-register architecture. Vector registers are used
to hold the vector operands, intermediate and final vector results. The vector functional pipelines retrieve
operands from and put results imo the vector registers. All vector registers are programmable irI user
instructions. Each vector register is equipped with a component countcrwhich keeps track ofthe component
registers used in successive pipeline cycles.
The length of each vector register is usually fixed, say, sixty-four 64-bit component registers in a vector
register in a Cray Series supercomputer. Other machines, like the Fujitsu VPZUDG Series, rise reconfigurable
vector registers to dynamically match the register length with that of the vector operands.
In general, there are fixed numbers of vector registers and functional pipelines in a vector processor.
Therefore. both resources must he reserved in advance to avoid resource conflicts between different vector
operations. Sorne early vector-reg;is,ter based supercomputers are summarized in Table 1.5.
Representative Supercomputer: Over a dozen pipelinod vector computers have been manufactured,
ranging irom workstations to mini- and supercomputers. Notable early examples include the Stardmt 3000
multiprocessor equipped with vector pipelines, the Convex C3 Series, the DEC 'v'A}i 9000, the [BM 390:"
VF, the Cray Research Y-MP family, the NEC SX Series, the Fujitsu VP2000, and the Hitachi S-810120. For
fi.|rther discussion, soe Chapters 8 and 13.
The Convex C 1 and C2 Series were made with ECL-“CMOS technologies. The latter C3 Series was based
on Ga.-is technology.
The DEC VAX 9000 was Digital's largest mainframe system providing concurrent sealarfvector
and multiprocessing capabilities. The ‘s-‘AX 9000 processors used a hybrid architecture. The vector unit
was an optional feature attached to the K-‘AX 9000 CPU. The Cray Y-MP family ofi'ered both vector and
multiprocessing capabilities.
Cmtrol Unit
I lotoroonnoetion Netwoflc ‘
where
-['1] N is the number ofpmrussing elements (PEs) in the machine. For example, the llliac [V had 64 PEs
and the Connection Machine CM-2 had 65,536 PEs.
{2} C is the set of instructions directly executed by the r-onrmf unit (CU), including scalar and program
flow control instructions.
{'3} I is the set of instructions broadcast by the CU to all PEs for parallel execution. These include
arithmetic, logic, data routing, masking, and other local operations esocuted by each active PE over
data within that PE.
{4} M is the set ofmasking schemes, where each mask partitions the set of PEs into enabled and disabled
subsets.
('5) R is the set ofdata-routing functions, specifying various patterns to be set up in the interconnection
network ibr inter-PE communications.
One can describe a particular SIMD machine architecture by spociiying the S-tuple. An example SIMD
machine is partially specified below.
I»)
lg Example 1.3 Operational specification of the MasPar
MP-1 computer
We will study the detailed arcbitecttlre of the MasPar MP-l in Chapter T. Listed below is a partial specification
ofthe 5-tuple for this machine:
{lj The MP-1 was an SIMD m.achine with N = 1024 to 16,38-'-l PEs, depending on which configuration is
oonsidered.
{2_j The CU executed scalar instructions, broadcast decoded vector instructions to the PE array, and
oontrolled intcr- PE communications.
{'3} Each PE was a register-based load.-"store RISC processor capable of executing integer operations over
various data sizes and standard floating-point operations. The PEs rcoeived instructions from the CU.
{lll The masking scheme was built within each PE and continuously monitored by the CU which could set
and reset the status ofeach PE dynamically at run time.
{Sj The MP-i had an X-Net mesh network plus a global multistage crossbar router for inter-CU-PE,
X-Net nearest B-neighbor, and global router communications.
Repmsontotive SIMD Computer: 'I'hree early commercial STMT) supercomputers are summarized in
Table 1.6. The number of PEs in these systems ranges from 4096 in the DAP6l0 to 16,354 irl the Ma.sPar
MP—1 and 65,536 in the CM—2. Both the CM—2 and DAP610 were fine-grain, bit-slice SLMD eornputers with
attached floating-point accelerators for blocks of PEs’.
Each PE of the MP-i was equipped with a 1-bit logic unit, 4-bit integer.-KLU, 64-bit mantissa unit, and
16-bit exponent unit. Multiple PEs could be built on a single chip due to the simplicity ofeach PE. The MP-l
" Witlt rapid advarroes in ‘t-“L5! teelutology, use ofbit-slice processors in systems has dtsapp-eured.
HM‘ If J11!!!‘ I'ml!I;|r1rHt\
Rrrellel Cunputaw Models _ T 29
irnplementod 32 PEs perchip with forty 32-bit registers per PE. The 32 PEs were interconnected by an X-Ne!
mesh, which was a 4-neighbor mesh augmented with diagonal dual-stage links.
The CM-2 implemented 16 PEs as a mesh on a single chip. Each 16-PE mesh chip was placed at one
vertex ofa 12-dimensional hypercube. Thus 16 >< 2 '3 = 2 '6 = 65,536 PEs fomted the entire SIMD array.
The D.-\PfilO implement-ed 64 PEs as a mesh on a chip. Globally, a large mesh (64 X 6-'-ll was lbrmed
by interconnecting these small meshes on chips. Fortran 90 and modified versions of C, Lisp, and other
sequential programming languages have been developed to program SIMD machines.
Mae Par Dcaig;ned for configurations from Fortran T7. MasP'ar Fortran
Computer I024 to l6,3 E4 pro-ee-ssors with {MIPF), and M:e;Par Parallel
Corporation 26.000 MIPS or 1.3 Gflops. Each Application Language;
MP-l Family PE was a RISC pruc1e5aor.Witl1 16 UNIX.-'03 with X-windurv.
Kbytes local memory. An X-Net symbolic debugger, visualizers
mesh plus a mult ietage crossbar and animators.
interconnect.
Thinking A bit~sliee array of up to 65,536 Driven by a host of VAX,
Machines PEs arranged rat a ll]-dimers ional Sun, or Symbolies 36410, Lisp
('Iorp-oration, hy]:|-ercube with 4 :-< 4 mesh on each emnpiler, Fo1tran9'D, C‘, and
CM-2 vertex, up to 1M bits of memory “Lisp supported by PARIS
per PE, with optional FPU shared
betvneen blocks ol'32 PEs. 23
Gflops peak and 5.6 Gfiops
sustained.
Active A fine-grain, bit-slice SIM!) array Provided by host VAXfVl‘vl5
Memory of up to 40945 PEs interconnected or UNIX Fortran-plus or
Technology by a square mesh with I K bitsper .-'iPi'tL on D.-'iP, Fortran T? or
D.¢'iP6i]fl PE, orthogonal and 4-rie ighb-or C on host.
Family links, 20 GLPS and 560 Mflops
peak pertbcrrnauoe.
also useful in scalability and programmability analysis. when real machines are compared with an idealized
parallel machine without worrying about communication overhead among processing nodes.
NP‘-Completeness An algorithm has a pol\-'nomit1l eomrrferin-' if there exists apolynomial p{s'] such that the
time complexity is O(p {sl} for problem size s. The setofprob lems having polyno mial-complex ity algorithms
is called P-elms (for polynomial class]. The set of problems solvable by nondcterministic algorithms in
polynomial time is called NP-class‘ { for nondcterministic polynomial class).
Since deterministic algorithms are special cases ofthe nondcterministic ones, we know that P -1: NP. The
P -class problems are computationally n'ttc-ttrble, while the NP — P-class problems are inrrnetttbie. But we do
not know whether P = NP or P as NP. This is still an open problem in oomputcr science.
To simulate a nondcterministic algorithm with a deterministic algorithm may require exponential time.
Therefore, intractable NP-class problems are also said to have exponential-time complexity.
L»-l
égl Example 1.4 Polynomial- and exponential-complexity
algorithms
Polynomial-complexity algorithms are known for sorting n numbers in Ofn log rt] time and for multiplication
of two rt X rt matrices in O'{n3j time. Therefore, both problems belong to the P-class.
PM‘ I Ifllli l'm'rIq|r_.\.I|n*\ _
Rrrdlel Cunpuoer Models 1 3|
Nonpolynomial algorithms have been developed for tl1e traveling salesperson problem with complexity
Qfrlzl") and for the knapsack problem with complexity O\[E"Pj. These complexities are eJq'Jom:nIr'oi, greater
than the polynomial complexities. So far, deterministic polynomial algorithms have not been found for these
problems. Therefore, these exponential-complexity problems belong to the NP-class.
Most oomputcr scientists believe that P sfi NP. This leads to the conjecture that there exists a subclass,
called NP-c-0nipLcIe(:l\lPC] problems, such that NPC C NPbut NPCFN P = l,'.'1{'Fig. 1.13]. ln fact, it has been
proved that ifany NP-complete problem is polynomial-time solvable, then one can conclude P = NP. Thus
NP-complete problems are oonsidered the hardest ones to solve. Only approximation algorithms can be
derived for solving the NP-complete problems in polynomial time.
Hg. 1.1! The relationships conjectured among the NP. E and NFC eimses of eompumrlonal pmhaloms
FRAM Models Conventional uniproccssor computers have been modeled as random |:Ic‘£'e’.i‘.S‘ nmehines
(RAM) by Shepcrdson and Sturgis (1963). A ,rJ.omHe1 r.ondom-access nmchirie [PRAMIII model has been
developed by Fortune and Wyllie ([973] for modeling idealized parallel computers with zcno synchronization
or memory aeeess overhead. This PRAM model will be used for parallel algorithm development and for
scalability and complexity analysis.
An n-processor PRAM (Fig. 1.14} has a globally addressable memory. The shared memory can be
distributed among the prroeessors or centralized in one plaoe. The n process-ors—also called proe:.'s.-ring
demenr.-r {PEs}—opcrate on a synchronized read-memory, compute, and write-memoty cycle. With shared
memory, the model must speciiy how concurrent read and concurrent write of memory are handled. Four
memory-update options are po ssiblc:
Tlghtty 9
synchronized Shaved
lvlemory
Fig. 1.14 PRAM model of a m:.|-ltlproceseor system width shamed mernc.r_y. on which all n processors operate
in loekstep in morn-ory access and pmgrarn ea-teoutl-on operations. Each processor can access any
memory location in unit time
F?» Mtfiruw Hlllr'».-rqtwrnw
31 i Advanced Covnpultertllrcirttteettrre
t Erc1'tt.sr'te read (ER}—This allows at most one processor to read fi'om any memory location in each
cycle, a rather restrictive policy.
~ Erulttsrte it-'rr're [EWl—This allows at most one processor to write into a memory location at a time.
' Concurrent read‘ (CR)—This allows multiple processors to read the same information from the same
memory cell in the same cycle.
' C'orrt'rtr'rt'rtr ‘l1-Tile’ {'CW_l—This allows simultaneous writes to the same memory location. In order to
avoid confusion_ some policy must be set up to resolve the write conflicts.
Various combinations of the above options lead to several variants of the PRAM model as specified below.
Since CR does not create a conflict problem, variants differ mainly in how they handle the CW conflicts.
FRAM ibriant: Described below arc four variants of the PRAtvI model, depending on how the memory
reads and writes are handled.
{'1} EREW~PR.*fM rrmdel—This m-odcl forbids more than one processor from reading or writing the same
memory cell simultaneously (Snir, 1982; Karp and Ramachandran, 1935]. This is the most restrictive
PRAM model proposed.
{2} CIli'Ellr'-PR.'l.1-f rrr-rtnlci'—Thc write conflicts are avoided by mutttal exclusion. Concurrent reads to thc
same memory location are allowed.
{'3} ERCll~"-PR.-IM rrrm;l'cl—This allows exclusive read or concurrent writes to the same memory location.
{'-fl] CREW-PR.-1M rrtod'e!—This model allows either concurrent reads or concurrent writes to the same
memory location.
Step I
l. E‘ <— tr
Z. Repeat
E<— H2
if-[Ir s: E] then
begin
Read C{'i,j, Ir]
Read C'{1',j, i'r+ E)
Compute C(r',j, Ir) + C.'{r',_,r', k+ E‘)
Store in C(i,j, Ir]
end
until (E = 1)
To reduce the numbcrof PEs to n3.~"log rt, use a PE array of size n >< n >< n.-‘log rt. Each PE is responsible tor
computing log n product terms and su.t|:u‘ning them up. Step l can be easily modified to produce n.-‘log rt partial
sums, each consisting of log n multiplications and {logn — 1] additions. Now we have art array C-[:r',_,r',kj,'U S
i,_,r' Sn - l, G S it S n.-‘log n — l, which can be summed up in log(n.-‘log rt] time. Combining the time spent in
step l and step 2, we haw: a total execution time 2 log n — l + log{n.-‘log nj = Oflog rt] for large n.
Discrepancy with Physical Model: PRAM models idealize parallel computers, in which all memory
references and pmgram executions by multiple processors are synchronized without extra cost. In reality,
such parallel machines do not exist. An SIMD machine with shared memory is the closest architecture
modeled by PRAM. However, PRAL-I allows different instructions to be executed on different processors
simultaneously. Therefore, PRAM really operates in synchronized MIMD mode with a shared memory.
Among the four PRFM variants, the BREW and CREW are the most popular models used. In fact,
every CREW algorithm can be simulated by an EREW algorithm. The CREW algorithm runs faster than an
equivalent EREW algorithm. It has been proved that the best n-processor EREW algorithm can be no more
than Oflog rt] times slower than any n-processor CRCW algorithm.
The CREW model has received more attention in the literature than the ERCW model. For our purposes,
we will use the CRCW-PRAM model unless otherwise stated. This particular model will be used in defining
scalability in Chapter 3.
For oomplexity analysis orperfoflnflltce comparison, various PRAM variants offeran ideal model ofparallel
computers. Therefore, computer scientists use the PRAM model more often than computer engineers. ln this
book, we design parallel-“vector computers using physical architectural models rather than PRAM models.
The PRAIM model will be used for scalability and performance studies in Chapter 3 as a theoretical reference
machine. PRAM models can indicate upper and lower bounds on the performance of real parallel comp-uters.
34 i Advanced Colnpiieerfiiclritaeru-.ra
is presented below, based on the work of Clark Thompson (1950). Three lower bounds on \-‘LS-l circuits
are imerpreted by Jeffrey Ullman {I984}. The bounds are obtained by setting limits on memory, l.-‘CI, and
communication for implementing parallel algorithms with VLF‘.-l chips.
The AT‘, Nlodel Let A be the chip area and The the latency for completing a given computation using a
VLSI circuit chip. Let s by the problem size involved in the computation. Thornpson stated in his doctoral
thesis that for certain computations, there exists a lower boundfls] such that
Memory Bound on Chip Arno There are many computations which are memory-bound, due to the need
to process large data sets. To implement this type of computation in silicon, one is limited by how densely
information {bit cells) can be placed on the chip. As depicted in Fig. l.l5a, the memory requirement oi" a
computation scts a lower bound on the chip area .»l.
The amount of information processed by the chip can be visualized as information flow upward across the
chip area. Each bit can flow through a unit area of the horizontal chip slice. Thus, the chip area bounds the
amount of memory bits stored on the chip.
HO Bound on Volume AT The volume of the rectangular cube is represented by the product AT. As
information flows through the chip for a period oftime T, thc number of irlput bits cannot etcccd the volume
AT, as demonstrated in Fig. l. 15a.
Time Time
. .9'.L 1
, Gmparoa