0% found this document useful (0 votes)
30 views32 pages

ACA UNIT-1 A Kai Hwang

The document discusses the evolution and architecture of parallel computer models, highlighting the significance of parallel processing in modern computing. It outlines the historical milestones in computer development, detailing the five generations of electronic computers and their respective hardware and software advancements. Additionally, it emphasizes the integration of hardware, software, and programming elements in addressing complex computing problems across various applications.

Uploaded by

ashishkumari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views32 pages

ACA UNIT-1 A Kai Hwang

The document discusses the evolution and architecture of parallel computer models, highlighting the significance of parallel processing in modern computing. It outlines the historical milestones in computer development, detailing the five generations of electronic computers and their respective hardware and software advancements. Additionally, it emphasizes the integration of hardware, software, and programming elements in addressing complex computing problems across various applications.

Uploaded by

ashishkumari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

PM 1'l|¢G-NH-‘ Hfllfidrflfleflflti

— I _

Parallel Computer Models


Dyer the last t.wo decades. computer and communication technologies have literally transformed the
world we live in. Parallel processing has emerged as the key enabling technology in modern computers.
driven by the ever-incrsing demand for higher performance. lower costs. and sustained productivity in
real-life applications.
Parallelism appears in various forms. such as lookahead. pipelining. vectorization. concurrency.
simultaneity. data parallelism. partitioning. interleaving. overlapping. multiplicity. replication. time sharing.
space sharing. multitasking. mulriprogramming. multithreading. and distributed computing at different
processing levels.
in this chapter. we model physical architectures of parallel computers. vector supercomputers.
multiprocessors. multicomputers. and massively parallel processors.Theoretical machine models are also
presented. including the parallel random-access 1THCl"tlHfl- (PRAl""b) and the complexity model of\"'LSl
(very large-scale integration) circulrs..flu-chitecuiral development tracks are identified with case scudies
in the book. Hardware and software subsystems are introduced to pave the way for detailed studies in
subsequent chapters.

THE STATE OF COMPUTING


1 Modem computers arc equipped with powerlill hardware facilities driven by extensive
software packages. To assess state-of-the-art computing, we first review historical milestones
in the development of computers. Then we take a grand tour of the crucial hartlwarc and software elements
built into modern computer systems. We then examine the evolutional relations in milestone architectural
development. Basic hardware and software factors are identified in analyzing the perfonnance of computers.

1.1.1 Computer Development Milestones


Prior to I945, computers were made with mechanical or electromechanical pans. The earliest mechanical
computer can be traced hack to 5110 BC in the ihrm of the abacus used in China. The abacus is manually
operated to perform decimal arithmetic with carry pmpagation digit by digit.
Blaise Pascal built a mechanical adder.~"suhtract'or in France in 1642. Charles Babbage desig ned a diffcrcnoe
engine in England for polynomial evaluation in 182?. Konrad Zusc built the first binary mechanical computer
in Germany in 1941. Howard Aiken proposed the very first electromechanical decimal computer, which was
built as the Harvard Mark I by IBM in 1944. Both Zu-;c's and Aikcn‘s machines were designed for general-
pu rpose computations.
The MIGIITLH HI" l'mrJI||r_.u|r¢\ :

4 i Advanced Colnpuoerfirehitecru-.re

Obviously, the fact that computing and communication were carried out with moving mechanical parts
greatly limited the computing speed and reliability' ofmechanical co mputers. Modern computers were marked
by the introduction of electronic components. The moving parts in mechanical computers were replaced by
high-mobility electrons in electronic computers. lnibrmation transmission by mechanical gears or levers was
replaced by electric signals traveling almost at the speed of light.

C/nmputer Generation: Over the past several doeades, electronic computers have gone through roughly
five generations of development. Table l.l provides s summary ofthe five generations ofelectronic computer
development. Each of the first three generations lasted about 10 years. The fourth generation covered a time
span of 15 years. The fifili generation today has processors and memory devices with more than l billion
transistors on a single silicon chip.
The division ofgcnerations is marked primarily by major changes in hardware and software technologies.
The entries in Table l.l indicate the new hardware and soflware features introduced with each generation.
Most i'-eatures introduced in earlier generations have been passed to later generations.

Table 1.1 Five Generations o]"Eloct.ronlc Cornputers

Gem.'rer.rir.-n Ter.'lrno.lr.-g;' anal Sr'i,f.l'u'ane' and R€pm.~a'=' !J‘.l'mi 92


A .n'.'lritor.‘mn- .-"lpp!ieerrirms .§}'.~r.lerm

First Vacuum tubes and relay Maehinefassembly languages, EN IAC,


{I945 54} memories, CPU driven by single user, no subroutine Princeton IAS,
PC and accumulator, linkage, IBM 71]].
fined-point arithmetic. programmed l.I'O using CPU.
Second Discrete transistors and IILL used with compilers, IBM T090,
(I955 64} core fllflfflflfllfi, subroutine libraries, batch CDC I-I5-D4,
floating-point arithmetic, processing monitor. Univac LARC
l-"G processors, muitipiexed
memory access.
Third Integrated circuits (SS1!- Mutt iprogram ming and time- IBM 315-l1"3TD,
(_ I965 T4} MSI}, microprograrnrning, sharing US, multiuser applications. cue osco,
pipelining, cache, and Tl-rise.
iookahead processors. PDP-B.
Fourth LSINLSI and semiconductor Multiprocessor U S, languages-._ VAX 9000,
-[ I975 90} memory, multiproccssors, compilers, and environments Cray X-MP,
vector snpercompnw, tor parallel processing. IBM 3090,
multicomputers. BEN TCZDIIHJ.
Filth Advanced‘VLSl processors, Sup-ersealar systems See Tables I. 3..
[I99] present) memory, and switches, on a chip, tmnisively parallel and Chapter I3.
high-density packaging, processing, grand challenge
scalable architectures. applications, heterogeneous
processing.

Progr-as in Hardwun: As far as hardware technology is concerned, the first generation (I945 I954} used
vacu|.n"n tubes and relay memories interconnected by insulated wires. The second generation (1955-1964)
HM‘ If TILE I'm-I!I;|r1rIit\
Rrreilel Cnrnprrter Models _ M 5

was marked by the use of discrete transistors, diodes, and magnetic ferrite cores, interconnected by printed
circuits.
The third generation -[i 1965-1974) began to use irrregrritcri circuits {IC s) for both logic and memory in
.srmH-scale or mcdirrrn-scale irrregruzirion {SSI or MSI ) and multilayered printed circuits. The fourth generation
( l 974-1991 ] usod l'rrr'ge-scale or 1-'cr'__\-‘large-sem"e irrrcgrariorr (LSI or ‘s-‘LS1 ). Semiconductor memory replaced
core memory as computers movcd from the third to the fourth generation.
The fifth generation (l99I-present) is highlighted by the use of high-density and high-speed processor
and memory chips based on advanced VLSI ncchoology. For example, 64-bit Crl-Iz range processors are now
available on a single chip with over one billion transistors.

The first Generation From the architectural and software points of view, first generation computers
were built with a single eerrrroi proeessirrg rm ir (CPU) which performed serial fixed-point arithmetic using a
program co1.mter,branch instructions, and an accumulator. The CPU must be involved in all memory access
and irrprrrr'r1rr.§rrrrr{:lr'Cl] operations. Machine or assembly languages were used.
Representative systems include the ENI.r*iC (Electronic Numerical Integrator and Calculator} built at the
Moore School of the University of Pennsylvania in 195'[l;the [AS [Institute ibrhdvaneed Studies) computer
ba_sed on a design proposed by John von Neumann, Arthur Burks, and Herman Goldstinc at Princeton in
1946; and the IBM Till, the first electronic stored-program commercial computer built by IBM in l953.
S-ubroutinc linkage was not implemented in early computers.

The Second Generation Index registers, floating-point arithmetic. multiplexed memory. and U0
processors were introduced with second-generation computers. High iei-‘oi lringrrriges {l~lLLs], such as Fortran,
Algol, and Cobol, were introduced along with compilers, subroutine libraries, and batch proccs sing monitors.
Register transfer language was developed by Irving Reed {I 957) for systematic design of digital computers.
Representative systems include the IBM 7030 (the Stretch computer) featining instruction lookahcad and
error-correcting memories built in 1962, the Univac LARC [Livennore Atomic Research Computer} built in
1959, and the CDC 1604 built in the 1960s.

The Third Genernti-on The third generation was represented by the lBMr‘3tii'i—3?O Series, the CDC
ofitllifitilltl Series, Texas lnstrurnents ASC {Advanced Scientific Computer), and Digital Equipmenfs PDP-8
Series from the mizl-I9-fifls to the mid l9'F'l]s.
Microprogrammed control became popular with this generation. Pipelining and cache memory were
introduced to close |.|p the specd gap between the CPU and main memory. The ideaofmultiprogramming was
implemented to interleave C PU and i-"O activities across multiple user programs. This lcd to the development
of time-sharing opcrrifing .s_y.n'cms {OS} using virtual memory with greater sharing or multiplexing of
resources.

The Fourth Generation Parallel computers in various architectures appeared in the fourth generation of
computers using shared or distributed memory or optional vector hardware. Multiprocessing OS, special
languages, and compilers were developed for parallelism. Sofhvare tools and environments were created for
parallel processing or distributed computing.
Representative systems include the VAX 9000, Cray X-MP, IBMr'3(l9D VF, BBN TC-2000, etc. During
these 15 years (l9'F5—l990), the technology ofparallel processing gradually became mature and entered the
production mainstream.
E i Advanced Cmnpimerfiichitecture

The Fifth Generation "These systems emphasise supersealar processors. cluster computers, and rnosrii-'e'I_1-'
por.oHer' processing ('M'PF). Scalable and latency tolerant architectures are being adopted in MPP systems
toting advanced VLSI tectmologics, high-density packaging, a11d optical technologies.
Fifih-generation computers achieved Terafiops (I0 II' floating-point operations per second) performance
by the mid-19905, and have now crossed the Petaflop (IOU floating point operations per sooond] range.
Ueremgemous processing is emerging to solve lalgoseale problems using a network of heterogeneous
computers. Early fifth-generation MPP systems were represented by several projects at Fujitsu (VTP500),
Cray Research [MPP), Thinking Machines Corporation [the CM-5}, and Intel (the Paragon]. For present-day
examples ofadvanced processors and systems; Chapter 13.

1.1.1 Elements of Modem Computers


Hardware. software, and programming elements of a modem computer system are briefly introduced below
in the context of parallel processing.

Computing Problem: It has been long recognized that the concept of computer architecture is no longer
rest rictod to the st ructure ofthe bare machine hardware. A modern computer is an integrated system consisting
of machine hardware, an instruction set, system software, application programs, and user interfaces. These
system elements are depicted in Fig. 1. 1. The use ofa computer isdriven by real-life problemsdemanding cost
effective solutions. Depending on the nature of the problems, the soh.|tions may require different computing
resources.

"W19
PS’ “ms Operatl ng
System

A‘and Data
"ms Manning Hardware
5
Strum
H%
Arelitoeture

Pro-gramrntng

Bmdlng Applications Software


[com pits, toad]
High-level
Languages

Performance
Evsl uatlon

Fig. 1.1 Elements ofa modern oon1pu1:e1- system

For numerical problems in science and technology, the solutions demand complex mathematical
formulations and intensive integer or floating-point computations. For alphanumerieal problems in business
F?» if run! I'nrl'J|||;1rlM'\
lh|'o0lelCornpu'te|'.fl-llodels i 1

and government. the solutions demand efficient transaction processing, large database management, and
information retrieval operation s.
For artificial intelligence [Al] problems, the solutions demand logic inferences and symbolic manipulations.
These computing problems have been labeled rrurrrericof eompnrirrg. rrtrnsoerion pro:-tosing, and logical
reasoning. S-omc complex problems may demand a combination ofthesc processing modes.

Algorithm: and Dam Strueturm Special algorithms and data structures are needed to specify the
computations and communications involved in computing problems. Most numerical algorithms are
deterministic, using regularly structured data. Symbolic processing may use heuristics or nondctcnninistic
searches over large knowledge bases.
Problem formulation and the development ofparallcl algorithmsoficn require interdisciplinary interactions
among thcoreticians, esperimcntalists, and computer programmers. There arc many books dealing with the
design and map ping ofalgorithms or heuristics onto parallel computers. ln this book, we are more conccmed
about the resources mapping problem than about the design and analysis ofparallcl algorithms.

Hardware Resource: The system architecture ofa computer is represented by three nested circles on the
right in Fig. 1 .l ..-it modem computer system demonstrates its power through coordinated cfiorts by hardware
resotnccs, an operating system, and application software. Processors, memory, and peripheral devices form
the hardware core of a computer system. We will study instruction-set processors, memory organization,
muhiproccssors, supercomputers, multicomputers, and massively parallel computers.
Special hardware inte rfams are oflen built into l-“O devices such as display terminals, workstations, optical
page scanners, magnetic ink character recognizers, modems. network adaptors, voice data entry, printers,
and plotters. These peripherals are connected to main frame computers directly or through local or wide-area
networks.
In addition, sofiware interface programs are needed. These software interfaces include file transfer
systems, editors, word processors, device drivers, interrupt handlers, network communication programs, etc.
These programs greatly facilitate the portability of user programs on difierent machine architectures.

Operating System An effective operating system manages the allocation and deallocation of resources
during the execution ofuser programs. Beyond the OS, application software must be developed to benefit the
users. Standard bench mark programs are needed for performance evaluation.
Mapping is a bidirectional process matching algorithmic structure with hardware architecture, and vice
versa. Efficient mapping will benefit the programmer and produce belier source codes. The mapping of
algorithmic and data structures onto the machine architecture includes processor scheduling, memory maps,
interproccssorcommunications, etc. These activities are usually architecture-dependent.
Optimal mappings are sought for various computer architectures. The implementation of these mappings
relies on efficient compiler and operating system support. Parallelism can be exploited at algorithm design
time, at program time, at compile time, and at run time. Techniques for exploiting parallelism at these levels
fortn the core of parallel processing technology.

System S-ofhvure Support Software support is needed for the development of etficient programs in high-
lcvcl languages. The source code written in a I-[LL must be first translated into object code by an optimizing
compiler. The cornpifer assigns variables to registers or to memory words, and generates machine operations
corresponding to HLL operators. to produce machine code which can he recognized by the machine hardware.
A loader is used to initiate the program execution through the CIS kcmcl.
re» Meemw um r-...=-mm. '
B i _ Admrrcad Compunerfirchitectn-.re

Resource binding demands the use of the compiler, assembler, loader. and OS kernel to commit physical
machine resources to program execution. The effectiveness of this process determines the eflieieney
of hardware utilization and the programmability of the computer. Today, programming parallelism is
still difiicult for most programmers due to the fact that existing languages were originally developed for
sequential computers. Programmers arc sometimes forced to program hardware-dependent f-eatures instead of
programming parallelism in a generic and portable w'ay. Ideally, we need to develop a parallel programming
environment with architecturc-independent languages, compilers, and software tools.
To develop a parallel language, we aim for etfieiency in its implementation, portability across different
machines, compatibility with existing sequential languages, expressiveness of parallelism, and ease of
programming. One can attempt a new language approach or try to extend existing sequential languages
gradually. A new language approach has the advantage of using explicit high-level constructs for specifying
parallelism. However, new languages are often incompatible with existing languages and require new
compilers or new passes to existing compilers. Most systems choose the language extension approach; one
way to achieve this is by providing appropriate function libraries.

Compiler Support There are three compiler upgrade approaches: preproecssor, preer:|mpii'er, and
pnrnifctfrbing compiler. A prcprocessor uses a sequential compiler and a low-level library of the target
computer to implement high-level parallel constructs. The precompiler approach requires some program flow
analysis, dependence checking, and liniitod optimizations towanl parallelism detection. The third approach
demands a fully developed parallelizing or vectorizing compiler which can automatically detect parallelism
in source code and transform sequential codes into parallel constnrcts. These approaches will be studied in
Chapter ltl.
The etficieucy of the binding process depends on the effectiveness of the preprocessor, the precompiler,
the parallelizing compiler, the loader, and the OS support. Due to unpredictable program behavior, none ofthe
existing compilers can be considered fully automatic or lirlly intelligent in detecting all types of parallelism.
Very often eonrjriifer dircerit-'e.s' are inserted into the source code to help the compiler do a better job. Users
may interact with the compiler to restr|.|cture the programs. This has been proven uselirl in enhancing the
performance of parallel computers.

1.1.3 Evolution ofCompui:erAr'chitecturie


The study of computer architecture involves both hardware organization and programrrtingfsoftware
requirements. As se-en by an assembly language programmer, computer architecture is abstracted by its
instruction set, which includes opcode (operation codes], addressing modes, registers, virtual memory, etc.
From the hardware implementation point of view, the abstract machine is organized with CPUs, caches,
buses, microcode, pipelines, physical memory, etc. Therefore, the study ofarchitecture covers both instruction-
set architectttres and machine implementation organizations.
Over the past decades, computer architecture has gone through cvolutional rather than revolutional
changes. Sustaining fcaturcs arc those that were proven performance delivercrs. As depicted in Fig. 1.2, we
started with the von Neumann architecture built as a sequential machine executing scalar data . The sequential
computer was improved from hit-serial to word—parallel operations, and from fixed—point to floating point
operations. The von Neumann architccture is slow due to sequential execution of instructions in programs.
l'h1'Ml.'I;Ifl\lI' HI-l'l_lNf.l]l(1|llf\
BuJciielCm1-pu'ne|'Medeis '.I—I. 9

Legends:
|l'E: Instruction Fetch and Execute.
SIMD: Sings Instruction stream and
Multlplo Data strmms
E “finial MIMD: Multiple Instruction straarls
snd Multiple Data streams
Functional
i-it
Q Parall rn

st
Mu ltlpla
Func Units Pipeline

tr
y-to oglstor-to
- es"
s rnory -Register

Associative
Processor QEE
5
'E
%
PTO-09$
Arte !l'
Muitleo

Massively parslte-l
processors {M PP)

Fig. 1.‘! Tree showing arehlneerural evolution from sequential scalar composers no veeror processors and
Mulll

parallel computers

Loulalhaeld, Paruilelism, and Pijllelining Lookahead techniques were introduced to prefetch instructions
in order to overlap HE (instnlction fctchidecode and execution] operations and to enable functional
parallelism. Ftmctional parallelism was supported by two approaches: Cine is to use multiple ftmctional units
simultaneously, and the other is to practice pipelining at va.rious processing levels.
The latter includespipelined instruction execution, pipelinod arithmetic computation s, and memory-access
operations. Pipelining has proven especially attractive in performing identical operations repeatedly over
vector data strings. Vector operations were originally carried out implicitly by sofiwarc-controlled looping
using scalar pipeline processors.

Flynn's Clnuiflcntlon Michael Flynn ('1 9‘i2] introduced a classification of various computer architectures
based on notions of instruction and data stream s. As illustrated in Fig. 1.3a, conventional sequ-cntial machines
are called SISD -['.s'ingie inslrnerion stream over rt singie rinrn stream] computers. Vector computers are
equipped with scalar and vector hardware or appear as SIMD (singie insnruerion stream over nrrlirr}-Jie ainm
snvsnnrsl machines {_Fig. l.3b]. Parallel computers are reserved for MIMD -[hnuiripie insrrnerion .-rrmnms over
naiiripie dam snennts) machines.
An MISD {rnuiripie instruction sl‘renrn.s' nnrfn single dnrn .s'rrenrn) machine is modeled in Fig. l.3d. The
same data stream flows through a linear array of processors executing different instruction streams. This
l'h1'Ml.'I;Ifl\lI' HI" l'n¢r.q|r_.u||r\

IO i Admirrced Cempmerfirehitecurre

architecture is also ltnown as systolic arr.n_]-‘s (Kttng and Leiserson, I978] For pipelinod execution ofspeeifie
algorithms.

LM
uses DP-13
ls
IS |g I
I -I 5°“
loaded
1 Program loaded I I from

H0 |5 133 from host D5 D5 host

[an 5'50 unlpmcefim archnecmm [hj SIMD architecture [with dstributed mernory)

CU = Control Unit |
= |rQ CU PU
PU Processing Unit is D5
MU = Memory Unit I I Sh;-Gd I
IS = Instruction Stream : : Memory :
D3 = Data Stream Ir‘-0 IS D5
C PU
PE = Processing Element
LM = Locai lI.|lfl'i10i']|'
[c] MIMD architecture [with shared memory]

IS ul IS
cu, cu: ''-
Memory ls IS '5

matn%aa~ H1
[program

US H
U0

[ti] MISD architecture [the systolic array)

Fig. 1.3 Flynn‘; dassiilcadon of computer ardtinecmres [Derived from Hicha-ei Flynn, 19??!)

Ofthe fourmachine models, most parallel computers built in thepast assumed the MIMD model forgene1al-
purposecomputations. The SIMD and MISD models are more suitable for special-purpose computations. For
this reason, MIMD is the most popular model, SIMD next, and MISD the least popular model being applied
in commercial machines.
Parallel! Wet-or Computer: Intrinsic parallel computers are those that execute programs in MIMD mode.
There ate two major classes of parallel computers, namely, s.horeo‘~memorj-* m1rIIipr0ees.s0r.t and message-
prrssing nwlfieomprrrers. The major distinction between multiproccssors and multicomputers lies in memory
sharing and the mechanisms used for interprocessor communication.
War If J11!!!‘ r'mx-;|umn
Rrroitlel Cornpu'tae|'Med-els i i | |

The processors in a multiprocessor system communicate with each other through shared \-'.nri.trbl'es in a
common memory. Each computer node in a multicomputer system has a local memory, unshared with other
nodes. lnterprocessor communication is done through rrressngc prrssirrg among the nodes.
Explicit vector instructiorts were introduced with the appearance of veemr processors. A voctorprocessor
is equipped with multiple vector pipelines that can be concurrently used under hardware or firmware control.
There are two iamilies ofpipelincd vector processors:
Merrrorjt=-to-rrrerrrrrrjt-' architecture supports the pipelinod flow of vector operands directly from the memory
to pipelines and then back to the memory. Register-m-rtrgisrer architecture uses vector registers to interface
between the memory and functional pipelines. ‘tfcctor processor architectures will be studied in Chapter B.
Another important branch of the architecture tree consists of the STMTJ oomputers for symchronized
vector processing. An SIMD computer exploits spatial parallelism rather than rcrry'Jortr1;JarttlieIism as in a
pipelinod computer SIMD computing is achievod through the use of an array oi'proeessirrg eIerrrcm.~'r {PEs]
synchronired by the same controller. Associative memory can be used to build Silt-'l'D associative processors.
SIMD machines will be treated in Chapter 8 along with pipelined vector computers.

Development Layer: A layered development of parallel computers is illustrated in Fig. 1.4, based on a
classification by Lionel Ni [I990]. Hardware configurations differ from machine to machine, even those oi"
the same model. The address space ofa processor in a computer system varies among difiierent architectures.
it depends on the tnentory organization, which is mar:hine—dependent. These features are up to the designer-
and should match the target application domains.

Ap-plicatiorts T
7 Programming Eiwironment Machhe
T Languages Supported lfldflfififltlflfll
Machine _ Commm|eatio_n__hiode| l
Dfiflefldfll“ Adcireesin t-I sP309
l Hardware Architecture

Fig. 1.4 Six layers for computer system development [Courtesy of Lionel Ni. 1990}

On the other hand, we want to develop application programs and programming environments which
are machine-independent. lndcpendcrrt of machine architecture, the user programs can be ported to many
computers with min imtrm conversion costs. High- level languages and communication models depend on the
architectural choices made in a computer system. From a programmer's viewpoint, these two layers should
be architecture-transparent.
Programming languages such as Fortran, C, C++, Pascal, Ada, Lisp and others can be supported by most
computers. However, the communication models, shared variables versus message passing, are mostly
mac hinc-dependent. The Linda approach using triple srxrcesoffers an arch itecture-tran sparcnt oommunication
model for parallel computers. These language features will be studied in Chapter 10.
Application programmers prefer more architecttrral transparency. However, kernel programmers have to
explore the opportunities supported by hardware. As a good computer architect, one has to approach the
problem from both ends. The compilers and CIS support should be designed to remove as many arehitecttrral
constrairlts as possible from the programmer.
F?» Mtfiruw Hfllrltmjtwrnw

II i Advanced Cnvnpunerfirchiteetu-.re

New Challenge: The technology of parallel processing is the outgrowth of several decades of research
and industrial advances in microelectronics, printed circuits, high density packaging, advanced processors,
memory systems, peripheral devices, communication channels, language evolution, compiler sophistication,
operating systems, programming environmcncs, and application challenges.
The rapid progress made in hardware technology has significantly increased the economical feasibility of
building a new generation ofoomputcrs adopting parallel processing. However, the major barrier preventing
parallel processing from entering the production mainstream is on the software and application side.
To dare. it is still fairly difficult to program parallel and vector computers. We need to strive for major
progress in the sofiware area in order to create a user-friendly environment for high-power computers.
A whole new generation of programmers need to be trained to program parallelism effectively. High-
performance computers provide fast and accurate solutions to scientific. engineering, business, social, and
defense problems.
Representative real-liii: problems include weather forecast modeling, modeling of physical, chemical and
biological processes, computer aided design, large-scale database management, artificial intelligence, crime
controL and strategic defense initiatives, just to name a few. The application domains ofparallcl processing
computers are expanding steadily. With a good understanding ofscalablc computer architectures and mastery
of parallel programming techniques, the reader will be better prepared to face future computing challenges.

1.1.4 Syscem Attributes to Performance


The ideal performance of a computer system demands a perfect match between machine capability and
program behavior. Machine capability can bc enhanced with better hardware tcctnrology, innovative
architectural features, and efficient resources management. However, program behavior is difficult to predict
due to its heavy dependence on application and run-time conditions.
There are also many other factors affecting program behavior, including algorithm design, data structures,
language eficiency, programmer skill, and compiler technology. It is impossible to achieve a perfect match
between hardware and sofiware by merely improving only a few factors without touching other factors.
Besides, machine performance may vary from program to program. This makes peril: ;Jerjorm¢im-¢= an
impossible target to achieve in real-lite applications. On the other hand, a machine cannot be said to have an
average performance either. All pcrtb rrnancc indices or benchmark ing results mu st be tied to a program mix.
For this reason, the performance should be described as a range or a distribution.
We introduce below lirndamental factors forprojecting the performance ofa computer. These performance
ind icators are by no means conclusive in all applications. However, they can be used to guide system architects
in designing better machines or to educate programmers or compiler writers in optimizing the codes for more
efficient execution by the hardware.
Consider the execution of a given program on a given computer. The simplest measure of program
performance is the rrrrrmrourm‘ time, which includes disk and memory accesses, input and output activities,
compilation time, OS overhead, and CPU time. ln order to shorten the tumaround time, one must reduce all
these time factors.
ln a multiprogrammcd computer, the l.-‘O and system ovcrhead.s ofa given program may overlap with the
CPU times required in other programs. Therefore, it is fair to compare just the total CPU time needed for
program execution. The CPU is used to execute both system programs and user programs, although oiten it
is the user CPU time that oonccms the user most.
War If J11!!!‘ r'mr:-;|(min
Rrueilel Cnmptmer Models i i |3

Clock Rate and CPI‘ The CPU {or simply theproeesserj oftoday‘s digital computer is driven by a clock
with a constant cycle time t‘. The inverse ofthe cycle time is the eloeir rare -[f= Ht‘). The size ofa pregrarn is
determined by its irrsrruerion mum‘ {L}, in terms of the number of machine instructions to be executed in the
program. Different machine instructions may require dificrent numbers ofcloclt cycles to execute. Therefore,
the eydes per instruction { C PI '1 becomes an important parameter for measuring the time needed to execute
each instrucfjon.
Fora given instnretion set, we can calculate an merrrge CPI over all instruction types, provided we know
their frequencies of appearance in the program. An accurate estimate of the average CPI requires a large
amount of program eode to be traced over s long period of time. Unless specifically focusing on a single
instruction type, we simply use the term CPI to mean the average value with respect to a given instruction set
and a given program mix.

Perfbrmance Factors Let I, be the number of instructions in a given program, or the instruction count.
The CPU time ( Tin secondsrprogram} needed to execute the program is estimated by finding the product of
three eontributing factors:

T=1,.><C'PI><t' {1.1}

The execution ofan instruction requires going through a cycle ofevents involving the instruction fetch,
decode, operand( s] fetch, execution, and store results. In this cycle, only the instruction decode and execution
phases are carried out in the CPU. The remaining three operations may require access to the memory. We
define a memorjt-‘ e_1-‘dc as the time needed to complete one memory reference. Usually, a memory cycle is it
times the processor cycle t. The value ofir depends on the speed of the cache and memory technology and
processor-memory interconnection scheme used.
The CPI of an instruction type can be divided into two component terms corresponding to the total
processor cycles and memory cycles needed to complete the execution of the instruction. Depending on the
instnrct ion type, the complete instruction cycle may involve one to as many as four memory references (one
for instruction fetch, two for operand fetch, and one for store results]. Therefore we can rewrite Eq. 1.1 as
follows:

T=I,.><{p+m><k)><r {L21
where p is the number ofprocessor cycles needed for the instruction decode and execution, m is the number
of memory references needed, It is the ratio between memory cycle and processor cycle, 1,. is the instnrction
eount, and r is the processor cycle time. Equation 1.2 can be further refined once the CPI components (p, m,
Ir) are weighted over the entire instnrction set.
System Attribute: The above five pczrformancc factors (1,, p, m, Ir, r) arc influenced by four system
attributes: instnretion-set architecture, compiler technology, CPU implementation and control, and cache and
memory hierarchy, as specified in Table 1.2.
The instn.|ction-set architecture afiects the program length (1,) and processor cycles needed (p). The
eompiler technology aifects the values ofr',.,p, and the memory reference count (m). The CPU implementation
and control determine the total processor time (p - I‘) needed. Finally, the memory technology and hierarchy
design affect the memory access latency (Ir - t‘). The above CPU time can be used as a basis in estimating the
execution rate ofa processor.
re» Mum-w not I'm'l!Il|(1rlM'\ '
I4 i _ Advorrced Compunerfirchitecto-.re

Table 1.2 Perfbnnenee Fuerors tersus System Attributes

Per_;l"urmnr|r.-.9 Frrr.'mrs-
Y )'n'.'rt'c rlvemge f.:vr.'r‘e.s per hr.s|'r'r.re!i0:r_ CPI PIr0c'e'.s.wr'
. ysrem - i - --
' _ C01!!!‘-I‘. P.roees."ror .Irf£'mr).|jt' .'l'Ili£'J'7‘l£.l'J’}'- Crete
.-"1.r.rrrbr.r.re.s _
IL. yerll.-'.sper Rre_',le.re'm.'e.s per .-"Ice-cs.s Time,
hr.'r.rruc'! iorr. p .i'n'.'t|‘nrc't‘lwr. m .[a!e'J'rqs'. It T
Instruction-set
. J’ if
Archrrectu re
Compiler
1" \/ \/
Technology
Processor
Implementation vi’ \/
and Control
Cache and
Memory 1/ st’
Hierarchy

MIPS Rate Let C be the total number of clock cycles needed to execute a given program. Then the CPU
time in Eq. 1.2 can be estimated as T= C>< t= Cf)’. Ftn-thermore, CPI = C.-‘L. and T= 1,. ><CPI >< t= L. >< CPlrjf.
The processor speed is often measured in terms of miflion instructions per secornrf (MIPS). We simply call
it the MIPS rate of a given processor. It should be emphasized that the MIPS rate varies with respect to a
number of factors, including the clock rate (fl, the instruction count {L}, and the CPI ofa given machine, as
defined be-low:
MIPS ratc= I" 6 = "f 6 = ‘ii X ['6 (1.31
T><1{1 CPI><l0 C><1t"1

Based on Eq. 1.3, the CPU time in Eq. 1.2 can also be written as T=1,.>< 1O'6r'MIPS. Based on the system
attributes identified in Table 1.2 and the above derived expressions, we conclude by indicating the fact that
the MIPS rate ofa given computer is directly proportional to the clock rate and inversely proportional to the
CPI. All fo|.|r system attributes, instmction set, compiler, processor, and memory technologies, affect the
MIPS rate, which varies also from program to program because ofvariations in the instruction mix.
Floating Point Operations per Second Most compute-intensive applications in science and engineering
make heavy use of floating point operations. Compared to inshuctiorts per second, for such applications a
more relevant measure ofperforrrrsnce is floating point operations per second, which is abbreviated as flops.
With prefix mega. (1051, gigfl tlflglt him (113121 of new 1105'}, this is written as rnegaflops {mt1ops), gigaflops
(gflops), terafleps or petafleps.

‘Throughput Rots: Another important concept is related to how many programs a system can execute per
unit time, called the .s_'t-sterrr throughput l -"___ {in programs?!->econdj. In a multiprogrammcd systcm,thc system
throughput is often lower than the C PU rhmugfipur ll"), defined by:

. f _
“P: % ‘ti-‘“
War If J11!!!‘ r'mr:-;|umn
Rrrallel Cumputaer Models i P |5

Note that Ht}, = {MIPS} >< 106.-1', irom Eq. 1.3. The unit for H1, is also programsfsecond. The CPU
throughput is a measure of how many programs can be executed per second, based only on the MIPS rate
and average program length (_I,.)_ Usually Hf, -=: Hr}, due to the additional system overheads caused by the l.-‘D,
oompilcr, and CIS when multiple programs are interleaved for CPU execution by multiprogramming ortime-
sharing operations. [ftheCPU is kept busy in a perfect prograrn-interleaving fashion, then Hf, = Hr}, This will
probably neverhappen, since the system overhead often causes an extra delay and the CPU may be left idle
ibr some cycles.

33"? Example 1.1 MIPS ratings and performance measurement


Consider the use of two systems SI and S; to csocute a hypothetical benchmark program. Machine
characteristics and claimed performance are given below:

.41-frrehirre Cioelt Pe'r_fi:rn':r.rnee' CPU Time

S1 SOD MHZ IUD MIPS 12.: seeottds


53 2.5 GI-Iz IEJDU MIPS .1: seconds

These data indicate that the measured CPU time on S, is 12 times longer than that measured on The
object codes n.|nning on the two machines have dii-Terent lengths due to the differences in the machines and
oompilers used. All other overhead times are ignored.
Based on Eq. 1.3, we can soc that the instruction count ofthe object code nrnning on S; must be 1.5 times
longer than that ofthe code running on SI. Furthermore, the average CPI on S, is seen to be 5, while that on
S3 is 1.39 executing the same benchmark program.
S, has a typical CISC (enrrrpfex in.srrr.1erirJn set (."||'JI.|'!‘i_||'J.IU'i-fig] architecture, while S; has a typical RISC
(reduced r'nsIrucrr'on set computing} architecture to be characterized in Chapter 4. This example offers a
simple comparison between the two types ofoomputcrs based on a single program run. When a different
program is run, the conclusion may not be the same.
We cannot calculate the CPU throughput H1, unless we know the program length and the average CPI of
each code. The system throughput Ii-'f._. should be measured across a large number of programs over a long
observation period. The message being conveyed is that one should not draw a sweeping conclusion about
the performance ofa machine based on one ora few program runs.

Ftrogmmmilrg Environments The programmability of a computer depends on the programming


environment provided to the users. [n fact, the marketability ofany new computer system depends on the
creation ofa user-friendly environment in which programming becomes a productive undertaking ratherthan
a challenge. We briefly introduce below the environmental featmcs desired in modern computers.
Conventional uniproecs sor computers are programmed in a .s'equenrr'o! environment in which in stnrctions
are executed one after another in a sequential manner. In fact, the original UNIX-"OS kernel was designed to
respond to one system call from the user process at a tirnc. Successive system calls must be serialized through
the kemel.
rm‘ MIGIELH Hill l!'m'rIq|r_.\.I|n*\ ‘I _

I6 i Advanced Celnpmerfirehlteetu-.re

Wlren using a parallel computer, one desires a prrrrrllei ant-ircnmenr where parallelism is automatically
exploited. Language extensions or new constructs must be developed to specifi; parallelism or to facilitate
easy detection of parallelism at various granularity levels by more intelligent compilers.
Besides parallel languages and compilers, the operating systems must be also extended to support parallel
processing. The OS must be able to manage the resources behind parallelism. Important issues include
parallel scheduling of concurrent processes, inter-process communication and sync-lironizatin-n, shared
memory allocation, and shared peripheral and communication links.
Implicit Parallelism An implicit approach uses aconventional language, such asC, C-H-, Fortran, or Pascal,
to write the source program. The sequentially coded source program is translated into parallel object code
by a parallciizing compiler. As illustrated in Fig. 1.5a, this compiler must be able to detect parallelism and
assign target machine resources. This compiler approach has been applied in programming shared-memory
multiprocessors.
With parallelism being implicit, success relics heavily on the “intelligence” of a parallclizing compiler.
This approach requires less eifort on the part of the programmer.

Explicit Parallelism The second approach (Fig. l.5b) requires more efibrt by the programmer to develop
a source program using parallel dialects of C, C++, Fortran, or Pascal. Parallelism is explicitly specified in
tl1e user programs. This reduces the burden on the compiler to detect parallelism. In stead, the compiler needs
to preserve parallelism and, where possible, assigns target machine resources. New programming language
Chapel (see Chapter 13) is in this category.

Source co-clowritton Soiree code written


in scepcntlal languages in concurrent dialects
C, C++, Fortran, or of C, -C++, Fortran,
Pascal or Pascal

Para llsl lzing Concurrency


comp-ilor preserving cornp-tier

Parallel Concurrent
object cod lb object code

Execution by Execution try


rmtirna system mntime system
{at lrrpllcit parallelism {bi Explicit parallelism

Fig. 1.5 Two approaches to paraiicl prcgramrning {Courtesy cf Gtaries Seirz: adapted with pcrrnisslon from
“CcincurrenrArd1ireerl.rres". pt 51 an-dp.53. VLSI met‘ Parole! Cornpl.ltatlol1.edi1:cd by Strays and Blrrwiscic.
M-organ Kauflrarn Pubiifliers. 1990}
HM‘ If J11!!!‘ I'mi!I;|r1rHt\
Rrrellel Cunputaw Models _ T | -y

Special software tools are noeded to make an environment more iriendly to user groups. Some of the
tools are parallel extensions ofconventional high -level languages. Others are integrated environments which
include tools providing difierent levels ofpmgram abstraction, validation, testing, debugging, and tuning;
performance prediction and monitoring; and visttalization support to aid program development, perfomiance
measurement, and graphics display and animation ofcomputational results.

MULTIPRDCESSDRS AND MULTICDM PUTER5


1 Two categories of parallel computers are architecturally modeled below. These physical
models are distinguished by having a shared common memory or unshared distributed
memories. Only architectural organization models are described in Sections 1.2 and 1.3. Theoretical and
complexity models for parallel computers are presented in Section 1.4.

1.2.1 Shared-Memory Multipmcessnrs


We describe below three shared- memory multiprocessor models: the rrrrrjbrm rm'rrmr'__i-'-access (_U M.-‘ti model,
the nomm{finrrrr-rrierrrrirfir-net-e.es {NLTMAI model, and the eriehe-orify memorii-' nrehirtrrum {COMAI model.
These models differ in how the memory and peripheral resources are shared ordistributed.

The UMA Modal ln a LFMA multiprocessor model (Fig. 1.6), the physical memory is uniformly shared
by all the processors. All processors have equal access time to all memory words, which is why it is called
uniform memory access. Each processor may usea private cache. Peripherals are also shared in some fashion.
Procemors

System Interconnect
[Bus Crossbar, Multistage netwofit]

SM II I I

Shared Merrnry

F-lg. 1.6 The UMP. muiriprocessor model

Multiprocessors are called rightly £‘OIJ]J.f£'£'|l.S1FS!£'J'fl.§‘ due to the high degree ofrcsource sharing. The system
interconnect takes the form of a common bus, a crossbar switch, or a multistage network to be studied in
Chapter 7.
Some computer manufacturers have mrr}tipr0r*es.sor {MP} cstcnsions of their unipmcassor {UP) product
line. The UM.-"L model is suitable for general-purpose and times haring applicationsby multiple users. lt can be
used to speed up the execution ofa single large program in time-critical applications. To coordinate parallel
events, synchronization and communication among processors arc done through using shared variables in
the common memory.
Par MIGIITLH Hf" l'mrJI||r_.u|r¢\ :

ll] i Advanced Celnprioerfirehiteem-.re

Wlren all processors have equal access to all peripheral devices, the system is called a synsrnerrfe
multiprocessor. In this case, all the processors are equally capable ofrunning the executive programs, such as
the CIS kernel and l.-‘Cl service routines.
ln an mryrrrnrerrie multiprocessor, only one or a sub set o fprocessors are executive-capable. An executive
or a master processor can execute the operating system and handle IICI. The remaining processors have no
HO capability and thus are called mrnehedpmeessors {A Ps]. Attached processors execute user codes under
the supervision of the master processor. ln both MP and AP configurations, memory sharing among master
and attached processors is still in place.

I»)
lg Example 1.2 Approximated performance of
a multiprocessor
This example exposes the r-rsider to parallel program execution on a shared memory multiprocessor system.
Consider the following Fortran program written for sequential execution on a uniproccssor system. All the
arrays, Ail ], Bfl), and Cfl ], are assumed to have N elements.

L1: [In 10 [= 1,N


L2: .-'\{[)= B(l'j + C(l]
L3: IO Continue
L4: SUM = U
L5: [I-n I'D J = 1, bl
L6: SUM = SUM + A(_J_)
LT: Zfl Continue

Suppose each line ofcode L2, L4, and L6 takes 1 mach inc cycle to execute. Tilt? time required to execute the
program control statements; Ll , L3, L5, and L? is ignored to simplify the analysis. .-°..ssumethat It cycles are needed
for each intcrproocssor commun icat ion operation via the shared memory.
Initially, all arrays are assumed already loaded in the main memory and the short program fragment
already loaded in the instruction cache. In other words, instruction fetch and data loading overhead is ignored.
Also. we ignore bus oontention or memory access conflicts problems. in this way. we can concentrate on the
analysis ofCPU demand.
The above program can be executed on a sequential machine in 23"." cycles under the above assumptions.
Ncycles are needed to execute the N independent iterations in the 1 loop. Similarly, N cycles are needod for
theJ loop, which contains N recursive iterations.
To execute tl1e program on an M-processor system, we partition the looping operations into M sections
with L =Nr‘M elements per section. ln the following parallel code, Dnall declares that all M sections be
executed by M processors in parallel.
For M-way parallel execution, the sectioned I loop can be done in L cycles.
The sectioned J loop produces Mpartial sums in L cycles. Thus EL cycles are consumed to produce all M
partial sums. StilL we need to merge these M partial sums to produce the final sum of N elements.
rm‘ MIGIELH H“ l'm'rIq|r_.r.I|n*\ _
Rrrullel Ccrnpuur Models 1 |9

Doall K = l, M
Do 1t]l=(K—1]*L+1,l(*L
A{_l)= B(l'j +C=['l)
10 Continue
SUM(K) =0
Du 20 .l = l, L
SUM(K) = sumoq + AIIK — 11* L +11
20 Continue
Endall
The addition of each pair of partial sums requires It cycles through the shared memory. An I-level
binary adder tree can be constructed to merge all the partial sums, where I = log; M. The adder tree takes
Mr + 1] cycles to merge the M partial sums sequentially from the leaves to the root ofthe tree. Therefore, the
mu ltiprccessor rcqu ires EL + Kit + 1]= 2Ni‘.M + (Ir + 1)log2 M cycles to produce the final sum.
Suppose N = 23} elements in the array. Sequential execution of the original program takes EN = 23'
machine cycles. Assume that each IPC synchronization overhead has an average value of k = 204} cycles.
Parallel execution on M = 256 processors requires E '3 + 1608 = 9809 machine cycles.
Comparing the abovetiming results, the multiproccs sorshowsa speedup iactorofl 14 out ofthe maximum
value of 256. Therefore, an effi ciency ot'214f256 = 83.6% has been achieved. We will study the speedup and
cflicieney issues in Chapter 3.
The above result was obtained under favorable assumptions about overhead. [n reality, the resulting
speedup might be lower after considering all software overhead and potential resource conflicts. Nevertheless,
the example shows the promising side of parallel processing ifthe intcrprocessor communication overhead
can be maintairieid to a suflieiently low level, represented here in the value of r.

The NUMJN. Model A NUNIA multiprocessor is a shared-mernory system in which the access time varies
with the location of the memory word. Two N'Lll‘vI.-4|. machine models are depicted in Fig. 1.7. The shared
memory is physically distributed to all processors, called local memories. The collection ofall fora! nrenrorics
forms a global address space accessible by all processors.
[t is faster to access a local memory with a local processor. The access ofremote memory attached to other
processors taltes longer due to the added delay through the interconnection network. The BBN TC-2000
Butterfly multiprocessor had the configuration shown in Fig. Lia.
Besides distributed memories, globally shared memory can be added to a multiprocessor system. In this
case, there are three memory-access patterns: The fastest is local memory access. The next is global memory
access. The slowest is access of remote memory as illustrated in Fig. l.'l'b. As a matter offact, the models
shown in Figs. 1.6 and 1.? can he easily modified to allow a mixture of shared memory and private memory
with prespecified access rights.
Ahierarehically structured multiprocessor is modeled in Fig. l.Tb. The processors are divided into several
cl'us'rers'*. Each cluster is itself an UMA or a NUMA multiprocessor. The clusters are connected to global
s'hnrcra'-rrrtrmory modules. The entire system is considered a NUMA multiproccs sor. All processors belonging
to the same cluster are allowed to uniformly access the efrrsrer shared-rncmori-' modules.

‘The word ‘cluster’ is used in a ditlierent sense in cluster computing, as we shall see later.
rhr MIEIHW HI-l'l_lNf.l]l(1|llf\

Zfl i Admlrrced Cempiimerfirelritecm-re

All clusters have equal access to the global memory. However, the access time to the cluster memory is
shorter than that to the global memory. One can specify the access rights among intereluster memories in
va.rious ways. The Cedar multiprocessor, built at the University of Illinois, had such a structure in which each
clusterwas an Alliant FX.-‘SD multiprocessor.

I Global Irteroorinect Network i l


Legends:
I IlIl IlI
P: Processor
CSM:Cli.ster
E' C5 Shared Memory
G5M:G|oba
Fl-Iii
EH! III Iii $ C3 III ? 5II
C5 Shamdiimw
rnN:emse»
' N iii! - N - lflllflffllflflflfltfllll
||-19;- ' l
'- I-' Network
E cenneelion E
Netwoiii CS I-S-l'u'l

B =:|-|rCli.is1et1
|=—15I' QQl§=
:
i“"_'_“_'_“"_“"_“] '|Il:
I Il:
I l_ _ _ _ _ _ -_. -_. -_. -_. -_.|
Cii.ls'tetN _ _ _ _ _ ?:_ _§ _ i
{aj Sharedlo-cal memories [e.g. the [b] Ahierarehieal cluster model [e.g. the Cedar system at the Uni-
H Butterfly] versity of Illinois}

Fig. 1.7 Two NLIMA models for rmiidproossor sy-s1:erns

The CONIA Model A multiprocessor using cache-only memory assumes the COl\-‘IA model. Early
examples ofCOMA machines include the Swedish Institute of Computer Science's Data Diffusion Machine
(DDM, Hagersten ct aI., 1990) and Kendall Square Research's KSR-1 rnaehine{Burli:l1ardt et al., I992]. The
COMA model is depicted in Fig. 1.8. Details ofKSR-1 are given in Chapter 9.

l lnteremnectbn Network |

III

C I in I C
II II II

Fig. 1.! The COMA model of a rriuiltlproeessor (P: Prc.\c-ess-rir.C: Cache. D: Directory; -2.3. the KER-1)

The COM.-"L model is a special ease ofa NLll\-‘IA machine, in which the distributed main memories are
oonverted to cliches. There is no memory hierarchy at each processor node. All the caches form a global
Rriellel Cunputer Models
.
1 2|

address space. Remote cache access is assisted by the distributed cache directories [D in Fig, 1.8). Depending
on the interconnection network used. sometimes hierarchical directories may be used to help locate copies oi
eachcblocks. Initial dataplacemcnt is not critical because data will eventually migrate to where it will be usod.
Besides the UMA, NUMA, and COMA models specified above, other variations exist for multiprocessors.
For example, a einehe-eoherzrrir rion-unifiirni l'fl€’l';lI-tJl':'l-' flC'£"£’S.5' {CC-NUl'l-'lA]I model can be specified with
distributed shared memory and cache directories. Early examples of the CC-NUMA model include the
Stanford Dash (Lenoslci ct al., 1990) and the MIT Alewiie (Agarwal et al., 1990] to be studied in Chapter 9.
A cache-ooherent COM.-it machine is one in which all cache copies must be kept consistent.

Representative Multipmcessors Several early commercially available multiproccssors arc summarized


in Table 1.3. They represent four classes of multiprocessors. The Soquent Symmetry Sill belonged to a
class called minisupereomputers. The IBM S]‘BlCH'L"39[l models were high-end mainframes, sometimescalled
ncar- superoo mputcrs. The BEN TC -2000 represented the MPP class.

Table 1.3 Sol-no Early Commercial Multiprocessor System:

['0-rriperrr_i' Hunlwure and .5'q-ff it‘are and Remarltfr


and Model A rclilieciure App] ii;"a1 ions

Soqueut Bus-connected with D‘li'N|X.i‘()E-3, Latter rno-tick designed


Syininetry 30 i356 processors, l<.APi'Soquei1t with taster processors of
S-Bl [PC via SLIC bus; preproceasor, the family.
Weitck floating-point transaction
aoce lerator. irmltiprocessiiig.
IB M ES.i‘9ClIDD 6 ES-'9t]IllID processors os support: rvrvs, Fiber optic chiimieks,
Model with vector faciiit iiai, V M K M S, i'IilX.-'3 TD, integrated
9D[l.""'\-" F crossbar conriected parallel Fort ran. cryptographic
to U0 chanitelis arid VSF V2.5 compiler. architecture.
shared memory.
BEN TC-EDDD SIZ MSSICHJ Ported l'v'Each"OS Latter rrii:i-dais designed
prmessors with local with multiclusteriiig, with taster processors of
memory contracted parallel Fortran, the rzimiry.
by a Butterfly time-critical
switch, a NLTNIZA applications.
machirie.

The S-S1 was a transaction processing multiprocessor consisting of 3-U BB6-"i486 microprocessors tied
to a common badizplane bus. The IBM ESIQODO models were the latest IBM mainframes having up to 6
processors with attached vector facilities. The TC-2000 could be configured to have 512 M83100 processors
interconnected by a multistage Butterfly network, This was designed as a NUMA machine for real-time or
time-c ritical applications.
Par MIGIITLH HI" l'mrJI||r_.u|n¢s :

Z1 i Advanced Cmnprroerfirchriteeturc

Multiprocessor systems are suitable for general-purpose mu ltiuser applications where programmabilit'y is
the major concern. A major shortcoming ofrnultiproeessors is the lack of scalability. It is rather diflicult to
build MPP machines using centralized shared memory model. Latency tolerance for remote memory access
is also a major limitation.
Packaging and cooling impose additional constraints on scalability. We will study scalability and
programmability in subsequent chapters.

1.1.2 Distributed-Memory Multicomputers


A distributed-memory multicomputer system is modeled in Fig. 1.9. The system consists of multiple
computers, often called nodes, intercccnnected by a message-passing network. Each node is an autonomous
computer consisting of a processor, local memory, and sometimes attached disks or l.-‘O peripherals.

M
P II H
> Message-passing P M
l['|lIBl'(X)l'tl1-B\C.1Ibfl network
: [Me-sh, ring, torus, :
hypercube, cube-
oc-nnected cycle, etc.) p- M

P
M H Ill
Fig. 1.9 Generic model ofa message-pasrlng multloorrrp-Lmer

The message-passing network provides point-to-point static connections among the nodes. All local
memories are private and are accessible only by local processors. Forth is reason, traditional multicomputers
have also been called no-nrnmrwmenmr_r=nec=.'ss (NORMA) machines. lnternode communication is carried
out by passing messages through the static connection network. With advances in interconnection and
network technologies, this model of computing has gained importance, because ofits suitability for certain
applications, scalability, and fault-tolerance.
Nlulticqrrlput-er Gal-emtion: Modern multicomputersuse hardware routers to pass messages. A computer
node is attached to each router. The botmdary router may be connected to l.-‘O and peripheral devices. Mes sage
pa_s.sing between any two nodes involves a sequence of routers and channels. Mixed types of nodes are
allowed in a heterogeneous multico mputer. The imemode conununication in a heterogeneous multicomputer
is achieved through compatible data representations and message-passing protocols.
Early message-passing multicomputers were based on processor board technology using hypercube
architecture and software-controlled message switching. The Caltech Cosmic and lntel iPSCr"l represented
this early development.
The second generation was implemented with mesh-connected architecture, hardware message routing,
and a software environment for medium-grain distributed computing, as represented by the lntel Paragon and
the Parsys SuperNode 1CI'l]'U.
F?» if run! r'nm;|wm1
RJrcMlelCunpote|'Mod-sis M 23

Subsequent systems of this lytte are fine-grain multicomputers, early examples being the MIT I-Machine
and Caltech Mosaic, implemented with both processor and communication gears on the same ‘s-‘LS-l chip. For
further discussion; sec Chapter l3.
In Section 2.4, we will study various static network topologies used to construct multicomputers.
Commonly used topologies include the rirrg, tree. rrrcsh. torus. in-‘perenbe. enbe-mrrrrer:rede\'eie, etc. Various
communication patterns are demanded among the nodes, such as one-to-one, broadcasting, permutations, and
multicast pattern s.
Important issues for multicomputers include message-routing schemes, network flow control strategies,
deadlock avoidance, virtual channels, message-pas.sing primitives, and program decomposition techniques.

Representative Mulricomput-er: Tbrcr: early message-passing multicomputers are surnnrarizod in


Table 1.4. With distributed processorfmemory nodes, such machines arc better in achieving a scalable
periormance. However, message passing imposes a requirement on programmers to distribute the
computations and data sets over the nodes or to establish cffieicnt communication among nodes.

Table 1.4 Some Early Comrnerclul Mulrlcomputcr Systems


.'i:5'.'t rem Int’er‘ n(.'L-‘BE.-“.2 Pars)-'.t Ltd.
Fe¢r.tr.rre.s Paragon XP-"S sass Srrpa'rr'v'urJl2J'HUG

Node Types 5U Ivfl-I: i860 XP Each node contains a EC-fancied Esprit


a.nd Memory computing nodes with CISC 54-bit CPU, sup-emodc built with
16-128 Mbytes per with FPU, 14 Dl'vl.-it multiple T-800
node, special U0 ports, with 1-6-4 Transputers per node.
service nodes. Mbytes .-"node.
Network and 2-D mesh with SCSI, 13-dimensional Reconfigurable
H0 HIPPI, VME, hypercube of B193 interconnect,
Ethernet, and custom nodes, S12-Gbyte expandable to have
IIO. memory, 64 l.-‘CI i024 processors.
boards.
OS and DSF conformance VertexfOS or UNIX IDRISIUS
Software Task with 4.3 BSD, supporting message UNIX-compatib le.
Parallelism visualization and passing using
Support programming worrnhole routing.
support.
Application General sparse rnatrix Scientific number Scientific and
Drivers methods, parallel crunching with scalar academic
data manipulation, nodes, database applications.
strategic computing. processing.
Performance 5--300 Gflnpet peak 2'-I‘ Gfiopa Peak. 36 200 NITPS to I3 GTPS
Remarks 64-hit results. 2.3»-' int) Gbytesfs U0 peak.
GIPS peak integer
pcrforrnance.
l'r\1'Ml.'I;Ifl\lI' HI" l' wrqt .u| rs T

Z4 ii Adrorrcad Compmerfirehltacm-re

The Paragon system had a mesh architecture, and the nCUBE.-'2 had a hypercube architecture. The lntel
iilfifls and some custom-designed VLSI processors were used as building blocks in these machines. All three
O5s were UNLX-compatible with extended lbnctions to support message passing.
Most multicomputers can be upgraded to yield a higher degree ofparallclism with enhanced processors.
We will study various massively parallel systems in Part lll where the tradooffs between scalability and
progranmtability are arraiymod.

1.1.3 ATaxonorny of MIMD Computers


Parallel computers appear as either SIMD or MIMD configurations. The SLMD-s appeal more to special-
purp-ose applications. it is clear that STMTJs are not sirescalahle. but unclear whether large Slhflls are
generation-scalable. The fact that CM-5 had an MIMD architecture, away from the SIMD architecture in
CM -Z, represents the architectural trend {see Chapter B]. Furthermore, the boundary between multiproccssors
and multicomputers has become blurred in recent years.
The architectural trend for general-purpose parallel computers is in favor of MTMD configurations with
various memory configurations (sec Chapter I3]. Gordon Bell (1992) has provided a taxonomy of MIMD
machines, reprinted in Fig. l.lCI. He considers shared-memory multiproccssors as having a single address
space. Scalable multiproccssors or multicomputers must use distributed memory. Multiprocessors using
ecrntrally shared memory have limited scalability.
Dynanic bhoingot
attorewae to trooeesore
KER
Dtstntmect rnemory flak: lmcltng,ri'ng rmlti
rmltprooeaors IEEE SCI atendaci trepeaer
{amiable} Hale trifling, eacheing
Alma-art DASH
State program btndng
BEN. Cedar, GM‘

ll|ltfl:|'n-colon Cross-print ormult-stage


Sirql&Acl'.trem Spare Cray, Fu,U.tau. HI.t.ar:.lrJ. IBM.
Shared lderrroql NEE, Tera-
Cortrpuflllon Sirrple. mg m|.t1|,u.ta
Denial memory mutt rephemtertt
rruttprocewors Bus rmtie
{I'D-1 scalable) DB3. Encore. N'C.F\'_.._
Sequatu. SE1 Eu.»

MIMI] Me-sh oonne-cud


Iltliisl
B.rtter'ly.' Fat Tree
ores-in-me CH5
mI.l1ioon'pu\e¢B
tsoatmte) Fqpermhes
NCUBE
Fast LAMB ‘It! high
Ilultienmputaen avatanility and tigh
la'l.|l-lpte Atflrem Space camely duster:
tllessag-F'a§ir1g DEC. Ta-nderrr
Oorrouhtlar
Lima sorotes-tmao
processing
waketa-.|'.leus , PCs
Central rmtioomyruhrs

Fig. 1.10 Bells tavronorrly efM|MD cornp-uters (Courtesy -ofGorelon Bel: rqarlnnecl with perrnlsdon from the
Commtmlcntlons ofAC.l'rl, August 1991}
PM‘ I Ifllli l'I>I'rIqIr_.I.III¢I _
Rrrollal CorrIpI.rI5er Models 1 25

Multicomputers use distributed memories with multiple address spaces. They are scalable with distributed
memory. The evolution of fast LAN (Inert! trretr nenvork]-cormeeted workstations has er-eated “commodity
supcreomputing”. Bell was the first to advocate high-speed workstation clusters interconnected by high-
speed switches in lieu of special-purpose multicomputers. The C M-5 development was an early move in that
direction.
The scalability of MIMD computers will be further studied in Section 3.4 and Chapter 9. In Part lll, we
will study distributedonemory multiproccssors (KER-1, SCI, etc.}; central-memory multiproccssors (Cray,
IBM, DEC, Fujitsu, Encore, ete.); multicomputers by lntel, TMC, and nCUBE; fast LAN-ba-red wor'kstation
clusters, and other exploratory research systems.

MULTIVECTOR AND SIMD COMPUTERS


1 ln this section, we introduce supercomputers and parallel proeossors for vector processing
and data parallelism. We classify supercomputers either as pipelirred voetor machines using a
few powerful processors equipped with vector hardware, or as SIMD computers emphasizing massive data
parallelism.

1.3.1 Vector Supercomputers


A vector computer is often built on top of a scalar processor. As shown in Fig. 1.11, the vector processor
is attached to the scalar processor as an optional feature. Program and data are first loaded into the main
memory through a host computer. All instructions are first decoded by the scalar control unit. If the decoded
i1'Istr1.IetiorI is a scalar operation or a program control operation, it will be directly executed by the scalar
processor using the scalar functional pipelines.

r
I I I I I I I I I I I II I I
“'1
Scalar Pro-oeosor -J
Seals
FI.nc1I_on:I
Pp-olnea
"""""" 'TuTI5.5I'I5rIi>E55&""““'”'
Seals |nai'Ix:'non

M M»Mm Z
We
i

i
‘labour

Control

hag-“mam - "|\I=I~c:I.or Ftnc. .


l

l'l9"l"|9'"°'1'|' I-botar
3'53“ {PI'og'ar1HId 'l|l'gI(;i'3|'
“la Dara; Rggqrm

—|
Hoot
Mme “"93
Shaw F__ _ -_ I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I

IIO {Llaor)

Fig. 1.11 The ardultoertrro of a vector sup-ereornputor


Par MIGIITLH HI" l'I»rI'JI|Ir_.uII¢\ :

Ii i Advanced Cmnpireerfirchitocturc

[fthe instruction is decoded as a vector operation, it will be sent to the vector control unit. This control
unjt will supervise the flow of vector data between the main memory and vector functional pipelines. The
vector data flow is coordinated by the control unit. A number ofvector functional pipelines may be built into
a vectorprocessor. Two pipeline vector supercomputer models are described below.

Hector Processor Model: Figure l .l l shows a register-to-register architecture. Vector registers are used
to hold the vector operands, intermediate and final vector results. The vector functional pipelines retrieve
operands from and put results imo the vector registers. All vector registers are programmable irI user
instructions. Each vector register is equipped with a component countcrwhich keeps track ofthe component
registers used in successive pipeline cycles.
The length of each vector register is usually fixed, say, sixty-four 64-bit component registers in a vector
register in a Cray Series supercomputer. Other machines, like the Fujitsu VPZUDG Series, rise reconfigurable
vector registers to dynamically match the register length with that of the vector operands.
In general, there are fixed numbers of vector registers and functional pipelines in a vector processor.
Therefore. both resources must he reserved in advance to avoid resource conflicts between different vector
operations. Sorne early vector-reg;is,ter based supercomputers are summarized in Table 1.5.

Table 1.5 Some Eorly Commercial Nutter‘ Supercornpirtcrs

Sys.rem Iri-cror Harzfwara‘ A mhirec-mm (.TunIpHer and


.1-f0d'-er‘ and (_'r.rpafiI'ii.Iie.r Sof.'rv.'rr.rc Support
Coirvex Goats-loosed mutt iprooessor ftdvmtoed (I, Fortran,
C3300 with B processors and and Ada vecisoriiting
family 500-lvihytefs access port. and parallclizing compilers.
4 Gt:-ytes rnain rnernory. Also supported interpro-cedurul
1 Ciilvrt Pr-=Il= oprtimiration.
perforrrranee with PUSIK IIIII]3.I.-‘()5
ooirer.|.rrent scalariveetor plus I.-‘(J iirterI'aees
operations. and visualization system
Digital Integrated vector processing MS or ULTRJJGCIS,
WKX QUDIIJ in the VAX environrriertt, \-‘AK Fortran and
System I25—5lJ0 lvtflops VAX Vector Instruction
pink perfi:-rrnmree. Emulator {VVIEFI
63 vector for vcctorized
I6 x 64 x 64 vector registers. program debugging.
Pipeline cl1.ai1Ii1Ig possible.
Cray Research Y~MP ran with 2, 4, or CF77 compiler for
Y-MP and 8 processors, 2.67 Gflop automatic vcctorizarion,
C-90 peak willr Y-MPS256. C-90 scalar optimirafion.
l1.IICl 2 vector pipes.-‘CPU and parallel processing.
buiit w itlr IOK gate ECL UNIt'_‘OS im proved
with I6 Gflops ptfilt performance. from UNIXN and
Berkeley BSDMS.
Par I J11!!!‘ l'mrJI||r_.u|n¢\
Rrrollel Cunpuow Models 1 21

A menmrj-'-to-naenmrjr-' architecture difiers from a register-to-register architecture in the use of a vector


stream unit to replace the vector registers. Vector operands and results are directly retrieved from and stored
into the main memory in superwords, say, S12 bits as i11thc Cyber 205.
Pipelined vector supercomputers started with uniproccssor models such as the Cray 1 in 1‘§'?6. Subsequent
supercomputer systems ofi'ered both uniprocessorand multiprocessor models such as the Cray Y-MP Series.

Representative Supercomputer: Over a dozen pipelinod vector computers have been manufactured,
ranging irom workstations to mini- and supercomputers. Notable early examples include the Stardmt 3000
multiprocessor equipped with vector pipelines, the Convex C3 Series, the DEC 'v'A}i 9000, the [BM 390:"
VF, the Cray Research Y-MP family, the NEC SX Series, the Fujitsu VP2000, and the Hitachi S-810120. For
fi.|rther discussion, soe Chapters 8 and 13.
The Convex C 1 and C2 Series were made with ECL-“CMOS technologies. The latter C3 Series was based
on Ga.-is technology.
The DEC VAX 9000 was Digital's largest mainframe system providing concurrent sealarfvector
and multiprocessing capabilities. The ‘s-‘AX 9000 processors used a hybrid architecture. The vector unit
was an optional feature attached to the K-‘AX 9000 CPU. The Cray Y-MP family ofi'ered both vector and
multiprocessing capabilities.

1.3.2 SIMD Supercomputers


In Fig. 1.3b, we have shown an abstract model ofSlMD oomputers having a single instruction stream over
multiple data streams. An operational model of SIMD computers is presented below (Fig. 1. 13) based on the
work ofH. J. Siegel (19791. Implementation models and ease studies of SIMD machines are given in Chapter E.

Cmtrol Unit

PEO PE1] PE2 PE N-1


]Pno]e.fl-I |Proe.1| |Pr<T.2| -H
|Mem.Dl |Mor'n.1| |Mom.2|

I lotoroonnoetion Netwoflc ‘

Fig. 1.12 C\|:|-creel-onal model of SIMD oompurers

SIMD Machine Model An operational model of an SIMD computer is specified by a 5-tuple:


M=(N, CJ, M, R) -[1.5_|
rs.- Mam-w rrrm-...¢-,......¢. '
Ill i Advanced Compmaerfiuehiteeture

where

-['1] N is the number ofpmrussing elements (PEs) in the machine. For example, the llliac [V had 64 PEs
and the Connection Machine CM-2 had 65,536 PEs.
{2} C is the set of instructions directly executed by the r-onrmf unit (CU), including scalar and program
flow control instructions.
{'3} I is the set of instructions broadcast by the CU to all PEs for parallel execution. These include
arithmetic, logic, data routing, masking, and other local operations esocuted by each active PE over
data within that PE.
{4} M is the set ofmasking schemes, where each mask partitions the set of PEs into enabled and disabled
subsets.
('5) R is the set ofdata-routing functions, specifying various patterns to be set up in the interconnection
network ibr inter-PE communications.
One can describe a particular SIMD machine architecture by spociiying the S-tuple. An example SIMD
machine is partially specified below.

I»)
lg Example 1.3 Operational specification of the MasPar
MP-1 computer
We will study the detailed arcbitecttlre of the MasPar MP-l in Chapter T. Listed below is a partial specification
ofthe 5-tuple for this machine:

{lj The MP-1 was an SIMD m.achine with N = 1024 to 16,38-'-l PEs, depending on which configuration is
oonsidered.
{2_j The CU executed scalar instructions, broadcast decoded vector instructions to the PE array, and
oontrolled intcr- PE communications.
{'3} Each PE was a register-based load.-"store RISC processor capable of executing integer operations over
various data sizes and standard floating-point operations. The PEs rcoeived instructions from the CU.
{lll The masking scheme was built within each PE and continuously monitored by the CU which could set
and reset the status ofeach PE dynamically at run time.
{Sj The MP-i had an X-Net mesh network plus a global multistage crossbar router for inter-CU-PE,
X-Net nearest B-neighbor, and global router communications.

Repmsontotive SIMD Computer: 'I'hree early commercial STMT) supercomputers are summarized in
Table 1.6. The number of PEs in these systems ranges from 4096 in the DAP6l0 to 16,354 irl the Ma.sPar
MP—1 and 65,536 in the CM—2. Both the CM—2 and DAP610 were fine-grain, bit-slice SLMD eornputers with
attached floating-point accelerators for blocks of PEs’.
Each PE of the MP-i was equipped with a 1-bit logic unit, 4-bit integer.-KLU, 64-bit mantissa unit, and
16-bit exponent unit. Multiple PEs could be built on a single chip due to the simplicity ofeach PE. The MP-l

" Witlt rapid advarroes in ‘t-“L5! teelutology, use ofbit-slice processors in systems has dtsapp-eured.
HM‘ If J11!!!‘ I'ml!I;|r1rHt\
Rrrellel Cunputaw Models _ T 29

irnplementod 32 PEs perchip with forty 32-bit registers per PE. The 32 PEs were interconnected by an X-Ne!
mesh, which was a 4-neighbor mesh augmented with diagonal dual-stage links.
The CM-2 implemented 16 PEs as a mesh on a single chip. Each 16-PE mesh chip was placed at one
vertex ofa 12-dimensional hypercube. Thus 16 >< 2 '3 = 2 '6 = 65,536 PEs fomted the entire SIMD array.
The D.-\PfilO implement-ed 64 PEs as a mesh on a chip. Globally, a large mesh (64 X 6-'-ll was lbrmed
by interconnecting these small meshes on chips. Fortran 90 and modified versions of C, Lisp, and other
sequential programming languages have been developed to program SIMD machines.

Table 1.6 Some Early Contmerdd SIMD 5Lll.'.NH'COl11]l>lJ‘ll!J'S

SW1‘-em SIMD )l1fuv_'§rl.Ir€ A rehlleelure Lerngr.r¢r‘|.;e.'r, C-:1-rr:pHcr.t


Moe|"e1 and C'ap¢:.r£I'Hl.rle.t and .':7q-,|l.i'H.'ure Suppon

Mae Par Dcaig;ned for configurations from Fortran T7. MasP'ar Fortran
Computer I024 to l6,3 E4 pro-ee-ssors with {MIPF), and M:e;Par Parallel
Corporation 26.000 MIPS or 1.3 Gflops. Each Application Language;
MP-l Family PE was a RISC pruc1e5aor.Witl1 16 UNIX.-'03 with X-windurv.
Kbytes local memory. An X-Net symbolic debugger, visualizers
mesh plus a mult ietage crossbar and animators.
interconnect.
Thinking A bit~sliee array of up to 65,536 Driven by a host of VAX,
Machines PEs arranged rat a ll]-dimers ional Sun, or Symbolies 36410, Lisp
('Iorp-oration, hy]:|-ercube with 4 :-< 4 mesh on each emnpiler, Fo1tran9'D, C‘, and
CM-2 vertex, up to 1M bits of memory “Lisp supported by PARIS
per PE, with optional FPU shared
betvneen blocks ol'32 PEs. 23
Gflops peak and 5.6 Gfiops
sustained.
Active A fine-grain, bit-slice SIM!) array Provided by host VAXfVl‘vl5
Memory of up to 40945 PEs interconnected or UNIX Fortran-plus or
Technology by a square mesh with I K bitsper .-'iPi'tL on D.-'iP, Fortran T? or
D.¢'iP6i]fl PE, orthogonal and 4-rie ighb-or C on host.
Family links, 20 GLPS and 560 Mflops
peak pertbcrrnauoe.

\ PRAM AND vLs| MODELS


Theoretrcal models of parallel computers are abstracted from the physical morlels studied in
previous sections. These models are often used by algorithm designers and ‘v'LSl device.-‘chip
developers. The ideal models provide a convenient framework for developing parallel algorithms without
worry about the implementation details or physical constraints.
The models can be applied to obtain theoretical performance bounds on parallel computers or to estimate
‘v'LSl complexity on chip area and execution time before the chip is fabricated. The abstract models are
F?» Mtfiruw Hlllrlimpwtnw

Ill i Advotriced Cmnpiunerfiuehtitectere

also useful in scalability and programmability analysis. when real machines are compared with an idealized
parallel machine without worrying about communication overhead among processing nodes.

1.4.1 Parallel Random-Access Machines


Theoretical models ofparallel computers are presented below. We define first the time and space complexities.
Computational tractabilny is reviewed for solving difiicult Problems on computers. Then we introduce the
rnrirfom-net-ess ntnchine {RAM}, ,rJar.tiHcf rttrirforri-access rratehim’ {_PRAlt-ll]-, and variants of PRAl'VIs. These
complexity models facilitate the study of asymptotic behavior of algorithms implementable on parallel
computers.
Tilrre and Space Cornplexities The complexity of an algorithm for solving a problem of size s on a
computer is determined by the execution time and the storage space required. The time complexity is a
function of the problem sire. The time eorrr,rJiexi1['jt-' function in order notation is the ns_t-vrrprorie time eonzp!e.\'ity
of the algorithm. Usually, the worst -ease time comp lexity is considered. For example, a time complex ity gs]
is said to be U‘ [Haj], read ‘“orderf{s)“, if there exist positive constants cl, cg and st, such that 1-, f{sj 5 ,g1'_s']
5 c;_,t'(sj__ for all nonnegativc values of s > so.
The space com,rJierir_v can be similarly defined as a function of the problem size s. The tts_}vqrJ!orit- spite-e
conyrltrxity refers to the data storage oflarge problems. Notcthat the program {_ code] storage requirement and
the storage fbr input data are not considered in this.
The time complexity ofa serial algorithm is simply called s'eriti1co:rr,r)i'e.rir_\-: The time complexity of a
parallel algorithm is called ;mr.nl'fcr' eom;JIe.\ir_1-'. lmuitively, the parallel complexity should be lower than
the serial complexity, at least asymptotically. We consider only alercrminisrie nl'g0ri.thm.s, in which every
operational step is uniquely defined in agreement with the way programs are executed on real computers.
Anonderermin isrie algorithm: contains operations resulting in one outcome from a set ofpossible outcomes.
There exist no real computers that can execute nondcterministic algorithms. Thereibre, all algorithms (or
machines] considered in this book are deterministic, tmless otherwise noted.

NP‘-Completeness An algorithm has a pol\-'nomit1l eomrrferin-' if there exists apolynomial p{s'] such that the
time complexity is O(p {sl} for problem size s. The setofprob lems having polyno mial-complex ity algorithms
is called P-elms (for polynomial class]. The set of problems solvable by nondcterministic algorithms in
polynomial time is called NP-class‘ { for nondcterministic polynomial class).
Since deterministic algorithms are special cases ofthe nondcterministic ones, we know that P -1: NP. The
P -class problems are computationally n'ttc-ttrble, while the NP — P-class problems are inrrnetttbie. But we do
not know whether P = NP or P as NP. This is still an open problem in oomputcr science.
To simulate a nondcterministic algorithm with a deterministic algorithm may require exponential time.
Therefore, intractable NP-class problems are also said to have exponential-time complexity.

L»-l
égl Example 1.4 Polynomial- and exponential-complexity
algorithms
Polynomial-complexity algorithms are known for sorting n numbers in Ofn log rt] time and for multiplication
of two rt X rt matrices in O'{n3j time. Therefore, both problems belong to the P-class.
PM‘ I Ifllli l'm'rIq|r_.\.I|n*\ _
Rrrdlel Cunpuoer Models 1 3|

Nonpolynomial algorithms have been developed for tl1e traveling salesperson problem with complexity
Qfrlzl") and for the knapsack problem with complexity O\[E"Pj. These complexities are eJq'Jom:nIr'oi, greater
than the polynomial complexities. So far, deterministic polynomial algorithms have not been found for these
problems. Therefore, these exponential-complexity problems belong to the NP-class.

Most oomputcr scientists believe that P sfi NP. This leads to the conjecture that there exists a subclass,
called NP-c-0nipLcIe(:l\lPC] problems, such that NPC C NPbut NPCFN P = l,'.'1{'Fig. 1.13]. ln fact, it has been
proved that ifany NP-complete problem is polynomial-time solvable, then one can conclude P = NP. Thus
NP-complete problems are oonsidered the hardest ones to solve. Only approximation algorithms can be
derived for solving the NP-complete problems in polynomial time.

NP NP: Nmdotormlnistlc poly nonial


time class
® P: Polynomial-time class.
MPG: NP-oompiete elass

Hg. 1.1! The relationships conjectured among the NP. E and NFC eimses of eompumrlonal pmhaloms

FRAM Models Conventional uniproccssor computers have been modeled as random |:Ic‘£'e’.i‘.S‘ nmehines
(RAM) by Shepcrdson and Sturgis (1963). A ,rJ.omHe1 r.ondom-access nmchirie [PRAMIII model has been
developed by Fortune and Wyllie ([973] for modeling idealized parallel computers with zcno synchronization
or memory aeeess overhead. This PRAM model will be used for parallel algorithm development and for
scalability and complexity analysis.
An n-processor PRAM (Fig. 1.14} has a globally addressable memory. The shared memory can be
distributed among the prroeessors or centralized in one plaoe. The n process-ors—also called proe:.'s.-ring
demenr.-r {PEs}—opcrate on a synchronized read-memory, compute, and write-memoty cycle. With shared
memory, the model must speciiy how concurrent read and concurrent write of memory are handled. Four
memory-update options are po ssiblc:

Tlghtty 9
synchronized Shaved
lvlemory

Fig. 1.14 PRAM model of a m:.|-ltlproceseor system width shamed mernc.r_y. on which all n processors operate
in loekstep in morn-ory access and pmgrarn ea-teoutl-on operations. Each processor can access any
memory location in unit time
F?» Mtfiruw Hlllr'».-rqtwrnw

31 i Advanced Covnpultertllrcirttteettrre

t Erc1'tt.sr'te read (ER}—This allows at most one processor to read fi'om any memory location in each
cycle, a rather restrictive policy.
~ Erulttsrte it-'rr're [EWl—This allows at most one processor to write into a memory location at a time.
' Concurrent read‘ (CR)—This allows multiple processors to read the same information from the same
memory cell in the same cycle.
' C'orrt'rtr'rt'rtr ‘l1-Tile’ {'CW_l—This allows simultaneous writes to the same memory location. In order to
avoid confusion_ some policy must be set up to resolve the write conflicts.

Various combinations of the above options lead to several variants of the PRAM model as specified below.
Since CR does not create a conflict problem, variants differ mainly in how they handle the CW conflicts.

FRAM ibriant: Described below arc four variants of the PRAtvI model, depending on how the memory
reads and writes are handled.

{'1} EREW~PR.*fM rrmdel—This m-odcl forbids more than one processor from reading or writing the same
memory cell simultaneously (Snir, 1982; Karp and Ramachandran, 1935]. This is the most restrictive
PRAM model proposed.
{2} CIli'Ellr'-PR.'l.1-f rrr-rtnlci'—Thc write conflicts are avoided by mutttal exclusion. Concurrent reads to thc
same memory location are allowed.
{'3} ERCll~"-PR.-IM rrrm;l'cl—This allows exclusive read or concurrent writes to the same memory location.
{'-fl] CREW-PR.-1M rrtod'e!—This model allows either concurrent reads or concurrent writes to the same
memory location.

3?) Example 1.5 Multiplication of two n >< n matrices in O(log


n) time on a PRAM with nil log n processors
(Viktor Prasanna,1992)
Let A and B be the input matrices. Assume rt3 PEs are available initially. We later rcduce the number of PEs
to rtj.-‘log rt. To visualize thc algorithm, assume the memory is organized as a three-dimensional array with
inputs A and B stored in two planes. Also, for the sake ofexplanation, assume a three-dimensional indexing
ofthe PEs. PE{t',j, Ir], D S It E rt — l are used for computing the[i,_,r‘]th entry ofthe output matrix, Cl E r',_,t' S
rt — l, and rt is apowerofl
ln step l, rt product terms corresponding to each output are computed using rt PEs in Cl-['1 j time. In step E,
these are added to produce an output in Oflog rt) time.
Thc total number of PEs used is rti. The result is available in C-'(i,j, fl), ll E i,_,r'£ rt -1. Listed below are
programs for each PE{i,j, Ir] to execute. All rtl PEs operate in parallel for rtj multiplications. But at most rt‘l.-"2
Pet are busy for or‘ - rt’) additions. Also, the PRAM is assumed to be crtsw.
Step 1
l. Rea-tl.-l{r',kj
.2. Read B{k,jj
War If J11!!!‘ I'mt!I;|(1rtnr\
Rrrdlel Cornptrtaer Models i i 33

3. Compute .*t{'t', kl X B{.i',_jj


4. Store in C{i,j, Ir)

Step I
l. E‘ <— tr
Z. Repeat
E<— H2
if-[Ir s: E] then
begin
Read C{'i,j, Ir]
Read C'{1',j, i'r+ E)
Compute C(r',j, Ir) + C.'{r',_,r', k+ E‘)
Store in C(i,j, Ir]
end
until (E = 1)
To reduce the numbcrof PEs to n3.~"log rt, use a PE array of size n >< n >< n.-‘log rt. Each PE is responsible tor
computing log n product terms and su.t|:u‘ning them up. Step l can be easily modified to produce n.-‘log rt partial
sums, each consisting of log n multiplications and {logn — 1] additions. Now we have art array C-[:r',_,r',kj,'U S
i,_,r' Sn - l, G S it S n.-‘log n — l, which can be summed up in log(n.-‘log rt] time. Combining the time spent in
step l and step 2, we haw: a total execution time 2 log n — l + log{n.-‘log nj = Oflog rt] for large n.

Discrepancy with Physical Model: PRAM models idealize parallel computers, in which all memory
references and pmgram executions by multiple processors are synchronized without extra cost. In reality,
such parallel machines do not exist. An SIMD machine with shared memory is the closest architecture
modeled by PRAM. However, PRAL-I allows different instructions to be executed on different processors
simultaneously. Therefore, PRAM really operates in synchronized MIMD mode with a shared memory.
Among the four PRFM variants, the BREW and CREW are the most popular models used. In fact,
every CREW algorithm can be simulated by an EREW algorithm. The CREW algorithm runs faster than an
equivalent EREW algorithm. It has been proved that the best n-processor EREW algorithm can be no more
than Oflog rt] times slower than any n-processor CRCW algorithm.
The CREW model has received more attention in the literature than the ERCW model. For our purposes,
we will use the CRCW-PRAM model unless otherwise stated. This particular model will be used in defining
scalability in Chapter 3.
For oomplexity analysis orperfoflnflltce comparison, various PRAM variants offeran ideal model ofparallel
computers. Therefore, computer scientists use the PRAM model more often than computer engineers. ln this
book, we design parallel-“vector computers using physical architectural models rather than PRAM models.
The PRAIM model will be used for scalability and performance studies in Chapter 3 as a theoretical reference
machine. PRAM models can indicate upper and lower bounds on the performance of real parallel comp-uters.

1.4.2 VLSI Complexity Model


Parallel computers rely on the use of VLSI chips to iahricate the major components such as processor
arrays, memory arrays, and large-scale switching rietworks. An A T‘ model ior two-dimensional VLSI chips
Par MIGIITLH HI" l'mrJI||r_.u|i¢\ :

34 i Advanced Colnpiieerfiiclritaeru-.ra

is presented below, based on the work of Clark Thompson (1950). Three lower bounds on \-‘LS-l circuits
are imerpreted by Jeffrey Ullman {I984}. The bounds are obtained by setting limits on memory, l.-‘CI, and
communication for implementing parallel algorithms with VLF‘.-l chips.
The AT‘, Nlodel Let A be the chip area and The the latency for completing a given computation using a
VLSI circuit chip. Let s by the problem size involved in the computation. Thornpson stated in his doctoral
thesis that for certain computations, there exists a lower boundfls] such that

If >< Tie oifou (1.6)


The chip area A is a meas ure of the chip's complexity. The latency Tis the time required irom when inputs
are applied until all outputs are produced for a single problem instance. Figure l.l 5 shows how to interpret
the AT: complexity results in VLSI chip development. The chip is represented by the base area in the two
horizontal dimensions. The vertical dimension corresponds to time. Therefore, the three-dimensional solid
represents the history ofthe computation performed by the chip.

Memory Bound on Chip Arno There are many computations which are memory-bound, due to the need
to process large data sets. To implement this type of computation in silicon, one is limited by how densely
information {bit cells) can be placed on the chip. As depicted in Fig. l.l5a, the memory requirement oi" a
computation scts a lower bound on the chip area .»l.
The amount of information processed by the chip can be visualized as information flow upward across the
chip area. Each bit can flow through a unit area of the horizontal chip slice. Thus, the chip area bounds the
amount of memory bits stored on the chip.

HO Bound on Volume AT The volume of the rectangular cube is represented by the product AT. As
information flows through the chip for a period oftime T, thc number of irlput bits cannot etcccd the volume
AT, as demonstrated in Fig. l. 15a.

Time Time

. .9'.L 1
, Gmparoa

{at Memory-iim ited no und on chip area


A and ID-limited hound on chip hiatoiy
Kin 5-§
{ht Communicntim-limited bound on the
bl$BCllOfl HT
repress riled by the volume AT

Fig. 1.15 The At’ oornpletdry model of ruvo-dimensional v|_5| chips

You might also like