Introduction to
Multiprocessor Concepts
Introduction
Why parallel processing?
-to meet increasing demand for higher performance, lower costs, sustained
productivity in real-life applications
How is it achieved?
-multiprogramming, multiprocessing or multicomputing
-Parallelism can appear in various forms like lookahead, pipelining,
vectorization, concurrency, simultaneity, data parallelism, partitioning,
interleaving, overlapping, multiplicity, replication, time sharing,
multithreading, distributed computing at processor levels.
Multiprocessors and Multicomputers
Parallel computers can be modeled as physical models having shared
common memory or unshared distributed memories
Shared-Memory Multiprocessors
3 main shared memory multiprocessor models
-Uniform memory access(UMA)
-Nonuniform memory access(NUMA)
-Cache only memory architecture(COMA)
They differ in how the memory and peripheral resources are shared or
distributed
The UMA model
Physical memory is uniformly shared by all the processors
All processors have equal access time to each memory words
Peripherals are shared usually but however each processor might have a
private cache
Multiprocessors are called tightly coupled systems due to high degree of
resource sharing
System interconnect can be a common bus, crossbar switch, multistage
network
UMA-suitable for general purpose and time sharing applications by
multiple users
Cont..
Also used to improve execution speed of large time-critical applications
Parallel events are done using shared variables in the common memory
Symmetric multiprocessor- when all processors have equal access to all
peripheral devices
Asymmetric multiprocessor- Only one or a subset of processors are
executive capable(Master processor)-equally capable of running OS
kernel and I/O service routines
-Master processor can execute OS and handle I/O
-Others do not have I/O capability and are called attached processors(AP)
-They execute user codes under the supervision of Master processor
Cont..
Approximate performance of a
Uniprocessor
Array elements A(I), B(I), C(I) is assumed to have N elements
L2,L4,L6 assumed to take 1 machine cycle and L1,L3,L5,L7 ignored
K cycles need for interprocessor communication via shared memory
Cont..
Ignoring bus contention/memory access violation problems
Uniprocessor take
- 2N cycles( N for I loop and N for J loop)
Multiprocessor with M number of processors take
-looping partitioned into M sections L=N/M elements per section
NUMA model
Is a shared memory system in which access time varies with the location of
the memory word.
Shared memory is distributed to all processors called local memories
Collection of all local memories forms a global address space accessible
by all processors
Cont..
COMA
Special case of NUMA model where distributed main memories are called
as caches
No memory hierarchy at each processor node
All caches form global address space
Remote cache access is assisted by distributed cache directories
Cont..
Cont..
Multiprocessor systems are suitable for general-purpose multiuser
applications where programmability is major concern
Shortcoming of multiprocessors is the lack of scalability
It is also difficult to build centralized shared memory model and is limited
by latency tolerance
Hence distributed memory multi computers could be used which have
larger scalability but are limited by less programmability
Distributed memory computers
Distributed memory multicomputer system consists of multiple
computers(nodes) interconnected by a message passing network
Each node is an independent computer consisting of processor, local
memory. Disks or I/O peripherals being optional
Message passing network provides point-to-point static connections
among the nodes
Since a processor can access only its private local memory,
multicomputers are called no-remote-memory-access(NORMA) machines
Internode communication is carried using message passing using static
connection network
Cont..
Conditions for Parallelism
To provide parallelism in terms of computing, the key areas are identified
as
-Computation models for parallel computing
-Interprocessor communication in parallel architectures
-System integration
Data dependencies
Execution of several segments in parallel needs to have each segment
independent of the other
Dependence graphs are used to describe the dependencies among
elements
Nodes of dependence graph correspond to instructions and the directed
edges shows the relationship between nodes
Analysis of dependence graph provides idea to bring in parallelism and
vectorization
Data Dependence
Flow dependence
Data dependence in programs
Dependence is a partial ordering relation, ie members of not every pair of
statements are related
Dependence on I/O operations
S1 and S3 are I/O dependent and both access tape unit 4
Hence data dependence relation should not be arbitrarily violated during
program execution else may get erroneous result
In uniprocessor system, repetitive runs should yield same results and is
preserved by defining the order of execution
Cont..
In multiprocessor system, the program order may or may not be preserved
depending on type of memory model used
Determinism is obtained by
-control by programmer
-constrained modification of writable data on shared memory
l
Control Dependencies
A situation where the order of execution of statements cannot be
determined before run time
Different paths taken after conditional branch may introduce or eliminate
data dependence among instructions
Dependence might exist between operations in successive iterations of a
loop
Cont..
The successive iterations of the loop are control independent
Cont..
Control dependence ex
Control dependence often restricts parallelism
Resource Dependence
Is concerned with using shared resources like integer units, floating point
units, registers, memory areas, ALU
Bernstein’s Conditions
A process is a software entity corresponding to the abstraction of a
program fragment defined at various processing levels
Ii- set of all input variables needed to execute Pi (operands-obtained by
memory or registers)
Oi- set of all output variables generated after executing Pi ( to be stored in
registers or memory locations)
If P1 and P2 has no dependencies for input and output, then it is said as
parallel P1II P2 and are obtained using Bernstein’s conditions which is flow
independence, antiindependence and output independence
Cont..
l
Example
Consider P1, P2, P3, P4, P5 in program order
Assume each statement takes 1 step and no pipelining
Cont..
Parallelism is commutative but not equivalence( Pi II Pj and Pj II Pk does
not imply Pi II Pk) but it is associative
Detection of parallelism
Cont..
In general
Where n is number of processes and parallelism violation can happen
collectively or partially
Hardware and Software Parallelism
Hardware Parallelism defined by machine architecture and hardware
multiplicity
It is often a function of cost and performance tradeoffs
It can also indicate the peak performance of the processor resources
Cont..
It can be characterized by number of instructions per machine cycle like if
a processor issues k instructions per machine cycle, then it is called a k-
issue processor
A conventional processor might take 1 or 2 m/c to issue single instruction
Intel processor variant is a 3 issue processor which issues 3 instructions per
machine cycle like 1 arithmetic, 1 memory access, 1 branch
IBM processor variant issues 4. 1 arithmetic, 1 memory, 1 floating point and
1 branch per cycle
A multiprocessor system built with n k-issue processors should be able to
handle a maximum of nk threads of instructions simultaneously
Software Parallelism
Is defined by the control and data dependence of programs
It is a function of algorithm, programming style, and compiler optimization
Parallelism in a program varies during the execution period
It often limits the sustained performance of the processor
Mismatch between software
parallelism and hardware parallelism