Motivation for Parallelism Motivation for Parallelism
The speed of an application is determined by more than just Communication enables parallel applications:
processor speed. Harnessing the computing power of distributed systems over the
Memory speed Internet is a popular example of parallel processing.
Disk speed (SETI, Folding@home, ...)
Network speed
... Constraints on the location of data
Multiprocessors typically improve the aggregate speeds: Huge data sets could be difficult, expensive, or otherwise
Memory bandwidth is improved by separate memories. infeasible to store in a central location.
Multiprocessors usually have more aggregate cache memory. Distributed data and parallel processing is a practical solution.
Each processor in a cluster can have its own disk and network
adapter, improving aggregate speeds.
2007 October 11th 1 2007 October 11th 2
Types of Parallelism ILP Example: Loop Unrolling
Instruction Level Parallelism (ILP) for( i=0; i<=64; i++ ) { This loop has one sequence of
vec[i] = a * vec[i]; load, multiply, store per iteration.
Instructions near each other in an instruction stream could be } The amount of ILP is very limited.
independent.
These can then execute in parallel
either partially (pipelining),
for( i=0; i<=64; i+=4 ) {
or fully (superscalar).
vec[i+0] = a * vec[i+0]; 4-fold loop unrolling increases the
Hardware needed for dependency tracking. vec[i+1] = a * vec[i+1]; amount of ILP exploitable by the
The amount of ILP available per instruction stream is limited vec[i+2] = a * vec[i+2]; hardware.
vec[i+3] = a * vec[i+3];
ILP is usually considered an implicit parallelism since the hardware }
automatically exploits it without programmer/compiler intervention.
Programmers/compilers could transform
applications to expose more ILP. Four independent
load, multiply, store
sequences per iteration.
2007 October 11th 3 2007 October 11th 4
Types of Parallelism TLP Example: Quicksort
QuickSort( A, B ):
if( B – A < 10 ) {
Task Level Parallelism (TLP) /* Base Case: Use fast sort */
Several instruction streams are independent. FastSort( A, B );
} else {
Parallel execution of the streams is possible. /* Continue Recursively */
Much more coarse-grain parallelism compared with ILP. Partition [A,B] into [A,C] and [C+1,B]; /* Task X */
Typically involves the programmer and/or compiler to: QuickSort( A, C ); /* Task Y */
QuickSort( C+1, B ); /* Task Z */
decompose the application into tasks, }
enforce dependencies,
and expose parallelism.
X
Some experimental techniques, such as speculative Both Y and Z depend
multithreading, are aimed at removing these burdens from the on X but are mutually
programmer. independent.
Y Z
2007 October 11th 5 2007 October 11th 6
Types of Parallelism Superscalar and OoO execution
Data Parallelism (DP) Scalar processors can issue one instruction per cycle.
In many applications a collection of data is transformed in such a
Superscalar processors can issue more than one instruction per cycle.
way that the operations on each element is largely independent
Common feature in most modern processors.
of the others.
By replicating functional units and adding hardware to detect and track
instruction dependencies a superscalar processor takes advantage of
A typical scenario is when we apply the same instruction to a Instruction Level Parallelism.
collection of data. A related technique (applicable to both scalar and superscalar processors)
Example: adding two arrays is out-of-order (OoO) execution.
Instructions are reordered (by hardware) for better utilization of pipeline(s).
Excellent example of the use of extra transistors to speed up execution
3 7 4 3 6 5 4 without programmer intervention.
+ + + + + + + Naturally limited by the available ILP.
4 1 5 6 2 4 3 The same operation (+) applied Also severely limited by the hardware complexity of dependency checking.
= = = = = = = to a collection of data In practice: 2-way superscalar architectures are common, more than 4-way
7 8 9 9 8 9 7 is unlikely.
2007 October 11th 7 2007 October 11th 8
Vector Processors Vector Processors
Vector processors refer to a previously common supercomputer Vector processors are suitable for a certain range of
architecture where a vector is a basic memory abstraction. applications.
Examples: Cray 1, IBM 3090/VF.
Vector: 1D array of numbers Traditional, scalar, processor designs have been successful
Example: add two vectors over a larger spectrum of applications.
Scalar solution: Economic realities favor large clusters of commodity
Loop through vectors and add each scalar element
Repeated address translations processors or small-scale SMPs.
Branches
Vector solution:
Add vectors via a vector addition instruction
Address translations once
No branches
Independent operations enable:
deeper pipelines,
higher clock frequencies, and
multiple concurrent functional units (e.g., SIMD-units)
2007 October 11th 9 2007 October 11th 10
Single Instruction Multiple Data (SIMD) Shared Memory Multiprocessor
Several functional units executing the same instruction on Multiprocessors where all processors share a single address
different data streams simultaneously and synchronized. space are commonly called Shared Memory Multiprocessors.
Suitable architecture for many data parallel applications: They can be classified based on how long access time
Matrix Computations different processors have to different memory areas.
Graphics Processing Uniform Memory Access (UMA): each processor has the same
Image Analysis access time.
...
Non-Uniform Memory Access (NUMA): some memory is closer to
a processor, access time is higher to distant memory.
Found primarily in common microprocessors and GPUs:
Furthermore, their caches could be coherent or not.
SIMD instruction extensions such as MMX, SSE, AltiVec.
CC-UMA: Cache-Coherent Uniform Memory Access
By another name, Symmetric MultiProcessor (SMP).
2007 October 11th 11 2007 October 11th 12
Bus-Based UMA MP NUMA MP
Chip Chip Chip Chip
Proc. Proc. Proc. Proc.
Memory Proc. Proc. Memory
Cache Cache Cache Cache
Interconnect
Bus
Memory Memory Proc. Proc. Memory
2007 October 11th 13 2007 October 11th 14
Multicore Multicore
Chip
When several processor cores are physically located in the
same processor socket we refer to it as a multicore processor.
Core Core Core Core
Both Intel and AMD now have quad-core (4 cores) processors
in their product portfolios.
A new desktop computer today is definitely a multiprocessor. L1 L1 L1 L1
Multicores usually have a single address space and are Cache Cache Cache Cache
cache-coherent.
They are very similar to SMPs but they
L2
typically share one or more levels of cache, Cache
have more favorable inter-processor/core communication speed.
Memory
2007 October 11th 15 2007 October 11th 16
Multicore Distributed Memory Multiprocessor
Multicore chips have multiple benefits: In contrast to a single address space, machines with multiple
Higher peak performance private memories are commonly called Distributed Memory
Power consumption control Machines.
Some cores can be turned off. Data is exchanged between memories via messages
Production yield increase communicated over a dedicated network.
8-core chips with a defective core sold with one core disabled. When the processor/memory pairs are physically separated,
...but also some potential drawbacks: such as on different boards or in different casings, such
Memory bandwidth per core limited machines are called Clusters.
Physical limits such as the number of pins.
Lower peak performance per thread
Some inherently sequential applications may actually run slower.
2007 October 11th 17 2007 October 11th 18
Cluster Clusters: Past and Present
Node Node Node Node
In the past, clusters were exclusively high-end machines with
custom supercomputer processors, and
Proc. Proc. Proc. Proc. custom high-performance interconnects.
These machines where very expensive and therefore limited
to research and big corporations.
Cache Cache Cache Cache In the ’90s onwards it is increasingly common with clusters
based on off-the-shelf components:
commodity processors, and
commodity interconnects (e.g. ethernet with switches).
Memory Memory Memory Memory
The economic benefits and programmer familiarity with
commodity components far outweigh the performance issues.
Network Has helped ”democratize” supercomputing:
many corporations and universities have clusters today.
2007 October 11th 19 2007 October 11th 20
Networks
Access to memory on other nodes is very expensive.
Data must be transferred over a relatively high-latency low-
bandwidth network.
Algorithms with low data locality will suffer.
High synchronization requirements will also degrade
performance for the same reason.
The network design is a tradeoff between conflicting goals:
Maximum bandwidth and low latency:
full connectivity
Low cost and power consumption:
tree network
Switch-based networks common today, other examples of
topologies include rings, meshes, hypercubes, and trees.
2007 October 11th 21