0% found this document useful (0 votes)
930 views9 pages

Taxonomy of Parallel Computing Paradigms

The document discusses various parallel computing paradigms and architectures. It describes SIMD and MIMD as the dominant paradigms today, with SIMD involving a single instruction stream operating on multiple data streams concurrently, and MIMD involving multiple instruction and data streams operating concurrently on different processors/cores. It also summarizes shared-memory and distributed-memory architectures, and hierarchical/hybrid systems that combine aspects of both. Key aspects like uniform memory access (UMA), cache coherence, and network characteristics are highlighted at a high level.

Uploaded by

sushma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
930 views9 pages

Taxonomy of Parallel Computing Paradigms

The document discusses various parallel computing paradigms and architectures. It describes SIMD and MIMD as the dominant paradigms today, with SIMD involving a single instruction stream operating on multiple data streams concurrently, and MIMD involving multiple instruction and data streams operating concurrently on different processors/cores. It also summarizes shared-memory and distributed-memory architectures, and hierarchical/hybrid systems that combine aspects of both. Key aspects like uniform memory access (UMA), cache coherence, and network characteristics are highlighted at a high level.

Uploaded by

sushma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Taxonomy of parallel computing paradigms:

A widely used taxonomy for describing the amount of concurrent control and data
streams present in a parallel architecture was proposed by Flynn [R38]. The dominating
concepts today are the SIMD and MIMD variants:

SIMD Single Instruction, Multiple Data. A single instruction stream, either on a single
processor (core) or on multiple compute elements, provides parallelism by operating on
multiple data streams concurrently. Examples are vector processors , the SIMD
capabilities of modern superscalar microprocessors , and Graphics Processing Units
(GPUs). Historically,the all but extinct large-scale multiprocessor SIMD parallelism was
implemented in Thinking Machines’ Connection Machine supercomputer.

MIMD Multiple Instruction, Multiple Data. Multiple instruction streams on multiple


processors (cores) operate on different data items concurrently. The sharedmemory and
distributed-memory parallel computers described in this chapter are typical examples for
the MIMD paradigm.

Shared-memory computers:

A shared-memory parallel computer is a system in which a number of CPUs work on a


common, shared physical address space.

there are two varieties of sharedmemory systems that have very different performance
characteristics in terms of main memory access:

Uniform Memory Access (UMA) systems exhibit a “flat” memory model: Latency and
bandwidth are the same for all processors and all memory locations. This is also called
symmetric multiprocessing (SMP). At the time of writing, single multicore processor
chips are “UMA machines.” However, “cluster on a chip” designs that assign separate
memory controllers to different groups of cores on a die are already beginning to appear.

cache-coherent Nonuniform Memory Access (ccNUMA) machines, memory is


physically distributed but logically shared. The physical layout of such systems is quite
similar to the distributed-memory case, but network logic makes the aggregated
memory of the whole system appear as one single address space. Due to the distributed
nature, memory access performance varies depending on which CPU accesses which
parts of memory (“local” vs. “remote” access).

Cache coherence:

Cache coherence mechanisms are required in all cache-based multiprocessor systems,


whether they are of the UMA or the ccNUMA kind. This is because copies of the same
cache line could potentially reside in several CPU caches.

the four possible states a cache line can assume:

M modified: The cache line has been modified in this cache, and it resides in no other
cache than this one. Only upon eviction will memory reflect the most current state.

E exclusive: The cache line has been read from memory but not (yet) modified.However,
it resides in no other cache.
S shared: The cache line has been read from memory but not (yet) modified. There may
be other copies in other caches of the machine.

I invalid: The cache line does not reflect any sensible data. Under normal circumstances
this happens if the cache line was in the shared state and another processor has
requested exclusive ownership.

The following Figure shows an example on two processors P1 and P2 with respective
caches C1 and C2. Each cache line holds two items.

UMA:

The simplest implementation of a UMA system is a dual-core processor, in which two


CPUs on one chip share a single path to memory. It is very common in high performance
computing to use more than one chip in a compute node, be they singlecore or
multicore.

In Following Figure two (single-core) processors, each in its own socket, communicate
and access memory over a common bus, the so-called frontside bus (FSB). All
arbitration protocols required to make this work are already built into the CPUs. The
chipset (often termed “northbridge”) is responsible for driving the memory modules and
connects to other parts of the node like I/O subsystems. This kind of design is outdated
and is not used any more in modern systems.
ccNUMA:

In ccNUMA, a locality domain (LD) is a set of processor cores together with locally
connected memory. This memory can be accessed in the most efficient way, i.e., without
resorting to a network of any kind. Multiple LDs are linked via a coherent interconnect,
which allows transparent access from any processor to any other processor’s memory.

The following figure is illustration of ccNUMA.

Distributed-memory computers:

The following Figure shows a simplified block diagram of a distributed-memory parallel


computer. Each processor P is connected to exclusive local memory, i.e., no other CPU
has direct access to it. Nowadays there are actually no distributed-memory systems any
more that implement such a layout. In this respect, the sketch is to be seen as a
programming model only

The distributed-memory architecture outlined here is also named No Remote Memory


Access (NORMA). Some vendors provide libraries and sometimes hardware support for
limited remote memory access functionality even on distributed-memory machines.
Since such features are strongly vendor-specific, and there is no widely accepted
standard available,
Hierarchical (hybrid) systems:

large-scale parallel computers are neither of the purely shared-memory nor of the purely
distributed-memory type but a mixture of both, i.e., there are shared-memory building
blocks connected via a fast network.

The following figure illustrate the hybrid systems:

Two-socket building blocks are currently the “sweet spot” for inexpensive commodity
clusters, i.e., systems built from standard components that were not specifically
designed for high performance computing. Depending on which applications are run on
the system, this compromise may lead to performance limitations due to the reduced
available network bandwidth per core. Moreover, it is per se unclear how the complex
hierarchy of cores, cache groups, sockets and nodes can be utilized efficiently. The only
general consensus is that the optimal programming model is highly application- and
system-dependent

Parallel computers with hierarchical structures as described above are also called
hybrids. The concept is actually more generic and can also be used to categorize any
system with a mixture of available programming paradigms on different hardware layers.
Prominent examples are clusters built from nodes that contain, besides the “usual”
multicore processors, additional accelerator hardware, ranging from application-specific
add-on cards to GPUs (graphics processing units), FPGAs (field-programmable gate
arrays), ASICs (application specific integrated circuits), co-processors, etc

Networks:

The characteristics of the network that connects the “execution units,” “processors,”
“compute nodes,” or whatever play a dominant role here. A large variety of network
technologies and topologies are available on the market, some proprietary and some
open
Basic performance characteristics of networks:

there are various options for the choice of a network in a parallel computer. The simplest
and cheapest solution to date is Gigabit Ethernet, which will suffice for many throughput
applications but is far too slow for parallel programs with any need for fast
communication

Point-Point Communications:

a point-to-point connection refers to a communications connection between two nodes or


endpoints

Bisection bandwidth:

sum of the bandwidths of the minimal number of links that are cut when splitting the
system into two parts

Suppose that half the nodes can inject data into the network at a rate of B bytes/sec.
The network has full bisection bandwidth if the bisection bandwidth is B

Buses:

A bus is a shared medium that can be used by exactly one communicating device at a
time

A switched network subdivides all communicating devices into groups. The devices in
one group are all connected to a central network entity called a switch in a star-like
manner. Switches are then connected with each other or using additional switch layers.
In such a network, the distance between two communicating devices varies according to
how many “hops” a message has to sustain before it reaches its destination. Therefore, a
multiswitch hierarchy is necessarily heterogeneous with respect to latency. The
maximum number of hops required to connect two arbitrary devices is called the
diameter of the network.

The fat tree network is a universal network for provably efficient communication. It
was invented by Charles E. Leiserson of the Massachusetts Institute of Technology in
1985.

In a fat tree, branches nearer the top of the hierarchy are "fatter" (thicker) than
branches further down the hierarchy. In a telecommunications network, the branches
are data links; the varied thickness (bandwidth) of the data links allows for more
efficient and technology-specific use

Mesh networks:

A mesh network is a local network topology in which the infrastructure nodes (i.e.
bridges, switches and other infrastructure devices) connect directly, dynamically and
non-hierarchically to as many other nodes as possible and cooperate with one another to
efficiently route data from/to clients. This lack of dependency on one node allows for
every node to participate in the relay of information. Mesh networks dynamically self-
organize and self-configure, which can reduce installation overhead. The ability to self-
configure enables dynamic distribution of workloads, particularly in the event that a few
nodes should fail. This in turn contributes to fault-tolerance and reduced maintenance
costs.

If a network is built as a combination of at least two of the topologies described above, it


is called hybrid.

Basics of parallelization:

Why parallelize?

 A single core may be too slow to perform the required task(s) in a “tolerable”
amount of time. The definition of “tolerable” certainly varies, but “overnight” is
often a reasonable estimate. Depending on the requirements, “over lunch” or
“duration of a PhD thesis” may also be valid.

 The memory requirements cannot be met by the amount of main memory which
is available on a single system, because larger problems (with higher resolution,
more physics, more particles, etc.) need to be solved

Data parallelism:
Many problems in scientific computing involve processing of large quantities of
data stored on a computer. If this manipulation can be performed in parallel, i.e.,
by multiple processors working on different parts of the data, then it is called
data parallelism.

Example: (Medium-grained loop parallelism)

Processing of array data by loops or loop nests is a central component in most scientific
codes.
The following diagram depicts the medium-grained loop parallelism.

OpenMP, a compiler extension based on directives and a simple API, supports, among
other things, data parallelism on loops
Example: Coarse-grained parallelism by domain decomposition:

In coarse-grained parallelism, a program is split into large tasks. Due to this, a large
amount of computation takes place in processors. This might result in load imbalance,
wherein certain tasks process the bulk of the data while others might be idle. Further,
coarse-grained parallelism fails to exploit the parallelism in the program as most of the
computation is performed sequentially on a processor. The advantage of this type of
parallelism is low communication and synchronization overhead. Message-passing
architecture takes a long time to communicate data among processes which makes it
suitable for coarse-grained parallelism.Cray Y-MP is an example of coarse-grained
parallel computer which has a grain size of about 20s.

Functional Parallelism:
Sometimes the solution of a “big” numerical problem can be split into more or less
disparate subtasks, which work together by data exchange and synchronization. In this
case, the subtasks execute completely different code on different data items, which is
why functional parallelism is also called MPMD (Multiple Program Multiple Data).

Example: Master-worker scheme:


Reserving one compute element for administrative tasks while all others solve the actual
problem is called the master-worker scheme. The master distributes work and collects
results. A typical example is a parallel ray tracing program: A ray tracer computes a
photorealistic image from a mathematical representation of a scene. For each pixel to be
rendered, a “ray” is sent from the imaginary observer’s eye into the scene, hits surfaces,
gets reflected, etc., picking up color components. If all compute elements have a copy of
the scene, all pixels are independent and can be computed in parallel.

A drawback of the master-worker scheme is the potential communication and


performance bottleneck that may appear with a single master when the number of
workers is large
Example: Functional decomposition:
Divide overall task into separate sub-tasks and assign each sub-task to different CPUs
As each CPU finishes its task it passes data to next CPU and receives data from previous
CPU
Functional Decomposition Drawbacks:
1.Startup cost
2.Load balancing
3.Scalability
Parallel scalability:
 Factors that limit parallel execution
o load imbalance
o communication cost;
o costs of creating and scheduling processes; and
o I/O operations (mostly sequential in nature).
 Scalability metrics:
o Algorithmic limitations.
 Operations that cannot be done in parallel because of, e.g.,
mutual dependencies, can only be performed one after another,
or even in a certain order.
o Bottlenecks.
 Shared resources are common in computer systems: Execution
units in the core, shared paths to memory in multicore chips,
I/O devices. Access to a shared resource serializes execution.
Even if the algorithm itself could be performed completely in
parallel, concurrency may be limited by bottlenecks.
o Startup overhead.
 Starting a parallel program, regardless of the technical details,
takes time. Of course, system designs try to minimize startup
time, especially in massively parallel systems, but there is
always a nonvanishing serial part. If a parallel application’s
overall runtime is too short, startup will have a strong impact.
o Communication.
 Fully concurrent communication between different parts of a
parallel system cannot be taken for granted. If solving a
problem in parallel requires communication, some serialization
is usually unavoidable.
 Simple scalability laws
o Serial performance for fixed problem size (work) s+ p is thus

 Parallel efficiency
o parallel efficiency is then defined as

 Refined Models
o blocking network
o nonblocking network, constant communication cost
o nonblocking network, domain decomposition with ghost layer
communication
 Choosing the right scaling baseline
o Today’s high performance computers are all massively parallel. In the
previous sections we have described the different ways a parallel
computer can be built: There are multicore chips, sitting in multisocket
shared-memory nodes, which are again connected by multilevel
networks. Hence, a parallel system always comprises a number of
hierarchy levels. Scaling a parallel code from one to many CPUs can
lead to false conclusions if the hierarchical structure is not taken into
account.
 Load imbalance
o Load imbalance occur when synchronization points are reached by
some workers earlier than by others, leading to at least one worker
idling while others still do useful work.
o As a consequence, resources are underutilized.
o The consequences of load imbalance are hard to characterize in a
simple model without further assumptions about the work distribution.

You might also like