0% found this document useful (0 votes)

150 views6 pages

Motivation For Parallelism Motivation For Parallelism

This document discusses different types of parallelism including instruction level parallelism (ILP), task level parallelism (TLP), and data parallelism (DP). ILP exploits parallelism within an instruction stream by allowing independent instructions to execute concurrently. TLP involves decomposing an application into independent tasks that can execute in parallel. DP applies the same operation to multiple data elements in parallel, as seen when applying the same instruction to an array.

Uploaded by

requinoxx

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

150 views6 pages

Motivation For Parallelism Motivation For Parallelism

Uploaded by

requinoxx

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Motivation for Parallelism Motivation for Parallelism

The speed of an application is determined by more than just Communication enables parallel applications:
processor speed. Harnessing the computing power of distributed systems over the
Memory speed Internet is a popular example of parallel processing.
Disk speed (SETI, Folding@home, ...)
Network speed
... Constraints on the location of data
Multiprocessors typically improve the aggregate speeds: Huge data sets could be difficult, expensive, or otherwise
Memory bandwidth is improved by separate memories. infeasible to store in a central location.
Multiprocessors usually have more aggregate cache memory. Distributed data and parallel processing is a practical solution.
Each processor in a cluster can have its own disk and network
adapter, improving aggregate speeds.

2007 October 11th 1 2007 October 11th 2

Types of Parallelism ILP Example: Loop Unrolling

Instruction Level Parallelism (ILP) for( i=0; i<=64; i++ ) { This loop has one sequence of
vec[i] = a * vec[i]; load, multiply, store per iteration.
Instructions near each other in an instruction stream could be } The amount of ILP is very limited.
independent.
These can then execute in parallel
either partially (pipelining),
for( i=0; i<=64; i+=4 ) {
or fully (superscalar).
vec[i+0] = a * vec[i+0]; 4-fold loop unrolling increases the
Hardware needed for dependency tracking. vec[i+1] = a * vec[i+1]; amount of ILP exploitable by the
The amount of ILP available per instruction stream is limited vec[i+2] = a * vec[i+2]; hardware.
vec[i+3] = a * vec[i+3];
ILP is usually considered an implicit parallelism since the hardware }
automatically exploits it without programmer/compiler intervention.
Programmers/compilers could transform
applications to expose more ILP. Four independent
load, multiply, store
sequences per iteration.

2007 October 11th 3 2007 October 11th 4

Types of Parallelism TLP Example: Quicksort
QuickSort( A, B ):
if( B – A < 10 ) {
Task Level Parallelism (TLP) /* Base Case: Use fast sort */
Several instruction streams are independent. FastSort( A, B );
} else {
Parallel execution of the streams is possible. /* Continue Recursively */
Much more coarse-grain parallelism compared with ILP. Partition [A,B] into [A,C] and [C+1,B]; /* Task X */
Typically involves the programmer and/or compiler to: QuickSort( A, C ); /* Task Y */
QuickSort( C+1, B ); /* Task Z */
decompose the application into tasks, }
enforce dependencies,
and expose parallelism.
X
Some experimental techniques, such as speculative Both Y and Z depend
multithreading, are aimed at removing these burdens from the on X but are mutually
programmer. independent.

Y Z

2007 October 11th 5 2007 October 11th 6

Types of Parallelism Superscalar and OoO execution

Data Parallelism (DP) Scalar processors can issue one instruction per cycle.
In many applications a collection of data is transformed in such a
Superscalar processors can issue more than one instruction per cycle.
way that the operations on each element is largely independent
Common feature in most modern processors.
of the others.
By replicating functional units and adding hardware to detect and track
instruction dependencies a superscalar processor takes advantage of
A typical scenario is when we apply the same instruction to a Instruction Level Parallelism.
collection of data. A related technique (applicable to both scalar and superscalar processors)
Example: adding two arrays is out-of-order (OoO) execution.
Instructions are reordered (by hardware) for better utilization of pipeline(s).
Excellent example of the use of extra transistors to speed up execution
3 7 4 3 6 5 4 without programmer intervention.
+ + + + + + + Naturally limited by the available ILP.
4 1 5 6 2 4 3 The same operation (+) applied Also severely limited by the hardware complexity of dependency checking.
= = = = = = = to a collection of data In practice: 2-way superscalar architectures are common, more than 4-way
7 8 9 9 8 9 7 is unlikely.

2007 October 11th 7 2007 October 11th 8

Vector Processors Vector Processors
Vector processors refer to a previously common supercomputer Vector processors are suitable for a certain range of
architecture where a vector is a basic memory abstraction. applications.
Examples: Cray 1, IBM 3090/VF.
Vector: 1D array of numbers Traditional, scalar, processor designs have been successful
Example: add two vectors over a larger spectrum of applications.
Scalar solution: Economic realities favor large clusters of commodity
Loop through vectors and add each scalar element
Repeated address translations processors or small-scale SMPs.
Branches
Vector solution:
Add vectors via a vector addition instruction
Address translations once
No branches
Independent operations enable:
deeper pipelines,
higher clock frequencies, and
multiple concurrent functional units (e.g., SIMD-units)

2007 October 11th 9 2007 October 11th 10

Single Instruction Multiple Data (SIMD) Shared Memory Multiprocessor

Several functional units executing the same instruction on Multiprocessors where all processors share a single address
different data streams simultaneously and synchronized. space are commonly called Shared Memory Multiprocessors.
Suitable architecture for many data parallel applications: They can be classified based on how long access time
Matrix Computations different processors have to different memory areas.
Graphics Processing Uniform Memory Access (UMA): each processor has the same
Image Analysis access time.
...
Non-Uniform Memory Access (NUMA): some memory is closer to
a processor, access time is higher to distant memory.
Found primarily in common microprocessors and GPUs:
Furthermore, their caches could be coherent or not.
SIMD instruction extensions such as MMX, SSE, AltiVec.
CC-UMA: Cache-Coherent Uniform Memory Access
By another name, Symmetric MultiProcessor (SMP).

2007 October 11th 11 2007 October 11th 12

Bus-Based UMA MP NUMA MP
Chip Chip Chip Chip

Proc. Proc. Proc. Proc.

Memory Proc. Proc. Memory

Cache Cache Cache Cache

Interconnect
Bus

Memory Memory Proc. Proc. Memory

2007 October 11th 13 2007 October 11th 14

Multicore Multicore
Chip
When several processor cores are physically located in the
same processor socket we refer to it as a multicore processor.
Core Core Core Core
Both Intel and AMD now have quad-core (4 cores) processors
in their product portfolios.
A new desktop computer today is definitely a multiprocessor. L1 L1 L1 L1
Multicores usually have a single address space and are Cache Cache Cache Cache
cache-coherent.
They are very similar to SMPs but they
L2
typically share one or more levels of cache, Cache
have more favorable inter-processor/core communication speed.

Memory

2007 October 11th 15 2007 October 11th 16

Multicore Distributed Memory Multiprocessor

Multicore chips have multiple benefits: In contrast to a single address space, machines with multiple
Higher peak performance private memories are commonly called Distributed Memory
Power consumption control Machines.
Some cores can be turned off. Data is exchanged between memories via messages
Production yield increase communicated over a dedicated network.
8-core chips with a defective core sold with one core disabled. When the processor/memory pairs are physically separated,
...but also some potential drawbacks: such as on different boards or in different casings, such
Memory bandwidth per core limited machines are called Clusters.
Physical limits such as the number of pins.
Lower peak performance per thread
Some inherently sequential applications may actually run slower.

2007 October 11th 17 2007 October 11th 18

Cluster Clusters: Past and Present

Node Node Node Node
In the past, clusters were exclusively high-end machines with
custom supercomputer processors, and
Proc. Proc. Proc. Proc. custom high-performance interconnects.
These machines where very expensive and therefore limited
to research and big corporations.
Cache Cache Cache Cache In the ’90s onwards it is increasingly common with clusters
based on off-the-shelf components:
commodity processors, and
commodity interconnects (e.g. ethernet with switches).
Memory Memory Memory Memory
The economic benefits and programmer familiarity with
commodity components far outweigh the performance issues.
Network Has helped ”democratize” supercomputing:
many corporations and universities have clusters today.

2007 October 11th 19 2007 October 11th 20

Networks

Access to memory on other nodes is very expensive.

Data must be transferred over a relatively high-latency low-
bandwidth network.
Algorithms with low data locality will suffer.
High synchronization requirements will also degrade
performance for the same reason.
The network design is a tradeoff between conflicting goals:
Maximum bandwidth and low latency:
full connectivity
Low cost and power consumption:
tree network
Switch-based networks common today, other examples of
topologies include rings, meshes, hypercubes, and trees.

2007 October 11th 21

PDC Architectures
No ratings yet
PDC Architectures
24 pages
Unit Iv Parallelism
No ratings yet
Unit Iv Parallelism
80 pages
Multiprocessor Basics & Performance
No ratings yet
Multiprocessor Basics & Performance
52 pages
Unit 5
No ratings yet
Unit 5
96 pages
Introduction To Parallel Programming
No ratings yet
Introduction To Parallel Programming
129 pages
Coa PPT-2
No ratings yet
Coa PPT-2
16 pages
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
No ratings yet
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
26 pages
CICS 504 Computer Organization
No ratings yet
CICS 504 Computer Organization
35 pages
Coa Unit 04
No ratings yet
Coa Unit 04
85 pages
Unit 4
No ratings yet
Unit 4
16 pages
COA - Unit 4
No ratings yet
COA - Unit 4
84 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
CA Chap7 Multicores Multiprocessors
No ratings yet
CA Chap7 Multicores Multiprocessors
42 pages
Unit VI Parallel Programming Concepts
No ratings yet
Unit VI Parallel Programming Concepts
90 pages
Parallel Architectures Parallel Architectures: Ever Faster
No ratings yet
Parallel Architectures Parallel Architectures: Ever Faster
11 pages
Unit V
No ratings yet
Unit V
95 pages
Unit 5
No ratings yet
Unit 5
96 pages
U1-Theory of Parallelism
No ratings yet
U1-Theory of Parallelism
43 pages
Ca - Unit 4
No ratings yet
Ca - Unit 4
77 pages
Parallel Programming Module 1
No ratings yet
Parallel Programming Module 1
71 pages
Unit 1
No ratings yet
Unit 1
21 pages
Coa Chapter 5
No ratings yet
Coa Chapter 5
96 pages
CS213 Parallel Processing Syllabus
No ratings yet
CS213 Parallel Processing Syllabus
26 pages
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
No ratings yet
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
43 pages
HPC Unit 1
No ratings yet
HPC Unit 1
65 pages
5 4 Parallel
No ratings yet
5 4 Parallel
47 pages
Organization CH 2
No ratings yet
Organization CH 2
102 pages
Multiprocessing Vs Multithreading 2
No ratings yet
Multiprocessing Vs Multithreading 2
16 pages
Parallel Computing Platforms and Memory System Performance: John Mellor-Crummey
No ratings yet
Parallel Computing Platforms and Memory System Performance: John Mellor-Crummey
43 pages
HPA - Notes
No ratings yet
HPA - Notes
5 pages
Architecture
No ratings yet
Architecture
67 pages
Part 1 - Lecture 2 - Parallel Hardware
No ratings yet
Part 1 - Lecture 2 - Parallel Hardware
60 pages
Module 2
No ratings yet
Module 2
124 pages
02 Lecture Flynn IN
No ratings yet
02 Lecture Flynn IN
78 pages
Unit 7 - Parallel Processing Paradigm
No ratings yet
Unit 7 - Parallel Processing Paradigm
26 pages
COA UNIT 5 (AutoRecovered)
No ratings yet
COA UNIT 5 (AutoRecovered)
14 pages
Parallel Computers
No ratings yet
Parallel Computers
39 pages
Multiprocessors - Parallel Processing Overview: "The Real World Is Inherently Concurrent Yet Our Computational
No ratings yet
Multiprocessors - Parallel Processing Overview: "The Real World Is Inherently Concurrent Yet Our Computational
78 pages
HPC Unit 2
No ratings yet
HPC Unit 2
72 pages
Module 2 - Parallel Computing
No ratings yet
Module 2 - Parallel Computing
55 pages
Parallel Processing
No ratings yet
Parallel Processing
31 pages
CS Chap7 Multicores Multiprocessors Clusters
No ratings yet
CS Chap7 Multicores Multiprocessors Clusters
65 pages
Flynn's Classification
No ratings yet
Flynn's Classification
46 pages
HPC - Unit-1 Insem Notes
No ratings yet
HPC - Unit-1 Insem Notes
76 pages
Intro HPC IITK
No ratings yet
Intro HPC IITK
44 pages
Arch13 Multiprocessors Afterlecture
No ratings yet
Arch13 Multiprocessors Afterlecture
70 pages
Multi-Core Programming - Increasing Performance Through Software Multi-Threading
No ratings yet
Multi-Core Programming - Increasing Performance Through Software Multi-Threading
11 pages
04 Architecture
No ratings yet
04 Architecture
22 pages
Parallel Computing Essentials
No ratings yet
Parallel Computing Essentials
109 pages
CC Unit 1.2
No ratings yet
CC Unit 1.2
39 pages
Cloud Computing CS 15-319: Programming Models-Part I Lecture 4, Jan 25, 2012
No ratings yet
Cloud Computing CS 15-319: Programming Models-Part I Lecture 4, Jan 25, 2012
40 pages
Pda 2
No ratings yet
Pda 2
105 pages
PAG Unit1
No ratings yet
PAG Unit1
64 pages
Lec1 Introduction To Parallel Computing
No ratings yet
Lec1 Introduction To Parallel Computing
40 pages
Cloud Computing - Lecture 3
No ratings yet
Cloud Computing - Lecture 3
22 pages
Introduction To Parallel Processing Architecture
No ratings yet
Introduction To Parallel Processing Architecture
31 pages
Module - 4 - Parallel Processing
No ratings yet
Module - 4 - Parallel Processing
32 pages
W3C1 Principles of Parallel Computing
No ratings yet
W3C1 Principles of Parallel Computing
28 pages
MSFC Editor
No ratings yet
MSFC Editor
20 pages
Parallel and Distributed Computing Short Answer Type Question Answer PDF
No ratings yet
Parallel and Distributed Computing Short Answer Type Question Answer PDF
5 pages
BDA Module1
No ratings yet
BDA Module1
64 pages
Cs 903advanced Computer Architecture Unit - I
No ratings yet
Cs 903advanced Computer Architecture Unit - I
57 pages
15.1 Processors, Parallel Processing and Virtual Machines
No ratings yet
15.1 Processors, Parallel Processing and Virtual Machines
10 pages
Distributed Systems CSE 2052 Module 1 Dr. Jayakumar 26 Aug 2023
No ratings yet
Distributed Systems CSE 2052 Module 1 Dr. Jayakumar 26 Aug 2023
60 pages
Teradata Architecture
100% (1)
Teradata Architecture
7 pages
Parallelizing Sequential Programs
No ratings yet
Parallelizing Sequential Programs
19 pages
CSC 303
No ratings yet
CSC 303
42 pages
Understanding Multicore and OpenMP
No ratings yet
Understanding Multicore and OpenMP
82 pages
Gate LevelCDCPaperr8
No ratings yet
Gate LevelCDCPaperr8
8 pages
Parallel Algorithm For The Chameleon Clustering Algorithm Using Dynamic Modeling
No ratings yet
Parallel Algorithm For The Chameleon Clustering Algorithm Using Dynamic Modeling
7 pages
Chapter1 Computer Abstractions and Technology
No ratings yet
Chapter1 Computer Abstractions and Technology
52 pages
Openmp For Java
No ratings yet
Openmp For Java
94 pages
Parallel Database
No ratings yet
Parallel Database
27 pages
AOS - Theory - Multi-Processor & Distributed UNIX Operating Systems - I
No ratings yet
AOS - Theory - Multi-Processor & Distributed UNIX Operating Systems - I
13 pages
Streams - NUS School of Computing
No ratings yet
Streams - NUS School of Computing
5 pages
FPGA MIMO System with Xilinx XSG
No ratings yet
FPGA MIMO System with Xilinx XSG
9 pages
Cascaded Mpeg Rate Control For Simultaneous Improvement of Accuracy and
No ratings yet
Cascaded Mpeg Rate Control For Simultaneous Improvement of Accuracy and
40 pages
(Ebook) Structured computer organization by Tanenbaum A.S., Austin T. ISBN 9780132916523, 0132916525 pdf version
No ratings yet
(Ebook) Structured computer organization by Tanenbaum A.S., Austin T. ISBN 9780132916523, 0132916525 pdf version
105 pages
Air Quality Checker Report
No ratings yet
Air Quality Checker Report
4 pages
Cloud ComputingChrist University Question Bank
No ratings yet
Cloud ComputingChrist University Question Bank
7 pages
Merge Sort Sequential and Parallel Progr
No ratings yet
Merge Sort Sequential and Parallel Progr
7 pages
Modeling Idealized Bounding Cases of Parallel Genetic Algorithms
No ratings yet
Modeling Idealized Bounding Cases of Parallel Genetic Algorithms
9 pages
Intro to Computers for BBA Students
No ratings yet
Intro to Computers for BBA Students
12 pages
Chapter-10 Parallel Programming Models, Languages and Compilers
No ratings yet
Chapter-10 Parallel Programming Models, Languages and Compilers
29 pages
Cloud Computing NEP Notes
No ratings yet
Cloud Computing NEP Notes
73 pages
Input and Output Devices Explained
No ratings yet
Input and Output Devices Explained
9 pages
DepartmentOfCSE Course Description
No ratings yet
DepartmentOfCSE Course Description
11 pages
The Berkeley Out-of-Order Machine (BOOM) : An Industry-Competitive, Synthesizable, Parameterized RISC-V Processor
No ratings yet
The Berkeley Out-of-Order Machine (BOOM) : An Industry-Competitive, Synthesizable, Parameterized RISC-V Processor
5 pages

Motivation For Parallelism Motivation For Parallelism

Uploaded by

Motivation For Parallelism Motivation For Parallelism

Uploaded by

Motivation for Parallelism Motivation for Parallelism

2007 October 11th 1 2007 October 11th 2

Types of Parallelism ILP Example: Loop Unrolling

2007 October 11th 3 2007 October 11th 4

2007 October 11th 5 2007 October 11th 6

Types of Parallelism Superscalar and OoO execution

2007 October 11th 7 2007 October 11th 8

2007 October 11th 9 2007 October 11th 10

Single Instruction Multiple Data (SIMD) Shared Memory Multiprocessor

2007 October 11th 11 2007 October 11th 12

Proc. Proc. Proc. Proc.

Memory Proc. Proc. Memory

Memory Memory Proc. Proc. Memory

2007 October 11th 13 2007 October 11th 14

2007 October 11th 15 2007 October 11th 16

2007 October 11th 17 2007 October 11th 18

Cluster Clusters: Past and Present

2007 October 11th 19 2007 October 11th 20

Access to memory on other nodes is very expensive.

2007 October 11th 21

You might also like