III Semester
[CSE]
Computer Architecture and
Organization(CST222)
UNIT-6
PROF. W.H. BISEN,
DEPARTMENT OF CSE,
SHRI RAMDEOBABA COLLEGE OF ENGINEERING &
MANAGEMENT, NAGPUR
Pipelining
CONTENTS: PIPELINING: BASIC CONCEPTS OF PIPELINING,
THROUGHPUT AND SPEEDUP, INTRODUCTION OF PARALLEL
COMPUTING: SISD, MISD, SIMD, MIMD
Overview
Pipelining is widely used in modern processors.
Pipelining improves system performance in terms of throughput.
Pipelined organization requires sophisticated compilation techniques.
Basic Concepts
Making the Execution of
Programs Faster
Use faster circuit technology to build the processor and the main
memory.
Arrange the hardware so that more than one operation can be
performed at the same time.
In the latter way, the number of operations performed per second is
increased even though the elapsed time needed to perform any one
operation is not changed.
Traditional Concept
Laundry Example
Ann, Brian, Cathy, Dave
each have one load of clothes
to wash, dry, and fold A B C D
Washer takes 30 minutes
Dryer takes 40 minutes
“Folder” takes 20 minutes
Traditional Concept
6 PM 7 8 9 10 11 Midnight
Time
30 40 20 30 40 20 30 40 20 30 40 20
Sequential laundry takes 6
A hours for 4 loads
If they learned pipelining, how
long would laundry take?
B
D
Traditional Pipeline Concept
6 PM 7 8 9 10 11 Midnight
Time
T
a 30 40 40 40 40 20
s
k A
Pipelined laundry takes 3.5
hours for 4 loads
O B
r
d C
e
r D
Traditional Pipeline Concept
Pipelining doesn’t help latency of
6 PM 7 8 9
single task, it helps throughput of
entire workload
Time Pipeline rate limited by slowest
T pipeline stage
a 30 40 40 40 40 20
Multiple tasks operating
s simultaneously using different
A
k resources
Potential speedup = Number pipe
O B stages
r Unbalanced lengths of pipe stages
d C reduces speedup
e Time to “fill” pipeline and time to
r “drain” it reduces speedup
D
Stall for Dependences
Role of Cache Memory
Each pipeline stage is expected to complete in one clock
cycle.
The clock period should be long enough to let the slowest
pipeline stage to complete.
Faster stages can only wait for the slowest one to
complete.
Since main memory is very slow compared to the
execution, if each instruction needs to be fetched from
main memory, pipeline is almost useless.
Fortunately, we have cache.
Pipeline Performance
The potential increase in performance resulting from pipelining is
proportional to the number of pipeline stages.
However, this increase would be achieved only if all pipeline stages
require the same time to complete, and there is no interruption
throughout program execution.
Unfortunately, this is not true.
Pipeline Performance
The previous pipeline is said to have been stalled for two clock cycles.
Any condition that causes a pipeline to stall is called a hazard.
Data hazard – any condition in which either the source or the
destination operands of an instruction are not available at the time
expected in the pipeline. So some operation has to be delayed, and the
pipeline stalls.
Instruction (control) hazard – a delay in the availability of an instruction
causes the pipeline to stall.
Structural hazard – the situation when two instructions require the use
of a given hardware resource at the same time.
Pipeline Performance
Again, pipelining does not result in individual instructions
being executed faster; rather, it is the throughput that
increases.
Throughput is measured by the rate at which instruction
execution is completed.
Pipeline stall causes degradation in pipeline performance.
We need to identify all hazards that may cause the
pipeline to stall and to find ways to minimize their impact.
Example-1
Non-pipeline execution takes 40ns.
A 4 stage pipeline with cycle time of 5ns. Speed up ratio of pipeline for
100 tasks is ??
Example-2
Consider a pipeline having 4 phases with duration 60, 50, 90 and 80 ns.
Given latch delay is 10 ns. Calculate-
Pipeline cycle time
Non-pipeline execution time
Speed up ratio
Pipeline time for 1000 tasks
Sequential time for 1000 tasks
Throughput
Example-2
Consider a pipeline having 4 phases with duration 60, 50, 90 and 80 ns.
Given latch delay is 10 ns. Calculate-
Pipeline Cycle Time-
Cycle time
= Maximum delay due to any stage + Delay due to its register
= Max { 60, 50, 90, 80 } + 10 ns
= 90 ns + 10 ns
= 100 ns
Non-Pipeline Execution Time-
Non-pipeline execution time for one instruction
= 60 ns + 50 ns + 90 ns + 80 ns
= 280 ns
Example-2
Consider a pipeline having 4 phases with duration 60, 50, 90 and 80 ns.
Given latch delay is 10 ns. Calculate-
Speed Up Ratio-
Speed up
= Non-pipeline execution time / Pipeline execution time
= 280 ns / Cycle time
= 280 ns / 100 ns
= 2.8 Pipeline Time For 1000 Tasks-
Pipeline time for 1000 tasks
= Time taken for 1st task + Time taken for
remaining 999 tasks
= 1 x 4 clock cycles + 999 x 1 clock cycle
= 4 x cycle time + 999 x cycle time
= 4 x 100 ns + 999 x 100 ns
= 400 ns + 99900 ns
= 100300 ns
Example-2
Consider a pipeline having 4 phases with duration 60, 50, 90 and 80 ns.
Given latch delay is 10 ns. Calculate-
Sequential Time For 1000 Tasks-
Non-pipeline time for 1000 tasks
= 1000 x Time taken for one task
= 1000 x 280 ns
= 280000 ns
Throughput-
Throughput for pipelined execution
= Number of instructions executed per unit
time
= 1000 tasks / 100300 ns
Example-3
A four stage pipeline has the stage delays as 150, 120, 160 and 140 ns respectively.
Registers are used between the stages and have a delay of 5 ns each. Assuming
constant clocking rate, the total time taken to process 1000 data items on the
pipeline will be?
Cycle Time-
Cycle time Pipeline Time To Process 1000 Data Items-
= Maximum delay due to any stage + Delay
due to its register Pipeline time to process 1000 data items
= Max { 150, 120, 160, 140 } + 5 ns = Time taken for 1st data item + Time taken
= 160 ns + 5 ns for remaining 999 data items
= 165 ns = 1 x 4 clock cycles + 999 x 1 clock cycle
= 4 x cycle time + 999 x cycle time
= 4 x 165 ns + 999 x 165 ns
= 660 ns + 164835 ns
= 165495 ns
= 165.5 μs
Example-4
Consider a non-pipelined processor with a clock rate of 2.5 gigahertz and
average cycles per instruction of 4. The same processor is upgraded to a
pipelined processor with five stages but due to the internal pipeline
delay, the clock speed is reduced to 2 gigahertz. Assume there are no
stalls in the pipeline. The speed up achieved in this pipelined processor
is?
Cycle Time in Non-Pipelined Processor-
Frequency of the clock = 2.5 gigahertz
Cycle time
= 1 / frequency Non-Pipeline Execution Time-
= 1 / (2.5 gigahertz)
= 1 / (2.5 x 109 hertz) Non-pipeline execution time to process 1
= 0.4 ns instruction
= Number of clock cycles taken to execute one
instruction
= 4 clock cycles
= 4 x 0.4 ns
= 1.6 ns
Example-4
Cycle Time in Pipelined Processor-
Frequency of the clock = 2 gigahertz
Cycle time
= 1 / frequency
Pipeline Execution Time-
= 1 / (2 gigahertz)
= 1 / (2 x 109 hertz)
Since there are no stalls in the pipeline, so
= 0.5 ns
ideally one instruction is executed per clock
cycle. So,
Pipeline execution time
= 1 clock cycle
= 0.5 ns
Example-4
Speed Up-
Speed up
= Non-pipeline execution time / Pipeline
execution time
= 1.6 ns / 0.5 ns
= 3.2
What is the approximate speed up of the pipeline in steady state under ideal
conditions when compared to the corresponding non-pipeline implementation?
Example-5
Consider an instruction pipeline with four stages (S1, S2, S3 and S4)
each with combinational circuit only. The pipeline registers are required
between each stage and at the end of the last stage. Delays for the
stages and for the pipeline registers are as given in the figure-
Example-5
PARALLEL COMPUTING
Parallel computing is a computing where the jobs are broken into
discrete parts that can be executed concurrently.
Each part is further broken down to a series of instructions.
Instructions from each part execute simultaneously on different CPUs.
simultaneous use of multiple computer resources that can include a
single computer with multiple processors, a number of computers
connected by a network to form a parallel processing cluster or a
combination of both.
Parallel Processing objective is execution speed improvement
● The manner is by doing several things at once
● The instruments are
– Architecture,
– Languages,
– Algorithm
Classification/Flynn’s classification
Based on the number of instruction and data streams that can be processed
simultaneously, computing systems are classified into four major categories:
Flynn’s Taxonomy
In 1966, Flynn proposed a taxonomy based on Instruction Streams
and Data Streams:
◦ SISD: Single Instruction Stream, Single Data Stream – this is a conventional
sequential computer
◦ SIMD: Single Instruction Stream, Multiple Data Stream – a single control
unit dispatches instructions to multiple processing units
◦ MISD: Multiple Instruction Stream, Single Data Stream – no architectures in
this group (although some people include pipelined architectures)
◦ MIMD: Multiple Instruction Stream, Multiple Data Stream – each
processing unit is able to execute instructions from a different control unit
Single-instruction, single-data
(SISD) systems
An SISD computing system is a uniprocessor machine which is capable
of executing a single instruction, operating on a single data stream.
machine instructions are processed in a sequential manner and
computers adopting this model are popularly called sequential
computers.
All the instructions and data to be processed have to be stored in
primary memory.
The advantages of SISD architecture are as follows −
•It requires less power.
•There is no issue of complex communication protocol between multiple cores.
Disadvantages of SISD
•The speed of SISD architecture is limited just like single-core processors.
•It is not suitable for larger applications.
Single-instruction,
multiple-data (SIMD)
systems
An SIMD system is a multiprocessor machine capable of executing the same
instruction on all the CPUs but operating on different data streams.
Machines based on an SIMD model are well suited to scientific computing
since they involve lots of vector and matrix operations.
Advantages of SIMD
•Same operation on multiple elements can be performed using one instruction only.
•Throughput of the system can be increased by increasing the number of cores of
the processor.
•Processing speed is higher than SISD architecture.
Disadvantages of SIMD
•There is complex communication between numbers of cores of processor.
•The cost is higher than SISD architecture.
Multiple-instruction,
single-data (MISD) systems
An MISD computing system is a multiprocessor machine capable of executing
different instructions on different Processors but all of them operating on the same
dataset .
The system performs different operations on the same data set. Machines built using
the MISD model are not useful in most of the application, a few machines are built,
but none of them are available commercially.
Multiple-instruction,
multiple-data (MIMD)
systems
An MIMD system is a multiprocessor machine which is capable of executing
multiple instructions on multiple data sets.
the MIMD model has separate instruction and data streams; therefore machines
built using this model are capable to any kind of application.
MIMD machines are broadly categorized into
shared-memory MIMD and distributed-memory MIMD
In the shared memory MIMD model, all the PEs are connected to a single global
memory and they all have access to it. The communication between PEs in this model
takes place through the shared memory, modification of the data stored in the global
memory by one PE is visible to all other PEs.
In Distributed memory MIMD machines all PEs have a local memory. The
communication between PEs in this model takes place through the interconnection
network (the inter process communication channel, or IPC).
processing elements (PEs)
HAPPY LEARNING