0% found this document useful (0 votes)

12 views34 pages

Chapter4 2

The document discusses the architecture and implementation of a pipelined processor design, detailing the roles of various units such as the ALU, data memory, and registers. It covers the execution of different instruction types (R-type, load/store, J-type) and the impact of pipelining on performance, including potential hazards like structural, data, and control hazards. Additionally, it explores the benefits of bypassing to reduce stalls and improve throughput in a multi-stage pipeline system.

Uploaded by

Namit Jain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views34 pages

Chapter4 2

Uploaded by

Namit Jain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

View from 30,000 Feet

Note: we haven’t bothered

showing multiplexors

• What is the role of the Add units? Source: H&P textbook

• Explain the inputs to the data memory unit

• Explain the inputs to the ALU
• Explain the inputs to the register unit 4
Clocking Methodology

Source: H&P textbook

• Which of the above units need a clock?
• What is being saved (latched) on the rising edge of the clock?
Keep in mind that the latched value remains there for an entire cycle
5
Implementing R-type Instructions

• Instructions of the form add $t1, $t2, $t3

• Explain the role of each signal

Source: H&P textbook

6
Implementing Loads/Stores

• Instructions of the form lw $t1, 8($t2) and sw $t1, 8($t2)

Where does this input come from?

7
Source: H&P textbook
Implementing J-type Instructions

• Instructions of the form beq $t1, $t2, offset

Source: H&P textbook 8

View from 10,000 Feet

9
Source: H&P textbook
View from 5,000 Feet

10
Source: H&P textbook
Latches and Clocks in a Single-Cycle Design

Instr Reg Data

PC ALU Addr
Mem File Memory

• The entire instruction executes in a single cycle

• Green blocks are latches
• At the rising edge, a new PC is recorded
• At the rising edge, the result of the previous cycle is recorded
• At the falling edge, the address of LW/SW is recorded so
we can access the data memory in the 2nd half of the cycle 11
Multi-Stage Circuit

Instead of executing the entire instruction in a single

cycle (a single stage), let’s break up the execution into
multiple stages, each separated by a latch

Instr Reg Data

PC L2 L3 ALU L4 L5
Mem File Memory

Reg
File
12
The Assembly Line
Unpipelined Start and finish a job before moving to the next

Jobs

Time

A B C
A B C Break the job into smaller stages
A B C
A B C
Pipelined

13
Performance Improvements?

• Does it take longer to finish each individual job?

• Does it take shorter to finish a series of jobs?

• What assumptions were made while answering these

questions?

• Is a 10-stage pipeline better than a 5-stage pipeline?

4
A 5-Stage Pipeline

register write in the first half of the clock cycle (dotted part)
register read in the second half of the clock cycle (solid part)

so that in CC5, I4 can use the register released by I1 (otherwise directly I5

will be able to use it)

5
Source: H&P textbook
A 5-Stage Pipeline

Use the PC to access the I-cache and increment PC by 4

all instructions go through all stages

for eg, add instruction does not require DM, but still it will
take 5 clock cycles (it will wait for that particular clock cycle)

6
DM - data memory
A 5-Stage Pipeline improved throughput, thorughput becomes 5 times

Read registers, compare registers, compute branch target; for now, assume
branches take 2 cyc (there is enough work that branches can easily take more)

branches dont work well with pipelines

7
A 5-Stage Pipeline

ALU computation, effective address computation for load/store

8
A 5-Stage Pipeline

Memory access to/from data cache, stores finish in 4 cycles

9
A 5-Stage Pipeline

Write result of ALU computation or load into register file

because of the solid and dotted lines, we are able to use

the writing of I1 for the reading of I4. otherwise we would have to
wait for I5 for reading what I1 wrote.

10
Pipeline Summary
note: no skipping of stages. so that there is no
overtaking (faster instruction overtaking the
slower one)
RR ALU DM RW
still 5 cycles taken (even if DM is empty). IM is not shown, only the latter 4 cycles are shown
ADD R1, R2,  R3 Rd R1,R2 R1+R2 -- Wr R3

BEQ R1, R2, 100 Rd R1, R2 -- -- --

Compare, Set PC

LD 8[R3]  R6 Rd R3 R3+8 Get data Wr R6

here R3 is the address

ST 8[R3]  R6 Rd R3,R6 R3+8 Wr data --

11
Performance Improvements?

• Does it take longer to finish each individual job?

yes, possibbly due to additional latch delays

• Does it take shorter to finish a series of jobs?

• What assumptions were made while answering these

questions?
– No dependences between instructions
– Easy to partition circuits into uniform pipeline stages
– No latch overhead

• Is a 10-stage pipeline better than a 5-stage pipeline?

12
Quantitative Effects

• As a result of pipelining:
 Time in ns per instruction goes up
 Each instruction takes more cycles to execute
 But… average CPI remains roughly the same
 Clock speed goes up becomes 5 times for 5 stage pipeline
 Total execution time goes down, resulting in lower
average time per instruction
 Under ideal conditions, speedup
= ratio of elapsed times between successive instruction
completions
= number of pipeline stages = increase in clock speed

13
Conflicts/Problems

• I-cache and D-cache are accessed in the same cycle – it

helps to implement them separately since, IM and DM might happen in the same clock
cycle, we must build separate hardware for these ops

• Registers are read and written in the same cycle – easy to

deal with if register read/write time equals cycle time/2

• Branch target changes only at the end of the second stage

-- what do you do in the meantime?

14
Hazards

• Structural hazards: different instructions in different stages

(or the same stage) conflicting for the same resource
for eg reading and writing in the same register in the same clock cycle solution: half cycles mein break kardo
eg(2) im and dm operations in the same clock cycle solution: keep separate IM and DM

• Data hazards: an instruction cannot continue because it

needs a value that has not yet been generated by an
earlier instruction dependencies, an instruction might need the output of some instruction that has still
not completed its 5 stages

• Control hazard: fetch cannot continue because it does

not know the outcome of an earlier branch – special case
of a data hazard – separate category because they are
treated in different ways

15
Structural Hazards

• Example: a unified instruction and data cache 

stage 4 (MEM) and stage 1 (IF) can never coincide

• The later instruction and all its successors are delayed

until a cycle is found when the resource is free  these
are pipeline bubbles

• Structural hazards are easy to eliminate – increase the

number of resources (for example, implement a separate
instruction and data cache, add more register ports)

5
Data Hazards

• An instruction produces a value in a given pipeline stage

• A subsequent instruction consumes that value in a pipeline

stage

• The consumer may have to be delayed so that the time

of consumption is later than the time of production

6
Example 1 – No Bypassing i1 and i2 have data hazard

• Show the instruction occupying each stage in each cycle (no bypassing)
if I1 is R1+R2R3 and I2 is R3+R4R5 and I3 is R7+R8R9
CYC-1 CYC-2 CYC-3 CYC-4 CYC-5 CYC-6 CYC-7 CYC-8

IF IF IF IF IF IF IF IF

D/R D/R D/R D/R D/R D/R D/R D/R

ALU ALU ALU ALU ALU ALU ALU ALU

DM DM DM DM DM DM DM DM

RW RW RW RW RW RW RW RW 7
Example 1 – No Bypassing
• Show the instruction occupying each stage in each cycle (no bypassing)
if I1 is R1+R2R3 and I2 is R3+R4R5 and I3 is R7+R8R9
CYC-1 CYC-2 CYC-3 CYC-4 CYC-5 CYC-6 CYC-7 CYC-8

IF IF IF IF IF IF IF IF
L2
I1 I2 I3 I3 I3
waiting for I2 to proceed
I4 I5

L3
D/R D/R D/R D/R D/R D/R D/R D/R
I1 I2 I2 I2
concluded finally
I3 I4
in the second half
of the clock cycle
L4 ALU ALU ALU ALU ALU ALU ALU ALU
I1 I2 I3
L5
DM DM DM DM DM DM DM DM
I1 this is a bubble
I2 I3
RW RW RW RW RW RW RW RW 8
I1 I2
Example 2 – Bypassing
• Show the instruction occupying each stage in each cycle (with bypassing)
if I1 is R1+R2R3 and I2 is R3+R4R5 and I3 is R3+R8R9.
Identify the input latch for each input operand.
CYC-1 CYC-2 CYC-3 CYC-4 CYC-5 CYC-6 CYC-7 CYC-8

IF IF IF IF IF IF IF IF

D/R D/R D/R D/R D/R D/R D/R D/R

ALU ALU ALU ALU ALU ALU ALU ALU

DM DM DM DM DM DM DM DM

RW RW RW RW RW RW RW RW 9
Example 2 – Bypassing Li Lj means that Li has been overwritten by Lj
L5 L3 because by the end of cyc4, L4 has been updated by the ALU op

• Show the instruction occupying each stage in each cycle (with bypassing)
if I1 is R1+R2R3 and I2 is R3+R4R5 and I3 is R3+R8R9.
Identify the input latch for each input operand.
observe that the result has been stored in L3 for I1 in cyc3 itself, and it is directly usable now for I2 (dont have to wait for all 5 cycles)
CYC-1 CYC-2 CYC-3 CYC-4 CYC-5 CYC-6 CYC-7 CYC-8

IF IF IF IF IF IF IF IF
I1 I2 I3 I4 I5
L2

D/R D/R D/R D/R D/R D/R D/R D/R

I1 I2 I3 I4
L3 L3 L3 L4 L3 L5 L3
ALU ALU ALU ALU ALU ALU ALU ALU
I1 I2 I3
L4

DM DM DM DM DM DM DM DM
I1 I2 I3
L5

RW RW RW RW RW RW RW RW
I1 I2 I3
Problem 1

IF D/R ALUL3 DM RW
i1 i1 i1 i1 i1

IF D/R ALU DM RW
L4
i2 i2 i2 i2
i2

IF D/R ALU DM RW
add $1, $2, $3

IF D/R ALU DM RW
lw $4, 8($1)

11
Problem 2
L2 L3 L4 L5

IF D/R ALU DM RW
i1 i1
i1 i1 i1

IF D/R ALU DM RW
i2 i2

lw $1, 8($2) IF D/R ALU DM RW

i2 i2
DM is in cyc4, so there is one cycle delay i2 i2
(still faster than non-bypass)

lw $4, 8($1) IF D/R ALU DM RW

12
Problem 3 1) read from L5
2) writing will happen in the first half and hence DM can access the written part in the second half

IF D/R ALU DM RW
i1 i1 i1 i1
i1

IF D/R ALU DM RW
i2 i2 i2
i2 i2

IF D/R ALU DM RW
lw $1, 8($2)

IF D/R ALU DM RW
sw $1, 8($3)

13
Problem 4

A 7 or 9 stage pipeline, RR and RW take an entire stage

IF IF Dec Dec RR ALU RW

ALU DM DM RW

lw $1, 8($2)

add $4, $1, $3 9

Problem 4

A 7 or 9 stage pipeline, RR and RW take an entire stage

instruction
fetch decode
IF IF Dec Dec RR ALU RW

ALU DM DM RW

lw $1, 8($2)

add $4, $1, $3 10

Problem 4

Without bypassing: 4 stalls

IF:IF:DE:DE:RR:AL:DM:DM:RW
IF: IF :DE:DE:DE:DE: DE :DE:RR:AL:RW

With bypassing: 2 stalls

IF:IF:DE:DE:RR:AL:DM:DM:RW
IF: IF :DE:DE:DE:DE: RR :AL:RW
lw $1, 8($2)
IF IF Dec Dec RR ALU RW
add $4, $1, $3

ALU DM DM RW
11

Module 2
No ratings yet
Module 2
64 pages
CA07 2022S3 New
No ratings yet
CA07 2022S3 New
29 pages
Design of 32bit MIPS Processor
No ratings yet
Design of 32bit MIPS Processor
23 pages
Pipelining ControlUnitAndHazards
No ratings yet
Pipelining ControlUnitAndHazards
109 pages
Pipe Lining
No ratings yet
Pipe Lining
66 pages
04 Pipeline
No ratings yet
04 Pipeline
83 pages
8 Pipeline DDP Control
No ratings yet
8 Pipeline DDP Control
54 pages
Ch#4 Part 1, 2,34
No ratings yet
Ch#4 Part 1, 2,34
70 pages
Lec 11
No ratings yet
Lec 11
30 pages
CS530 Fall2015 Lecture9
No ratings yet
CS530 Fall2015 Lecture9
5 pages
Chapter 04 Processor 3.5
No ratings yet
Chapter 04 Processor 3.5
52 pages
Processor Organization & Instruction Cycle
No ratings yet
Processor Organization & Instruction Cycle
31 pages
Advanced CPU Pipeline Techniques
No ratings yet
Advanced CPU Pipeline Techniques
17 pages
CS M151B / EE M116C: Computer Systems Architecture
No ratings yet
CS M151B / EE M116C: Computer Systems Architecture
38 pages
Chapter 4 The Processor
No ratings yet
Chapter 4 The Processor
72 pages
HRY-312 Computer Organization Introduction To Pipelining
No ratings yet
HRY-312 Computer Organization Introduction To Pipelining
30 pages
3 Pipeline
No ratings yet
3 Pipeline
38 pages
15IF11 Multicore A PDF
No ratings yet
15IF11 Multicore A PDF
64 pages
Pipelining and Parallelism
No ratings yet
Pipelining and Parallelism
41 pages
SRM Pipelining 05
No ratings yet
SRM Pipelining 05
42 pages
MIPS Pipelining and Hazards
0% (1)
MIPS Pipelining and Hazards
38 pages
Pipeline Processor Design
No ratings yet
Pipeline Processor Design
89 pages
Two Forms of Pipelining: - E.g., Floating Point Operations
No ratings yet
Two Forms of Pipelining: - E.g., Floating Point Operations
36 pages
L03 Pipelining
No ratings yet
L03 Pipelining
45 pages
Reduced Instruction Set Computer (Risc) Complex Instruction Set Computer (Cisc)
No ratings yet
Reduced Instruction Set Computer (Risc) Complex Instruction Set Computer (Cisc)
7 pages
Advanced Pipelining Techniques
No ratings yet
Advanced Pipelining Techniques
44 pages
Lec7 Pipelining
No ratings yet
Lec7 Pipelining
22 pages
Bản Sao Của Lecture 9 - Pipelined Processor Design
No ratings yet
Bản Sao Của Lecture 9 - Pipelined Processor Design
11 pages
Lecture10 - Chapter4-P2
No ratings yet
Lecture10 - Chapter4-P2
46 pages
Lecture Notes Pipelining Stages 7B
No ratings yet
Lecture Notes Pipelining Stages 7B
7 pages
Pipeline Review: Here Is The Example Instruction Sequence Used To Illustrate Pipelining On The Previous Page
No ratings yet
Pipeline Review: Here Is The Example Instruction Sequence Used To Illustrate Pipelining On The Previous Page
11 pages
Ca06 2014 PDF
No ratings yet
Ca06 2014 PDF
53 pages
FemtoRV32 Piplined Processor Report
No ratings yet
FemtoRV32 Piplined Processor Report
25 pages
Lec12 Pipeline
No ratings yet
Lec12 Pipeline
23 pages
Lecture # Pipelining
No ratings yet
Lecture # Pipelining
36 pages
Computer Systems Pipelining Guide
No ratings yet
Computer Systems Pipelining Guide
7 pages
Pipelined MIPS Processor: Dmitri Strukov ECE 154A
No ratings yet
Pipelined MIPS Processor: Dmitri Strukov ECE 154A
81 pages
Computer Architecture: Introduction To The Concept of Pipelined Processor
No ratings yet
Computer Architecture: Introduction To The Concept of Pipelined Processor
20 pages
Pipe Lining
No ratings yet
Pipe Lining
43 pages
Chapter 2 Lecture 4 and 5
No ratings yet
Chapter 2 Lecture 4 and 5
56 pages
Week 11
No ratings yet
Week 11
33 pages
Computer Architecture and Organization
No ratings yet
Computer Architecture and Organization
49 pages
ILP - Appendix C PDF
No ratings yet
ILP - Appendix C PDF
52 pages
Sp23 Solution
No ratings yet
Sp23 Solution
22 pages
06 Ooo Basics
No ratings yet
06 Ooo Basics
74 pages
CA Lecture 12
No ratings yet
CA Lecture 12
48 pages
Lec11 Pipeline 1 Notes
No ratings yet
Lec11 Pipeline 1 Notes
26 pages
Lecture2a PDF
No ratings yet
Lecture2a PDF
63 pages
CSCE 5610 Computer System Architecture: Instruction Level Parallelism
No ratings yet
CSCE 5610 Computer System Architecture: Instruction Level Parallelism
11 pages
Lecture 32 Pipelined Execution Structural and Data Hazards
No ratings yet
Lecture 32 Pipelined Execution Structural and Data Hazards
30 pages
Lect3 Pipeline
No ratings yet
Lect3 Pipeline
4 pages
Arch3 Pipelining Afterlecture
No ratings yet
Arch3 Pipelining Afterlecture
180 pages
Lec 25
No ratings yet
Lec 25
20 pages
Computer Architecture: Pipelining: Dr. Ashok Kumar Turuk
No ratings yet
Computer Architecture: Pipelining: Dr. Ashok Kumar Turuk
136 pages
Pipelined Processor Design: Computer Architecture and Assembly Language
No ratings yet
Pipelined Processor Design: Computer Architecture and Assembly Language
22 pages
Introduction To Pipelining Introduction To Pipelining
No ratings yet
Introduction To Pipelining Introduction To Pipelining
35 pages
Ch-2-BASIC ARCHITECTURE OF THE 8088 AND 8086
No ratings yet
Ch-2-BASIC ARCHITECTURE OF THE 8088 AND 8086
24 pages
Addressing Modes in 8085 Microprocessor
No ratings yet
Addressing Modes in 8085 Microprocessor
1 page
Microprocessor Instruction Tasks
No ratings yet
Microprocessor Instruction Tasks
2 pages
MIPS32 1004K: Industry's First Multi-Threaded Multiprocessor IP Core For Embedded Applications
No ratings yet
MIPS32 1004K: Industry's First Multi-Threaded Multiprocessor IP Core For Embedded Applications
2 pages
AMD A4-9120 vs Intel J4005 Comparison
No ratings yet
AMD A4-9120 vs Intel J4005 Comparison
1 page
Lec02 Superscalar SW VLIW 22 23
No ratings yet
Lec02 Superscalar SW VLIW 22 23
34 pages
4th Sem EC 1257 Microprocessor
No ratings yet
4th Sem EC 1257 Microprocessor
7 pages
Chapter 3: Intel 8086
83% (6)
Chapter 3: Intel 8086
135 pages
8051 Microcontroller Instruction Set: Electronics Hub
No ratings yet
8051 Microcontroller Instruction Set: Electronics Hub
12 pages
2620 Final PDF
No ratings yet
2620 Final PDF
45 pages
Timing Diagrams: Macine Cycles Summary
No ratings yet
Timing Diagrams: Macine Cycles Summary
9 pages
DSP Processor
No ratings yet
DSP Processor
29 pages
Iiit Kota Sem3 Midterm QP
No ratings yet
Iiit Kota Sem3 Midterm QP
6 pages
Q # 1 What Is Machine Cycle in Computer and How It Works?
100% (1)
Q # 1 What Is Machine Cycle in Computer and How It Works?
4 pages
Mock Exams Q1
No ratings yet
Mock Exams Q1
3 pages
Bus Organization of 8085 Microprocessor
No ratings yet
Bus Organization of 8085 Microprocessor
6 pages
Pipelining: by Based On The Text Book "Computer Organization" by Carl Hamacher Et Al., Fifth Edition
No ratings yet
Pipelining: by Based On The Text Book "Computer Organization" by Carl Hamacher Et Al., Fifth Edition
23 pages
Lesson Plan FOR MICROPROCESSOR
No ratings yet
Lesson Plan FOR MICROPROCESSOR
3 pages
PAN159 Chinese
No ratings yet
PAN159 Chinese
9 pages
Instruction Sap 2
No ratings yet
Instruction Sap 2
3 pages
The Berkeley Out - of - Order Machine (Boom!) : An Open - Source Industry - Compeeeve, Synthesizable, Parameterized Risc - V Processor
100% (1)
The Berkeley Out - of - Order Machine (Boom!) : An Open - Source Industry - Compeeeve, Synthesizable, Parameterized Risc - V Processor
45 pages
EEE C415: Digital Signal Processing: Architecture of TMS320C54x
No ratings yet
EEE C415: Digital Signal Processing: Architecture of TMS320C54x
22 pages
Seven Segment Based Alarm Clock Using 8051 Microcontroller
No ratings yet
Seven Segment Based Alarm Clock Using 8051 Microcontroller
16 pages
Parallel Programming Platforms: Alexandre David 1.2.05
No ratings yet
Parallel Programming Platforms: Alexandre David 1.2.05
30 pages
Microprocessor PYQs
No ratings yet
Microprocessor PYQs
7 pages
5 PIC18 AddressingModes FSRs Table Part2
No ratings yet
5 PIC18 AddressingModes FSRs Table Part2
15 pages
Mit Ocw Complex Digital Systems Lab1
No ratings yet
Mit Ocw Complex Digital Systems Lab1
6 pages
PC 24 Close The Gap
No ratings yet
PC 24 Close The Gap
28 pages
8051 Microcontroller
No ratings yet
8051 Microcontroller
76 pages

Chapter4 2

Uploaded by

Chapter4 2

Uploaded by

View from 30,000 Feet

Note: we haven’t bothered

• What is the role of the Add units? Source: H&P textbook

• Explain the inputs to the data memory unit

Source: H&P textbook

• Instructions of the form add $t1, $t2, $t3

Source: H&P textbook

• Instructions of the form lw $t1, 8($t2) and sw $t1, 8($t2)

Where does this input come from?

• Instructions of the form beq $t1, $t2, offset

Source: H&P textbook 8

Instr Reg Data

• The entire instruction executes in a single cycle

Instead of executing the entire instruction in a single

Instr Reg Data

• Does it take longer to finish each individual job?

• Does it take shorter to finish a series of jobs?

• What assumptions were made while answering these

• Is a 10-stage pipeline better than a 5-stage pipeline?

so that in CC5, I4 can use the register released by I1 (otherwise directly I5

Use the PC to access the I-cache and increment PC by 4

all instructions go through all stages

branches dont work well with pipelines

ALU computation, effective address computation for load/store

Memory access to/from data cache, stores finish in 4 cycles

Write result of ALU computation or load into register file

because of the solid and dotted lines, we are able to use

BEQ R1, R2, 100 Rd R1, R2 -- -- --

LD 8[R3]  R6 Rd R3 R3+8 Get data Wr R6

ST 8[R3]  R6 Rd R3,R6 R3+8 Wr data --

• Does it take longer to finish each individual job?

• Does it take shorter to finish a series of jobs?

• What assumptions were made while answering these

• Is a 10-stage pipeline better than a 5-stage pipeline?

• I-cache and D-cache are accessed in the same cycle – it

• Registers are read and written in the same cycle – easy to

• Branch target changes only at the end of the second stage

• Structural hazards: different instructions in different stages

• Data hazards: an instruction cannot continue because it

• Control hazard: fetch cannot continue because it does

• Example: a unified instruction and data cache 

• The later instruction and all its successors are delayed

• Structural hazards are easy to eliminate – increase the

• An instruction produces a value in a given pipeline stage

• A subsequent instruction consumes that value in a pipeline

• The consumer may have to be delayed so that the time

D/R D/R D/R D/R D/R D/R D/R D/R

ALU ALU ALU ALU ALU ALU ALU ALU

D/R D/R D/R D/R D/R D/R D/R D/R

ALU ALU ALU ALU ALU ALU ALU ALU

D/R D/R D/R D/R D/R D/R D/R D/R

lw $1, 8($2) IF D/R ALU DM RW

lw $4, 8($1) IF D/R ALU DM RW

A 7 or 9 stage pipeline, RR and RW take an entire stage

IF IF Dec Dec RR ALU RW

add $4, $1, $3 9

A 7 or 9 stage pipeline, RR and RW take an entire stage

add $4, $1, $3 10

Without bypassing: 4 stalls

With bypassing: 2 stalls

You might also like