Processor Structure and Reduced
Instruction Set
MODULE 5
Module 5 Processor Structure and Reduced Instruction Set
• Processor organization,
• Register organization
• Instruction cycle
• Instruction pipelining
• Processor Organization for Pipelining
• Instruction Execution Characteristics
• The Use of a Large Register File,
• Compiler-Based Register Optimization
• Reduced Instruction Set Architecture
Processor Functions
• A processor must perform several functions:
• Fetch instruction: Reads instructions from memory.
• Interpret instruction: Decodes the instruction to
determine the operation.
• Fetch data: Reads data from memory or I/O devices.
• Process data: Performs arithmetic or logical operations.
• Write data: Stores results in memory or sends them to an
I/O device.
• To achieve these tasks efficiently, the processor contains
registers, an ALU (Arithmetic and Logic Unit), and a Control
Unit (CU).
Register Organization
• Registers act as high-speed memory within the processor. They can be
categorized into:
1. User-Visible Registers:
• Used in assembly-level programming to minimize memory access.
• Types:
• General-purpose registers: Store operands for any operation.
• Data registers: Hold integer or floating-point values.
• Address registers: Store memory addresses for instructions.
• Condition code registers: Store flags like zero, carry, overflow, etc.
Register Organization
2. Control and Status Registers:
• Used by the processor and OS to manage execution.
• Common registers:
• Program Counter (PC): Stores the address of the next instruction.
• Instruction Register (IR): Holds the fetched instruction.
• Memory Address Register (MAR): Holds the address of data in memory.
• Memory Buffer Register (MBR): Temporarily stores data read from or written to memory.
• Program Status Word (PSW): Holds condition codes, interrupt enable/disable bits, and execution mode.
Instruction Cycle
The instruction cycle consists of fetch, decode, execute, and interrupt handling stages:
1. Fetch: Reads the instruction from memory into the Instruction Register (IR).
2. Decode: Determines the operation and required operands.
3. Execute: Performs the operation using the ALU or registers.
4. Interrupt Handling (if needed): Saves the current state and executes the interrupt
service routine.
• If indirect addressing is used, an indirect cycle fetches the actual operand address.
• The cycle repeats for each instruction in the program.
Data Flow in the Processor
Data moves between the PC, MAR, MBR, IR, and ALU during execution:
1. Fetch Cycle:
• PC → MAR → Address Bus
• Memory → MBR → IR (Instruction is fetched)
• PC is incremented for the next instruction
2. Indirect Cycle (if needed):
• MBR (stores address) → MAR → Memory read
• MBR (stores final operand address)
Data Flow in the Processor
Data moves between the PC, MAR, MBR, IR, and ALU during execution:
3. Execute Cycle:
• ALU operates on data from registers or memory.
• The result is stored in a register or memory location.
4. Interrupt Cycle:
• PC (saved in memory) → New PC loaded with interrupt routine address
Instruction Pipelining
Pipelining Strategy
• Instruction pipelining is a technique used to improve CPU performance by overlapping
instruction execution, much like an assembly line in a factory.
• Instead of executing one instruction at a time, the processor breaks the execution into stages,
with multiple instructions being processed simultaneously.
• Basic Two-Stage Pipeline
1. Fetch Instruction (FI): Reads the instruction from memory.
2. Execute Instruction (EI): Decodes and executes the instruction.
• This approach improves speed by allowing a new instruction to be fetched while another is
being executed. However, execution time varies, and branch instructions cause delays.
Instruction Pipelining
Six-Stage Instruction Pipeline
To optimize performance, instruction processing can be broken into more stages:
1. Fetch Instruction (FI): Reads the instruction into a buffer.
2. Decode Instruction (DI): Determines the opcode and operands.
3. Calculate Operands (CO): Computes effective addresses.
4. Fetch Operands (FO): Retrieves operands from memory or registers.
5. Execute Instruction (EI): Performs the computation.
6. Write Operand (WO): Stores the result.
• With a six-stage pipeline, multiple instructions are in different stages simultaneously. If each stage takes
an equal time, execution time is significantly reduced.
• Challenges: Memory conflicts, branch instructions, and interrupts can stall the pipeline.
Pipeline Performance
•
Pipeline Hazards
Hazards occur when instruction dependencies prevent continuous execution.
Three types of hazards exist:
1. Resource Hazards (Structural Hazards)
• Occurs when multiple instructions require the same hardware resource.
• Example: If a memory read and an instruction fetch cannot occur simultaneously, the
pipeline must stall.
• Solution: Increase hardware resources (e.g., multiple memory ports, multiple ALUs).
Pipeline Hazards
2. Data Hazards
• Occurs when an instruction depends on the result of a previous instruction still in the pipeline
• Types:
• Read After Write (RAW) Hazard: A register read occurs before the previous instruction writes to it.
• Write After Read (WAR) Hazard: A write occurs before a previous instruction reads from the same location.
• Write After Write (WAW) Hazard: Two instructions write to the same location out of order.
• Example of RAW Hazard:
• ADD EAX, EBX ; EAX = EAX + EBX
• SUB ECX, EAX ; ECX = ECX - EAX (EAX is not ready)
• The pipeline must stall for EAX to be updated before being used.
• Solution:
• Forwarding (Bypassing): Pass data directly to dependent instructions.
• Pipeline Stalling: Delay execution until data is ready.
Pipeline Hazards
3. Control Hazards (Branch Hazards)
• Occur when the pipeline fetches the wrong instruction after a branch (jump, if-else, loops,
etc.)
• Until the branch is executed, the pipeline does not know which instruction to fetch next.
• Penalty: Flushing incorrect instructions from the pipeline causes delays.
Handling Branch Hazards
1. Multiple Streams: Fetch both possible branch targets.
• Problem: Wastes resources and increases complexity.
2. Prefetch Branch Target: Fetch the next instruction and the branch target in
parallel.
• Used in IBM 360/91.
3. Loop Buffer: A small cache that stores recently executed instructions.
• If a loop repeats, it fetches instructions from the buffer instead of memory.
• Used in CDC Star-100, CRAY-1.
4. Branch Prediction: The CPU predicts whether a branch will be taken.
Handling Branch Hazards
4. Branch Prediction: The CPU predicts whether a branch will be taken.
• Static Prediction:
• Always Not Taken: Assume the branch is never taken.
• Always Taken: Assume the branch is always taken.
• Opcode-based Prediction: Certain opcodes predict branch behavior.
• Dynamic Prediction:
• Uses history to make better guesses.
• Taken/Not Taken Switch: Stores whether a branch was taken previously.
• Branch History Table (BHT): A cache that stores past branch decisions.
5. Delayed Branching:
• The CPU reorders instructions to execute useful instructions before resolving the branch.
Problem
• Pipelined processor has a clock rate of 2.5 GHz and executes a program with 1.5
million instructions. The pipeline has five stages, and instructions are issued at a
rate of one per clock cycle. Ignore penalties due to branch instructions and out-
of- sequence executions.
• a. What is the speedup of this processor for this program compared to a nonpipelined
processor?
• b. What is throughput (in MIPS) of the pipelined processor?
• Solution:
• Given:
• clock_rate_ghz = 2.5 # GHz
• instructions = 1.5e6 # 1.5 million instructions
• pipeline_stages = 5 # 5-stage pipeline
• instruction_issue_rate = 1 # One instruction per clock cycle
Problem
•
Problem
• A nonpipelined processor has a clock rate of 2.5 GHz and an average CPI (cycles
per instruction) of 4. An upgrade to the processor introduces a five- stage
pipeline. However, due to internal pipeline delays, such as latch delay, the clock
rate of the new processor has to be reduced to 2 GHz.
• a. What is the speedup achieved for a typical program?
• b. What is the MIPS rate for each processor?
• Solution:
• Given:
• clock_rate_non_pipelined = 2.5e9 # 2.5 GHz
• CPI_non_pipelined = 4
• clock_rate_pipelined = 2.0e9 # 2 GHz
• pipeline_stages = 5
• CPI_pipelined = 1 # Ideal pipeline CPI
Problem
•
Problem
•
Instruction Execution Characteristics
• They are the patterns and behaviors observed during the execution of
high-level language (HLL) programs when compiled to machine-level code.
These characteristics help architects understand:
• What types of instructions occur most frequently
• How operands are used
• How control flows (e.g., branches, loops)
• Which parts of programs consume the most time
1. Operations Performed:
• Studies show that most frequently executed operations in compiled HLL programs are:
• Data movement and control flow dominate, not complex arithmetic.
Instruction Execution Characteristics
2. Operand Usage
• Operand = the data on which operations are performed.
• Findings:
• Most operands are simple scalar variables (e.g., integers, chars)
• Around 80% are local to the procedure/function
• Arrays, structures, and pointers are used less frequently
• Since most data is local, registers are ideal for holding them.
3. Execution Sequencing
• Most instructions are simple (e.g., add, load, branch)
• Procedure calls and returns are frequent and expensive
• Branch instructions (like if, for, while) are common and affect pipeline flow
• Efficient support for procedure calls, register use, and branch prediction is critical.
Instruction Execution Characteristics
• Example Insight (from studies like Patterson's and Hennessy’s):
• Even though CALL/RETURN occurs less frequently, it consumes a
disproportionate amount of time, due to saving/restoring context.
Instruction Execution Characteristics
• Instruction execution characteristics helps architects design better CPUs:
• Add more registers for fast operand access
• Use simple instruction formats for fast decoding
• Design better pipelines and branch prediction
• Optimize hardware for realistic program behavior, not hypothetical workloads
Use of a Large Register File
• Large register file — a fast, on-chip storage space used to hold operands and
temporary results.
• This design choice is driven by the desire to minimize costly memory accesses
and maximize processor speed.
• Why Use a Large Register File?
1. Registers are faster than memory
• Accessing data in a register is much quicker than accessing cache or main memory.
• Keeping operands in registers significantly speeds up instruction execution.
2. High frequency of scalar and local variable access
• Studies show most high-level language (HLL) variables are:
• Scalars (e.g., integers, characters)
• Local to procedures (used within a function)
• Therefore, keeping these frequently used variables in registers is efficient.
Use of a Large Register File
• Why Use a Large Register File?
3. Reduces load/store operations
• RISC design limits memory access to only LOAD and STORE instructions.
• With enough registers, most operations can be done register-to-register, reducing memory traffic.
• Register Windows
• Each procedure needs its own set of registers.
• Calling another procedure (or returning from one) would typically require
saving/restoring registers to/from memory — slow.
• Hence use register windows — overlapping sets of registers assigned to each procedure.
Use of a Large Register File
• Global Variables
• Register windows are great for local variables, but global variables (shared across
functions) can't be held in these rotating windows.
• Solutions:
1. Assign global variables to memory (traditional)
2. Use fixed “global registers” — a small set of registers always accessible to all procedures
3. For frequently used, local variables, a register file is faster and more efficient than cache.
Use of a Large Register File
• Benefits of a Large Register File
• Reduced memory access = better performance
• Enables faster procedure calls with register windows
• Improves instruction pipelining efficiency
• Allows more operand storage for HLL programs
Compiler-Based Register Optimization
• In RISC architecture, the number of physical (hardware) registers is limited
(e.g., 16, 32). But high-level language (HLL) programs use many variables. So the
compiler is responsible for:
• Keeping as many frequently-used variables in registers as possible
• Minimizing load/store instructions to/from memory
• Reusing registers when possible without conflicts
• This is called register allocation and optimization.
Compiler-Based Register Optimization
• Graph Coloring Algorithm
• This is the most common algorithm used in compilers to perform register allocation.
• HLL programs refer to variables symbolically (e.g., a, b, sum)
• Compiler maps these symbolic variables to virtual registers
• Then it tries to assign virtual registers to physical registers in the most efficient way
• If registers run out, some variables must be "spilled" to memory (less efficient)
• Reason for Register Optimization
• Memory access is slow compared to registers
• Good register allocation = faster code, smaller binaries, less energy
• In RISC, since most operations are register-to-register, this becomes even more critical
Compiler-Based Register Optimization
• More Registers vs Compiler Optimization
• If you have many registers (like in some RISC CPUs), optimization becomes easier
• But even with few registers, a smart compiler can do a great job with optimization
• Studies show that beyond 32–64 registers, performance gains taper off unless your
compiler is very poor
Instruction Set Architecture
• Instruction Set Architecture (ISA) is the part of a computer architecture that
defines the interface between software and hardware. It includes:
• The set of instructions the processor can execute (e.g., arithmetic, logical, data transfer).
• Instruction formats, addressing modes, and data types.
• Registers, memory organization, and I/O mechanisms.
• Interrupts and exception handling mechanisms.
• ISA serves as the programmer’s view of the machine, defining what the
processor can do—not how it does it. It acts as a bridge between software and
the underlying hardware implementation.
RISC Architecture – Reduced Instruction Set Computer
• RISC is a computer architecture that uses a small, highly optimized set of
instructions, all designed to be executed very quickly — usually one instruction
per clock cycle.
Instruction Format – Reduced Instruction Set Computer
• RISC architectures typically use simple and fixed-length instruction formats to
facilitate fast decoding and efficient pipelining. Here's an overview of the
common formats:
• Usually, Instruction Format is of 3 types:
• R-Type (Register Type)
• I-Type (Immediate Type)
• J-Type (Jump Type)
• Instructions facilitate Simplified decoding
• Registers are used predominantly for operands
Instruction Format – Reduced Instruction Set Computer
• R-Type (Register Type)
• Used for arithmetic and logical operations.
Field Opcode rs rt rd shamt funct
Bit Length 6 5 5 5 5 6
• rs, rt : Source register
• rd : destination register
• shamt: shift amount
• funct: further specify the operation
• Example:
• ADD R1, R2, R3 Add contents of R2 and R3, store in R1
Instruction Format – Reduced Instruction Set Computer
• I-Type (Immediate Type)
• Used for data transfer, arithmetic with constants, and branching.
Field Opcode rs rt Immediate
Bit Length 6 5 5 16
• rs : Source register
• rt : destination register
• Immediate: constant value or address offset
• Example:
• ADDI R1, R2, #10 Add immediate value 10 to R2, store in R1
Instruction Format – Reduced Instruction Set Computer
• J-Type (Jump Type)
• Used for unconditional jumps.
Field Opcode Address
Bit Length 6 26
• Address : Jump target - usually combined with upper bits from the PC
• Example:
• JMP 10000 Jump to instruction at address 10000
Functional Elements – Reduced Instruction Set
Computer
1. Instruction Fetch Unit (IFU)
• Function: Fetches the next instruction from memory.
• Works with the Program Counter (PC) to determine the address of the next instruction.
• Uses instruction cache to reduce fetch time.
• First stage in the pipeline.
2. Instruction Decode Unit (IDU)
• Function: Decodes the fetched instruction into control signals.
• Identifies the operation (opcode), source and destination registers.
• Checks for operand readiness and forwards to the execution unit.
• Since instruction formats are simple and fixed, decoding is fast and efficient.
• Second stage in the pipeline.
Functional Elements – Reduced Instruction Set
Computer
3. Register File
• Function: Stores general-purpose data for computation.
• Contains a large number of registers (e.g., 32 or more).
• Most operands are read/written directly to/from registers, not memory.
• Registers are used for storing:
• Operands for ALU
• Intermediate values
• Function parameters
• Reduces memory access, speeds up computation.
4. Arithmetic and Logic Unit (ALU)
• Function: Performs arithmetic and logic operations like:
• ADD, SUB, AND, OR, XOR
• Comparisons (for branches)
• Works on data from registers, not memory.
• Third stage of the pipeline (Execute).
Functional Elements – Reduced Instruction Set
Computer
5. Control Unit
• Function: Directs the operation of the processor.
• Generates control signals based on the instruction type.
• Manages instruction flow, pipelining, branching, etc.
• Often hardwired instead of microprogrammed (unlike CISC).
• Enables fast instruction execution.
6. Load/Store Unit (Memory Access)
• Function: Handles memory access operations:
• LOAD: Read data from memory into a register
• STORE: Write data from a register to memory
• Only these instructions access memory.
• Separates memory from computation (Load/Store architecture).
Functional Elements – Reduced Instruction Set
Computer
7. Pipeline Registers
• Function: Hold intermediate data between pipeline stages.
• Allow overlapping execution of multiple instructions.
• Boosts throughput via instruction pipelining.
8. Program Counter (PC)
• Function: Holds the address of the next instruction to fetch.
• Automatically updated after each instruction.
• Can be modified by branch/jump instructions.
9. Instruction and Data Cache
• Function: Stores frequently accessed instructions and data.
• Reduces latency of memory operations.
• Helps maintain performance despite slower main memory.
• Instructions are fixed length (usually 32 bits) and simple in format.
Functional Elements – Reduced Instruction Set
Computer
Functional Element Functionality
Instruction Fetch Unit Fetches instructions
Instruction Decoder Decodes and prepares instructions
Register File Holds operands and results
ALU Performs arithmetic/logical operations
Control Unit Manages execution and pipelining
Load/Store Unit Handles memory reads/writes
Program Counter (PC) Tracks next instruction address
Pipeline Registers Buffer between pipeline stages
Instruction/Data Cache Fast access to code/data
Branch Prediction Unit Reduces branch penalties
Pipelining – Reduced Instruction Set Computer
• RISC designs are ideal for instruction pipelining, which overlaps execution stages of
different instructions. Typical RISC pipeline:
1. IF - Instruction Fetch
2. ID - Instruction Decode
3. EX – Execute
4. MEM - Memory Access (if needed)
5. WB - Write Back
• This allows RISC CPUs to execute 1 instruction per cycle after the pipeline fills up.
Execution – Reduced Instruction Set Computer
1. Fixed-Length Instructions:
• All instructions are 32-bit wide, which simplifies the fetch stage and allows easy decoding.
2. Few Addressing Modes:
• Only register and immediate addressing modes are supported, which simplifies the effective
address calculation stage in the pipeline.
3. Register-to-Register Operations:
• Operands come from fast-access registers, reducing memory access time and improving
overall execution speed.
4. Simple Control Logic:
• Because of uniform instruction formats and simple operations, the control unit can be
hardwired rather than microprogrammed, improving performance.
Advantage – Reduced Instruction Set Computer
• Simpler control logic → faster execution
• Easier to implement pipelining → higher instruction throughput
• Compiler optimization is easier
• Lower power consumption
• Better performance with fewer transistors → cost-effective
• Examples:
• ARM (used in most smartphones & embedded devices)
• MIPS
• RISC-V (open-source architecture)
• SPARC (used in servers)
• PowerPC