Dpco Unit 3,4,5 New Notes
Dpco Unit 3,4,5 New Notes
Functional Units of a Digital Computer: Von Neumann Architecture – Operation and Operands
of Computer Hardware Instruction – Instruction Set Architecture (ISA): Memory Location, Address
and Operation – Instruction and Instruction Sequencing – Addressing Modes, Encoding of Machine
Instruction – Interaction between Assembly and High Level Language.
The structure described in the figure outlines the basic components of a computer system, particularly
focusing on the memory and processor. Here's a breakdown of the components:
Memory: This is where data and instructions are stored. It is a crucial part of the computer system
that allows for the storage and retrieval of information.
Control Unit: This component manages the operations of the computer. It directs the flow of data
between the CPU and other components.
Arithmetic Logic Unit (ALU): The ALU performs arithmetic and logical operations. It is
responsible for calculations and decision-making processes.
Input: This refers to the devices or methods through which data is entered into the computer system.
Output: This refers to the devices or methods through which data is presented to the user or other
systems.
Processor: The processor, or CPU, is the central component that carries out the instructions of a
computer program. It includes the ALU and Control Unit.
1
Accumulator: This is a register in the CPU that stores intermediate results of arithmetic and logic
operations.
4
Example: MOV R2, R1: This instruction copies the contents of register R2 to register R1.
2. Absolute or direct addressing mode: The address of the location of the operand is given explicitly
as a part of the instruction.
Example: MOV 2000, A: This instruction copies the contents of memory location 2000 into the A
register. As shown in the instruction, here, address of operand is given explicitly in the instruction.
3. Immediate addressing mode: The operand is given explicitly in the instruction.
• Example :MOV #20, A: This instruction copies operand 20 in the register A. The sign # in front of
the value of an operand is used to indicate that this value is an immediate operand.
4. Indirect addressing mode: In this addressing mode, the instruction contains the address of memory
which refers the address of the operand.
5. Register indirect addressing mode: The effective address of the operand is the contents of a
register or the main memory location whose address is given explicitly in the instruction.
• Example : MOV (R0), A: This instruction copies the contents of memory addressed by the contents
of register R0 into the register A.
6. Displacement addressing mode: This addressing mode combines the capabilities of direct
addressing and register indirect addressing. In this addressing mode, instruction has two address
fields: Value and referenced register. The effective address is computed by adding contentsof
referenced register to value.
EA =Value + (R).
Three common variation of displacement addressing are :
• Relative addressing
• Base register addressing
• Index addressing.
5
7. Relative addressing mode: Here, the referenced register is program counter (PC) and hence this
addressing mode is also known as PC-relative addressing.
The effective address is determined by adding the contents of PC to the address field. EA = PC +
Address part of instruction.
The address part is a signed number so that it is possible to have branch target location either before
or after the branch instruction. This addressing mode is commonly used to specify the target address
in branch instructions.
Example: JNZ BACK: This instruction causes program execution to go to the branch target location
identified by the name BACK, if the branch condition is satisfied.
8. Base register addressing: In this addressing mode, the referenced register contains the main
memory address and address field contains the displacement. Displacement is usually unsigned
integer number. EA = (R) + Displacement.
Example: MOV [R+8], A: This instruction copies the contents of memory whose address is
determined by adding the contents of register R and displacement 8 to the register A.
9. Index addressing mode: In this addressing mode, the address field references the main memory
and the referenced register contains a positive displacement from that address. EA = Memory address
+ (R).
The indexing is a technique that allows programmer to point or refer the data (operand) stored in
sequential memory locations one by one. It is an efficient mechanism for performing iterative
operations.
Example: MOV [R1 + RI], R: In this instruction main memory address is given by register R1 and
the referenced register RI gives the positive displacement. The contents of the memory address
generated by the addition of main memory address and displacement is copied to register R.
10. Auto increment addressing mode: The effective address of the operand is the contents of a
register specified in the instruction. After accessing the operand, the contents of this register are
incremented to address the next location.
Example: MOV R0, (R2)+: The above instruction copies the contents of register RO into the memory
location whose address is specified by the contents of register R2. After copy operation, the contents
of register R2 are automatically incremented by 1.
11. Auto decrement addressing mode: The contents of a register specified in the instruction are
decremented and then they are used as an effective address to access a memory location.
Example: MOV - (R0), R1: This instruction, initially decrements the contents of register RO and then
the decremented contents of register RO are used to address the memory location. Finally, the
contents from the addressed memory location are copied into the register R1.
12. Stack addressing mode: A stack is linear array of reserved memory locations. It is associated with
a pointer called Stack Pointer (SP).
In stack addressing mode, stack pointer always contains the address of Top Of Stack (TOS) where
the operand is to be stored or located. Thus, the address of the operand (source or destination) is the
contents of stack pointer.
This addressing mode is the special case of register indirect addressing where referenced register is a
stack pointer.
Usually, stack grows in the direction of descending addresses, (descending stack), starting from a
high address and progressing to lower one. In this stack, SP is decremented before any items are
6
appended (pushed) on stack and SP is incremented after any items popped from the stack.
Example: PUSH R: This instruction decrements SP and copies the contents of register R on to the
top of stack pointed by stack pointer.
MIPS Addressing Modes
The MIPS addressing modes are as follows:
1.Immediate addressing: In this addressing mode, the operand is a constant within the instruction
itself.
Example:lui $s0, 61 // Loads decimal 61 in upper 16 bits register $s0
2. Register addressing : In this addressing mode, the operand is a register
Example: add $t1, $s0, $s1 // Adds contents of $50 and $s1 and store result in St1.
3. Base or displacement addressing: In this addressing mode, the operand is at the memory location
whose address is the sum of a register and a constant in the instruction.
Example: l w $t1, 4 ($t2) // where $t1 = rs, $t2 base (memory address),
4 = offset value Thus; $t1 = Memory [$t2 +4]
4. PC-relative addressing : In this addressing mode, the branch address is the sum of the PC and a
constant in the instruction.
Example: be q $0,$3,Label
5. Pseudo direct addressing: In this addressing mode, the jump address is the 26 bits of the
instruction concatenated with the upper bits of the PC. Address in Pseudo-Direct must be a multiple
of four
7
.
10
A computer programming language is a language in which the codes are written to instruct the computer
to perform tasks.
Computer programming languages are broadly classified into the following three main categories ?
Machine Language
Assembly Language
High-level Language
Read this article to get an overview of assembly language and high-level languages, and how they are
different from each other.
What is Assembly Language?
Assembly language is a type of low-level programming language which is used for writing instructions
for computers or other programmable devices. Since, it is a low-level language, therefore, it can
communicate with computer hardware directly.
In assembly language, the computer codes are written using words and expressions that are easier to
understand for human. The computer processor can only execute machine codes, hence it is required to
convert the assembly codes into machine codes. For this purpose, a utility program is used to convert
assembly code into executable machine code. This utility program which converts assembly code into
machine code is called assembler.
The major advantages of assembly language are less memory requirement, less execution time, easier to
write instructions and codes, etc.
What is High-level Language?
High-Level Language, also called HLL, is a category of computer programming languages that use
English like statements to write the computer instructions and codes. These are the most widely used
programming languages because they are easy to understand to human being. However, similar to the
assembly language, the CPU cannot process the high-level language codes directly, i.e. they need to be
translated first into the executable machine codes. For this, there are two types of language translators
used namely, interpreter and compiler.
The major advantages of high-level languages include easy to write, debug, and understand, machine
independent, etc. The common examples of high-level languages are C, C++, Java, Python, C#, etc.
Now, let us discuss the important differences between assembly language and high-level language.
Difference between Assembly Language and High-level Language
The following table highlights all the significant differences between assembly language and high-level
language ?
Parameter Assembly Language High-Level Language
Assembly language is a computer programming High-level language is a computer
Definition language in which abbreviated keywords are used program language in which
to write instruction codes. English like statements are used to
write codes.
A language translator called "assembler" is High-level language requires an
Language required to convert the assembly language code interpreter or a compiler to convert
translator into the machine code. the high-level language codes into
the executable machine codes.
Level of Assembly language is a low-level language. High-level language, as the name
11
language implies, is high-level language.
Programmer Assembly language is less programmer friendly High-level language is highly user
friendliness programming language. friendly programming language.
Speed of Computer instructions written in assembly Computer instructions written in
execution language execute faster. high-level language execute
slower.
Machine High-level language is machine
dependency Assembly language is machine dependent. independent.
Assembly language is more prone to errors. The chances of errors in high-level
Prone to error languages are less.
Memory Assembly language codes require less memory High-level language codes require
requirement space. more memory space.
The length of executable codes in assembly The length of executable codes in
Code length language is shorter. high-level language is longer.
Assembly language codes are relatively difficult High-level language codes are
Debug to debug. It is more challenging and time- very easy to debug.
consuming.
Assembly language is a complex programming High-level languages are easy to
language, as to write the instruction codes in write codes without much
Complexity assembly language, the programmer must have a knowledge about the computer
deep understanding of hardware and system hardware and architecture.
architecture.
High-level language codes are less
efficient, as the coder has less
Efficiency Assembly language codes are more efficient. control over the underlying
hardware.
Readability The readability of assembly language codes is High-level language codes are
less. more readable.
Development Assembly language programs take more time and High-level language programs
time & effort effort to develop. require less development time and
effort.
Memory Assembly language codes require manual High-level language codes provide
management memory management. automatic memory management.
Syntax Assembly language uses symbolic representation High-level languages use
of machine codes. keywords and statements similar
to English language.
Applications Assembly language is primarily used to program High-level languages are mainly
processors, microcontrollers, embedded systems, used for developing software
device drivers, etc. applications, web applications, etc.
Both compilers and interpreters are the language processors used to convert software codes written in
high-level language into machine language codes. Compilers and interpreters are types of system
software. They are required because a computer cannot process a code written in high-level programming
language like C, C++, Java, etc. Therefore, we convert a HLL code into machine code for execution.
12
In this article, we will highlight all the major differences between a compiler and an interpreter. Let's
start with some basics so that it will become easier to understand their differences.
What is a Compiler?
A language processor that converts a program written in high-level language into machine language,
entire program at once, is called a compiler. Thus, the input of a compiler is a high-level language code
(called source code), while its output is a machine language code (called object code).
A compiler scans whole program and then check it for syntactic and semantic error, once the code is
checked for errors, it is converted into an object code. Then, it can be processed by the machine to
perform the corresponding task. The common programming languages that use compiler are C, C++, C#,
etc
Advantages of Compiler
There are various advantages of the compiler which are as follows ?
A compiler translates a program in a single run.
It consumes less time.
CPU utilization is more.
Both syntactic and semantic errors can be checked concurrently.
It is easily supported by many high-level languages like C, C++, JAVA, etc.
What is an Interpreter?
A language translator that converts a high-level language program into a machine language program, one
line at a time, is referred to as an interpreter. Interpreters converts the codes slower than compiler. This
is because the interpreter can scan and translate only one statement of the program at a time. Therefore,
interpreters convert the source code into machine code during the execution of the program.
Interpreters do not generate an object code corresponding to the source code. However, these are
relatively easy to use and execute the code. The programming languages that use interpreters are Perl,
Ruby, Python, METLAB, etc.
Advantages of Interpreter
There are various advantages of the interpreter which are as follows ?
An interpreter translates the program line by line.
The interpreter is smaller in size.
It is flexible.
Error localization is easier.
The interpreter facilitates the implementation of computer programming language constructs.
Difference between Compiler and Interpreter
The following table highlights all the significant differences between a Compiler and an Interpreter ?
Parameter Compiler Interpreter
Program Compilers scan the entire program in The program is interpreted/translated one line
scanning one go. at a time.
Error detection As and when scanning is performed, One line of code is scanned, and errors
all the errors are shown in the end encountered are shown.
together, not line by line.
Object code Compilers convert the source code to Interpreters do not convert the source code into
object code. object code.
13
Execution time The execution time of compiler is It is not preferred due to its slow speed.
less, hence it is preferred. Usually, interpreter is slow, and hence takes
more time to execute the object code.
Need of source Compiler doesn't require the source It requires the source code for execution later.
code code for execution later.
Programming Programming languages that use Programming languages that uses interpreter
languages compilers include C, C++, C#, etc.. include Python, Ruby, Perl, MATLAB, etc.
Types of errors Compiler can check syntactic and Interpreter checks the syntactic errors only.
detected semantic errors in the program
simultaneously.
Size Compiler are larger in size. Interpreters are smaller in size.
Flexibility Compilers are not flexible. Interpreters are relatively flexible.
Efficiency Compilers are more efficient. Interpreters are less efficient.
Instruction fetch cycle: In this cycle, the instruction is fetched from the memory location whose address
is in the PC. This instruction is placed in the Instruction Register (IR) in the processor.
Instruction decode cycle: In this cycle, the opcode of the instruction stored in the instruction register is
decoded/examined to determine which operation is to be performed.
Instruction execution cycle :In this cycle, the specified operation is performed by the processor. This
often involves fetching operands from the memory or from processor registers, performing an arithmetic
or logical operation and storing the result in the destination location. During the instruction execution, PC
contents are incremented to point to the next instruction. After completion of execution of the current
instruction, the PC contains the address of the next instruction and a new instruction fetch cycle can
begin.
Basic MIPS Implementation
• In this chapter we will see the implementation of a subset of the core MIPS instruction set. These
instructions are divided into three classes :
• The memory-reference instructions: load word (lw) and store word (sw)
• The arithmetic-logical instructions: add, sub, AND, OR, and slt
• The branch instructions: branch equal (beq) and jump (j)
•The subset considered here does not include all the integer instructions (for example, shift, multiply, and
divide are missing), nor does it include any floating-point instructions. The key principles used in creating
15
a datapath and designing the control are discussed here. The implementation of the remaining instructions
is somewhat similar.
•For implementing every instruction, the first two steps are same:
1. Fetch the instruction: Send the Program Counter (PC) contents (address of instruction) to the memory
that contains the opcode and fetch the instruction from that memory.
2. Fetch operand(s) : Read one or two registers, using fields of the instruction to select the registers to
read. For the load word instruction, we need to read only one register, but most other instructions we
require to read two registers.
•The remaining actions required to complete the instruction depend on the instruction class. For each of
the three instruction classes (memory-reference, arithmetic-logical and branches), the actions are mostly
the same, independent of the exact instruction. This shows that the simplicity and regularity of the MIPS
instruction set simplifies the implementation by making the execution of many of the instruction classes
similar.
•For example, all instruction classes, except jump, use the Arithmetic-Logical Unit (ALU) after reading
the registers.
•Memory-reference instructions use the ALU for an address calculation
•Arithmetic-logical instructions use the ALU for the operation execution and
•Branches use the ALU for comparison.
•After using the ALU, the actions required to complete various instruction classes are not same.
•A memory-reference instruction needs to access the memory either to read data for a load or write data
for a store.
•An arithmetic-logical or load instruction must write the data from the ALU or memory back into a
register.
•A branch instruction may need to change the next instruction address based on the comparison;
otherwise, the PC should be incremented by 4 to get the address of the next instruction.
•Fig. 7.2.1 shows the block diagram of a MIPS implementation, showing the functional units and
interconnection between them.
Operation
• The program counter gives the instruction address to the instruction memory.
• After the instruction is fetched, the register operands required by an instruction are specified by fields of
that instruction.
16
• Once the register operands have been fetched, they can be used to compute a memory address (for a
load or store), to compute an arithmetic result (for an integer arithmetic-logical instruction), or a compare
(for a branch).
• If the instruction is an arithmetic-logical instruction, the result from the ALU must be written to a
register.
• If the operation is a load or store, the ALU result is used as an address to either store a value from the
registers or load a value from memory into the registers. The result from the ALU or memory is written
back into the register file.
• Branches require the use of the ALU output to determine the next instruction address, which comes
either from the ALU (where the PC and branch offset are summed) or from an adder that increments the
current PC by 4.
• Fig. 7.2.1 shows that data going to a particular unit is coming from two different sources. For example,
the value written into the PC can come from one of two adders, the data written into the register file can
come from either the ALU or the data memory, and the second input to the ALU can come from a register
or the immediate field of the instruction. The selection of appropriate source is done using multiplexer
(data selector). The multiplexer selects from among several inputs based on the setting of its control lines.
The control lines are set based primarily on information taken from the instruction being executed. This is
illustrated in Fig. 7.2.2.
• Fig. 7.2.2 also shows the control unit, which has the instruction as an input, is used to determine the
control signals for the functional units and two of the multiplexers.
• The third multiplexer, which determines whether PC+4 or the branch destination address is written into
the PC, is set based on the Zero output of the ALU, which is used to perform the comparison of a beq
instruction.
17
• Fig. 7.3.1 shows the combination of the three elements (instruction memory, program
counter and adder) from Fig. 7.3.2 to form a datapath that fetches instructions and
increments the PC to obtain the address of the next sequential instruction.
• The instruction memory stores the instructions of a program and gives instruction as an
output corresponding to the address specified by the program counter. The adder is used to
increment the PC by 4 to the address of the next instruction.
• Since the instruction memory only reads, the output at any time reflects the contents of
the location specified by the address input, and no read control signal is needed.
• The program counter is a 32-bits register that is written at the end of every clock cycle
and thus does not need a write control signal.
• The adder always adds its two 32-bits inputs and place the sum on its output.
Data path Segment for Arithmetic - Logic Instructions
• The arithmetic-logic instructions read operands from two registers, perform an ALU
operation on the contents of the registers, and write the result to a register. We call these
instructions as R-type instructions. This instruction class includes add, sub, AND, OR, and
slt. For example, OR $t1, $t2, $t3 reads $t2 and $t3, performs logical OR operation and
saves the result in $t1.
• The processor's 32 general-purpose registers are stored in a structure called a register file.
A register file is a collection of registers in which any register can be read or written by
specifying the number of the register in the file. The register file contains the register state
of the computer.
• Fig. 7.3.2 shows multiport register file (two read ports and one write port) and the ALU
section of Fig. 7.3.2 We know that, the R-format instructions have three register operands:
numbers Two source operands and one destination operand.
18
• For each data word to be read from the register file, we need to specify the register
number to the register file. On the other hand, to write a data word, we need two inputs:
One to specify the register number to be written and one to supply the data to be written
into the register.
•The register file always outputs the contents of whatever register numbers are on the Read
register inputs. Write operations, however, are controlled by the write control (Reg W)
signal. This signal is asserted for a write operation at the clock edge.
• Since writes to the register file are edge-triggered, it is possible to perform read and write
operation for the same register within a clock cycle: The read operation gives the value
written in an earlier clock cycle, while the value written will be available to a read in a
subsequent clock cycle.
• As shown in Fig. 7.3.2, the register number inputs are 5 bits wide to specify one of 32
registers, whereas the data input and two data output buses are each 32 bits wide.
Data path Segment for Load Word and Store Word Instructions
• Now, consider the MIPS load word and store word instructions, which have the general
form lw $t1, offset_value($t2) or sw $t1, offset_value ($t2).
• In these instructions $t1 is a data register and $t2 is a base register. The memory address
is computed by adding the base register ($t2), to the 16-bits signed offset value specified
in the instruction.
• In case of store instruction, the value from the data register ($t1) must be read and in case
of load instruction, the value read from memory must be written into the data register
($t1). Thus, we will need both the register file and the ALU from Fig. 7.3.2.
• We know that, the offset value is 16-bits and base register contents are 32-bits. Thus, we
need a sign-extend unit to convert the 16-bits offset field in the instruction to a 32-bits
signed value so that it can be added to base register.
• In addition to sign extend unit, we need a data memory unit to read from or write to. The
data memory has read and write control signals to control the read and write operations. It
also has an address input, and an input for the data to be written into memory. Fig. 7.3.3
shows these two elements.
19
• Sign extension is implemented by replicating the high-order sign bit of the original data
item in the high-order bits of the larger, destination data item.
• Therefore, two units needed to implement loads and stores, in addition to the register file
and ALU of Fig. 7.3.2, are the data memory unit and the sign extension unit.
Data path Segment for Branch Instruction
• The beq instruction has three operands, two registers that are compared for equality, and
a 16-bits offset which is used to compute the branch target address relative to the branch
instruction address. It has a general form beq $t1, $t2, offset.
• To implement this instruction, it is necessary to compute the branch target address by
adding the sign-extended offset field of the instruction to the PC. The two important things
in the definition of branch instructions which need careful attention are:
• The instruction set architecture specifies that the base for the branch address calculation
is the address of the instruction following the branch (i.e., PC + 4 the address of the next
instruction.
• The architecture also states that the offset field is shifted left 2 bits so that it is a word
offset; this shift increases the effective range of the offset field by a factor of 4.
• Therefore, the branch target address is given by
Branch target address = PC+4 + offset (shifted left 2 bits)
• In addition to computing the branch target address, we must also see whether the two
operands are equal or not. If two operands are not equal the next instruction is the
instruction that follows sequentially (PC= PC+4); in this case, we say that the branch is
not taken. On the other hand, if two operands are equal (i.e., condition is true), the branch
target address becomes the new PC, and we say that the branch is taken.
20
• Thus, the branch datapath must perform two operations : Compute the branch address
• Fig. 7.3.5 shows the structure of the data path segment that handles branches.
• To compute the branch target address, the branch datapath includes a sign extension unit,
shifter and an adder.
• To perform the compare, we need to use the register file and the ALU shown in Fig.
7.3.2.
• Since the ALU provides an Zero signal that indicates whether the result is 0, we I can
send the two register operands to the ALU with the control set to do a subtract operation.
If the Zero signal is asserted, we know that the two values are equal.
• For jump instruction lower 28 bits of the PC are replaced by lower 26 bits of the
instruction shifted left by 2 bits and making two LSB bits 0. This can be implemented by
simply concatenating 00 to the jump.
• In the MIPS instruction set, branches are delayed, meaning that the instruction
immediately following the branch is always executed, independent of whether the branch
condition is true or false. When the condition is false, the execution looks like a normal
branch. When the condition is true, a delayed branch first executes the instruction
immediately following the branch in sequential instruction order before jumping to the
specified branch target address.
21
Creating a Single Data path
• We can combine the data path components needed for the individual instruction classes
into a single data path and add the control to complete the implementation.
• This simplest data path will attempt to execute all instructions in one clock cycle. This
means that no data path resource can be used more than once per instruction, so any
element needed more than once must be duplicated. We therefore need a memory for
instructions separate from one for data. We need the functional units to be duplicated and
many of the elements can be shared by different instruction flows.
• To share a data path element between two different instruction classes, we have
connected multiple connections to the input of an element and used a multiplexer and
control signal to select among the multiple inputs.
Example 7.3.1 Show how to build a data path for the operational portion of the memory-
reference and arithmetic-logical instructions that uses a single register file and a single
ALU to handle both types of instructions, adding any necessary multiplexers.
Solution : To create a data path with only a single register file and a single ALU, we must
support two different sources for the second ALU input, as well as two different sources
for the data stored into the register file. Thus, one multiplexer is placed at the ALU input
and another at the data input to the register file. Fig. 7.3.6 shows the operational portion of
the combined data path.
• We can make a simple data path for the core MIPS architecture by adding the datapath
for instruction fetch, the data path from R-type and memory instructions, and the datapath
for branches as shown in the Fig. 7.3.7.
Designing a Control Unit
• Here, we restrict ourselves to implement load word (lw), store word (sw), branch equal
(beq), and the arithmetic-logical instructions add, sub, AND, OR, and set on less than. We
will also the design to include a jump instruction (j).
The ALU Control
• The MIPS ALU defines the six following combinations of four control inputs:
22
• Depending on the instruction class, the ALU will need to perform one of these first five
functions. (NOR function is needed for other parts of the MIPS instruction set. It is not
included in the subset we are implementing.)
• In case of load word and store word instructions, we use the ALU to compute the
memory address by addition.
• In case of the R-type instructions, the ALU needs to perform one of the five actions -
AND, OR, subtract, add, or set on less than.
• In case of branch equal, the ALU must perform a subtraction.
• We can control the operation of ALU by the 4-bits ALU control input and 2-bits
ALUOP. The 2-bits ALUOP is interpreted as shown in Table 7.4.1.
• Table 7.4.2 shows how to set the ALU control inputs based on the 2-bits ALUOP control
and the 6-bits function code.
• Here, multiple levels of decoding technique is used.
Advantages of using multiple levels of decoding
• It reduces the size of the main control unit.
• Use of several smaller control units may also potentially increase the speed of the control
unit.
• Table 7.4.3 shows how the 4-bits ALU control is set depending on these two input fields:
6-bits function fields and 2-bits ALU Op field.
23
• Once the truth table has been constructed, it can be optimized and can be implemented
using logic gates.
Designing the Main Control Unit
• Before looking at the rest of the control design, it is useful to review the formats
of the three instruction classes: The R-type, branch and load-store instructions. Fig. 7.4.1
shows these formats.
• Format for R-format instructions: Opcode is 0. These instructions have three register
operands: rs, rt, and rd. Fields rs and rt are sources, and rd is the destination. The funct
(Function) field is an ALU function discussed in the previous section. The shamt field is
used only for shifts.
Format for load and store instructions: Load (opcode ) or store (opcode ). The register
rs is the base register that is added to the 16-bit address field to form the memory address.
For loads, rt is the destination register for the loaded value. For stores, rt is the source
register whose value should be stored into memory.
Format for branch equal: Opcode is 4. The registers rs and rt are the source registers that
are compared for equality. The 16-bits address field is sign-extended, shifted, and added to
the PC + 4 to compute the branch target address.
Important observations about this instruction format
• Bits 31: 26 in the instruction format is op field and gives opcode (operation code). We
will refer to this field as Op[5: 0].
• Bits 25:21 and 20:16 in the instruction format always specify the rs and rt fields,
respectively.
• Bits 25: 21 always give the base register (rs) for load and store instructions.
• Bits 15: 0 give the 16-bits offset for branch equal, load, and store.
• The destination register is in one of two places. For a load it is in bit positions 20: 16 (rt),
while for an R-type instruction it is in bit positions 15: 11 (rd). Thus, we will need to add a
multiplexer to select which field of the instruction is used.
• From the above information, we can add the instruction labels and extra multiplexer (for
the Write register number input of the register file) to the simple datapath. Fig. 7.4.2
24
shows these additions plus the ALU control block, the write signals for state elements, the
read signal for the data memory, and the control signals for the multiplexers. Since all the
multiplexers have two inputs, they each require a single control line.
• Fig. 7.4.2 shows seven single-bit control lines (Reg Dst, RegW, ALUS rc, Mem W, Mem
R, PCS rc and Mem to Reg) plus the 2-bits ALUOP control signal.
• Table 7.4.4 describes the function of single-bit control lines.
• These nine control signals (seven single-bit control lines and the 2-bits ALUOP control
signals) can be set according six input signals to the control unit, which are the opcode bits
31 to 26. Fig. 7.4.3 shows the data path with the control unit and the control signals. [Refer
Fig. 7.4.3 on next page]
25
• As shown in the Fig. 7.4.3, the input to the control unit is the 6-bits opcode field from the
instruction.
• The outputs of the control unit consist of three 1-bit signals that are used to control
multiplexors (Reg Dst, ALUS rc, and Mem to Reg), three signals for controlling reads and
writes in the register file and data memory (Reg Write, Mem Read, and Mem Write), a 1-
bit signal used in determining whether to possibly branch (Branch), and a 2-bits control
signal for the ALU (ALUOP).
• An AND gate is used to combine the branch control signal and the Zero output from the
ALU; the AND gate output controls the selection of the next PC.
• Table 7.4.5 defines whether each control signal should be 0, 1, or don't care (X) for each
of the opcode values.
• For all R-format instructions (add, sub, AND, OR, and slt) the source register fields are
rs and rt, and the destination register field is rd; this defines how the signals ALUSrc and
RegDst are set (See first row of Table 7.4.5).
• R-type instruction also writes a register (Reg W=1), but neither reads nor writes data
memory.
• For all R-format instructions, the PC should be unconditionally replaced with PC 4. Thus
the Branch control signal is 0; otherwise, the PC is replaced by the branch target if the
Zero output of the ALU is also high.
• The ALUOP field for R-type instructions is set to 10 to indicate that the ALU control
should be generated from the funct field.
• The second and third rows of Table 7.4.5 give the control signal settings for lw and sw.
These ALU Src and ALUOP fields are set to perform the address calculation.
• The MemRead and MemWrite are set to perform the memory access. Finally, RegDst
and Reg W are set for a load to cause the result to be stored into the rt register.
• The branch instruction sends the rs and rt registers to the ALU. The ALUOP field for
branch is set for a subtract ALU control 01), which is used to test for equality.
26
• It is important to note that the Mem to Reg field is irrelevant when the Reg W signal is 0:
since the register is not being written, the value of the data on the register data write port is
not used. Thus, the entry Mem to Reg in the last two rows of the Table 7.4.5 is replaced
with X for don't care. Don't cares can also be added to RegDst when Reg W is 0.
Operation of the Data path
Datapath for an R-type Instruction
• Fig. 7.4.4 shows the operation of the datapath for an R-type instruction. Here as an
example we have considered instruction: add $t1, $t2, $t3. The asserted control signals and
active datapath elements are highlighted. Although everything occurs in one clock cycle,
the execution of the instruction can be divided into sequential steps according to flow of
information as given below:
1. Fetch the instruction and increment the PC.
2.Read data from two registers, $t2 and $t3 and compute the setting of the control lines
using main control unit.
3. Generate the ALU function using the function code for addition operation (bits 50,
which is the funct field, of the instruction) and perform addition on the data read from the
register file.
4. Write the result from the ALU into the register file using bits 15: 11 of the instruction
that select the destination register ($t1).
Note The control lines, datapath units, and connections that are active are highlighted.
• Fig. 7.4.5 shows the operation of the datapath for load word instruction. Here as an
example we have considered instruction: Iw $t1, offset($t2). The execution of the load
27
instruction can be divided into sequential steps according to flow of information as given
below:
1. Fetch the instruction and increment the PC.
2. Read data from register $t2 of register file.
3. Compute the sum of the value read from the register file and the sign-extended, lower
16 bits of the instruction (offset) using ALU.
4. Use the sum from the ALU as the address for the data memory.
5. Write the data from the memory unit into the register file; the register destination is
given by bits 20: 16 of the instruction ($t1).
Note The control lines, datapath units, and connections that are active are highlighted.
• A load and store instructions operate very similarly. The main differences are that the
memory control indicate a write rather than a read, the value of the second
register is used for the data to store, and the operation of writing the data memory value to
the register file is not necessary.
Datapath for Branch - On - Equal Instruction
• Fig. 7.4.6 shows the operation of the datapath for branch-on-equal instruction. Here as an
example we have considered instruction: beq $t1, $t2, offset. It operates much like an R-
format instruction, but the ALU output is used to determine whether the PC is written with
PC + 4 or the branch target address. The execution of the branch-on-equal instruction can
be divided into sequential steps according to flow of information as given below :
1. Fetch the instruction and increment the PC.
2. Read data from two registers, $t2 and $t3.
3. Generate the ALU function using the function code for subtract operation (bits 5 0,
which is the funct field, of the instruction) and perform subtraction on the data read from
the register file. Add the value of PC + 4 to the sign-extended, lower 16 bits of the
instruction (offset) shifted left by two to get the branch target address.
4.Use the Zero result from the ALU to decide which adder result to store into the PC.
28
Finalizing Control
• The control function can be precisely defined using the contents of Table 7.4.6. The
outputs are the control lines, and the input is the 6-bits opcode field, Op [5: 0], i.e. bits 31
26 of the instruction. Thus, we can create a truth table for each of the outputs based on the
binary encoding of the opcodes as shown in the Table 7.4.6.
Implementing Jumps
• The jump instruction is not conditional. Fig 7.4.7 shows the format of jump instruction.
• Like a branch, the low-order 2 bits of a jump address are always 00 2. The next lower 26
bits of this 32-bits address come from the 26-bits immediate (address) field in the
29
instruction. The upper 4 bits of the address that should replace the PC come from the PC
of the jump instruction plus 4. Thus, we can implement a jump by storing into the PC the
concatenation of
• The upper 4 bits of the current PC + 4
• The 26-bits immediate field of the jump instruction
• The bits 00two
•Fig. 7.4.8 shows the datapath operation for jump instruction. Here, additional control is
added to Fig. 7.4.3 for jump instruction. An additional multiplexer is used to select the
source for the new PC value, which is either the incremented PC (PC + 4), the branch
target PC, or the jump target PC. One additional control signal is added for the additional
multiplexer. This control signal, called Jump, is asserted only when the instruction is a
jump-that is, when the opcode is 2.
Reasons for not using Single - Cycle Implementation
1. It is inefficient. Because the longest possible path in the processor determines the clock
cycle. Remember that the clock cycle must have the same length for every instruction in
single-cycle design.
2. The overall performance of a single-cycle implementation is likely to be poor, since the
clock cycle is too long.
3. The penalty for using the single-cycle design with a fixed clock cycle is significant.
Single-cycle designs for floating-point unit or an instruction set with more complex
instructions do not work well at all.
4. It do not improve the worst-case cycle time. Thus it violates the great idea of making
the common case fast.
30
Hardwired Control
• In the hardwired control, the control units use fixed logic circuits to interpret instructions
and generate control signals from them.
• The fixed logic circuits use contents of the control step counter, contents of the
instruction register, contents of the condition code flag and the external input signals such
as MFC and interrupt requests to generate control signals.
• Fig. 7.5.1 shows the typical hardwired control unit. Here, the fixed logic circuit block
includes combinational circuit (decoder and encoder) that generates the required control
outputs, depending on the state of all its inputs.
• By separating the decoding and encoding functions, we can draw more detail block
diagram for hardwired control unit as shown in the Fig. 7.5.2.
• The instruction decoder decodes the instruction loaded in the IR. If IR is an 8-bit
register then instruction decoder generates 28, i.e. 256 lines; one for each instruction.
According to code in the IR, only one line amongst all output lines of decoder goes high
i.e., set to 1 and all other lines are set to 0.
• The step decoder provides a separate signal line for each step, or time slot, in a control
sequence. The encoder gets in the input from instruction decoder, step decoder, external
inputs and condition codes. It uses all these inputs to generate the individual control
signals.
• After execution of each instruction end signal is generated which resets control step
counter and make it ready for generation of control step for next instruction.
• Let us see how the encoder generates signal for single bus processor organisation shown
in Fig. 7.5.3 Yin. The encoder circuit implements the following logic function to generate
Yin
Yin= T1+ T6. ADD + T4. BRANCH + ...
• The Yin signal is asserted during time interval T 1 for all instructions, during T for an ADD
instruction, during T4 for an unconditional BRANCH instruction and soon.
• As another example, the logic function to generate Zout signal can given by,
Zout = T2 + T7. ADD + T6 BRANCH + ...
• The Zout signal is asserted during time interval T 2 of all instructions, during T, for an ADD
instruction, during T6 for an unconditional branch instruction and so on.
31
• Fig. 7.5.3 and 7.5.4 shows the hardware implementation of logic functions for Yin and
Zout control signals.
Example 7.5.1 Generate the logic circuit for the following function
End =T7. ADD + T5 . BR + (T5. N T4. ). BBN+...
Solution :Fig. 7.5.5 shows the circuit that generates the End control signal from the logic
function.
End =T7. ADD + T5 . BR + (T5. N T4. ). BBN+...
Advantages of hardwired control unit
• Hardwired control unit is fast because control signals are generated by combinational
circuits.
• The delay in generation of control signals depends upon the number of gates.
• It has greater chip area efficiency since its uses less area on-chip.
Disadvantages of hardwired control unit
• More the control signals required by CPU; more complex will be the design of control
unit.
• Modifications in control signal are very difficult. That means it requires rearranging of
wires in the hardware circuit.
• It is difficult to correct mistake in original design or adding new feature in existing
design of control unit.
Microprogrammed Control
• Every instruction in a processor is implemented by a sequence of one or more sets of
concurrent microoperations. Each microoperation is associated with a specific set of
control lines which, when activated, causes that microoperation to take place.
• Since the number of instructions and control lines is often in the hundreds, the
complexity of hardwired control unit is very high. Thus, it is costly and difficult to design.
• Further more, the hardwired control unit is relatively inflexible because it is difficult to
change the design, if one wishes to correct design error or modify the instruction set.
• Microprogramming is a method of control unit design in which the control signal
selection and sequencing information is stored in a ROM or RAM called a control memory
32
CM. The control signals to be activated at any time are specified by a microinstruction,
which is fetched from CM in much similar way an instruction is fetched from main
memory.
• Each microinstruction also explicitly or implicitly specifies the next microinstruction to
be used, thereby providing the necessary information for sequencing.
• A sequence of one or more micro operations designed to control specific operation, such
as addition, multiplication is called a micro program. The micro programs for all
instructions are stored in the control memory.
• The address where these microinstructions are stored in CM is generated by micro
program sequencer/micro program controller. The micro program sequencer generates the
address for microinstruction according to the instruction stored in the IR.
• Fig. 7.6.1 shows the micro programmed control unit. It consists of control memory,
control address register, micro instruction register and micro program sequencer.
• The components of control unit work together as follows:
• The control address register (µpc) holds the address of the next microinstruction to be
read. Every time a new instruction is loaded into the IR, the output of the block labele d
"starting address generator" is loaded into the µpc.
• When address is available in control address register, the sequencer issues READ
command to the control memory.
• After issue of READ command, the word from the addressed location is read into the
microinstruction register.
• The µpc is then automatically incremented by the clock, causing successive
microinstructions to be read from the control memory.
• The content of the micro instruction register generates control signals which are
delivered to various parts of the processor in the correct sequence.
• Number of times the control unit is required to check the status of the condition codes or
external inputs to choose between alternative courses of action.
• In such situation, micro programmed control use conditional branch microinstructions. In
additions to the branch address, these instructions specify which of the external inputs,
condition codes, or, possibly bits of the instruction register should be checked as a
condition for branching to take place.
• Let us see the implementation of instruction Branch < 0, as shown in Fig. 7.6.2. When
this instruction is loaded into IR, a branch microinstruction transfers control to the
corresponding micro routine, which is assumed to start at location 45 in the control
memory. This address is the output of the starting address generator block in Fig. 7.6.1.
33
• The microinstruction at location 45 tests the N bit of the condition codes. If this bit is
equal to 0, a branch takes place to location 0 to fetch a new machine instruction.
Otherwise, the microinstruction at location 46 is executed to put the branch target address
into register Z. The microinstruction in location 47 loads this address into the PC.
• To support microprogram branching, the starting address and branch address generator
block loads a new address into the upc when a microinstruction instructs it to do so.
• To allow implementation of conditional branch, the external inputs, condition codes and
the contents of IR are given to the starting address and branch address generator block.
• Let us see how the upc is used at different situations.
1.When a new instruction is loaded into the IR, the upc is loaded with the starting address
of the micro routine for that instruction.
2. When a Branch microinstruction is encountered and the branch condition is satisfied,
the upc is loaded with the branch address.
3. When the End instruction is encountered, the upc is loaded with the address of the first
Control Word (CW) in the micro routine for the instruction fetch cycle, i.e. address 0.
4. In any other situation, the upc is incremented microinstruction is fetched from the
control memory.
Advantages of Micro programmed Control
• It simplifies the design of control unit. Thus it is both, cheaper and less error prone to
implement.
• Control functions are implemented in software rather than hardware.
• The design process is orderly and systematic.
• More flexible, can be changed to accommodate new system specifications or to correct
the design errors quickly and cheaply.
• Complex function such as floating point arithmetic can be realised efficiently.
• The new or modified instruction set of CPU can be easily implemented by simply
rewriting or modifying the contents of control memory.
• The fault can be easily diagnosed in the micro-program control unit using diagnostics
tools by maintaining the contents of flags, registers and counters.
34
Disadvantages of Micro programmed Control
• A micro programmed control unit is somewhat slower than the hardwired control unit,
because time is required to access the microinstructions from CM.
• The flexibility is achieved at some extra hardware cost due to the control memory and its
access circuitry.
• The design duration of micro-program control unit is more than hardwired control unit
for smaller CPU.
Besides these disadvantages, the microprogramming is the dominant technique for
implementing control units.
Example 7.6.1 Write a combined micro routine that can implement that BGT (Branch if >
0), BPL (Branch if plus), and BR (Branch Unconditionally) instructions. The branch
conditions, for the BGT and BPL instructions are Z + (N XOR V) = 0 and N = 0,
respectively. What is the total number of micro instructions required? How many micro
instructions are needed if a separate micro routine is used for each machine instruction?
Solution: For bus organisation, we write microroutine for the implementation of BGT
(Branch if > 0), BPL (Branch if plus) and BR (Branch unconditionally) instructions as
follows.
Combine Microroutine
• The first three microinstructions in the combine micro routine are used to fetch the
opcode. The total number of microinstructions required are 11. If a separate micro routine
is used for each machine instruction 17 microinstructions are needed.
Micro instruction
• A simple way to structure microinstructions is to assign one bit position to each control
signal required in the CPU. However, this scheme has one serious drawback-assigning
individual bits to each control signal results in long microinstructions, because the number
35
of required signals is usually large. Moreover, only a few bits are used in any given
instruction. The solution of this problem is to group the control signals.
Grouping of Control Signals
• Grouping technique is used to reduce the number of bits in the microinstruction. This
technique is explained in the following section.
• Let us consider single bus CPU having different control signals, as shown in Fig. 7.6.3.
• The total number of grouping bits are 18. Therefore, we minimized 39 bits
microinstruction to 18 bit microinstruction.
• Grouping of control signals result in relatively small increase in the required hardware as
it becomes necessary to use decoding circuits to translate the bit patterns of each group
into actual control signals.
• Since the number of bits required is less for microinstruction, less space is required for
microinstructions for the instruction.
Techniques of Grouping of Control Signals
• The grouping of control signals can be done either by using technique called vertical
organisation or by using technique called horizontal organisation.
• Highly encoded scheme that use compact codes to specify only a small number of
control functions in each microinstruction are referred to as a vertical organisation.
• On the other hand, the minimally encoded scheme, in which resources can be controlled
with a single instruction, is called a horizontal organisation.
• Table 7.6.1 shows the comparison between horizontal and vertical organisation.
38
• Let us assume that the source operand can be specified in the following addressing
modes: Indexed, auto increment, auto decrement, register indirect and register direct. We
now use this instruction in conjunction with the CPU structure shown in Fig. 7.6.4 to
• Fig. 7.6.5 shows a flowchart of a microprogram for the ADD Rsrc, Rdst instruction.
Each box in the flowchart corresponds to a microinstruction that controls the transfers and
operations indicated within the box.
• The microinstruction is located at the address indicated by the number above the upper
right-hand corner of the box. During the execution of the microinstruction, the branching
takes place at point A. The branching address is determined by the addressing mode used
in the instruction
• At point B, it is necessary to choose between actions required by direct and indirect
addressing modes. If the indirect mode is specified in the instruction, then the
microinstruction in location 170 is performed to fetch the operand from the memory. If the
direct mode is specified, this fetch must be bypassed by branching immediately to location
171. The remaining microoperations required to complete execution of instruction are
same and hence shared by the instructions having operand with different addressing
modes.
• From the above discussion we can say that branching allows sharing of microinstructions
for different microprograms and it reduces the size of control memory.
Techniques for Modification or Generation of Branch Addresses
Bit-O Ring
• In this technique, the branch address is determined by ORing particular bit or bits with
the current address of the microinstruction.
• For example, if the current address is 170 and the branch address is 172 then the branch
address can be generated by ORing 02 (bit 1), with the current address.
39
Using Condition Variables
• In this technique the condition variables are used to modify the contents of the CM
address register directly, thus eliminating whole or in part the need for branch addresses in
microinstructions.
• For example, let the condition variable CY indicate occurrence of CY = 1 and no carry
when CY = 0. Suppose that we want to execute a SKIP_ON_CARRY microinstruction.
This can be done by logically connecting CY to the count enable input of upc at on
appropriate point in the microinstruction cycle. This allows the overflow condition to
increment upc an extra time, thus performing the desired skip operation.
Wide-Branch Addressing
• Generating branch addresses becomes more difficult as the number of branches
increases. In such situations Programmable Logic Array can be used to generate the
required branch addresses. This simple and inexpensive way of generating branch
addresses is known as wide-branch addressing.
• Here, the opcode of a machine instruction is translated into the starting address of the
corresponding micro-routine. This is achieved by connecting the opcode bits of the
instruction register as inputs to the PLA, which acts as a decoder. The output of the PLA is
the address of the desired microroutine.
Example 7.6.1 For a single bus organisation of CPU, write a microprogram for
instruction. Add(Rsrc)+, Rdst
Solution: Let us examine the path needed for the flowchart in 7.6.6 to execute the
instruction Add(Rsrc)+, Rdst.
In this instruction the source operand is accessed in the autoincrement mode and the Rsrc
and Rdst are general purpose registers in the processor. We assume that the processor has
16 registers that can be used for addressing purpose, each specified using a 4-bit code. We
also assume that the instruction has a 3-bit field, (bits 8-10) used to specify the addressing
mode for the source operand, as shown in Fig. 7.6.6. Bit patterns 11,10,01 and 00 located
in bits 10 and 9 denote the indexed, autodecrement, autoincrement and register modes
respectively. For each of these modes bit-8 is used to specify the indirect version. For
example, 100 in the mode field specifies the direct version of the autodecrement mode,
whereas 101 specifies the indirect version.
As a part of execution, first the opcode and mode fields are decoded to determine that an
Rsrc or Rdst register is involved. The decoded output is then used to gate the contents of the
40
Rsrcor Rdst fields in the IR into a second decoder, which produces the gating signals for the
actual registers R0 to R15.
The flowchart of a microprogram for the ADD Rsrc, Rdst instruction in Fig. 7.6.5 is drawn
by combining the microroutines for all possible values of the mode field, resulting in a
structure that requires many branch points. The instruction Add(Rsrc)+, Rdst requires two
branch microinstructions. In each branch microinstruction, the expression in brackets
indicates the branch address that is to be loaded into the upc and how this address is
modified using the bit-ORing scheme. For example, the branch instruction at location 123
modifies the branch address 170 to 171 by ORing the bit-8 in the IR with bit upc0 to
change the addressing mode from indirect to direct, as shown in Fig. 7.6.7.
The address for branch microinstruction at location 003 is generated as shown in Fig. 7.6.8
The Table 7.6.2 gives the microinstruction sequence for the execution of Add(R src)+, Rdst
instruction.
41
Example 7.6.2 For a single bus organisation of data paths inside the CPU, write a micro
program of micro-instructions and draw chart of a microprogram for the following
instruction. MOV (Rsrc)+, Rdst.
Rsrc uses direct and Rdst auto increment addressing.
Solution : Flowchart
42
Micro instruction Execution
• A micro programmed computer has two distinct levels of control.
i) The instruction level and ii) The microinstruction level.
• At the instruction level, the CPU continuously executes instruction cycles that involve
the following steps.
1. The CPU fetches an instruction from main memory, whose address is stored in the
program counter PC.
2. The opcode part of I is placed in an instruction register IR, the operation specified by IR
is then decoded and executed.
3. PC is altered to point to the next instruction to be fetched from M.
• A similar sequence of operations takes place at the lower microinstruction level, where
the control-unit continuously executes microinstruction cycles as follows:
1. The addressing portion (microprogram sequencer) of the control-unit fetches a
microinstruction MI from the control memory CM, whose address is stored in the
microprogram counter upc.
2.MI is loaded into the microinstruction register MIR and is decoded to produce the
required control signals.
3. upc is altered to point the next microinstruction to be fetched from CM.
43
• A microinstruction cycle can be executed faster than an instruction cycle, since
microinstructions are stored within the CPU, whereas instructions must be fetched from an
external memory. Microinstructions also normally require less decoding than instructions.
Pipelining
• We have seen various cycles involved in the instruction cycle. These fetch, decode and
execute cycles for several instructions are performed simultaneously to reduce overall
processing time. This process is referred to as instruction pipelining.
• To apply the concept of instruction pipelining, we must subdivide instruction processing
in number of stages as given below.
S1 - Fetch (F): Read instruction from the memory.
S2 - Decode (D): Decode the opcode and fetch source operand (s) if necessary.
S3 - Execute (E): Perform the operation specified by the instruction.
S4 - Store (S): Store the result in the destination.
• Here, instruction processing is divided into four stages hence it is known as four-stage
instruction pipeline. With this subdivision and assuming equal duration for each stage we
can reduce the execution time for 4 instructions from 16 time units to 7 time units. This is
illustrated in Fig. 7.8.1.
• In this instruction pipelining four instructions are in progress at any given time. This
means that four distinct hardware units are needed, as shown in Fig. 7.8.2. These units are
implemented such that they are capable of performing their tasks simultaneously and
without interfering with one another. Information from the stage is passed to the next stage
with the help of buffers.
44
Solution: Six stages in the pipeline :
1) Fetch Instruction (FI): Read the next expected instruction into a buffer.
2) Decode Instruction (DI): Determine the opcode and the operand specifiers.
3) Calculate Operands (CO): Calculate the effective address of each source operand.
4) Fetch Operands (FO): Fetch each operand from memory.
5) Execute Instruction (EI): Perform the indicated operation and store the result, if any in
the specified destination operand location.
6) Write Operand (WO): Store the result in memory.
Example 7.8.2 What is the ideal speed-up expected in a pipelined architecture with 'n'
stages? Justify your answer. AU May-07, Marks 2
Solution: The pipelined processor ideally completes the processing of one instruction in
each clock cycle, which means that the rate of instruction processing with n stage pipeline
is n times that of sequential operation. Therefore, ideal speed-up factor is n. However,
such ideal performance of the pipeline is achieved only when pipeline stages must
complete their processing tasks for a given instruction in the time allotted. Unfortunately,
this is not the case; pipeline operations could not sustained without interruption throughout
the program execution.
45
Pipeline Stages in the MIPS Instructions
1. Fetch instruction from memory.
2. Read registers while decoding the instruction. The regular format of MIPS instructions
allows reading and decoding to occur simultaneously.
3. Execute the operation or calculate an address.
4. Access an operand in data memory.
5. Write the result into a register.
Designing Instruction Sets for Pipelining
• From the following points we can realized that the MIPS instruction set is designed for
pipelined execution.
1. All MIPS instructions are the same length. This restriction makes it much easier to fetch
instructions in the first pipeline stage and to decode them in the second stage.
2. MIPS has only a few instruction formats, with the source register fields being located in
the same place in each instruction. Due to this symmetry the second stage can begin
reading the register file at the same time that the hardware is determining what type of
instruction was fetched.
3. Memory operands only appear in loads or stores in MIPS. Due to this restriction we can
use the execute stage to calculate the memory address and then access memory in the
following stage.
4. Since operands are aligned in memory, data can be transferred between processor and
memory in a single pipeline stage.
Pipeline Hazards
• The timing diagram for instruction pipeline operation shown in Fig. 7.8.3 (b) completes
the processing of one instruction in each clock cycle. This means that the rate of
instruction processing is four times that of sequential operation.
• The potential increase in performance resulting from pipelining is proportional to the
number of pipeline stages. However, this increase would be achieved only if pipelined
operation shown in Fig. 7.8.3 (b) could be performed without any interruption throughout
program execution. Unfortunately, this is not the case.
• For many of reasons, one of the pipeline stages may not be able to complete its operation
in the allotted time.
• Fig. 7.8.4 shows an example in which the operation specified in instruction 2 requires
three cycles to complete, from cycle 4 through cycle 6. Thus, in cycles 5 and 6, the
information in buffer B2 must remain intact until the instruction execution stage has
completed its operation. This means that stage 2 and in turn, stage 1 are blocked from
accepting new instructions because the information in B 1 cannot be overwritten. Thus
46
decode step for instruction and fetch step for instruction 5 must be postponed as shown in
the Fig. 7.8.4.
• The instruction pipeline shown in Fig. 7.8.4 is said to have been stalled for two clock
cycles (clock cycles 5 and 6) and normal pipeline operation resumes in clock cycle 7.
• Any reason that causes the pipeline to stall is called a hazard.
Types of Hazards
1. Structural hazards: These hazards are because of conflicts due to insufficient
resources when even with all possible combination, it may not be possible to overlap the
operation.
2. Data or data dependent hazards: These result when instruction in the pipeline
depends on the result of previous instructions which are still in pipeline and not
completed.
3. Instruction or control hazards: They arise while pipelining branch and other
instructions that change the contents of program counter. The simplest way to handle these
hazards is to stall the pipeline. Stalling of the pipeline allows few instructions to proceed
to completion while stopping the execution of those which results in hazards.
Structural Hazards
• The performance of pipelined processor depends on whether the functional units are
pipelined and whether they are multiple execution units to allow all possible combination
of instructions in the pipeline. If for some combination, pipeline has to be stalled to avoid
the resource conflicts then there is a structural hazard.
• In other words, we can say that when two instructions require the use of a given
hardware resource at the same time, the structural hazard occurs.
• The most common case in which this hazard may arise is in access to memory. One
instruction may need to access memory for storage of the result while another
47
instruction or operand needed is being fetched. If instructions and data reside in the same
cache unit, only one instruction can proceed and the other instruction is delayed. To avoid
such type of structural hazards many processors use separate caches for instruction and
data.
Data Hazards
• When either the source or the destination operands of an instruction are not available at
the time expected in the pipeline and as a result pipeline is stalled, we say such a situation
is a data hazard.
• Consider a program with two ins ructions, I1followed by I2. When this program is
executed in a pipeline, the execution of these two instructions can be performed
concurrently. In such case the result of I1 may not be available for the execution of I 2. If the
result of I2 is dependent on the result of I1 we may get incorrect result if both are executed
concurrently. For example, assume A = 10 in the following two operations :
I1: A ← A +5
I2: B← A×2
• When these two operations are performed in the given order, one after the other, we get
result 30. But if they are performed concurrently, the value of A used in computing B
would be the original value, 10, leading to an incorrect result. In this case data used in the
I2 depend on the result of I1. The hazard due to such situation is called data hazard or data
dependent hazard. To avoid incorrect results we have to execute dependent instructions
one after the other (in-order).
Control (Instruction) Hazards
• The purpose of the instruction fetch unit is to supply the execution units with a steady
stream of instructions. This stream is interrupted when pipeline stall occurs either due to
cache miss or due to branch instruction. Such a situation is known as instruction hazard.
• Instruction hazard can cause greater degradation in performance than data
hazards.
Unconditional Branching
• Fig. 7.8.5 shows a sequence of instructions being executed in a two-stage pipeline. The
instruction I2 is a branch instruction and its target instruction is I K. In clock cycle 3, the
instruction I3 is fetched and at the same time branch instruction (I 2) is decoded and the
target address is computed. In clock cycle 4, the incorrectly fetched instruction I3 is
discarded and instruction IK is fetched. During this time execution unit is idle and pipeline
is stalled for one clock cycle.
48
Branch Penalty: The time lost as a result of a branch instruction is often referred to as the
branch penalty.
Factor effecting branch penalty
1. It is more for complex instructions.
2. For a longer pipeline, branch penalty is more..
• In case of longer pipelines, the branch penalty can be reduced by computing the branch
address earlier in the pipeline.
Pipelined Data path and Control
• The Fig. 7.9.1 shows the general structure of multistage pipeline. As shown in the Fig.
7.9.1, the usually pipeline processor consists of a sequence of m data processing circuits,
called elements, stages or segments.
• These stages collectively perform a single operation on a stream of data operands passing
through them. The processing is done part by part in each stage, but the final result is
obtained only after an operand set has passed through the entire pipeline.
• Each stage consists of two major blocks : Multiword input register and datapath circuit.
• The multiword input registers Ri, hold partially processed results as they move through
the pipeline and they also serves as buffers that prevent neighbouring stages from
49
interfering with one another. In each clock period the individual stages process its data and
transfers its results to the next stage.
• The m-stage pipeline processor shown in Fig. 7.9.1 can simultaneously process up to m
independent sets of data operands. Thus when the pipeline is full, m separate operations
are being executed concurrently, each in a different stage. This gives a new final result
from the pipeline every clock cycle.
• If time required to perform single sub operation in the pipeline is T seconds then for m
stage pipeline the time required to complete a single operation is mT seconds. This is
called delay or latency of the pipeline.
• The maximum number of operations completed per second can be given as 1/T. This is
called throughput of the pipeline.
Implementation of Two Stage Instruction Pipelining
• The simplest instruction pipelining breaks instruction processing into two parts: A fetch
stage S1 and an execute stage S2. When these two stages are overlapped, we get two stage
pipelining with increased throughput.
• The Fig. 7.9.2 shows an implementation of a two-stage instruction pipeline. The fetch
stage S1 consists of the microprogram counter µPC, which is the source for
microinstruction addresses, and the control memory CM, which stores the
microinstructions.
50
• The execution stage S2consists of microinstruction register µIR, the decoders that extract
control signals from the microinstructions in μIR, and the logic for determining next
address or the branch address.
• The microinstruction register acts as a buffer register for stage S2. With these two stages
it is possible that, while instruction Ii with address Ai is being executed by stage S2, the
instruction Ii+1 with the next consecutive addressAi+1is fetched from memory by stage S1. If
on executing Iiin S2 it is determined that a branch must be made to a non consecutive
address, then the pre fetched instruction Ii+1 in S1has to be discarded. In such cases branch
address is obtained directly from μI itself and fed back to S 1. The branch address is then
loaded into µPC and next instruction is fetched from the branch address.
Organization of CPU with Four Stage Instruction Pipelining
• The Fig. 7.9.3 shows the implementation of four-stage instruction pipelining. As shown
in the Fig. 7.9.3 the CPU is directly connected to a cache memory, which is split into
instruction and data parts, called the I-cache and D-cache, respectively. This splitting of
the cache permits both an instruction word and a memory data word to be accessed in the
same clock cycle
• The four stages S1:S4 shown in Fig. 7.9.3 perform the following functions:
• S1: IF: Instruction fetching and decoding using the I cache.
• S2: OL: Operand loading from the D-cache to register file.
• S3: EX: Data processing using the ALU and register file.
• S4: OS: Operand storing to the D-cache from register file.
• Stages S2 and S4 implements memory load and store operations, respectively. • Stages S 2,
S3and S4 share the CPU's local Register File (RF). The registers in the register file act as
inter stage buffer registers.
• The stage 3 implements data transfer and data processing operations of the register to
register type using ALU of the CPU.
51
Implementation of MIPS Instruction Pipeline
• Fig. 7.9.4 shows the single-cycle datapath with the pipeline stages. The instruction
execution is divided into five stages means a five-stage pipeline.
1. IF : Instruction fetch
2. ID: Instruction decode and register file read
3. EX: Execution or address calculation
4. MEM: Data memory access
5. WB: Write back.
• In this pipeline stages, all instructions advance during each clock cycle from one pipeline
register to the next. The registers are named for the two stages separated by that register.
For example, the pipeline register between the IF and ID stages is called IF/ID.
Pipeline Control
• Fig. 7.9.5 shows the single-cycle datapath with added control to the pipelined datapath.
• This datapath uses the control logic for PC source, register destination number, and ALU
control discussed in the previous section.
• Here, we need the 6-bits funct field (function code) of the instruction in the EX stage as
input to ALU control, so these bits must also be included in the ID/EX pipeline register.
• The 6-bits of funct field are also the 6 least significant bits of the immediate field in the
instruction, so the ID/EX pipeline register can supply them from the immediate field since
sign extension leaves these bits unchanged.
• We have already assumed that the PC is written on each clock cycle, so there is no
separate write signal for the PC. Similarly, there are no separate write signals for the
pipeline registers (IF/ID, ID/EX, EX/MEM, and MEM/WB), since the pipeline registers
are also written during each clock cycle.
• As shown in the Fig. 7.9.5, each control line is associated with a component active in
only a single pipeline stage. Thus, we can divide the control lines into five groups
according to the pipeline stage.
1. Instruction fetch: The control signals to read instruction memory and to write the PC
are always asserted, so there is nothing special to control in this pipeline stage.
52
2. Instruction decode/register file read: Nothing special to control in this pipeline stage.
3. Execution/address calculation: The signals need to be control are RegDst, ALUOP,
and ALUSrc. The signals select the Result register, the ALU operation, and either Read
data 2 or a sign-extended immediate for the ALU.
4. Memory access: The signals set in this stage are Branch, MemRead, and MemWrite.
The branch equal, load, and store instructions set these signals, respectively. The PC Src
selects the next sequential address unless control asserts Branch and the ALU result equal
to 0.
5. Write-back: The two control signals set in this stage are Mem to Reg and RegWrite.
The Mem to Reg signal decides between sending the ALU result or the memory value to
the register file, and RegWrite signal writes the chosen value.
• The nine control lines we have seen in Fig. 7.9.6 are grouped here by pipeline stage.
Thus implementing control means setting the nine control lines in each stage for each
instruction.
• The simplest way to do this is to extend the pipeline registers to include control
information as shown in Fig. 7.9.6.
53
• Fig. 7.9.7 shows the full datapath with the extended pipeline registers and with the
control lines connected to the control portions of the pipeline registers. The control values
for the last three stages are created during the instruction decode
54
Handling Data Hazards
Operand Forwarding
• A simple hardware technique which can handle data hazard is called operand forwarding
or register by passing.
• In this technique, ALU results are fed back as ALU inputs. When the forwarding logic
detects the previous ALU operation has the operand for current instruction, it forwards
ALU output instead of register file.
• This is illustrated in Fig. 7.10.1. Fig. 7.10.1 (a) shows a portion of the processor datapath
involving the ALU and the register file.
55
• The source and result register constitute the interstage buffers needed for pipelined
operation, as shown in Fig. 7.10.1 (b).
56
• Compared to data path shown in Fig. 7.10.7, the multiplexers are added to provide inputs
to the ALU along with the forwarding unit.
Handling Data Hazards in Software
• In this approach, software (compiler) detects the data dependencies. If data dependencies
are found it introduces necessary delay between two instructions by inserting NOP (no-
operation) instructions as follows :
I1 :MUL R2, R3, R4
NOP
NOP
ADD R4, R5, R6
Disadvantages of adding NOP instructions
• It leads to larger code size.
• A given processor may have several hardware implementations. NOP instructions
inserted to satisfy the requirements of one implementation may not be needed and hence
would lead to reduce performance on a different implementation.
• To achieve better performance, the compilers are designed such that they can reorder
instructions to perform useful task in the NOP slots.
Example 7.10.1 Convert the following code segment in C to MIPS instructions, assuming
all variables are in memory and are addressable as offsets from $t0:
a = b + e;
c = b+f; AU May-19, Marks 2
Solution :
lw $t1, 0($t0)
57
lw $t2, 4($t0)
add $t3, $t1,$t2
sw $t3, 12($t0)
lw $t4, 8($10)
add $t5, $t1,$t4
sw $t5, 16($t0)
Example 7.10.2 Find the hazards in the code segment of the previous example and reorder
the instructions to avoid any pipeline stalls.
Solution: Both add instructions have a hazard because of their respective dependence on
the immediately preceding lw instruction. It is important to note that bypassing eliminates
several other potential hazards, including the dependence of the first add on the first lw
and any hazards for store instructions. Moving up the third lw instruction to become the
third instruction eliminates both hazards:
lw $t1, 0($t0)
lw $t2, 4($10)
lw $t4, 8($t0)
add $t3, $t1,$t2
sw $t3, 12($t0)
add $t5, $t1,$t4
sw $t5, 16($t0)
Side Effects
• When destination register of the current instruction is the source register of the next
instruction there is a data dependency. Such data dependency is explicit and it is identified
by register name. There are some instruction that change the contents of a register other
than the one named as the destination. For example, instruction (stack instructions: push or
pop) that uses an auto increment or auto decrement addressing mode.
• In auto increment or auto decrement addressing mode, the instruction changes the
contents of a source register used to access one of its operands. In such cases, we need to
check data dependencies for registers affected by an auto increment or auto decrement
operation along with the destination register.
• When a location other than one explicitly named in an instruction as a destination
operand is affected, the instruction is said to have a side effect.
• Another possible side effect involves the condition code flags, which are used by
instructions such as conditional branches and add-with-carry. Let us assume that, R 1and
R2 holds a double-precision integer number and R3 and R4 holds another double-precision
integer number.
58
• The addition of these two numbers may be accomplished as follows:
ADD R1, R3
ADD with Carry R2, R4
• Even though register names are different, the dependency (implicit) exists between these
two instructions through the carry flag. This flag is Set/Reset by the first instruction and
used in the second instruction, which performs the operation
R4← [R2]+[R4]+Carry
59
UNIT V MEMORY AND I/O
Memory Concepts and Hierarchy – Memory Management – Cache Memories: Mapping and
Replacement Techniques – Virtual Memory – DMA – I/O – Accessing I/O: Parallel and Serial
Interface – Interrupt I/O Interconnection Standards: USB, SATA
Memory Concept
• Memories are made up of registers. Each register in the memory is one storage location
also called memory location.
• Each memory location is identified by an address. The number of storage locations can
vary from a few in some memories to hundreds of thousand in others. Each location can
accommodate one or more bits.
• Generally, the total number of bits that a memory can store is its capacity. Most of the
types the capacity is specified in terms of bytes (group of eight bits).
• Each register consists of storage elements (flip-flops or capacitors in semiconductor
memories and magnetic domain in magnetic storage), each of which stores one bit of data.
A storage element is called a cell.
• The data stored in a memory by a process called writing and are retrieved from the
memory by a process called reading. Fig. 5.1.1 illustrates in a very simplified way the
concept of write, read, address and storage capacity for a generalized
memory.
• As shown in the Fig. 8.1.1 a memory unit stores binary information in groups of
bits called words. A word in memory is an entity of bits that moves in and out of storage
as a unit. A word having group of 8-bits is called a byte. Most computer memories use
words that are. multiples of eight bits in length. Thus, a 16-bit word contains two bytes,
and a 32-bit word is made of 4-bytes.
60
• The communication between a memory and its environment is achieved through data
lines, address selection lines, and control lines that specify the direction of transfer.
• The Fig. 8.1.2 shows the block diagram of memory unit. Then data lines provide the
information to be stored in memory and the k address lines specify the particular word
chosen among the many available. The two control inputs specify the direction transfer.
• When there are k address lines we can access 2 k memory words. For example, if k = 10
we can access 210 = 1024 memory words.
Illustrative Examples
Example 8.1.1 A bipolar RAM chip is arranged as 16 words. How many bits are stored in
the chip?
Solution: 16 x 8 = 128 bits Therefore one word = 8 bits.
Example 8.1.2 How many address bits are needed to operate a 2 K × 8 ROM ?
Solution: 2 K memory locations = 2048 locations
Since 211=2048, we need 11 address lines.
Example 8.1.3 How many locations are addressed using 18 address bits?
Solution: The number of locations addressed = 218 = 262144
Characteristics of Memory
• The Table 8.1.1 lists the key characteristics of memory systems.
61
Physical characteristics :
Volatile/Nonvolatile If memory can hold data even if power is turned off, it is called.
nonvolatile memory; otherwise it is called volatile memory.
Erasable/Nonerasable The memories in which data is once programmed cannot be erased
are called nonerasable memories. On the other hand, if data in the memory is erasable then
memory is called erasable memory.
The Table 8.1.2 shows the characteristics of some common memory technologies
Characteristics of some common memory technologies
• The processor of a computer can usually process instructions and data faster than they
are fetched from the memory unit. The memory cycle time, then is the bottleneck in the
system. One way to avoid this bottleneck is to use a cache memory. Cache memory is a
small, fast memory that is inserted between the larger, slower main memory and the
processor. It usually holds the currently active segments of a program and their data.
62
• In most modern computers, the physical main memory is not as large as the address
space spanned by an address issued by the processor. Here, the virtual memory technique
is used to extend the apparent size of the physical memory. It uses secondary storage such
as disks, to extend the apparent size of the physical memory.
Classification of Primary Memory
• The memory devices can be classified based on following parameters:
• Principle of operation
• Physical characteristics
• Mode of access and
• Terminology used for fabrication.
• The Fig. 8.1.3 shows the classification of memory.
• Broadly semiconductor memories are classified as volatile memories and non-volatile
memories. Volatile memories can retain their state as long as power is applied. On the
other hand, non-volatile memories can hold data even if power is turned off.
• Read/Write Memories (RWMs) are those memories, which allows both read and write
operations. They are used in applications where data has to change continuously. They are
also used for temporary storage of data. ROM memories allow only read operation. They
are used to store monitor programs and constants used in the program.
• The volatile memories which can hold data as long as power is ON are called Static
RAMS (SRAMs). Dynamic RAMS (DRAMs) stores the data as a charge on the capacitor
and they need refreshing of charge on the capacitor after every few milliseconds to hold
the data even if power is ON.
• EPROM and EEPROM are erasable memories in which the stored data can be erased and
new data can be stored.
63
• The semiconductor memories are also classified as Bipolar and MOS memories
depending upon the type of transistors used to construct the individual cell.
Classification of Secondary Memory
• A primary memory is costly and has a limited size. This memory is mainly used for
storing the currently processing data.
• Secondary storage is used to store data and instructions (programs) when they are not
being processed.
• The devices those are used as secondary storage are non-volatile and have a larger
storage capacity. Also, they are less expensive as compared to primary storage devices.
• However, they are slower in comparison. The examples are hard disks, floppies, CD-
ROMs, magnetic tapes etc. This type of memory is also called secondary memory,
auxiliary memory or peripheral storage.
• Fig. 8.1.4 shows the classification of secondary storage devices. They can be categorized
broadly according to their access types as sequential and random (direct)
Memory Hierarchy
• Ideally, computer memory should be fast, large and inexpensive. Unfortunately, it is
impossible to meet all the three of these requirements using one type of memory.
• Increased speed and size are achieved at increased cost.
• Very fast memory system can be achieved if SRAM chips are used. These chips are
expensive and for the cost reason it is impracticable to build a large main memory using
SRAM chips. The only alternative is to use DRAM chips for large main memories.
• Processor fetches the code and data from the main memory to execute the program. The
DRAMS which form the main memory are slower devices. So it is necessary to insert wait
states in memory read/write cycles. This reduces the speed of execution.
• The solution for this problem is come out with the fact that most of the computer
programs work with only small sections of code and data at a particular time. In the
64
memory system, small section of SRAM is added along with main memory, referred to as
cache memory.
• The program which is to be executed is loaded in the main memory, but the part of
program (code) and data that work at a particular time is usually accessed from the cache
memory.
• This is accomplished by loading the active part of code and data from main memory to
cache memory. The cache controller looks after this swapping between main memory and
cache memory with the help of DMA controller.
• The cache memory just discussed is called secondary cache. Recent processors have the
built-in cache memory called primary cache.
• DRAMs along with cache allow main memories in the range of tens of megabytes to be
implemented at a reasonable cost, the size and better speed performance. But the size of
memory is still small compared to the demands of large programs with voluminous data. A
solution is provided by using secondary storage, mainly magnetic disks and magnetic
tapes to implement large memory spaces. Very large disks are available at a reasonable
price, sacrificing the speed.
• From the above discussion, we can realize that to make efficient computer system it is
not possible to rely on a single memory component, but to employ a memory hierarchy.
Using memory hierarchy all of different types of memory units are employed to give
efficient computer system. A typical memory hierarchy is in Fig. 8.2.1.
• In summary, we can say that a huge amount of cost-effective storage can be provided by
magnetic disks. A large, yet affordable, main memory can be built with DRAM
technology along with the cache memory to achieve better speed performance.
65
• The Fig. 8.2.2 shows how memory hierarchy can be employed in a computer system. As
shown in the Fig. 8.2.2, the bottom of the hierarchy magnetic tapes and magnetic disks are
used as a secondary memory. This memory is also known auxiliary memory.
• The main memory occupies a central position by being able to communicate directly
with the CPU and with auxiliary memory devices through an I/O processor.
Fig. 8.2.3 shows common memory hierarchies with two, three and four levels.
Memory Technologies
RAM (Random Access Memories)
• Memories that consists of circuits capable of retaining their state as long as power is
applied are known as static memories.
• These are Random Access Memory (RAM) and hence combinely called static RAM
memories.
Static RAM Cell
• The Fig. 8.3.1 shows the implementation of static RAM cell. It consists of two cross-
coupled inverters as a latch and two transistors T1 and T2 which act as a switches.
• The latch is connected to two bit lines by transistors T1 and T2. The word line controls the
opening and closing of transistors T1 and T2. When word line is at logic 0 level (Ground
level), the transistors are off and the latch retains its state.
66
Read operation
• For read operation, word line is made logic 1 (high) so that both transistors are ON. Now
if the cell is in state 1, the signal on bit line b is high and the signal on bit line b' is low.
The opposite is true if the cell is in state 0. The b and b' are complements of each other.
The sense/write circuits connected to the bit lines monitor the states of b and b' and set the
output accordingly.
Write operation
• For write operation, the state to be set is placed on the line b and its complement is
placed on line b' and then the word line is activated. This action forces the cell into the
corresponding state and write operation is completed.
CMOS Cell
• The Fig. 8.3.2 shows the CMOS cell. Here, transistor pairs (T 3, T5) and (T4, T6) form the
cross-coupled inverters and transistors T1and T2act as a switches. The latch is connected to
two bit lines by transistors T1 and T2.
• The word line controls the opening and closing of transistors T1 and T2. When word line
is at logic 0 level (ground level), the transistors T1 and T2 are off and the latch retains its
state.
Read operation
• For read operation, word line is made high to switch-on transistor T1 and T2. The cell is in
state 1, if voltage at point X is maintained high by having transistors T3 and T6 on, while
T3 and T5 are off. The cell is in state 0, if the voltage at point X is maintained low by having
transistors T3 and T6 off, while T4 and T5 are on.
Write operation
• For write operation, the state to be set is placed on the line b and its complement is
placed on line b', and then the word line is activated. This action forces the cell into the
corresponding state.
67
DRAM (Dynamic RAMs)
• Dynamic RAM stores the data as a charge on the capacitor. Fig. 8.3.3 shows the dynamic
RAM cell
69
• The PROMS are onetime programmable. Once programmed, the information stored is
permanent.
EPROM (Erasable Programmable Read Only Memory)
• Erasable programmable ROMs use MOS circuitry. They store 1's and 0's as a
packet of charge in a buried layer of the IC chip.
• EPROMs can be programmed by the user with a special EPROM programmer.
• The important point is that we can erase the stored data in the EPROMS by exposing the
chip to ultraviolet light through its quartz window for 15 to 20 minutes, as shown in the
Fig. 8.3.6.
• It is not possible to erase selective information, when erased the entire information is
lost.
• The chip can be reprogrammed.
• This memory is ideally suitable for product development, experimental projects and
college laboratories, since this chip can be reused many times.
EPROM Programming :
• When erased each cell in the EPROM contains 1. Data is introduced by selectively
programming 0's into the desired bit locations. Although only O's will be programmed,
both 1's and O's can be presented in the data.
• During programming address and data are applied to address and data pins of the
EPROM. When the address and data are stable, program pulse is applied to the program
input of the EPROM. The program pulse duration is around 50 ms and its amplitude
depends on EPROM IC. It is typically 5.5 V to 25 V.
• In EPROM, it is possible to program any location at any time - either individually,
sequentially or at random.
EEPROM (Electrically Erasable Programmable Read Only Memory)
• Electrically erasable programmable ROMs also use MOS circuitry very similar to that of
EPROM.
• Data is stored as charge or no charge on an insulated layer or an insulated floating gate in
the device.
70
• The insulating layer is made very thin (<200 Å). Therefore, a voltage as low as 20 to 25
V can be used to move charges across the thin barrier in either direction for programming
or erasing.
• EEPROM allows selective erasing at the register level rather than erasing all the
information since the information can be changed by using electrical signals.
• The EEPROM memory also has a special chip erase mode by which entire chip can be
erased in 10 ms. This time is quite small as compared to time required to erase EPROM. It
can be erased and reprogrammed with device right in the circuit.
Disadvantage
• EEPROMs are most expensive and the least dense ROMs.
Flash Memory
Flash memories are read/write memories. In flash memories it is possible to read the
contents of a single cell, but it is only possible to write an entire block of cells. A flash cell
is based on a single transistor controlled by trapped charge.
Flash devices have greater density than the EEPROM memory. Due to this flash devices
have higher capacity and a lower cost per bit. They require a single power supply voltage
and consume less power in their operation. The low power consumption of flash memory
makes it suitable for portable equipments such as hand-held computers, cell phones, digital
cameras, MP3 music players and so on. In hand-hold computers and cell phones, flash
memory is used to store the software needed to operate the equipment. In digital cameras,
flash memory is used to store picture image data. In MP3 players, flash memory is used to
store the sound data.
The flash memories are available in modules. These modules are implemented in two
types: flash cards and flash drives.
Flash Cards: In this, flash chips are mounted on a small card. Flash cards have a standard
interface that makes them usable in variety of applications. These cards are available in
variety of memory sizes. The sizes are 8, 32 and 64 Mbytes.
Flash Drives: Flash drives are manufactured to replace hard-disk drives. These drives are
designed to fully emulate the hard disk, to the point that they can be fitted into standard
disk drive bays. However, the storage capacity of flash drives is significantly lower than
that of hard disk drives. Currently, the capacity of flash drives is less than 1 Gbyte. The
Table 8.3.1 shows the comparison between flash drives and hard disk drives.
71
Synchronous DRAM (SDRAM)
• DRAMS whose operation is directly synchronised with a clock signal are known as
synchronous DRAMS or simply SDRAMs.
• In a typical DRAM, the processor has to wait through delay of access time slowing the
system performance. To solve above problem, the SDRAM exchanges data with the
processor synchronized to an external clock signal.
• Advantage: This allows processor to read/write data with memory without imposing
wait states.
• In this, DRAM latches the address sent by the processor and then responds after a set
number of clock cycles.
• Advantage: Meanwhile, the processor can do other task while the SDRAM is processing
the request.
• The Fig. 8.3.7 shows the internal architecture of SDRAM. The cell organisation of
SDRAM is same as asynchronous DRAM. The address and data lines are buffered by
means of buffer registers.
• It supports burst mode transfer.
• Advantage: In burst mode, after first access, no address setup or row/column line pre-
charge time is needed. This improves system performance.
72
• Its mode register allows burst length, burst type and latency to be customized to suit
specific system needs.
• Advantage: The SDRAMs have built in refresh circuitry.
• The Fig. 8.3.8 shows a timing diagram for a typical burst read cycle of length 4.
• The first clock cycle row address is latched by activating RAS control signal. The
memory typically takes 2 or 3 clock cycles to activate the selected row. (Fig. 8.3.8 shows 2
clock cycles).
• Then, the column address is latched by activating the CAS control signal.
• After a delay of one clock cycle, the memory places the first set of data bits on the data
lines.
• The SDRAM then automatically increments the column address to access the next three
sets of data bits in the selected row in three successive clock cycles.
Double-Data-Rate Series
• The standard SDRAM performs its operations on the rising edge of the clock signal. On
the other hand, the faster SDRAMs transfers data on both the edges of the clock signal.
73
• The latency of the faster SDRAMs is same as that for standard SDRAM. However, since
they transfer data on both the edges of the clock signal, their bandwidth is effectively
doubled for long burst transfer. Such faster SDRAMs are known as Double-Data-Rate
SDRAMs (DDR SDRAMs).
• In DDR SDRAMs, the cell array is organized in two banks as shown in Fig. 8.3.9.
• As shown in Fig. 8.3.9 it has dual bank architecture to support on-chip parallelism.
• It also supports burst mode transfer. This allows the sequential reading of number of bits
from same row, eliminating the address setup time for each bit.
74
Internal Organization of Memory Chips
• Memory cells are organised in the form of an array, in which each cell is capable of
storing one bit of information. Let us see the organisation of memory chip. using single
decoder and using two dimensional decoding.
Organization using Single Decoder
• The Fig. 8.3.10 shows the organization of 16 × 8 (16 words of 8-bit each) memory. Here,
the organisation is shown with detail connections of address lines, data lines and control
lines. The data input and the data output of each sense/write circuit are connected to a
single bidirectional data line that can be connected to the data bus of a computer. Two
control lines, , are provided in addition to address and data lines.
The input specifies the required operation and the input selects a
given chip in a multichip memory system.
As shown in the Fig. 8.3.10 there are 16 words of 8-bits each. A memory with 16 words
needs four address lines. The four address inputs go through a 4 × 16 decoder to select one
of the sixteen words. The decoder is enabled with a memory enable input. When the
memory enable is 0, all outputs of the decoder are 0 and none of the memory words are
selected. With the memory select at 1, one of the sixteen words is selected, dictated by the
value in the four address lines. Once a word has been selected, the read/write input
determines the operation. During write operation, the data available in the input lines are
75
transferred into the eight memory cells of the selected word. The memory cells that are not
selected are disabled and their previous binary values remain unchanged.
• Example 8.3.1 Draw the organization of 4 K×1 memory cell and explain.
Solution: The Fig. 8.3.11 shows the organization of 4K×1 memory cell. Such organization
needs12 address lines and one data line, and two control signals ( ), resulting
15 external connections. Here, 12-bit address is
divided into two groups of 6-bit each to form therow and column addresses for the cell
array. A row address selects a row of 64 cells, all of which are accessed in parallel.
However, according to the column address, only one of these cells is connected to the
external data line by the output multiplexer and input demultiplexer.
76
• In virtual memory, the addresses that processor issues to access either instruction or data
are called virtual or logical address. The set of such addresses is called address space.
These addresses are translated into physical addresses by a combination of hardware and
software components.
• The set of physical addresses or locations is called the memory space.
Concept of Paging
• Fig. 8.5.1 shows a typical memory organization that implements virtual memory. The
memory management unit controls this virtual memory system. It translates virtual address
into physical addresses.
• A simple method for translating virtual addresses into physical addresses is to assume
that all programs and data are composed of fixed length unit called pages, as shown in the
Fig. 8.5.2
• Pages constitute the basic unit of information that is moved between the main memory
and the disk whenever the page translation mechanism determines that a swapping is
required.
Virtual to Physical Address Translation
• Involves two phases: Segment translation and page translation.
Segment translation :
A logical address (also known as virtual address) consists of a selector and an offset. A
selector is the contents of a segment register.
• Every segment selector has a linear base address associated with it, and it is stored in the
segment descriptor. A selector is used to point a descriptor for the segment in a table of
descriptors. The linear base address from the descriptor is then added to the offset to
generate the linear address. This process is known as
77
segmentation or segment translation. If paging unit is not enabled then the linear address
corresponds to the physical address. But if paging unit is enabled, paging mechanism
translates the linear address space into the physical address space by paging translation.
Page translation
• Page translation is the second phase of address translation. It transforms a linear address
generated by segment translation into a physical address.
• When paging is enabled, the linear address is broken into a virtual page number and a
page offset.
Fig. 8.5.4 shows the translation of the virtual page number to a physical page number. The
physical page number constitutes the upper portion of the physical address, while the page
offset, which is not changed, constitutes the lower portion. The number of bits in the page
offset field decides the page size.
• The page table is used to keep the information about the main memory location of each
page. This information includes the main memory address where the page is stored and the
current status of the page.
• To obtain the address of the corresponding entry in the page table the virtual page
number is added with the contents of page table base register, in which the starting address
of the page table is stored. The entry in the page table gives the physical page number, in
which offset is added to get the physical address of the main memory.
78
Example 8.5.1 The logical address space in a computer system consists of 128 segments.
Each segment can have up to 32 pages of 4 K words each. Physical memory consists of 4
K blocks of 4 K words in each. Formulate the logical and physical address formats.
Solution :
Number of bits for segment address = log2 128 = log227 = 7 bits
Number of bits for page address = log232 = log222 log2
Number of bits for word address = log2 4096 = 212 = 12 bits
Number of bits for block address = log24096 = log2212 = 12 bits
Example 8.5.2 An address space is specified by 32 bits and corresponding memory space
by 24 bits.
i) How many words are there in the address space?
ii) How many words are there in the memory space?
79
iii) If a page consists of 4 K words, how many pages and blocks are there in the systems.
Solution :
i)Words in the address space 232 = 4 G words
ii) Words in the memory space = 224 = 16 M words
iii) Number of pages = Words in address space/Words per page = 4 G words/4 K words =
1 M pages
iv) Number of blocks = Words in memory space/Words per page/block = 16 M words/4 K
words = 4 K words
Example 8.5.3 An address space is specified by 24 bits and the corresponding memory
space by 16 bits. How many words are there in the virtual memory and in the main
memory ?
Solution: Words in the address space, i.e., in the virtual memory
=224 = 16 M words
Words in the memory space, i.e., in the main memory
=216 = 64 K words
Translation Look a side Buffer (TLB)
• To support demand paging and virtual memory processor has to access page table which
is kept in the main memory. To reduce the access time and degradation of performance, a
small portion of the page table is accommodated in the memory management unit. This
portion is called Translation Look a side Buffer (TLB).
• It is used to hold the page table entries that corresponds to the most recently accessed
pages. When processor finds the page table entries in the TLB it does not have to access
page table and saves substantial access time.
• The Fig. 8.5.6 shows a organization of a TLB where the associative mapping technique
is used
• As shown in the Fig. 8.5.6, a given virtual address is compared with the TLB entries for
the reference page by MMU. If the page table entry for this page is found in the TLB, the
physical address is obtained immediately.
80
• If entry is not found, there is a miss in the TLB, then the required entry is obtained from
the page table in the main memory and the TLB is updated.
• It is important that the contents of the TLB be coherent with the contents of page tables
in the memory. When page table entries are changed by an operating system, it must
invalidate the corresponding entries in the TLB. One of the control bits in the TLB is
provided for this purpose. When an entry is invalidated, the TLB acquires the new
information as a part of the MMU's normal response to access misses.
Page Size
• The page size has a great effect on both storage utilization and the effective memory data
transfer rate. It is denoted by Sp.
• If Spis too large, excessive internal fragmentation results and if it is too small, the page
tables become very large. This tends to reduce space utilization (u). There should be
balance between these two schemes.
• Let us denote the average segment size in words as S. If Ss>>Sp, the last page assigned
to a segment should contain about S p /2 words. The size of the page table associated with
each segment is approximately Ss/Sp words, assuming each entry in the table is a word.
Hence the memory space overhead associated with each segment is,
• We can derive the expression for optimum page size SpOPT by defining the value of Sp, that
maximizes u or, equivalently, that minimizes S. Differentiating S with respect to we obtain
• Substituting the value of SpOPT equation (8.5.2), we get the optimum space utilization as,
81
• The Fig. 8.5.7 shows the effect of page size and segment size on space utilization. It
shows that space utilization factor increases with increase in segment size and it is
4. It schedules a disk operation to read the desired page into the newly allocated frame.
5. When the disk read is complete, it modifies the internal table kept with the program and
the page table to indicate that the page is now in memory.
6. It restarts the instruction that was interrupted by the illegal address trap. The program
can now access the page as though it had always been in memory.
• The hardware to implement demand paging is the same as the hardware for paging and
swapping :
Page table: This table has the ability to mark an entry invalid through a valid-invalid bit
or special value of protection bits.
83
Secondary memory: This memory holds those pages that are not present in main
memory. The secondary memory is usually a hard-disk. It is known as the swap device,
and the section of disk used for this purpose is known as swap space or backing store.
Example 8.5.4: Calculate the effective address time if average page-fault service time of
20 milliseconds and a memory access time of 80 nanoseconds. Let us assume the
probability of a page fault 10 %. Dec.-09, Marks 2
Solution: Effective access time is given as
=(1 - 0.1) x (80) + 0.1 (20 milliseconds)
=(1 -0.1) × 80+ 0.1 x 20,000,000 = 72 + 2,000,000 (nanoseconds)
=2,000,072 (nanoseconds)
Page Replacement Algorithms
• If the required page is not found in the main memory, the page fault occurs and the
required page is loaded into the main memory from the secondary memory. •However, if
there is no vacant space in the main memory to copy the required page, it is necessary to
replace the required page with one of the existing page in the main memory which is
currently not in use.
• Thus we can say that the page replacement is a basic requirement of demand paging.
• There are many different page-replacement algorithms used by various operating
systems. These are discussed in the following sections.
FIFO Algorithm
• It is the simplest page-replacement algorithm. A FIFO (First In First Out) replacement
algorithm replaces the new page with the oldest page in the main memory.
• For our example reference string, our three frames are initially empty. The first three
references (6, 0, 1) cause page faults, and the required pages are brought into these empty
frames.
The next reference (2) replaces page 6, because page 6 was brought in first. Since 0 is the
next reference and 0 is already in memory, we have no fault for this reference. The first
reference to 3 results in page 0 being replaced, since it was the first of the three pages in
memory (0, 1 and 2) to be brought in. This process continues as shown in Fig. 8.5.9.
• The FIFO page-replacement algorithm is easy to understand and program. However, its
performance is not always good. It may replace the most needed page as it is oldest page.
84
Optimal algorithm
• Advantage: An optimal page-replacement algorithm has the lowest page-fault rate of all
algorithms. It has been called OPT or MIN.
• It simply states that "Replace the page that will not be used for the longest period of
time".
• For example, on our sample reference string, the optimal page replacement algorithm
would yield nine page faults, as shown in Fig. 8.5.10. The first three references cause
faults that fill the three empty frames.
Reference string
• The reference to page 2 replaces page 6, because 6 will not be used until reference 18,
whereas page 0 will be used at 5, and page 1 at 14. The reference to page 3 replaces page
1, as page 1 will be the last of the three pages in memory to be referenced again. With only
nine page faults, optimal replacement is much better than a FIFO algorithm, which had 15
faults. (If we ignore the first three, which all algorithms must suffer, then optimal
replacement is twice as good as FIFO replacement.) In fact, no replacement algorithm can
process this reference string in three frames with less than nine faults.
• Disadvantage: Unfortunately, the optimal page replacement algorithm is difficult to
implement, because it requires future knowledge of the reference string
LRU algorithm
• The algorithm which replaces the page that has not been used for the longest period of
time is referred to as the Least Recently Used (LRU) algorithm.
• The result of applying LRU replacement to our example reference string is shown in Fig.
8.5.11.
• When the reference to page 4 occurs, LRU replacement sees that, of the three frames in
memory, page 2 was used least recently. The most recently used page is page 0, and just
85
before that page 3 was used. Thus, the LRU algorithm replaces page 2, not knowing that
page 2.
• When it then faults for page 2, the LRU algorithm replaces page 3 since, of the three
pages in memory {0, 3, 4}, page 3 is the least recently used.
• LRU replacement with 12 faults is still much better than FIFO replacement with 15.
Counting algorithms
• In the counting algorithm the a counter of the number of references that have been made
to each pages are kept, and based on these counts the following two schemes work.
• LFU algorithm: The Least Frequently Used (LFU) page replacement algorithm requires
that the page with the smallest count be replaced.
• The reason for this selection is that an actively used page should have a large reference
count.
• This algorithm suffers from the situation in which a page is used heavily during the
initial phase of a process, but then is never used again.
• MFU algorithm: The Most Frequently Used (MFU) page replacement algorithm is
based on the argument that the page with the smallest count was probably just brought in
and has yet to be used.
Example 8.5.5: Explain page replacement algorithms. Find out page fault for following
string using LRU method
60 12 0 30 4 2 30 321 20 15
Consider page frame size 3.
Solution:
86
Solution :Reference string
87
• Cache controller implements the cache logic. If processor finds that the addressed code
or data is not available in cache - the condition referred to as cache miss, the desired
memory block is copied from main memory to cache using cache controller. The cache
controller decides which memory block should be moved in or out of the cache and in or
out of main memory, based on the requirements. (The cache block is also known as cache
slot or cache line.)
• The percentage of accesses where the processor finds the code or data word it needs in
the cache memory is called the hit rate/hit ratio. The hit rate is normally greater than 90
percent.
Hit rate =Number of hits /Total number of bus cycles× 100 %
Example 8.4.1 The application program in a computer system with cache uses 1400
instruction acquisition bus cycle from cache memory and 100 from main memory. What is
the hit rate? If the cache memory operates with zero wait state and the main memory bus
cycles use three wait states, what is the average number of wait states experienced during
the program execution ?
Solution : Hit rate=1400 /1400 +100 × 100 = 93.3333 %
Total wait states = 1400 x 0 + 100 × 3 = 300
Average wait states = Total wait states / Number of memory bus cycles
= 300 / 1500 = 0.2
Most Commonly used Cache Organizations
• Two most commonly used system organizations for cache memory are :
• Look-aside and
• Look-through
Look-aside system organization
• The Fig. 8.4.2 shows system of look-aside cache organization. Here,the cache and the
main memory are directly connected to the system bus.
• In this system, the CPU initiates a memory access by placing a physical address on the
memory address bus at the start of read or write cycle.
• The cache memory M1 immediately compares physical address to the tag addresses
currently residing in its tag memory. If a match is found, i.e., in case of cache hit, the
88
access is completed by a read or write operation executed in the cache. The main memory
is not involved in the process of read or write.
• If match is not found, i.e., in case of cache miss, the desired access is completed by a
read or write operation directed to M2. In response to cache miss, a block of data that
includes the target address is transferred from M2 to M1. The system bus is used for this
transfer and hence it is unavailable for other uses like I/O operations.
Look-through system organization
• The Fig. 8.4.3 shows look-through system of cache organization. Here, the CPU
communicates with cache via a separate (local) bus which is isolated from the main system
bus. Thus during cache accesses, the system bus is available for use by other units, such as
I/O controllers, to communicate with main memory.
• Unlike the look-aside system, look-through cache system does not automatically send all
memory requests to main memory; it does so only after a cache miss.
• A look-through cache systems use wider local bus to link M 1 and M2, thus speeding up
cache-main-memory transfers (block transfers).
• Look-through cache system is faster.
Disadvantages:
• It is complex.
• It is costly.
• It takes longer time for M2 to respond to the CPU when a cache miss occurs.
Cache Read Operation
• The Fig. 8.4.4 shows the small cache system. Here each cache block is 4 bytes and each
memory address is 10-bit long. Due to this 8 high-order bits form the tag or block address
and the 2 low-order bits define a displacement address within the block.
89
• When a block is assigned to the cache data memory, its tag is also placed in the cache tag
memory.
• During read operation, the 8 high-order bits of an address are compared with stored tags
in the cache tag memory to find match (cache hit). The stored tag pinpoints the
corresponding block in cache data memory and the 2-bit displacement is used to read the
target word.
Cache Write Operation
• The Fig. 8.4.5 shows execution of cache write operation. It uses same addressing
technique as in case of read operation.
• When cache hit occurs, the new data, in this case E6, is stored at the location pointed by
address in the cache data memory, thereby overwriting the old data 5 A.
• Now data in the cache data memory and data in the main memory for given address is
different. This causes cache consistency problem.
90
Program Locality
• In cache memory system, prediction of memory location for the next access is essential.
This is possible because computer systems usually access memory from the consecutive
locations. This prediction of next memory address from the current memory address is
known as program locality.
• Program locality enables cache controller to get a block of memory instead of getting just
a single address.
• The principle of program locality may not work properly when program executes JUMP
and CALL instructions. In case of these instructions, program code is not in sequence.
Locality of Reference
• We know that program may contain a simple loop, nested loops or a few procedures that
repeatedly call each other. The point is that many instructions in localized area of the
program are executed repeatedly during some time period and the remainder of the
program is accessed relatively infrequently. This is referred to as locality of reference.
• It manifests itself in two ways: temporal and spatial.
• The temporal means that a recently executed instruction is likely to be executed again
very soon.
• The spatial means that instructions stored near by to the recently executed instruction are
also likely to be executed soon.
• The temporal aspect of the locality of reference suggests that whenever an instruction or
data is first needed, it should be brought into the cache and it should remain there until it is
needed again.
• The spatial aspect suggests that instead of bringing just one instruction or data from the
main memory to the cache, it is wise to bring several instructions and data items that
reside at adjacent address as well. We use the term block to refer to a set of contiguous
addresses of some size.
Block Fetch
• Block fetch technique is used to increase the hit rate of cache.
• A block fetch can retrieve the data located before the requested byte (look behind)or data
located after the requested byte (look ahead) or both.
• When CPU needs to access any byte from the block, entire block that contains the
needed byte is copied from main memory into cache.
• The size of the block is one of the most important parameters in the design of a cache
memory system.
91
Block size
1. If the block size is too small, the look-ahead and look-behind are reduced and therefore
the hit rate is reduced.
2. Larger blocks reduce the number of blocks that fit into a cache. As the number of blocks
decrease, block rewrites from main memory becomes more likely.
3. Due to large size of block, the ratio of required data and useless data is less.
4. Bus size between the cache and the main memory increases with block size to
accommodate larger data transfers between main memory and the cache, which increases
the cost of cache memory system.
Elements of Cache Design
• The cache design elements include cache size, mapping function, replacement algorithm
write policy, block size and number of caches.
Cache size: The size of the cache should be small enough so that the overall average cost
per bit is close to that of main memory alone and large enough so that the overall average
access time is close to that of the cache alone.
Mapping function: The cache memory can store a reasonable number of blocks at any
given time, but this number is small compared to the total number of blocks in the main
memory. Thus we have to use mapping functions to relate the main memory blocks and
cache blocks. There are two mapping functions commonly used direct mapping and
associative mapping. Detail description of mapping functions is given in section 8.4.6.
Replacement algorithm: When a new block is brought into the cache, one of the existing
blocks must be replaced, by a new block.
• There are four most common replacement algorithms:
• Least-Recently-Used (LRU)
• First-In-First-Out (FIFO)
• Least-Frequently-Used (LFU)
• Random
• Cache design change according to the choice of replacement algorithm. Detail
description of replacement algorithm is given in section 8.4.8.
Write policy: It is also known as cache updating policy. In cache system, two copies of
the same data can exist at a time, one in cache and one in main memory. If one copy is
altered and other is not, two different sets of data become associated with the same
address. To prevent this, the cache system has updating systems such as: write through
system, buffered write through system and write-back system. The choice of cache write
policy also change the design of cache.
Block size: It should be optimum for cache memory system.
92
Number of caches: When on-chip cache is insufficient, the secondary cache is used. The
cache design changes as number of caches used in the system changes.
Mapping
• Usually, the cache memory can store a reasonable number of memory blocks at any
given time, but this number is small compared to the total number of blocks in the main
memory. The correspondence between the main memory blocks and those in the cache is
specified by a mapping function.
• The mapping techniques are classified as :
1. Direct-mapping technique
2. Associative-mapping technique
• Fully-associative
• Set-associative techniques.
• To discuss these techniques of cache mapping we consider a cache consists of 128 blocks
of 16 words each, for a total of 2048 (2 K) words and assume that the main memory has 64
K words. This 64 K words of main memory is addressable by a 16-bit address and it can
be viewed as 4 K blocks of 16 words each. The group of 128 blocks of 16 words each in
main memory form a page.
Direct-Mapping
• It is the simplest mapping technique.
• In this technique, each block from the main memory has only one possible location in the
cache organization. In this example, the block i of the main memory maps onto block j(j =
i modulo 128) of the cache, as shown in Fig. 8.4.6. Therefore, whenever one of the main
memory blocks 0, 128, 256, ...... is loaded in are stored in cache, it is stored in cache block
0. Blocks 1, 129, 257, …… block 1 and so on.
In general the mapping expression is
j = i modulo m
where
i =Main memory block number
j=Cache block (line) number
m= Number of blocks (lines) in the cache
• To implement such cache system, the address is divided into three fields, as shown in
Fig. 8.4.6.
• The lower order 4-bits select one of the 16 words in a block. This field is known as word
field.
• The second field known as block field is used to distinguish a block from other blocks.
Its length is 7-bits since 27 = 128.
93
• When a new block enters the cache, the 7-bit cache block field determines the cache
position in which this block must be stored.
• The third field is a tag field. It is used to store the high-order 5-bits of memory address of
the block. These 5-bit (tag bits) are used to identify which of the 32 blocks (pages) that are
mapped into the cache.
• When memory is accessed, the 7-bit cache block field of each address generated by CPU
points to a particular block location in the cache. The high-order 5-bits of the address are
compared with the tag bits associated with that cache location. If they match, then the
desired word is in that block of the cache. If there is no match, then the block containing
the required word must first be read from the main memory and loaded into the cache.
• This means that to determine whether requested word is in the cache, only tag field is
necessary to be compared. This needs only one comparison.
• The main drawback of direct mapped cache is that if processor needs to access same
memory locations from two different pages of the main memory frequently, the controller
has to access main memory frequently. Since only one of these locations can be in the
cache at a time. For example, if processor want to access memory location 100 H from
page 0 and then from page 2, the cache controller has to access page 2 of the main
memory. Therefore, we can say that direct-mapped cache is easy to implement, however,
it is not very flexible.
Associative-Mapping (Fully-Associative Mapping)
• The Fig. 8.4.7 shows the associative-mapping technique. In this technique, a main
memory block can be placed into any cache block position. As there is no fix block, the
memory address has only two fields: word and tag. This techniques is also referred to as
fully-associative cache.
94
• The 12-tag bits are required to identify a memory block when it is resident in the cache.
The high-order 12-bits of an address received from the CPU are compared to the tag bits
of each block of the cache to see if the desired block is present.
• Once the desired block is present, the 4-bit word is used to identify the necessary word
from the cache.
• This technique gives complete freedom in choosing the cache location in which to place
the memory block. Thus, the memory space in the cache can be used more efficiently.
• A new block that has to be loaded into the cache has to replace (remove) an existing
block only if the cache is full.
• In such situations, it is necessary to use one of the possible replacement algorithm to
select the block to be replaced.
• Dis advantage:In associative-mapped cache, it is necessary to compare the higher-order
bits of address of the main memory with all 128 tag corresponding to each block to
determine whether a given block is in the cache. This is the main disadvantage of
associative-mapped cache.
Set-Associative Mapping
• The set-associative mapping is a of both direct and associative mapping.
• It contains several groups of direct-mapped blocks that operate as several direct-mapped
caches in parallel.
• A block of data from any page in the main memory can go into a particular block
location of any direct-mapped cache. Hence the contention problem of the direct-mapped
technique is eased by having a few choices for block placement.
• The required address comparisons depend on the number of direct-mapped caches in the
cache system. These comparisons are always less than the comparisons required in the
fully-associative mapping.
• Fig. 8.4.8 shows two way set-associative cache. Each page in the main memory is
organized in such a way that the size of each page is same as the size of one directly
95
mapped cache. It is called two-way set-associative cache because each block from main
memory has two choices for block placement.
• In this technique, block 0, 64, 128,.....4032 of main memory can map into any of the two
(block 0) blocks of set 0, block 1, 65, 129,,...., 4033 of main memory can map into any of
the two (block 1) blocks of set 1 and so on.
• As there are two choices, it is necessary to compare address of memory with the tag bits
of corresponding two block locations of particular set. Thus for two-way set-associative
cache, we require two comparisons to determine whether a given block is in the cache.
• Since there are two direct-mapped caches, any two bytes having same offset from
different pages can be in the cache at a time. This improves the hit rate of the cache
system.
• To implement set-associative cache system, the address is divided into three fields, as
shown in Fig. 8.4.8.
• The 4-bit word field selects one of the 16 words in a block.
• The set field needs 6-bits to determine the desired block from 64 sets. However, there are
now 64 pages. To identify a block belongs to a particular page from 64 pages, six tag bits
are required.
Example 8.4.2 Consider a cache consisting of 256 blocks of 16 words each, for a total of
4096 (4 K) words and assume that the main memory is addressable by a 16-bit address and
it consists of 4 K blocks. How many bits are there in each of the TAG, BLOCK/SET and
word fields for different mapping techniques ?
Solution: We know that memory address is divided into three fields. We will now find the
exact bits required for each field in different mapping techniques.
a) Direct-mapping
Word bits: We know that each block consists of 16 words. Therefore, to identify each
word we must have (24 = 16) four bit reserved for it.
Block bits: The cache memory consists of 256 blocks and using direct-mapped technique,
block k of the main memory maps onto block k modulo 256 of the cache. It has one to one
96
correspondence and requires unique address for each block. To address 128 block we
require (28 = 256) eight bits.
Tag bits: The remaining 4 (16 – 4 - 8) address bits are tag bits which stores the higher
address of the main memory.
The main memory address for direct-mapping technique is divided as shown below:
b) Associative-mapping
Word bits: The word length will remain same i.e. 4 bits.
In the associative-mapping technique, each block in the main memory is identified by the
tag bits and an address received from the CPU is compared with the tag bits of each block
of the cache to see if the desired block is present. Therefore, this type of technique does
not have block bits, but all remaining bits (except word bits) are reserved as tag bits.
Block bits: 0
Tag bits: To address each block in the main memory (2 12 = 4096) 12 bits are required and
therefore, there are 12 tag bits.
The main memory address for direct mapping technique is divided as shown below:
c) Set-associative mapping
Let us assume that there is a 2-way set-associative mapping. Here, cache memory 'is
mapped with two blocks per set. The set field of the address determines which set of the
cache might contain the desired block.
Word bits: The word length will remain same i.e. 4 bits.
Set bits: There are 128 sets (256/2). To identify each set (27 = 128) seven bits are required.
Tag bits: The remaining 5 (16 – 4-7) address bits are the tag bits which stores higher
address of the main memory.
The main memory address for 2-way set associative mapping technique is divided as
shown below:
97
Example 8.4.3 A block set-associative cache consists of 64 blocks divided into 4 block
sets. The main memory contains 4096 blocks, each consists of 128 words of 16 bits length:
i) How many bits are there in main memory?
ii) How many bits are there in each of the TAG, SET and WORD fields?
Solution: i) Number of bits in main memory:
= Number of blocks x Number of words per block x Number of bits per word
= 4096 × 128 × 16
= 8388608 bits
ii) Number of bits in word field :
There are 128 words in each block. Therefore, to identify each word (2 7 = 128) 7 bits are
required.
iii) Number of T bits in set field: There are 64 blocks and each set consists of 4 blocks.
Therefore, there are 16 (64/4) sets. To identify each set (2 4 = 16) four bits are required.
iv) Number of bits in tag field: The total words in the memory are :
4096 x 128 = 524288
To address these words we require (219 = 524288) 19 address lines. Therefore, tag bits are
eight (19-7-4).
Example 8.4.4 A digital computer has a memory unit of 64 K × 16 and a cache memory of
1 K words. The cache uses direct mapping with a block size of four words. How many bits
there in the tag index, block and word field of the address format?
Solution:Word bits: Number of word bits = log2 22 = 2-bits
Block bits: Number of block = Cache size / Words in each block
Number of block bits= log2 256 = log2 28 = 8 bits.
98
Example 8.4.5 A two way set associative cache memory uses block of four words. The
cache can accommodate a total of 2048 words from main memory. The main memory size
is 128 K x 32.
i)How many bits are there in the tag index, block and word field of address format?
ii) What is size of cache memory?
Solution: Number of bits in main memory address = log2 128K = log217 = 17 bit
Number of blocks in the cache memory = 2048/4 = 512 blocks
Number of sets in the cache memory = 512/2 = 256 sets
Number of bits in set field = log2256 = log228 = 8 bits
Number of bits in word field = log2 4 = log222 = 2 bits
Number of bits in tag field = 17 – 8 – 2 = 7 bits
Example 8.4.6 A direct mapped cache has the following parameters: cache size = 1 K
words, Block size 128 words and main memory size is 64 K words. Specify the number of
bits in TAG, BLOCK and WORD in main memory address.
Solution : Word bits = log2 128 = 7 bits
Number of blocks = cache size / words in each block = 1K/128 = 8
Number of block bits = log2 8 = 3-bits
Number of address bits to address main memory
= log2 64K = log2216 = 16 bits
Tag bits = 16 – 3 – 7 = 6 bits
Example 8.4.7 How many total bits are required for a direct-mapped cache with 16 kb of
data and 4-word blocks, assuming a 32-bit address?
Solution:
16 kb = 4K words =212 words
Block size of 4 words = 210 blocks
Each block has 4 x 32 = 128 bits of data + tag + valid bit
Tag + valid bit = (32 – 10 – 2 - 2) + 1 = 19
Total cache size =210 (128+ 19) = 210 ×147
Therefore, 147 kb are needed for the cache.
99
Example 8.4.8 You have been asked to design a cache with the following properties:
1) Data words are 32 bits each.
2) A cache block will contain 2048 bits of data.
3) The cache is direct mapped.
4) The address supplied from the CPU is 32 bits long.
5) There are 2048 blocks in the cache.
6) Addresses are to the word.
In the below Fig.8.4.11, there are 8 fields (label ed a,b,c,d,e,f,g and h), you will need to
indicate the proper name or number of bits for a particular portion of this cache
configuration.
Solution:
f. (name) -You are being asked to show what part of a physical address form the index,
offset, and tag. < f > refers to the most significant bits of the address - so this is the tag.
g. (name) - It follows that the next part of the address is the index.
h. (name) - The least significant bits form the offset.
c. (number) - There are 211 bits / block and there are 25 bits / word. Thus there are 26 words
/ block so we need 6 bits of offset.
b. (number) - There are 211 blocks and the cache is direct mapped (or "1-way set
associative"). Therefore, we need 11 bits of index.
a. (number) - The remaining bits form the tag. Thus, 32 - 6 - 11 = 15 bits of tag.
d. (number) - Field < d > refers to the fact that a tag must be stored in each block. Thus, 15
bits are kept in each block.
e. (number) Field < e > asks you to specify the total number of bits / block. This.is 2048.
We need to compare the valid bit associated with the block, the tag stored in the block, and
the tag associated with the physical address to determine if the cache entry is useable or
not. The tags should be the same and the valid bit should be 1.
Cache Size
There are 2048 blocks in the cache and there are 2048 bits / block. There are 8 bits/byte.
Thus, there are 256 bytes / block
2048 blocks × 256 bytes / block =219 bytes (or 0.5 MB)
100
Comparison between Mapping Techniques
Cache Coherency
• In a single CPU system, two copies of same data, one in cache memory and another in
main memory may become different. This data inconsistency is called as cache coherence
problem.
• Cache updating systems eliminates data inconsistency in the main memory caused by
cache write operations.
• In multiprocessor systems, another bus master can take over control of the system bus.
This bus master could write data into a main memory blocks which are already held in the
cache of another processor. When this happens, the data in the cache no longer match
those held in main memory creating inconsistency.
• The 80386 supports four different approaches to prevent data inconsistency, that is to
protect cache coherency:
1. Bus watching (snooping) 2. Hardware transparency
3. Non-cacheable memory 4. Cache flushing
Bus watching: In bus watching, cache controller invalidates the cache entry, if another
master writes to a location in shared memory which also resides in the cache memory. Fig.
8.4.12 shows bus watching.
101
Hardware transparency : In hardware transparency, accesses of all devices to the main
memory are routed through the same cache or by copying all cache writes both to the main
memory and to all other caches that share the same memory. Fig. 8.4.13 shows hardware
transparent system.
Non-cacheable memory: The 80386DX can partition its main memory into a cacheable
and non-cacheable memory. By designing shared memory as non-cacheable memory
cache coherency can be maintained, since shared memory is never copied into cache.
Cache flushing: To avoid data inconsistency, a cache flush writes any altered data to the
main memory and caches in the system are flushed before a device writes to shared
memory.
Cache Updating Policies
• In a cache system, two copies of same data can exist at a time, one in cache and one in
main memory. If one copy is altered and other is not, two different sets of data become
associated with the same address. To prevent this, the cache system has updating systems
such as: write through system, buffered write through system and write-back system.
Write through System
• The cache controller copies data to the main memory immediately after it is written to
the cache. Due to this, main memory always contains a valid data and thus any block in
the cache can be overwritten immediately without data loss.
• The write through is a simple approach.
• This approach requires time to write data in main memory with increase in bus traffic.
102
• This in effect reduces the system performance.
Buffered Write through System
• In buffered write through system, the processor can start a new cycle before the write
cycle to the main memory is completed. This means that the write accesses to the main
memory are buffered.
• In such systems, read access which is a "cache hit" can simultaneously when main
memory is updated.
be performed
• However, two consecutive write operations to the main memory or read operation with
cache "miss" require the processor to wait.
Write-Back System
• In a write-back system, the alter (update) bit in the tag field is used to keep information
of the new data. If it is set, the controller copies the block to main memory before loading
new data into the cache.
• Due to one time write operation, number of write cycles are reduced in write-back
system. But this system has following disadvantages.
• Write-back cache controller logic is more complex.
• It is necessary that, all altered blocks must be written to the main memory before another
device can access these blocks in main memory.
• In case of power failure, the data in the cache memory is lost, so there is no way to tell
which locations of the main memory contain old data. Therefore, the main memory as well
as cache must be considered volatile and provisions must be made to save the data in the
cache.
Replacement Algorithms
• When a new block is brought into the cache, one of the existing blocks must be replaced,
by a new block.
• In case of direct-mapping cache, we know that each block from the main memory has
only one possible location in the cache, hence there is no choice. The previous data is
replaced by the data from the same memory location from new page of the main memory.
• For associative and set-associative techniques, there is a choice of replacing existing
block. The choice of replacement of the existing block should be such that the probability
of accessing same block must be very less. The replacement algorithms do the task of
selecting the existing block which must be replaced. • There are four most common
replacement algorithms:
• Least-Recently-Used (LRU)
• First-In-First-Out (FIFO)
103
• Least-Frequently-Used (LFU)
• Random
• Least-Recently-Used: In this technique, the block in the set which has been in the cache
longest with no reference to it, is selected for the replacement. Since we assume that more-
recently used memory locations are more likely to be referenced again. This technique can
be easily implemented in the two-way set-associative cache organization.
• First-In-First-Out: This technique uses same concept that stack implementation uses in
the microprocessors. In this technique, the block which is first loaded in the cache amongst
the present blocks in the cache is selected for the replacement.
• Least-Frequently-Used: In this technique, the block in the set which has the fewest
references is selected for the replacement.
• Random: Here, there is no specific criteria for replacement of any block. The existing
blocks are replaced randomly. Simulation studies have proved that random replacement
algorithm provides only slightly inferior performance to algorithms just discussed.
Example 8.4.9 Consider web browsing application. Assuming both client and server are
involved in the process of web browsing application, where can caches be placed to speed
up the process? Design a memory hierarchy for the system. Show the typical size and
latency at various levels of the hierarchy. What is the relationship between cache size and
its access latency? What are the units-of data transfers between hierarchies? What is the
relationship between the data location, data size, and transfer latency?
Solution: a) Assuming both client and server are involved in the process of web browsing
application, caches can be placed on both sides - web browser and server.
b) Memory hierarchy for the system is as follows:
1. Browser cache, size = fraction of client computer disk, latency = local disk = latency.
2. Proxy cache, size=proxy disk, latency = LAN+ proxy disk latencies
3. Server-side cache, size = fraction of server disk, latency = WAN + server disk
4. Server storage, size = server storage, latency = WAN + server storage. Latency is not
directly related to cache size.
c) The units of data transfers between hierarchies are pages.
d) Latency grows with page size as well as distance.
Accessing I/O
• The important components of any computer system are CPU, memory and I/O devices
(peripherals). The CPU fetches instructions (opcodes and operands/data) from memory,
processes them and stores results in memory. The other components of the computer
system (I/O devices) may be loosely called the Input/Output system.
104
• The main function of I/O system is to transfer information between CPU or memory and
the outside world.
• The important point to be noted here is, I/O devices (peripherals) cannot be connected
directly to the system bus. The reasons are discussed here.
1. A variety of peripherals with different methods of operation are available. So it would
be impractical to incorporate the necessary logic within the CPU to control a range of
devices.
2. The data transfer rate of peripherals is often much slower than that of the memory or
CPU. So it is impractical to use the high speed system bus to communicate directly with
the peripherals.
3. Generally, the peripherals used in a computer system have different data formats and
word lengths than that of CPU used in it.
• So to overcome all these difficulties, it is necessary to use a module in between system
bus and peripherals, called I/O module or I/O system, or I/O interface.
The functions performed by an I/O interface are:
1. Handle data transfer between much slower peripherals and CPU or memory.
2. Handle data transfer between CPU or memory and peripherals having different data
formats and word lengths.
3. Match signal levels of different I/O protocols with computer signal levels.
4. Provides necessary driving capabilities - sinking and sourcing currents.
Requirements of I/O System
• The I/O system if nothing but the hardware required to connect an I/O device to the bus.
It is also called I/O interface. The major requirements of an I/O interface are :
1. Control and timing
2. Processor communication
3. Device communication
4. Data buffering
5. Error detection
• The important blocks necessary in any I/O interface are shown in Fig. 8.6.1.
105
• As shown in the Fig. 8.6.1, I/O interface consists of data register, status/control register,
address decoder and external device interface logic.
• The data register holds the data being transferred to or from the processor.
• The status/control register contains information relevant to the operation of the I/O
device. Both data and status/control registers are connected to the data bus.
• Address lines drive the address decoder. The address decoder enables the device to
recognize its address when address appears on the address lines.
• The external device interface logic accepts inputs from address decoder, processor
control lines and status signal from the I/O device and generates control signals to control
the direction and speed of data transfer between processor and I/O devices.
106
• The Fig. 8.6.2 shows the I/O interface for input device and output device. Here, for
simplicity block schematic of I/O interface is shown instead of detail connections.
• The address decoder enables the device when its address appears on the address lines.
• The data register holds the data being transferred to or from the processor.
• The status register contains information relevant to the operation of the I/O device.
• Both the data and status registers are assigned with unique addresses and they
are connected to the data bus.
I/O Interfacing Techniques
I/O devices can be interfaced to a computer system I/O in two ways, which are called
interfacing techniques,
• Memory mapped I/O
• I/O mapped I/O
Memory mapped I/O
• In this technique, the total memory address space is partitioned and part of this space is
devoted to I/O addressing as shown in Fig. 8.6.3.
• When this technique is used, a memory reference instruction that causes data to be
fetched from or stored at address specified, automatically becomes an I/O instruction if
that address is made the address of an I/O port
Advantage
• The usual memory related instructions are used for I/O related operations. The special
I/O instructions are not required.
Disadvantage
• The memory address space is reduced.
I/O mapped I/O
• If we do not want to reduce the memory address space, we allot a different I/O address
space, apart from total memory space which is called I/O mapped I/O technique as shown
in Fig. 8.6.4
107
Advantage
• The advantage is that the full memory address space is available.
Disadvantage
• The memory related instructions do not work. Therefore, processor can only use this
mode if it has special instructions for I/O related operations such as I/O read, I/O write.
Memory Mapped I/O, I/O Mapped I/O Comparison
109
• The processor's software checks each of the I/O devices every so often. During this
check, the processor tests to see if any device needs servicing. Fig. 8.7.1 shows the
flowchart for this. (See Fig. 8.7.1 on next page.)
• This is a simple program which services I/O ports A, B and C. The routine checks the
status of I/O ports in proper sequence.
• If the request bit is set, the corresponding service routine is called to complete the I/O
request.
• This test and service procedure continues until all the I/O port status registers are tested
and all the I/O ports requesting service are serviced. Once this is done, the processor
continues to execute the normal programs.
• The routine assigns priorities to the different I/O devices. Once the routine is started, the
service request bit at port A is always checked first. Once port A is checked, port B is
checked, and then port C. However, the order can be changed by simply changing the
routine and thus the priorities.
Interrupt I/O
• Sometimes it is necessary to have the computer automatically execute one of a collection
of special routines whenever certain conditions exists within a program or the computer
system e.g. It is necessary that computer system should give response to devices such as
keyboard, sensor and other components when they request for service.
110
• This method provides an external asynchronous input that would inform the processor
that it should complete whatever instruction that is currently being executed and fetch a
new routine (Interrupt Service Routine) that will service the requesting device. Once this
servicing is completed, the processor would resume exactly where it left off. The event
that causes the interruption is called interrupt and the special routine executed to service
the interrupt is called Interrupt Service Routine (ISR).
• The interrupt service routine is different from subroutine because the address of ISR is
predefined or it is available in Interrupt Vector Table (IVT), whereas subroutine address is
necessarily to be given in subroutine CALL instruction. IRET instruction is used to return
from the ISR whereas RET instruction is used to return from subroutine. IRET instruction
restores flag contents along with CS and IP in the IA-32 architecture; however RET
instruction only restores CS and IP contents.
• An interrupt caused by an external signal is referred as a hardware interrupt.
• Conditional interrupts or interrupts caused by special instructions are called software
interrupts.
Enabling and disabling interrupts
• Most of the processors provide the masking facility. In the processor those interrupts
which can be masked under software control are called maskable interrupts.
• The interrupts which can not be masked under software control are called non-maskable
interrupts.
• Maskable interrupts are enabled and disabled under program control. By setting or
resetting particular flip-flops in the processor, interrupts can be masked or unmasked,
respectively.
• When masked, processor does not respond to the interrupt even though the interrupt is
activated.
Vectored interrupts
• When the external device interrupts the processor (interrupt request), processor has to
execute interrupt service routine for servicing that interrupt. If the internal control circuit
of the processor produces a CALL to a predetermined memory location which is the
starting address of interrupt service routine, then that address is called vector address and
such interrupts are called vector interrupts. For vector interrupts fastest and most flexible
response is obtained since such an interrupt causes a direct hardware-implemented
transition to the correct interrupt-handling program. This technique is called vectoring.
When processor is interrupted, it reads the vector address and loads it into the PC.
111
Interrupt nesting
• For some devices, a long delay in responding to an interrupt request may cause error in
the operation of computer. Such interrupts are acknowledged and serviced even though
processor is executing an interrupt service routine for another device. •A system of
interrupts that allows an interrupt service routine to be interrupted is known as nested
interrupts.
Interrupt priority
• When interrupt requests arrive from two or more devices simultaneously, the processor
has to decide which request should be serviced first and which one should be delayed. The
processor takes the decision with the help of interrupt priorities.
• It accepts the request having the highest priority.
Recognition of Interrupt and Response to Interrupt
• The CPU recognizes the interrupt when the external asynchronous input (interrupt input)
is asserted (a signal is sent to the interrupt input) by an I/O device.
• In response to an interrupt a special sequence of actions are performed. These are as
follows:
• When a processor is interrupted, it stops executing its current program and calls a special
routine which "services" the interrupt. The event that causes the interruption is called
interrupt and the special routine which is executed is called interrupt service routine.
1. The processor completes its current instruction. No instruction is cut-off in the middle
of its execution.
2. The program counter's current contents are stored on the stack. Remember, during the
execution of an instruction the program counter is pointing to the memory location for the
next instruction.
3. The program counter is loaded with the address of an interrupt service routine.
4. Program execution continues with the instruction taken from the memory location
pointed by the new program counter contents.
5. The interrupt program continues to execute until a return instruction is executed.
6. After execution of from the RET instruction processor gets the old address (the address
of the next instruction where the interrupt service routine was called.) of the program
counter form the stack and puts it back into the program counter. This allows the
interrupted program to continue executing at the instruction following the one where it
was interrupted. Fig. 8.8.2 shows the response to interrupt with flowchart and diagram.
112
Comparison between Programmed I/O and Interrupt Driven I/O
The Table 8.8.1 gives the comparison between programmed I/O and interrupt driven I/O.
114
Input Interface
• Commonly used input device is a keyboard. Fig. 8.9.1 shows the hardware components
needed for connecting a keyboard to a processor.
• A typical keyboard consists of mechanical switches that are normally open. When key is
pressed, corresponding signal alters and encoder circuit generates ASCII code for the
corresponding key.
• For interfacing switches to the microprocessor based systems, usually push button keys
are used. These push button keys when pressed, bounces a few times, closing and opening
the contacts before providing a steady reading, as shown in the Fig. 8.9.2. Reading taken
during bouncing period may be faulty. Therefore, microprocessor must wait until the key
reach to a steady state; this is known as key debounce.
• Key-debouncing circuit included in the block diagram of Fig. 8.9.3 eliminate effect of
key bouncing. The problem of key bounce can be eliminated using key debounce circuit
(hardware approach) or software approach.
• When debouncing is implemented in software, the I/O routine that reads a ASCII code of
character from the keyboard waits long enough to ensure that bouncing has subsided.
• Fig. 8.9.3 shows the hardware approach to prevent key bouncing. It consists of flip-flop.
The output of flip-flop shown in Fig. 8.9.3 is logic 1 when key is at position A
(unpressed)and it is logic 0 when key is at position B, as shown in Table 8.9.1. It is
important to note that, when key is in between A and B, output does not change,
preventing bouncing of key output. In other words we can say that output does not change
during transition period, eliminating key debouncing.
115
• The output of encoder in Fig. 8.9.4 consists of the bits that represent the encoded
character and one control signal called valid, which indicates that a key is being pressed.
The encoder circuit sends this information to the interface circuit.
• The interface circuit consists of a data register, DATA IN, and a status flag, SIN. •When
key is pressed, the valid signal is activated, (changes from 0 to 1) causing the ASCII code
to be loaded into DATA IN and SIN to be set to 1. The status flag SIN is cleared to 0 when
the processor reads the contents of the DATA IN register.
• The interface circuit is connected to an asynchronous bus on which transfers are
controlled by using the handshake signals master-ready and slave-ready.
• The Fig. 8.9.4 shows the typical internal circuitry for input interface. Here, we receive
the data input from the keyboard input device.
• When the key is pressed, its switch closes and establishes a path for an electrical signal.
This signal is detected by an encoder circuit that generates ASCII code for the
corresponding character.
• The input interface consists of data register, DATA IN, and status flag, SIN. When a key
is pressed, the valid signal activates and causes the ASCII code to be loaded into DATA
IN and SIN to be set to 1.
116
• The status flag SIN is cleared to 0 when the processor reads the contents of the DATA
IN register.
• An address decoder is used to select the input interface when the high-order 31 bits of an
address corresponds to any of the addresses assigned to this interface.
• Address bit determines whether the status or the data register is to be read when the
Master-ready signal is active.
• The control handshake is accomplished by activating the slave-ready signal when either
Read-status or Read-data is equal to 1.
Output Interface
• The Fig. 8.9.5 shows typical example of output interface which is used to interface
parallel printer.
• The output interface contains a data register, DATAOUT, and a status flag, SOUT.
• The SOUT flag is set to 1 when the printer is ready to accept another character, and it is
cleared to 0 when a new character is loaded into DATAOUT by the processor.
• The Fig. 8.9.6 shows the detail internal circuit for output interface.
117
• Flags SIN and SOUT are in the status register.
• Labels RS1 and S0 used for inputs that determines the selection of desired register.
8.9.1.4 Programmable Parallel Interface
• The input and output interfaces can be combined into a single interface and the direction
of data flow can be controlled by data direction register.
• Single interface can be programmed to use for input the data or output the data. Such a
interface is known as programmable parallel interface.
• The Fig. 8.9.8 shows the simplified block diagram of 8-bit programmable parallel
interface.
• Data and interface lines are bidirectional. There direction is controlled by Data Direction
Register (DDR).
118
• Two lines M0 and M1 are connected to the status and control. These two lines decides the
mode of operation of the parallel interface, i.e. whether to operate parallel interface as a
simple input/output or handshake input/output.
• Ready and Accept signals are provided as handshaking signals.
• The signal is also provided to allow interrupt drives I/O data transfer.
Serial Interface
• A serial interface is used to transmit / receive data serially, i.e,, one bit at a time.
• A key feature of an interface circuit for a serial port is that it is capable of
communicating in a bit serial fashion on the device side and in a bit parallel fashion on the
processor side.
• A shift register is used to transform information between the parallel and serial formats.
The Fig. 8.9.9 shows the block diagram of typical internal circuit for serial interface.
• As shown in the Fig. 8.9.9, the input shift register accepts serial data bit by bit and
converts it into the parallel data. The converted parallel data is loaded in the data register
and it is then read by the processor using data bus.
• When it is necessary to send data serially, the data is loaded into DATAOUT register. It
is then loaded into output shift register. Output shift register converts this parallel data into
serial data.
119
• In serial interface one bit is transferred at a time over a single line.
120
• To read a block of data from the disk processor sends a series of commands to the disk
controller device telling it to search and read the desired block of data from the disk.
• When disk controller is ready to transfer first byte of data from disk, it sends DMA
request DRQ signal to the DMA controller.
• Then DMA controller sends a hold request HRQ, signal to the processor HOLD input.
The processor responds this HOLD signal by floating its buses and sending out a hold
acknowledge signal HLDA, to the DMA controller.
• When the DMA controller receives the HLDA signal, it takes the control of system bus.
• When DMA controller gets control of the buses, it sends the memory address where the
first byte of data from the disk is to be written. It also sends a DMA acknowledge, DACK
signal to the disk controller device telling it to get ready to output the byte.
• Finally, it asserts both the I/O read and memory write signals on the control bus.
Asserting the I/O read signal enables the disk controller to output the byte of data from the
disk on the data bus and asserting the memory write signal enables the addressed memory
to accept data from the data bus. In this technique data is transferred directly from the disk
controller to the memory location without passing through the processor or the DMA
controller.
• Thus, the CPU is involved only at the beginning and end of the transfer.
• After completion of data transfer, the HOLD signal is deasserted to give control of all
buses back to the processor.
• The Fig. 8.10.2 shows the interaction between processor and DMA discussed above.
121
Comparison of I/O Program Controlled Transfer and DMA Transfer
• DMA controller communicates with the CPU via the data bus and control lines.
• The registers in DMA are selected by the CPU through the address bus by enabling the
DS (DMA select) and RS (Register select) inputs.
• The RD (Read) and WR (write) inputs are bidirectional.
• When the BG (bus grant) input is 0, the CPU can communicate with the DMA registers
through the data bus to read from or write the DMA registers RD and WR signals are input
signals for DMA.
• When BG = 1, the CPU has relinquished the buses and the DMA can communicate
directly with the memory by specifying an address in the address bus and activating the
RD or WR signals (RD and WR are now output signals for DMA). DMA consists of data
count, data register, address register and control logic.
• Data counter register stores the number which gives the number data transfers to be done
in one DMA cycle. It is automatically decremented after each word transfer.
122
• Data register acts as buffer whereas address register initially holds the starting address of
the device. Actually, it stores the address of the next word to be transferred. It is
automatically incremented or decremented after each word transfer.
• After each transfer, data counter is tested for zero. When the data count reaches zero, the
DMA transfer halts.
• The DMA controller is normally provided with an interrupts capability, in which case it
sends an interrupt to processor to signal the end of the I/O data transfer.
Data Transfer Modes
• DMA controller transfers data in one of the following three modes :
• Single transfer mode (cycle stealing)
• Block transfer mode
• Demand or burst transfer mode
Single transfer mode (Cycle stealing mode)
In this mode device can make only one transfer (byte or word). After each transfer DMAC
gives the control of all buses to the processor. Due to this processor can have access to the
buses on a regular basis.
It allows the DMAC to time share the buses with the processor, hence this mode is most
commonly used.
The operation of the DMA in a single transfer mode is as given below:
1. I/O device asserts DRQ line when it is ready to transfer data.
2. The DMAC asserts HLDA line to request use of the buses from the processor.
3. The processor asserts HLDA, granting the control of buses to the DMAC.
4. The DMAC asserts to the requesting I/O device and executes DMA bus cycle,
resulting data transfer.
5. I/O device deasserts its DRQ after data transfer of one byte or word.
6. DMA deasserts line.
7. The word/byte transfer count is decremented and the memory address is incremented.
8. The HOLD line is deserted to give control of all buses back to the processor.
9. HOLD signal is reasserted to request the use of buses when I/O device is ready to
transfer another byte or word. The same process is then repeated until the last transfer.
10. When the transfer count is exhausted, terminal count is generated to indicate the end of
the transfer.
Block transfer mode
In this mode device can make number of transfers as programmed in the word count
register. After each transfer word count is decremented by 1 and the address is
decremented or incremented by 1. The DMA transfer is continued until the word count
123
"rolls over" from zero to FFFFH, a Terminal Count (TC) or an external END of Process (
) is encountered. Block transfer mode is used when the DMAC needs to transfer a
block of data.
The operation of DMA in block transfer mode is as given below:
1. I/O device asserts DRQ line when it is ready to transfer data.
2. The DMAC asserts HLDA line to request use of the buses from the microprocessor.
3. The microprocessor asserts HLDA, granting the control of buses to the DMAC.
4. The DMAC asserts to the requesting I/O device and executes DMA bus cycle,
resulting data transfer.
5. I/O device deasserts its DRQ after data transfer of one byte or word.
6. DMA deasserts line.
7. The word/byte transfer count is decremented and the memory address is
incremented.
8. When the transfer count is exhausted, the data transfer is not complete and the DMAC
waits for another DMA request from the I/O device, indicating that it has another byte or
word to transfer. When DMAC receives DMA request steps through are repeated.
9. If the transfer count is not exhausted, the data transfer is complete then DMAC de
asserts the HOLD to tell the microprocessor that it no longer needs the buses.
10. Microprocessor then de asserts the HLDA signal to tell the DMAC that it has resumed
control of the buses.
Demand transfer mode
In this mode the device is programmed to continue making transfers until a TC or
external is encountered or until DREQ goes inactive.
The operation of DMA in demand transfer mode is as given below:
1. I/O device asserts DRQ line when it is ready to transfer data.
2. The DMAC asserts HLDA line to request use of the buses from the microprocessor.
3. The microprocessor asserts HLDA, granting the control of buses to the DMAC.
4. The DMAC asserts to the requesting I/O device and executes DMA bus cycle,
resulting data transfer.
5. I/O device de asserts its DRQ after data transfer of one byte or word.
6. DMA de asserts line.
7. The word/byte transfer count is decremented and the memory address is
incremented.
8. The DMAC continues to execute transfer cycles until the I/O device de asserts DRQ
indicating its inability to continue delivering data. The DMAC de asserts HOLD signal,
giving the buses back to microprocessor. It also de asserts .
124
9. I/O device can re-initiate demand transfer by reasserting DRQ signal.
10. Transfer continues in this way until the transfer count has been exhausted.
The flowcharts in the Fig. 8.10.4 summarized the three data transfer modes of DMA. (See
Fig. 8.10.4 on next page.)
Use of DMA in a Computer System
• The Fig. 8.10.5 (a) shows the use of DMA in a computer system.
• The DMA is used to connect a high-speed network to the computer bus. The DMA
control handles the data transfer between high-speed network and the computer system.
• It is also used to transfer data between processor and floppy disk with the help of floppy
disk controller.
• Let us see how DMA controller does the data transfer between floppy disk and the
processor. The Fig. 8.10.5 (b) shows the interface required for such transfer.
• The sequence of events that takes place during the data transfer are as follows:
• When processor needs some data from the disk, it sends a series of command words to
registers inside the floppy disk controller.
• The floppy disk controller then proceeds to find the specified track and sector on the
disk.
125
• In the mean while processor loads the DMA data counter and address register. The data
counter is loaded with count equal to the number of bytes to be transferred to or from the
memory. The address register is loaded with the starting address of the memory.
• When the controller reads the first byte of data from a sector, it sends DMA request,
DRQ signal to the DMA controller. DMA request sends a hold request signal to the HOLD
input of the processor.
• The processor then floats its buses and sends a hold-acknowledge signal to the DMA
controller.
• The DMA controller then sends out the first transfer address on the bus and asserts
the input of the FDC to tell it that the DMA transfer is underway. When the number
of bytes specified in the DMA controller initialization has been transferred, the DMA
controller asserts the (end of process) signal, which is connected to the TC (Terminal
Count) input of the FDC.
• This causes FDC to generate interrupt signal to tell the processor that the requested block
of data has been read from disk to a buffer in memory.
Similar process is required to write data into the disk.
Interconnection Standards
Universal Serial Bus (USB)
• USB gives fast and flexible interface for connecting all kinds of peripherals.
• USB is playing a key role in fast growing consumer areas like digital imaging, PC
telephony, and multimedia games, etc.
126
• The presence of USB in most new PCs and its plug-n-play capability, means that PCs
and peripherals (such as CD ROM drives, tape and floppy drives, scanners, printers, video
devices, digital cameras, digital speakers, telephones, modems, key boards, mice, digital
joysticks and others) will automatically configure and work together, with high degree of
reliability, in this exciting new application areas.
• USB opens the door to new levels of innovation and its use for input devices. There are
also brand new opportunities of all types of peripherals from printers to scanners to high
speed connection such as Ethernet, DSL, cable and satellite communications.
• USB has advantages that specifically benefit developers, including the hardware
designers who select components and design the circuits, the PC programmers who write
the software that communicates with USB peripherals, and peripherals programmers who
write the code that resides inside USB peripherals.
USB Features
1. Simple connectivity
USB offers simple connectivity.
2. Simple cables
The USB cable connector are keyed so you cannot plug them in wrong.
3. One interface for many devices
USB is versatile enough to be usable with many kinds of peripherals with no need of
having a different connector and protocols for each peripheral. USB supports all kinds of
data, from slow mouse inputs to digitized audio and compresses video.
4. Automatic configuration
When a user connects a USB peripheral to a powered system, windows automatically
detects the peripheral and loads the appropriate software driver for it. There is no need to
locate and run a setup programme or restart the system before using the peripheral.
5. No user setting
USB peripherals do not have user selectable settings such as port address and interrupt
request (IRQ) lines.
6. Frees hardware resources for other devices
Using USB for as many peripherals as possible frees up IRQ lines for the peripherals that
do require them.
7. Hot pluggable (Plug-and play)
You can install or remove a peripheral regardless of the power state.
8. Data transfer rates
USB supports three data transfer rates, 480 Mb/s (high-speed), 12 Mb/s (full-speed) and
1.5 Mb/s (low-speed).
127
9. Coexistence with IEEE 1394
USB 2.0 and IEEE 1394 offer similar data rate primarily differ in terms of application
focus.
10. Reliability
Reliability of USB results from both the hardware design and data transfer protocols.
11. Low cost
Even though USB is more complex than earlier interfaces, its components and cables are
inexpensive. A device with a USB interface is likely to cost the same or less than its
equivalent older interfaces.
12. Low power consumption
Power circuits and code automatically power down USB peripherals when not in use, yet
keep them ready to respond when needed.
13. Flexibility
USB's four transfer types and three speed make it feasible for many types of peripherals.
14. Operating system support
Windows 98 was the first Windows operating system to reliably support USB, and its
successors such as Windows 2000 support USB as well. Other computers and operating
systems also have USB support. ON apple's iMac, the only peripherals connectors are
USB. Other Macintoshes also support USB, and support is in progress for Linux, NetBSD,
and FreeBSD.
USB System
• The Fig. 8.11.1 shows the basic components of USB system. It consists of USB host,
USB device and USB cable. The USB host is a personal computer (PC) and devices are
scanner, printer etc. There will be only one host in the USB system, however there can be
127 devices in the USB system.
USB Connector
• The Fig. 8.11.2 shows the pin-out of the USB connector.
128
• There are two types of USB connectors.
• In either case, there are four signals as indicated in Table 8.11.1 The 5.0 V and ground
signals can be used to power the device connected on the PCI bus as long as the amount of
current does not exceed 100 mA per device.
• The data signals are biphase signals. When + data represents 5.0 V the data represents 0
V and vice versa.
• To maintain the signal frequency in the specified range i.e. to achieve synchronization
we have to insert a sync bit in the data stream, if a logic 1 is transmitted for more than six
bits in a row. The process of inserting sync bit is known as bit stuffing.
• The bit stuffing isillustrated in Fig. 8.11.4.Bit stuffing ensures that Digital data
the receiver can maintain synchronization for long strings of logic 1s.
• The Fig. 8.11.5 shows the flowchart to generate USB data from the raw digital serial
data. (SeeFig. 8.11.5 on next page.)
129
8.11.1.5 USB Commands
• USB data is transmitted to particular receptor with the use of USB commands. •The
communication begins with the transmission of the sync byte (80H).
• It is followed by the packet identification byte (PID). The PID contains eight bits, but
only the rightmost four bits contain the type of packet. The leftmost four bits of the PID
are the complement form of four rightmost bits.
1.
• For example, if a command is 1001, the actual PID byte is 0110 1001. The Table 8.11.2
shows the available 4-bit PIDs and their 8-bit codes. The PIDs are also used as token
indicators, as data indicators, and for handshaking
USB Packet Format
• The Fig. 8.11.6 shows the formats of data, token, handshaking and start-of frame packets
used on the USB.
130
• The data packet begins with PID, then data byte and ends with CRC (cyclic redundancy
check) code.
• Packets use two types of CRC codes : one is a 5-bit CRC and the other (used for data
packets) is a 16-bit CRC.
• The 5-bit CRC is generated with the X5 + X2 +1 polynomial.
• The 16-bit CRC is generated with the X16 + X15 + X2 + 1 polynomial.
• The USB uses the ACK (acknowledge) and NAK (Not acknowledge) tokens to co-
ordinate the transfer of data packets between the host system (host is a PC or other
computer that contain two components: controller and a root hub). (A hub is device that
contains one or more connecters or internal connections to USB devices along with the
hardware to enable communicating with each device and the USB device.)
• Once a data packet is transferred from the host to the USB device, the USB device either
transmits and ACK or a NAK token back to the host.
• If the data and CRC are received without error, the ACK is sent; otherwise, the NAK is
sent.
• If the host receives a NAK token, it retransmits the data packet until the receiver receives
it without error.
• This method of data transfer is known as stop and wait flow control. In this method, the
host has to wait for the client to send an ACK or NAK before transferring additional data
packets.
Data Transfer Types
• The USB is designed to handle many types of peripherals with varying requirements for
transfer rate, response time, and error correcting. There are four types of data transfers
each handle different needs and a peripheral can support the transfer types that are best
suited for its purpose.
1. Control transfer
• Control transfers are the only type with functions defined by the USB specification.
These transfers enable the host to read and select configurations and other settings on the
devices being enumerated. Control transfers may also send custom requests that send and
receive blocks of data for any purpose. All USB devices must support control transfers.
• This data transfer exchanges configuration, setup, and command information between the
device and host. CRCS check the data and initiate retransmissions when needed to
guarantee the correctness of these packets.
• Control transfers use message pipes. In a message pipe, each transfer begins with a Setup
transaction containing a request. To complete the transfer, the host and device may
131
exchange data and status information, or the device may just send status information.
There is always at least one transaction that sends information in each direction.
• If the request is one that the device supports, it takes the requested action. A device may
also respond with a code that indicates that it does not support the request.
2. Bulk transfer
• Bulk transfers are intended for situations where the rate of transfer isn't critical, such as
sending a file to a printer or receiving data from a scanner. In these cases quick transfers
are nice, but the data can wait if necessary. If the bus is very busy with other transfers that
have guaranteed transfer rates, bulk transfers must wait, but if the bus is idle, bulk
transfers are very fast. Only full-speed devices can do bulk transfers. Devices are not
required to support bulk transfers, but a specific I device class might require it.
• This data transfer moves large amounts of data when timely delivery is not critical.
Typical applications include printers and scanners. Bulk transfers are fillers, claiming
unuse USB bandwidth when nothing more important is going on. CRCs protect these
packets.
3. Interrupt transfer
• Interrupt transfers are for devices that must receive the host's or device's attention
quickly. Other than control transfers, interrupt transfers are the only way that low speed
devices can transfer data. A keyboard or mouse can use interrupt transfers to send
keypress or mouse movement data. Both full and low speed devices can do interrupt
transfers. Devices aren't required to support interrupt transfers, but a specific device class
might require it.
• This data transfers, though not interrupt in the CPU diverting sense, poll devices to see if
they need service. Peripherals exchanging small amounts of data that need immediate
attention (such as mice and keyboards) use interrupt transfers. Error checking validates the
data.
4. Isochronous transfer (USB for audio devices)
• Isochronous transfers are for devices that must transfer data at a constant rate, such as
audio files to be played in real time, or other data that needs a guaranteed delivery rate or
time. This is the only transfer type that doesn't support automatic re-transmitting of data
received with errors, so occasional errors must be acceptable. Only full speed devices can
do isochronous transfers. Devices are not
required to support isochronous transfers, but a specific device class might require
it.
• This data transfer handling trimming data like an audio or video device. It is time
sensitive information so within limitation it has guaranteed access to a USB bus. No error
132
checking occurs so the system must tolerate occasional scrambled bytes. Above three
transfers namely Bulk transfer, Interrupt transfer, and Isochronous transfer use Stream
Pipes.
• In a stream pipe, the data has no format defined by the USB specification. The receiving
device just accepts whatever arrives. The device firmware or host software can then
process the data in whatever way is appropriate for the application. Of course, even with
stream data, the sending and receiving devices will need to agree on some type of format.
SATA
• A serial advanced technology attachment (serial ATA, SATA or S-ATA) is a computer
bus interface that connects host bus adapters with mass storage devices like optical drives
and hard drives.
• As its name implies, SATA is based on serial signaling technology, where data is
transferred as a sequence of individual bits
• This interface is commonly used to connect hard disk drives to a host system such as a
computer motherboard.
• The first version of SATA communicated at 150 megabytes per second (MBps). The
standard was soon upgraded to 300 MBps in 2004 and 600 MBps in 2009-estimated to be
sufficient to accommodate 10 years of advances in device throughput.
Features
• Low Voltage Requirement: SATA operates on 500 mV (0.5 V) peak-to-peak signaling.
This help in promoting a much low interference and crosstalk between conductors.
• Hot Plugging: This feature helps users to change or remove storage devices even when
the computer is running.
• Staggered Spin-Up: Allows sequential hard disk drive startup, which helps even out
power load distribution during system booting.
133
• Native Command Queuing (NCQ): Usually, the commands reach a disk for reading or
writing from different locations on the disk. When the commands are carried out based on
the order in which they appear, a substantial amount of mechanical overhead is generated
because of the constant repositioning of the read/write head. SATA II drives use an
algorithm to identify the most effective order to carry out commands. This helps to reduce
mechanical overhead and improve performance.
• Port Multipliers: Allows the connection of up to 15 drives to a SATA controller. This
facilitates the building of disk enclosures.
• Port Selectors: Facilitates redundancy for two hosts connected to a single drive,
allowing the second host to take over in the event of a primary host failure.
• Simplified construction: PATA cables had 40-pin/80-wire ribbon cable. This was
complex in structure. In comparison, SATA had a single 7 pin data cable and a 15 pin
power cable. This cable resulted in a higher signaling rate, which translates to faster
throughput of data.
• Differential Signaling: SATA uses differential signaling. Differential signaling is a
technology which uses two adjacent wires to simultaneously the in-phase and out-of-phase
signals. Thus, it is possible to transfer high-speed data with low operating voltage and low
power consumption by detecting the phase difference between the two signals at the
receiver's end.
• High data transfer rate: SATA has a high data transfer rate of 150/300/600
MBS/second. This capability of SATA allows for faster program loading, better picture
loading and fast document loading.
• Large Cable Length: SATA cable can be of length up to 1 meter, whereas PATA cable
can only have a length of maximum 18 inches.
Operating Modes
• SATA operates on two modes :
• IDE mode: IDE stands for Integrated Drive Electronics. This mode is used to provide
backward compatibility with older hardware, which runs on PATA, at low performance.
• AHCI mode: AHCI is an abbreviation for Advanced Host Controller Interface. AHCI is
a high-performance mode that also provides support for hot-swapping. •The Serial ATA
[SATA] bus is defined over two separate connectors, one connector for the data lines and
one for the power lines.
134
SATA Data pinout