0% found this document useful (0 votes)
269 views16 pages

HW3S24 Sol

Uploaded by

ruweiyan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
269 views16 pages

HW3S24 Sol

Uploaded by

ruweiyan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

University of Southern California

Department of Electrical Engineering


EE557 Spring 2024
Instructor: Murali Annavaram
Homework #3, Due: 11:59PM, Tuesday, March 19th
TOTAL SCORE:

Problem 1 [VLIW] (30 pts.)


Consider the following VLIW machine and the program.

Loop: L.D F0,0(R1) O1


ADD.D F4,F0,F2 O2
S.D 0(R1),F6 O3
ADDI R1, R1, #8
BNE R1, R2, Loop

Branches are delayed by two instructions and executed in the EX3 stage. Jumps are delayed by one
instruction and are executed in the ID3 stage by the branch pipeline. Results are forwarded to the
EX stages with Full-forwarding.

(a) Compute the following operation latencies for three cases:


Full
Source Destination No forwarding Register Forwarding
Forwarding
FP ALU op FP ALU op
Load Double FP ALU op
Store Double Load Double
Load Double Store Double
INT op/
INT
Branch

(b) Show a VLIW code with Rotating registers (in Figure 3.32 of the textbook) for the kernel using
software pipelining for the program. Follow the procedure of “Software pipelining algorithm” on
page 149 in the book. Assume full forwarding is used in the VLIW machine.

LABEL Slot 1(LD/ST) Slot 2(FP) Slot 3(int/branch)


KERNEL

(c) The VLIW machine is increased its computation power by add more slots in it. Now, it has 2
LD/ST slots, 2 FP op slot, 1 INT slot and 1 branch slot. The operation latencies are kept the same
as the previous machine. Show a VLIW code with Rotating registers (in Figure 3.32 of the
textbook) for the kernel using software pipelining for the same program.
Slot 1 Slot 2 Slot 3 Slot 4 Slot 5 Slot 6
LABEL
(LD/ST) (LD/ST) (FP) (FP) (INT) (Branch)
KERNEL
1
Sol.>
(a)
Full
Source Destination No forwarding Register Forwarding
Forwarding
FP ALU op FP ALU op 4 3 2
Load Double FP ALU op 3 2 1
Store Double Load Double 0 0 0
Load Double Store Double 3 2 1
INT op/
INT 2 1 0
Branch

(b) The kernel is shown in shade in the table.

ITE1 ITE2 ITE3 ITE4


INST1 O1
INST2 --
INST3 O2 O1
INST4 -- --
INST5 -- O2 O1
INST6 O3 -- --
INST7 -- O2 O1
INST8 O3 -- --
INST9 -- O2
INST10 O3 --
INST11 --
INST12 O3

LABEL Slot 1(LD/ST) Slot 2(FP) Slot 3(int/branch)


ADD.D
KER L.D RR3,0(R1) ADD R1,R1,#8
RR1,RR2,F2
S.D -24(R1),RR0 BNE R1,R2,KER
NOOP
NOOP

(c) The kernel is shown in the table.

ITE1 ITE2 ITE3 ITE4 ITE5 ITE6 ITE6


INST1 O1
INST2 -- O1
INST3 O2 -- O1
INST4 -- O2 -- O1
INST5 -- -- O2 -- O1
INST6 O3 -- -- O2 -- O1
INST7 O3 -- -- O2 -- O1
INST8 O3 -- -- O2 --
INST9 O3 -- -- O2
INST10 O3 -- --
INST11 O3 --
INST12 O3

LABEL Slot 1 (LD/ST) Slot 2 (LD/ST) Slot 3 (FP) Slot 4 (FP) Slot 5 (INT) Slot 6 (Branch)
L.D S.D -40(R1), ADD.D ADDI R1, R1, BNE R1, R2,
KER
RR6,0(R1) RR0 RR3,RR4,F2 #8 KER
NOOP
NOOP

2
Problem 2 [Predicated instructions] (30 pts.)
Consider the code in Exercise 3.24, and do the followings:
(a) Identify all basic blocks in and all possible traces in the code. For traces, answer with basic
blocks
(b) Schedule the code the best you can by using local scheduling only (i.e., within basic blocks)
on the VLIW machine used in problem 1(b) with Full-forwarding.
(c) Translate this code into the new assembly code with predicated and/or non-predicated
instructions, but using no conditional branch instructions or unconditional jumps. Make sure
that you do not create unwanted exceptions.

Sol.>
(a)
BB1: I1~ I5
BB2: I6 ~ I7
BB3: I8 ~ I10
BB4: I11 ~ I13
BB5: I14 ~ I15
BB6: I16 ~ I17
Trace 1: BB1 -> BB2 -> BB3: (A>=C&&A>=B)
Trace 2: BB1 -> BB2 -> BB4 -> BB5 -> BB6: (A>=C&&A<B) and (B>D&&A=C+D) Trace 3:
BB1 -> BB2 -> BB4 -> BB5: (A>=C&&A<B) and (B>D&&A!=C+D)
Trace 4: BB1 -> BB2 -> BB4 -> BB6: (A>=C&&A<B) and (B<=D)
Trace 5: BB1 -> BB4 -> BB5 -> BB6: (A<C) and (B>D&&A=C+D)
Trace 6: BB1 -> BB4 -> BB5: (A<C) and (B>D&&A!=C+D)
Trace 7: BB1 -> BB4 -> BB6: (A<C) and (B<=D)

(b)
BB INST LABEL LD/ST FPop INT/BRANCH
1 I1 LW R5, 0(R1)
1 I2 LW R7, 0(R3)
1 I3 LW R6, 0(R2)
1 I4 SLT R8,R5,R7
1 I5 BNEZ R8, else
1 I6
1 I7
2 I8 SLT R9,R5,R6
2 I9 BNEZ R9, else
2 I10
2 I11
3 I11 ADD R10,R6,R7
3 I12 SW R10,0(R1) J exit
3 I13
4 I14 else LW R11,0(R4)
4 I15
4 I16 SLT R12,R11,R6
4 I17 BEZ R12,else1
4 I18

3
4 I19
5 I20 ADD R13,R7,R11
5 I21 BNE R5,R13,exit
5 I22
5 I23
6 I24 else1 SUB R14,R5,R7
6 I24 SW R14,0(R2)

(c)

Problem 3 [Loop-carried dependency] (35 pts.)


Consider the following code running on the VLIW machine in Problem 1 (b):
Loop: L.S F0, 0(R2) O1
ADD.S F4, F3, F0 O2
S.S F4, 16(R2) O3
ADD R2, R2, #8
BNE R2, R4, Loop

a. Draw the data dependency graph.


b. Write a VLIW program using local scheduling only.
c. Write a VLIW program using software pipelining with Rotating registers (in Figure 3.32 of the
textbook). Include prologue, kernel and epilogue. Follow and show the procedure of “Software
pipelining algorithm” on page 149 in the book.

d. What is the speedup (in cycles) of the software pipelining in c over the local scheduling in b if
the number of iterations of the loop is one hundred?

Sol.>
a.
(0,2) (0,3)
O1 O2 O3

(2,1)
b.
4
LABEL Slot 1(memory) Slot 2(FP) Slot 3(int/branch)
LOOP L.S F0, 0(R2)

ADD.S F4, F3,


F0
BNE R2, R4,
LOOP
ADDI R2, R2, #8
S.S F4, 8(R2)

c.
Resources (slot) utilization:
Data dependencies: (DDG):
One cycle (C): MIN(C) = 2 + 3 + 1 = 6. DIFF(C) = 0 + 0 + 2.
Therefore, II1 = ⌈6/2⌉ = 3
Op slot conflicts:
II2 = MAX (II(OMemOp ), II(OFPOp ), II(Obr,intOp )) = 2

Therefore, MINII = MAX(II1, II2) = 3


Schedule: INST1-3 is the prologue. INST4-6 is the kernel. INST13-15 is the epilogue.

ITE1 ITE2 ITE3


INST1 O1
INST2
INST3 O2
INST4 O1
INST5
INST6 O3 O2
INST7 O1
INST8
INST9 O3 O2
INST10 O1
INST11
INST12 O3 O2
INST13
INST14
INST15 O3

VLIW program
LABEL Slot 1(memory) Slot 2(FP) Slot 3(int/branch)
PROLOGUE L.S F0, 0(R2)

ADD.S RR1, F3,


ADDI R2, R2, #8
F0
KERNEL L.S RR2, 0(R2) BNE R2, R4, KERNEL

ADD.S RR1, F3,


S.S RR0, 8(R2) ADDI R2, R2, #8
RR0
EPILOGUE

S.S RR1, 8(R2)

d.
Execution Time:
For local scheduling only, 6 × 100 = 600 cycles
For software pipelining, 3(prologue) + 99 × 3(kernel) + 3(epilogue) = 303
Speedup = 600/303 = 1.98

5
Problem 4 [Vector memory] (20 pts.)
Exercise 3.27 in the notes, but consider the following modifications.
a) Assume 16 banks and a bank access time is 6 cycles. What are the permissible vector strides
(from 1 to 16) that will avoid conflict? Try to generalize this result to any stride by consider
the stride modulo 16?
b) Assume 17 banks and a bank access time is 6 cycles. What are the permissible vector strides
(from 1 to 17) that will avoid conflict? Try to generalize this result to any stride by consider
the stride modulo 17.
c) Discuss the advantages and inconveniences in using 17 banks rather than 16 banks.

Sol.>
a. A bank number where a component of a vector, Xi, is stored can be computed by (Stride * i)
mod 16. To see if there is a conflict, we find the minimum gap between the components, which
are returned from the same bank. Let’s say the gap is K and a stride is S in this problem. For
example, if stride is 1, the same bank number will be repeated every 16 component of a vector
and K is 16.

S K
1 16
2 8
3 16
4 4
5 16
6 8
7 16
8 2
9 16
10 8
11 16
12 4
13 16
14 8
15 16
16 1

As shown in the table above, K can be generalized. Depending on a stride, bank numbers
accessed by every component has a pattern and the pattern is repeated every K. Because the
length of the pattern is the GCD (the greatest common divisor) of a stride and the total number
of banks, which is 16 in this problem, K is the number of banks divided by the GCD of a stride
and the total number of bank. Because a bank access time is 6, there is a conflict if K is less
than 6:
16
𝐼𝑓 < 6, 𝑡ℎ𝑒𝑛 𝑡ℎ𝑒𝑟𝑒 𝑖𝑠 𝑎 𝑐𝑜𝑛𝑓𝑙𝑖𝑐𝑡.
𝐺𝐶𝐷(𝑆,16)
Hence, strides except multiples of 4, such as 4, 8, and 12, are permissible to avoid conflicts. In
case S is greater than 16, the bank number pattern of S will be same as the bank number pattern
of (S mod 16) and its K is also same.
17
b. We can use same method used in part a and 𝐾 = 𝐺𝐶𝐷(𝑆,17)
Because 17 is a prime number and the GCD of each stride except a multiple of 17 is 1, the same
bank will be accessed for every 17 component of a vector. If a stride is 17, then memory bank
0 is accessed for every component of a vector and there are conflicts. Hence, every stride except
a multiple of 17 is permissible to avoid conflicts. As discussed in part a, S, which is greater
than 17, will access the same bank as (S mod 17).

c. 17 banks have less conflict on each memory bank than 16 banks because a bank number
accessed is obtained by modulo 17 and a bank number is likely to be different if the total
number of memory bank is a prime number. However, computing an address modulo 17 is
6
much more complex than computing an address modulo 16. Even though the computation is
pipelined, it is inefficient.

Problem 5 [Vector loop] (20 pts.)


Exercise 3.26 in the book, but assume an architecture similar to the one in Figure 3.41, with 8 vector
registers of 128 components each and a very large number of memory banks (say 1024) so that
conflicts for banks never happen. The bank access time is 10 cycles, the stride is 1, the latency of
the multiply pipeline is 10, and the latency of the add pipeline is 5.
Assume that memory addresses point to vector elements and that there are multiple adders and
multipliers. Branches are delayed by two instructions.

Sol.>
a. The code for processing each strip of 64 components is given below:
LOOP: L.V V1,0(R2),R6 /load X; R6 contains the stride of 1.
L.V V2,0(R3),R6 /load Y; 0(R3) is the base address of Y
MUL.V V3,V2,V1 /Multiply two vector registers
ADD.V V4,V4,V3 /Partial sums accumulate in V4
ADDI R2,R2,#128 /This assumes that memory
ADDI R3,R3,#128 /addresses point to vector elements
ADDI R4,R4,#1
BNE R4,R5,LOOP

Partial vector sums accumulate in V4. At the end we simply have to add the components of V4. To
simplify we have assumed that memory addresses point to vector elements. If vector components
have 8 bytes (double precision) and memory is byte-addressable, then the increments on the address
registers should be 1024.
Given that vector operations are chained, one iteration of the loop takes: 𝑇𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛 =
𝑙𝑎𝑡𝑒𝑛𝑐𝑦(𝐿. 𝑉) + 𝑙𝑎𝑡𝑒𝑛𝑐𝑦(𝑀𝑈𝐿. 𝑉) + 𝑙𝑎𝑡𝑒𝑛𝑐𝑦(𝐴𝐷𝐷. 𝑉) + (𝑉. 𝐿) − 1 = 10 + 10 + 5 +
128 − 1 = 152 𝑐𝑦𝑐𝑙𝑒𝑠.
Since the loop has to iterate 8 times (= 1024/128), the total number of cycles taken by the dot-
product is 152 × 8 = 1216 𝑐𝑦𝑐𝑙𝑒𝑠 (ignoring the scalar phase at the end).

b.
For the multiplication of two matrices, a component of the result matrix is obtained by a dot-product
of two vectors. Therefore, we need 1024 × 1024 = 1048576 dot-products to multiply two
1024 × 1024 matrices. The matrix multiply takes 1216 × 1048576 = 1275068416 𝑐𝑦𝑐𝑙𝑒𝑠.

c. Unrolling the vector loop twice (after register renaming):


LOOP: L.V V1,0(R2),R6 /load X
L.V V2,0(R3),R6 /load Y
MUL.V V3,V2,V1 /Multiply two vector registers
ADD.V V4,V4,V3 /Partial-sum
ADDI R2,R2, #128
ADDI R3,R3, #128
L.V V5, 0(R2), R6 /load X
L.V V6, 0(R3), R6 /load Y
MUL.V V7,V5,V6 /Multiply two vector registers
ADD.V V8,V8,V7 / partial-sum
ADDI R2,R2,#128
ADDI R3,R3,#128
ADDI R4,R4,#2
BNE R4,R5,LOOP

Load and multiplication on each strip of 128 components of the vector are chained and run in
parallel. Partial sums are written in two different vector registers, V4 and V8. The scalar processor

7
accumulates the components of the partial sum vectors at the end. Ignoring the scalar phases, the
one iteration of the unrolled loop takes 152 cycles. Hence we can compute 256 elements of the
vector in each iteration, and the number of iterations for the vector with size of 1024 is halved and
it is 4 (= 1024/256). Therefore the total number of cycle to execute the dot product is 152 × 4 =
608 𝑐𝑦𝑐𝑙𝑒𝑠.
For the multiplication of the two 1024 ∗ 1024 matrices, we need 1024 × 1024 dot products.
Hence, the total number of cycle for this computation is 608 × 1024 × 1024 =
637534208 𝑐𝑦𝑐𝑙𝑒𝑠.

Problem 6 [Memory mapping] (20 pts.)


Exercise 4.3 in the book, but consider the following modifications:
Assume a 42-bit virtual address space per process in a 64-bit machine and 8 GB of main memory.
The page size is 16 KB. Page table entries are 4 bytes in every table. Various hierarchical page
table organizations are envisioned: 1, 2 and 3 levels. The virtual space to map is populated as shown
in Figure 1. Kernel space addresses are not translated because physical addresses are identical to
virtual addresses. However virtual addresses in all other segments must be translated.

242-1

(a) Assume a single level page table. All 28 bits of virtual page number are used to index the page
table. How many tables would we have? What would be their total size?
Sol.>
There is only one table. The size of the table is 228 * 4 = 1 GB

(b) Assume first a 2-level page table. We split the 28 bits of virtual page number into two fields of
15-bit level 1 and 13-bit level 2 (bottom) each. How many tables would we have? What would
be their total size?
Sol.>
The first level (root) page table must be allocated. It is accessed with 15 bits and therefore has 2 15
8
entries of 4 bytes each for a total of 128 KB. Each entry of the first-level page table covers 213+14 =
227 = 128 MB of virtual memory. We don’t need a second-level table for the kernel since it is not
mapped. But we need a second-level page table for the code segment and the two data segments.
The code segment lies in the virtual memory area covered by the first entry of the first-level page
table. The two data segments lie in the virtual memory area covered by the last 6 entries of the first-
level page table. Therefore we need a total of 7 second-level page tables which occupy 32 KB each.
Thus the number of page tables is 8 (1 level 1 + 7 level 2) and the total physical memory occupied
by the page tables is 128 KB + 7* 32 KB = 352 KB, i.e. a huge reduction as compared to the single-
level page table.

(c) For a 3-level page table splitting the 28 bits of virtual page number into 3 fields of 10, 9, 9 bits
each.
Sol.>
The first level table is accessed with 10 bits and therefore has 210 entries of 4 bytes each for a total
of 4 KB. Each entry of the first-level page table covers 218+14 = 232 = 4 GB of virtual memory. The
second level table is accessed with 9 bits and therefore has 29 entries of 4 bytes each for a total of 2
KB. Each entry of the second-level page table covers 29+14 = 223 = 8 MB of virtual memory. The
code segment lies in the virtual memory area covered by the first entry of the first-level page table
and the first 2 entries of the second-level page table. The two data segments lie in the virtual
memory area covered by the last entry of the first-level page table and the last 96 entries of the
second-level page table. Therefore we need a total of 98 third-level page tables, which occupy 2
KB each. Thus the number of page tables is 1+2+98 and the total physical memory occupied by the
page tables is 1*4KB+(2+98) * 2KB = 204 KB.

Problem 7 [Cache organization] (20 pts.)


Exercise 4.5 in the book, but consider the following modifications:
The address string is (0, 4, 1, 2, 1)4.
Show the cache contents for each memory access until a pattern emerges but at least four iterations.
Sol.>
a.
DIRECT-MAPPED:
Table 1: Direct-Mapped
Iteration 1 2 3 4
Block Address 0 4 1 2 1 0 4 1 2 1 0 4 1 2 1 0 4 1 2 1
Line 0 0 4 4 4 4 0 4 4 4 4 0 4 4 4 4 0 4 4 4 4
Line 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Line 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
Line 3
Misses * * * * * * * * * * *
Total number of misses = 4 + (2 * 3) = 10 misses
FA with LRU REPLACEMENT:
Table 2: FA with LRU
Iteration 1 2 3 4
Block Address 0 4 1 2 1 0 4 1 2 1 0 4 1 2 1 0 4 1 2 1
Line 0 4 1 2 1 0 4 1 2 1 0 4 1 2 1 0 4 1 2 1
Line 0 4 1 2 1 0 4 1 2 1 0 4 1 2 1 0 4 1 2
Line 0 4 4 2 1 0 4 4 2 1 0 4 4 2 1 0 4 4
Line 0 0 4 2 2 0 0 4 2 2 0 0 4 2 2 0 0
Misses * * * *
The first line is MRU.
Total number of misses = 4 misses

9
FA with FIFO REPLACEMENT:
Table 3: FA with FIFO
Iteration 1 2 3 4
Block Address 0 4 1 2 1 0 4 1 2 1 0 4 1 2 1 0 4 1 2 1
Line 0 4 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
Line 0 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Line 0 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
Line 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Misses * * * *
The first line is the Last In.
Total number of misses = 4 misses

FA with LIFO REPLACEMENT:


Table 4: FA with LIFO
Iteration 1 2 3 4
Block Address 0 4 1 2 1 0 4 1 2 1 0 4 1 2 1 0 4 1 2 1
Line 0 4 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
Line 0 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Line 0 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
Line 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Misses * * * *
The first line is the Last In.
Total number of misses = 4 misses

2-way SA with LRU REPLACEMENT:


Table 5: 2-way SA with LRU per set
Iteration 1 2 3 4
Block Address 0 4 1 2 1 0 4 1 2 1 0 4 1 2 1 0 4 1 2 1
Line 0 0 4 4 2 2 0 4 4 2 2 0 4 4 2 2 0 4 4 2 2
Line 1 0 0 4 4 2 0 0 4 4 2 0 0 4 4 2 0 0 4 4
Line 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Line 3
Misses * * * * * * * * * * * * *
Line 0 and 1 compose Set 0, and Line 2 and Line 3 compose Set 1. The upper lines are MRU.
Total number of misses = 4 + 3 * 3 = 13 misses

b.
Sol.>
COLD MISSES:
The number of cold misses is independent of the cache organization and replacement policy. The
number of cold misses is equal to the number of blocks in the trace:
Nr_cold_misses = 4 misses
CAPACITY MISSES:
The number of capacity misses is also independent of the cache organization and replacement
policy. It is given by the number of misses in an FA LRU cache minus the number of cold misses:
Nr_capacity_misses = 4 - 4 = 0 misses
CONFLICT MISSES:
Conflict misses must be evaluated for each cache by subtracting both cold and capacity misses from
the total number of misses of the corresponding cache:
i. Direct-mapped : Nr_conflict_misses = 10 - 4 = 6 misses
ii. FA with LRU : Nr_conflict_misses = 4 - 4 = 0
iii. FA with FIFO : Nr_conflict_misses = 4 - 4 = 0
iv. FA with LIFO : Nr_conflict_misses = 4 - 4 = 0
v. 2-way SA with LRU : Nr_conflict_misses = 13 - 4 = 9
Sometimes using a FA cache with LRU to calculate the capacity misses will end up having the
10
negative numbers of conflict misses for some cache organizations and this of course is not
acceptable. The problem is FA cache with LRU is not the best cache and hence cannot be used;
instead we should use a FA cache with an optimal replacement algorithm as we will do in part c
and part d.
c.
FA with THE OPTIMAL REPLACEMENT:
Table 6: FA with OPT
Iteration 1 2 3 4
Block Address 0 4 1 2 1 0 4 1 2 1 0 4 1 2 1 0 4 1 2 1
Line 0 0 4 1 2 1 0 4 1 2 1 0 4 1 2 1 0 4 1 2 1
Line 1 0 4 1 2 1 0 4 1 2 1 0 4 1 2 1 0 4 1 2
Line 2 0 4 4 2 1 0 4 4 2 1 0 4 4 2 1 0 4 4
Line 3 0 0 4 2 2 0 0 4 2 2 0 0 4 2 2 0 0
Misses * * * *

Total number of misses = 4 misses


d.
First we need to evaluate the number of capacity misses using FA with OPT replacement. It is given
by the number of misses in an FA OPT cache minus the number of cold misses:
Nr_capacity_misses = 4 - 4 = 0 misses
CONFLICT MISSES:
Conflict misses must be evaluated for each cache by subtracting both cold and capacity misses from
the total number of misses of the corresponding cache:
i. Direct-mapped : Nr_conflict_misses = 10 - 4 = 6 misses
ii. FA with LRU : Nr_conflict_misses = 4 - 4 = 0
iii. FA with FIFO : Nr_conflict_misses = 4 - 4 = 0
iv. FA with LIFO : Nr_conflict_misses = 4 - 4 = 0
v. 2-way SA with LRU : Nr_conflict_misses = 13 - 4 = 8 misses

Problem 8 [Cache policies] (20 pts.)


Exercise 4.6 in the book, but with the following trace:
ABCDDEACEBCADA
Show the cache contents for each memory access for the trace. In PLRU, the bits that keep tracking
or replacement statistic have been initialized as 0’s. In LFU, keep track of access count of those in
the cache only.

Sol.>
a.
LRU
Table 7: LRU
trace A B C D D E A C E B C A D A
Miss/Hit M M M M H M M H H M H H M H
D D E A C E B C A D A
C
LRU replacement B C C D E A C E B C A D
A B
priority list A B B C D E A C E B C C
A
A A B C D D A A E B B
LRU miss rate = (the number of misses) / (the number cache accesses) = 8 / 14

FIFO
Table 8: FIFO
trace A B C D D E A C E B C A D A
Miss/Hit M M M M H M M H H M M H M H
FIFO replacement A B C D D E A A A B C C D D

11
priority list A B C C D E E E A B B C C
A B B C D D D E A A B B
A A B C C C D E E A A
FIFO miss rate = 9 / 14

Pseudo-LRU (PLRU)
Table 9: PLRU
trace A B C D D E A C E B C A D A
Miss/Hit M M M M H M M H H M H H M H
D D E A C E B C A D A
C
B B B C D E C A E B C B
PLRU priority list A A
A C C D E A A E B C B D
B
A A B C D D C A E A C

PLRU miss rate = 8 / 14

LFU
Table 10: LFU
trace A B C D D E A C E B C A D A
Miss/Hit M M M M H M M H H M H M H H
D D D D C E E C C D D
C
B C C E A D C C E E C C
LFU priority list A B
A B B C E A D D D D E A
A
A A B C E A B B A A E
LFU miss rate = 8 / 14

Optimum
Table 10: Optimum
trace A B C D D E A C E B C A D A
Miss/Hit M M M M H M H H H H H H M H
D A A C E B C A A A A
A
A A C C E B C A C C C C
OPT priority list A C
B C B E B C A B B B B B
B
B D B A A E E E E D D
Opt miss rate = 6 / 14

b.
COLD MISSES:
The number of cold misses is independent of the cache organization and replacement policy. The
number of cold misses is equal to the number of blocks in the trace:
Nr_cold_misses = 5 misses

i. Using FA with LRU to calculate capacity misses:


The number of capacity misses is also independent of the cache organization and replacement
policy. It is given by the number of misses in an FA LRU cache minus the number of cold
misses:
Nr_capacity_misses = 8 - 5 = 3 misses

CONFLICT MISSES:
1. FA with LRU : Nr_conflict_misses =8-8=0
2. FA with FIFO : Nr_conflict_misses =9-8=1
3. FA with PLRU : Nr_conflict_misses =8-8=0
4. FA with LFU : Nr_conflict_misses =8-8=0
5. FA with OPT : Nr_conflict_misses = 6 - 8 = -2 misses

ii. Using FA with OPT to calculate capacity misses:


The number of capacity misses is also independent of the cache organization and replacement
policy. It is given by the number of misses in an FA OPT cache minus the number of cold
12
misses:
Nr_capacity_misses = 6 - 5 = 1 miss
CONFLICT MISSES:
1. FA with LRU : Nr_conflict_misses =8-6=2
2. FA with FIFO : Nr_conflict_misses =9-6=3
3. FA with PLRU : Nr_conflict_misses =8-6=2
4. FA with LFU : Nr_conflict_misses =8-6=2
5. FA with OPT : Nr_conflict_misses = 6 - 6 = 0 misses

Problem 9 [Cache optimization] (20 pts.)


Exercise 4.7 in the book, but consider the following modifications:
The matrix sizes are 512x512 and the cache size is 128 KB.

Sol.>
1. The pseudocode for Matrix Multiplication X = Y*Z is
// X = Y*Z
initialize X = 0;
for (i = 0; i<N; i++)
for (j = 0; j<N; j++)
for (k = 0; k<N; k++)
X[i][j] += Y[i][k] * Z[k][j]; // Statement S1

There are N3 FP multiplications and N3 FP additions. Thus a total of 2N3 = O(N3) FP operations
are involved in this matrix multiplication problem. With N=512, the total number of FP ops is
228 = 256M. Because each element of the matrix is 8 bytes and the cache block size is 8 bytes as
well, there is no spatial locality on the cache line. Because the cache size can easily contain a
matrix row, X and Y suffer no capacity misses. Take each element of X. Each element X[i][j] is
computed in a register then stored when it is computed. Thus X[i][j] is accessed only once and
can only have a single cold miss. Y is accessed row by row. The reuse distance (the number of
distinct blocks accessed between two consecutive accesses to the same block of Y) is 1 column
of Z plus one row of Y, i.e. 2N = 2x512 = 210, and the size should be 8 x 210 = 8 KB. This fits
well within the cache size of 128 KB. The only misses on X and Y are cold misses. However Z
(of size 8N2 = 221 = 2 MB) is too big to be held in the cache and must be reloaded in cache for
the computation of every row of X. The total number of misses on elements of X is N2, Y is N2
and Z is N3. Thus the total number of misses is N3 + 2N2 = 227 + 219=134742016.

2. The blocked for Matrix Multiplication X = Y*Z is


We need to choose a k such that 3 sub-matrix (X, Y and Z) fit well within the data cache.
512 2
3×( ) × 8 B ≤ 128 KB ⇒ k ≥ 6.93
k
In order to minimize the cache misses, k should be at least 8, so that all the tiles can be kept in
cache when that tile is being computed. The code for k=8 is:
// X = Y*Z
initialize X = 0;
for (i = 0; i<N; i+=N/8)
for (j = 0; j<N; j+=N/8)
for (k = 0; k<N; k+=N/8)
for (ii = i; ii<min(i+N/8-1, N); ii++)
for (jj = j; jj<min(j+N/8-1, N); jj++)
for (kk = k; kk<min(k+N/8-1, N); kk++)
X[ii][jj] += Y[ii][kk] * Z[kk][jj];
A total of ((2*N/8)*(N/8)2)*8*64 = 2N3 fp operations are involved in this matrix
multiplication problem, the same number as before. Now let’s count the misses. We must
multiply two matrices of 8x8 submatrices. The 3 submatrices accessed in each multiply (each

13
of size 64x64x8 = 215 = 32 KB) now all fit in the cache. For instance the computation of the
first tile of X is given by:
𝑋11 = 𝑌11 × 𝑍11 + 𝑌12 × 𝑍21 + ⋯ + 𝑌17 × 𝑍71 + 𝑌18 × 𝑍81
The first term for each X[i][j] must bring matrix tiles including X[i][j], Y[i][k] and Z[k][j] in
cache. So for each X[i][j], we have misses for 3 submatrices at first, then for the 3 remaining
terms we only have misses for 2 submatrices (as X[i][j] stays in cache.) Thus the total number
of misses per tile that includes X[i][j] is the number of misses for 3 submatrices plus 3 times the
misses for 2 submatrices. Total is: 17 times the misses for each tile that includes X[i][j]. Since
there are 64 tiles for X, the total number of misses is 17x64 (the number of misses for all the
tiles or submatrices). Now each submatrix needs N2/64 misses to bring the submatrix in cache.
Thus the total number of misses is 4456448.

3.
Tiling (blocking) dramatically reduces the total number of cache misses when the cache size is
limited compared to the data (or problem) size. The total number of fp operations remains the
same, i.e. 2N3. The total misses for part 1 is N3+2N2, while it is 17N2 for part 2. With tiling, the
loop overhead is larger as now the nested loop is twice deeper. This overhead involves branch
handling with some extra integer calculations for loop indexes. However, this overhead is
usually much smaller than the penalty caused by long latency cache misses.

Problem 10 [Non-blocking cache] (20 pts.)


Problem 4.8 in the book, but consider the data cache is direct-mapped and is empty at the start and
its block size is 32 bytes. Word size is 4 bytes. A cache miss by a load or store takes 31 cycles
total: one cycle for initial try plus 30 cycles to load a missed block. A pre-fetch takes 1 cycle for
execution and additional 30 cycles for a miss. For a pre-fetch miss, a load or store retry when a
block is brought into the cache after 30 cycles.
In part a, consider the following code
LOOP : LW R4,0(R3)
ADD R1,R1,R4
ADDI R3,R3,stridex4
BNE R3,R5,LOOP
In part b, consider the following code
LOOP : LW R4,0(R3)
ADD R1,R1,R4
PW stridex4(R3)
ADDI R3,R3,stridex4
BNE R3,R5,LOOP

Sol.>
a. Each block has 8 words of 32 bits (4 bytes).
The time taken by an iteration with a hit or miss is 6 or 36 cycles respectively:
(i) cache hit:
LOOP : LW R4,0(R3) (1)
ADD R1,R1,R4 (1+1)
ADDI R3,R3,stridex4 (1)
BNE R3,R5,LOOP (1+2flushes)

(ii) cache miss:


LOOP : LW R4,0(R3) (1+ 30 cache-latency)
ADD R1,R1,R4 (1+1)
ADDI R3,R3,stridex4 (1)
BNE R3,R5,LOOP (1+2flushes)

If the stride is greater than or equal to 8, every load incurs a cache miss and the average execution
time per iteration is 37 cycles. For each stride less than 8, there is a miss pattern that repeats
periodically every LCM(8, stride)/stride memory accesses. The number of misses in each period
14
is LCM(8, stride)/8. For instance, if the stride were one, the pattern period (number of memory
accesses) will be 8 (or LCM(8,1)/1), and only 1 cache miss will occur (or LCM(8, 1)/8). The
number of accesses is also the number of loop iterations, i.e., 8. So the average execution time
37×1+7×7
per iteration in cycles is = 10.75 cycles/iteration.
8
The average execution time for each loop as a function of the stride is as follows:
i) stride ≥ 8
The cache misses in every iteration. The average iteration execution time per iteration is 36
cycles.
ii) stride < 8:
#of misses×37+#of hits×7
The average iteration execution time (in cycles) = =
# of accesses
LCM(8,stride) LCM(8,stride) LCM(8,stride)
( )×37+( − )×7 30
8 stride 8
LCM(8,stride) = × stride +7
8
stride

b.
(i) cache hit:
LOOP : LW R4,0(R3) (1)
ADD R1,R1,R4 (1+1)
PW stridex4(R3) (1)
ADDI R3,R3,stridex4 (1)
BNE R3,R5,LOOP (1+2flushes)

(ii) cache miss:


PW stridex4(R3) (1)
ADDI R3,R3,stridex4 (1)
BNE R3,R5,LOOP (1+2flushes)
LOOP : LW R4,0(R3) (1+ 25 cache-latency + 1 retry)
ADD R1,R1,R4 (1+1)
PW stridex4(R3) (1)
ADDI R3,R3,stridex4 (1)
BNE R3,R5,LOOP (1+2flushes)

The number of cycles taken by a loop iteration with a cache hit increases to 8 because PW
instruction consumes one additional cycle. Because the prefetch instruction is only one iteration
ahead of the load instruction and the stride of the prefetch instruction is the same as that of the
load instruction, the benefit of having the prefetch instruction comes when the prefetch
instruction accesses the next cache block. It reduces the penalty of a cache miss by the total
number cycles between the prefetch instruction and the load instruction of the next iteration (i.e.,
5 cycles). Therefore, the miss cost is 25 cycles (=30-5). Because there is one retry cycle, the
time taken by an iteration with a miss is 34. We just plug this new penalty number into the
equation obtained in part a.
The average execution time for each loop as a function of the stride is as follows:
i) stride ≥ 8
The cache misses on each iteration. The average iteration execution time per iteration is 34
cycles.
ii) stride < 8:
#of misses×34+#of hits×8
The average iteration execution time (in cycles) = =
# of accesses
LCM(8,stride) LCM(8,stride) LCM(8,stride)
( )×34+( − )×8 26
8 stride 8
LCM(8,stride) = × stride +8
8
stride

c.
i) stride ≥ 8
For stride that is greater than or equal to 8, prefetch is always more effective as the average
execution time for a single iteration is 33 cycles which is lower than average iteration
execution time without prefetch (36 cycles.)
ii) stride < 8:

15
26 30 1
× stride + 8 < × stride + 7 ⇒ 1 < × stride ⇒ stride > 2
8 8 2

16

You might also like