HW3S24 Sol
HW3S24 Sol
Branches are delayed by two instructions and executed in the EX3 stage. Jumps are delayed by one
instruction and are executed in the ID3 stage by the branch pipeline. Results are forwarded to the
EX stages with Full-forwarding.
(b) Show a VLIW code with Rotating registers (in Figure 3.32 of the textbook) for the kernel using
software pipelining for the program. Follow the procedure of “Software pipelining algorithm” on
page 149 in the book. Assume full forwarding is used in the VLIW machine.
(c) The VLIW machine is increased its computation power by add more slots in it. Now, it has 2
LD/ST slots, 2 FP op slot, 1 INT slot and 1 branch slot. The operation latencies are kept the same
as the previous machine. Show a VLIW code with Rotating registers (in Figure 3.32 of the
textbook) for the kernel using software pipelining for the same program.
Slot 1 Slot 2 Slot 3 Slot 4 Slot 5 Slot 6
LABEL
(LD/ST) (LD/ST) (FP) (FP) (INT) (Branch)
KERNEL
1
Sol.>
(a)
Full
Source Destination No forwarding Register Forwarding
Forwarding
FP ALU op FP ALU op 4 3 2
Load Double FP ALU op 3 2 1
Store Double Load Double 0 0 0
Load Double Store Double 3 2 1
INT op/
INT 2 1 0
Branch
LABEL Slot 1 (LD/ST) Slot 2 (LD/ST) Slot 3 (FP) Slot 4 (FP) Slot 5 (INT) Slot 6 (Branch)
L.D S.D -40(R1), ADD.D ADDI R1, R1, BNE R1, R2,
KER
RR6,0(R1) RR0 RR3,RR4,F2 #8 KER
NOOP
NOOP
2
Problem 2 [Predicated instructions] (30 pts.)
Consider the code in Exercise 3.24, and do the followings:
(a) Identify all basic blocks in and all possible traces in the code. For traces, answer with basic
blocks
(b) Schedule the code the best you can by using local scheduling only (i.e., within basic blocks)
on the VLIW machine used in problem 1(b) with Full-forwarding.
(c) Translate this code into the new assembly code with predicated and/or non-predicated
instructions, but using no conditional branch instructions or unconditional jumps. Make sure
that you do not create unwanted exceptions.
Sol.>
(a)
BB1: I1~ I5
BB2: I6 ~ I7
BB3: I8 ~ I10
BB4: I11 ~ I13
BB5: I14 ~ I15
BB6: I16 ~ I17
Trace 1: BB1 -> BB2 -> BB3: (A>=C&&A>=B)
Trace 2: BB1 -> BB2 -> BB4 -> BB5 -> BB6: (A>=C&&A<B) and (B>D&&A=C+D) Trace 3:
BB1 -> BB2 -> BB4 -> BB5: (A>=C&&A<B) and (B>D&&A!=C+D)
Trace 4: BB1 -> BB2 -> BB4 -> BB6: (A>=C&&A<B) and (B<=D)
Trace 5: BB1 -> BB4 -> BB5 -> BB6: (A<C) and (B>D&&A=C+D)
Trace 6: BB1 -> BB4 -> BB5: (A<C) and (B>D&&A!=C+D)
Trace 7: BB1 -> BB4 -> BB6: (A<C) and (B<=D)
(b)
BB INST LABEL LD/ST FPop INT/BRANCH
1 I1 LW R5, 0(R1)
1 I2 LW R7, 0(R3)
1 I3 LW R6, 0(R2)
1 I4 SLT R8,R5,R7
1 I5 BNEZ R8, else
1 I6
1 I7
2 I8 SLT R9,R5,R6
2 I9 BNEZ R9, else
2 I10
2 I11
3 I11 ADD R10,R6,R7
3 I12 SW R10,0(R1) J exit
3 I13
4 I14 else LW R11,0(R4)
4 I15
4 I16 SLT R12,R11,R6
4 I17 BEZ R12,else1
4 I18
3
4 I19
5 I20 ADD R13,R7,R11
5 I21 BNE R5,R13,exit
5 I22
5 I23
6 I24 else1 SUB R14,R5,R7
6 I24 SW R14,0(R2)
(c)
d. What is the speedup (in cycles) of the software pipelining in c over the local scheduling in b if
the number of iterations of the loop is one hundred?
Sol.>
a.
(0,2) (0,3)
O1 O2 O3
(2,1)
b.
4
LABEL Slot 1(memory) Slot 2(FP) Slot 3(int/branch)
LOOP L.S F0, 0(R2)
c.
Resources (slot) utilization:
Data dependencies: (DDG):
One cycle (C): MIN(C) = 2 + 3 + 1 = 6. DIFF(C) = 0 + 0 + 2.
Therefore, II1 = ⌈6/2⌉ = 3
Op slot conflicts:
II2 = MAX (II(OMemOp ), II(OFPOp ), II(Obr,intOp )) = 2
VLIW program
LABEL Slot 1(memory) Slot 2(FP) Slot 3(int/branch)
PROLOGUE L.S F0, 0(R2)
d.
Execution Time:
For local scheduling only, 6 × 100 = 600 cycles
For software pipelining, 3(prologue) + 99 × 3(kernel) + 3(epilogue) = 303
Speedup = 600/303 = 1.98
5
Problem 4 [Vector memory] (20 pts.)
Exercise 3.27 in the notes, but consider the following modifications.
a) Assume 16 banks and a bank access time is 6 cycles. What are the permissible vector strides
(from 1 to 16) that will avoid conflict? Try to generalize this result to any stride by consider
the stride modulo 16?
b) Assume 17 banks and a bank access time is 6 cycles. What are the permissible vector strides
(from 1 to 17) that will avoid conflict? Try to generalize this result to any stride by consider
the stride modulo 17.
c) Discuss the advantages and inconveniences in using 17 banks rather than 16 banks.
Sol.>
a. A bank number where a component of a vector, Xi, is stored can be computed by (Stride * i)
mod 16. To see if there is a conflict, we find the minimum gap between the components, which
are returned from the same bank. Let’s say the gap is K and a stride is S in this problem. For
example, if stride is 1, the same bank number will be repeated every 16 component of a vector
and K is 16.
S K
1 16
2 8
3 16
4 4
5 16
6 8
7 16
8 2
9 16
10 8
11 16
12 4
13 16
14 8
15 16
16 1
As shown in the table above, K can be generalized. Depending on a stride, bank numbers
accessed by every component has a pattern and the pattern is repeated every K. Because the
length of the pattern is the GCD (the greatest common divisor) of a stride and the total number
of banks, which is 16 in this problem, K is the number of banks divided by the GCD of a stride
and the total number of bank. Because a bank access time is 6, there is a conflict if K is less
than 6:
16
𝐼𝑓 < 6, 𝑡ℎ𝑒𝑛 𝑡ℎ𝑒𝑟𝑒 𝑖𝑠 𝑎 𝑐𝑜𝑛𝑓𝑙𝑖𝑐𝑡.
𝐺𝐶𝐷(𝑆,16)
Hence, strides except multiples of 4, such as 4, 8, and 12, are permissible to avoid conflicts. In
case S is greater than 16, the bank number pattern of S will be same as the bank number pattern
of (S mod 16) and its K is also same.
17
b. We can use same method used in part a and 𝐾 = 𝐺𝐶𝐷(𝑆,17)
Because 17 is a prime number and the GCD of each stride except a multiple of 17 is 1, the same
bank will be accessed for every 17 component of a vector. If a stride is 17, then memory bank
0 is accessed for every component of a vector and there are conflicts. Hence, every stride except
a multiple of 17 is permissible to avoid conflicts. As discussed in part a, S, which is greater
than 17, will access the same bank as (S mod 17).
c. 17 banks have less conflict on each memory bank than 16 banks because a bank number
accessed is obtained by modulo 17 and a bank number is likely to be different if the total
number of memory bank is a prime number. However, computing an address modulo 17 is
6
much more complex than computing an address modulo 16. Even though the computation is
pipelined, it is inefficient.
Sol.>
a. The code for processing each strip of 64 components is given below:
LOOP: L.V V1,0(R2),R6 /load X; R6 contains the stride of 1.
L.V V2,0(R3),R6 /load Y; 0(R3) is the base address of Y
MUL.V V3,V2,V1 /Multiply two vector registers
ADD.V V4,V4,V3 /Partial sums accumulate in V4
ADDI R2,R2,#128 /This assumes that memory
ADDI R3,R3,#128 /addresses point to vector elements
ADDI R4,R4,#1
BNE R4,R5,LOOP
Partial vector sums accumulate in V4. At the end we simply have to add the components of V4. To
simplify we have assumed that memory addresses point to vector elements. If vector components
have 8 bytes (double precision) and memory is byte-addressable, then the increments on the address
registers should be 1024.
Given that vector operations are chained, one iteration of the loop takes: 𝑇𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛 =
𝑙𝑎𝑡𝑒𝑛𝑐𝑦(𝐿. 𝑉) + 𝑙𝑎𝑡𝑒𝑛𝑐𝑦(𝑀𝑈𝐿. 𝑉) + 𝑙𝑎𝑡𝑒𝑛𝑐𝑦(𝐴𝐷𝐷. 𝑉) + (𝑉. 𝐿) − 1 = 10 + 10 + 5 +
128 − 1 = 152 𝑐𝑦𝑐𝑙𝑒𝑠.
Since the loop has to iterate 8 times (= 1024/128), the total number of cycles taken by the dot-
product is 152 × 8 = 1216 𝑐𝑦𝑐𝑙𝑒𝑠 (ignoring the scalar phase at the end).
b.
For the multiplication of two matrices, a component of the result matrix is obtained by a dot-product
of two vectors. Therefore, we need 1024 × 1024 = 1048576 dot-products to multiply two
1024 × 1024 matrices. The matrix multiply takes 1216 × 1048576 = 1275068416 𝑐𝑦𝑐𝑙𝑒𝑠.
Load and multiplication on each strip of 128 components of the vector are chained and run in
parallel. Partial sums are written in two different vector registers, V4 and V8. The scalar processor
7
accumulates the components of the partial sum vectors at the end. Ignoring the scalar phases, the
one iteration of the unrolled loop takes 152 cycles. Hence we can compute 256 elements of the
vector in each iteration, and the number of iterations for the vector with size of 1024 is halved and
it is 4 (= 1024/256). Therefore the total number of cycle to execute the dot product is 152 × 4 =
608 𝑐𝑦𝑐𝑙𝑒𝑠.
For the multiplication of the two 1024 ∗ 1024 matrices, we need 1024 × 1024 dot products.
Hence, the total number of cycle for this computation is 608 × 1024 × 1024 =
637534208 𝑐𝑦𝑐𝑙𝑒𝑠.
242-1
(a) Assume a single level page table. All 28 bits of virtual page number are used to index the page
table. How many tables would we have? What would be their total size?
Sol.>
There is only one table. The size of the table is 228 * 4 = 1 GB
(b) Assume first a 2-level page table. We split the 28 bits of virtual page number into two fields of
15-bit level 1 and 13-bit level 2 (bottom) each. How many tables would we have? What would
be their total size?
Sol.>
The first level (root) page table must be allocated. It is accessed with 15 bits and therefore has 2 15
8
entries of 4 bytes each for a total of 128 KB. Each entry of the first-level page table covers 213+14 =
227 = 128 MB of virtual memory. We don’t need a second-level table for the kernel since it is not
mapped. But we need a second-level page table for the code segment and the two data segments.
The code segment lies in the virtual memory area covered by the first entry of the first-level page
table. The two data segments lie in the virtual memory area covered by the last 6 entries of the first-
level page table. Therefore we need a total of 7 second-level page tables which occupy 32 KB each.
Thus the number of page tables is 8 (1 level 1 + 7 level 2) and the total physical memory occupied
by the page tables is 128 KB + 7* 32 KB = 352 KB, i.e. a huge reduction as compared to the single-
level page table.
(c) For a 3-level page table splitting the 28 bits of virtual page number into 3 fields of 10, 9, 9 bits
each.
Sol.>
The first level table is accessed with 10 bits and therefore has 210 entries of 4 bytes each for a total
of 4 KB. Each entry of the first-level page table covers 218+14 = 232 = 4 GB of virtual memory. The
second level table is accessed with 9 bits and therefore has 29 entries of 4 bytes each for a total of 2
KB. Each entry of the second-level page table covers 29+14 = 223 = 8 MB of virtual memory. The
code segment lies in the virtual memory area covered by the first entry of the first-level page table
and the first 2 entries of the second-level page table. The two data segments lie in the virtual
memory area covered by the last entry of the first-level page table and the last 96 entries of the
second-level page table. Therefore we need a total of 98 third-level page tables, which occupy 2
KB each. Thus the number of page tables is 1+2+98 and the total physical memory occupied by the
page tables is 1*4KB+(2+98) * 2KB = 204 KB.
9
FA with FIFO REPLACEMENT:
Table 3: FA with FIFO
Iteration 1 2 3 4
Block Address 0 4 1 2 1 0 4 1 2 1 0 4 1 2 1 0 4 1 2 1
Line 0 4 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
Line 0 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Line 0 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
Line 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Misses * * * *
The first line is the Last In.
Total number of misses = 4 misses
b.
Sol.>
COLD MISSES:
The number of cold misses is independent of the cache organization and replacement policy. The
number of cold misses is equal to the number of blocks in the trace:
Nr_cold_misses = 4 misses
CAPACITY MISSES:
The number of capacity misses is also independent of the cache organization and replacement
policy. It is given by the number of misses in an FA LRU cache minus the number of cold misses:
Nr_capacity_misses = 4 - 4 = 0 misses
CONFLICT MISSES:
Conflict misses must be evaluated for each cache by subtracting both cold and capacity misses from
the total number of misses of the corresponding cache:
i. Direct-mapped : Nr_conflict_misses = 10 - 4 = 6 misses
ii. FA with LRU : Nr_conflict_misses = 4 - 4 = 0
iii. FA with FIFO : Nr_conflict_misses = 4 - 4 = 0
iv. FA with LIFO : Nr_conflict_misses = 4 - 4 = 0
v. 2-way SA with LRU : Nr_conflict_misses = 13 - 4 = 9
Sometimes using a FA cache with LRU to calculate the capacity misses will end up having the
10
negative numbers of conflict misses for some cache organizations and this of course is not
acceptable. The problem is FA cache with LRU is not the best cache and hence cannot be used;
instead we should use a FA cache with an optimal replacement algorithm as we will do in part c
and part d.
c.
FA with THE OPTIMAL REPLACEMENT:
Table 6: FA with OPT
Iteration 1 2 3 4
Block Address 0 4 1 2 1 0 4 1 2 1 0 4 1 2 1 0 4 1 2 1
Line 0 0 4 1 2 1 0 4 1 2 1 0 4 1 2 1 0 4 1 2 1
Line 1 0 4 1 2 1 0 4 1 2 1 0 4 1 2 1 0 4 1 2
Line 2 0 4 4 2 1 0 4 4 2 1 0 4 4 2 1 0 4 4
Line 3 0 0 4 2 2 0 0 4 2 2 0 0 4 2 2 0 0
Misses * * * *
Sol.>
a.
LRU
Table 7: LRU
trace A B C D D E A C E B C A D A
Miss/Hit M M M M H M M H H M H H M H
D D E A C E B C A D A
C
LRU replacement B C C D E A C E B C A D
A B
priority list A B B C D E A C E B C C
A
A A B C D D A A E B B
LRU miss rate = (the number of misses) / (the number cache accesses) = 8 / 14
FIFO
Table 8: FIFO
trace A B C D D E A C E B C A D A
Miss/Hit M M M M H M M H H M M H M H
FIFO replacement A B C D D E A A A B C C D D
11
priority list A B C C D E E E A B B C C
A B B C D D D E A A B B
A A B C C C D E E A A
FIFO miss rate = 9 / 14
Pseudo-LRU (PLRU)
Table 9: PLRU
trace A B C D D E A C E B C A D A
Miss/Hit M M M M H M M H H M H H M H
D D E A C E B C A D A
C
B B B C D E C A E B C B
PLRU priority list A A
A C C D E A A E B C B D
B
A A B C D D C A E A C
LFU
Table 10: LFU
trace A B C D D E A C E B C A D A
Miss/Hit M M M M H M M H H M H M H H
D D D D C E E C C D D
C
B C C E A D C C E E C C
LFU priority list A B
A B B C E A D D D D E A
A
A A B C E A B B A A E
LFU miss rate = 8 / 14
Optimum
Table 10: Optimum
trace A B C D D E A C E B C A D A
Miss/Hit M M M M H M H H H H H H M H
D A A C E B C A A A A
A
A A C C E B C A C C C C
OPT priority list A C
B C B E B C A B B B B B
B
B D B A A E E E E D D
Opt miss rate = 6 / 14
b.
COLD MISSES:
The number of cold misses is independent of the cache organization and replacement policy. The
number of cold misses is equal to the number of blocks in the trace:
Nr_cold_misses = 5 misses
CONFLICT MISSES:
1. FA with LRU : Nr_conflict_misses =8-8=0
2. FA with FIFO : Nr_conflict_misses =9-8=1
3. FA with PLRU : Nr_conflict_misses =8-8=0
4. FA with LFU : Nr_conflict_misses =8-8=0
5. FA with OPT : Nr_conflict_misses = 6 - 8 = -2 misses
Sol.>
1. The pseudocode for Matrix Multiplication X = Y*Z is
// X = Y*Z
initialize X = 0;
for (i = 0; i<N; i++)
for (j = 0; j<N; j++)
for (k = 0; k<N; k++)
X[i][j] += Y[i][k] * Z[k][j]; // Statement S1
There are N3 FP multiplications and N3 FP additions. Thus a total of 2N3 = O(N3) FP operations
are involved in this matrix multiplication problem. With N=512, the total number of FP ops is
228 = 256M. Because each element of the matrix is 8 bytes and the cache block size is 8 bytes as
well, there is no spatial locality on the cache line. Because the cache size can easily contain a
matrix row, X and Y suffer no capacity misses. Take each element of X. Each element X[i][j] is
computed in a register then stored when it is computed. Thus X[i][j] is accessed only once and
can only have a single cold miss. Y is accessed row by row. The reuse distance (the number of
distinct blocks accessed between two consecutive accesses to the same block of Y) is 1 column
of Z plus one row of Y, i.e. 2N = 2x512 = 210, and the size should be 8 x 210 = 8 KB. This fits
well within the cache size of 128 KB. The only misses on X and Y are cold misses. However Z
(of size 8N2 = 221 = 2 MB) is too big to be held in the cache and must be reloaded in cache for
the computation of every row of X. The total number of misses on elements of X is N2, Y is N2
and Z is N3. Thus the total number of misses is N3 + 2N2 = 227 + 219=134742016.
13
of size 64x64x8 = 215 = 32 KB) now all fit in the cache. For instance the computation of the
first tile of X is given by:
𝑋11 = 𝑌11 × 𝑍11 + 𝑌12 × 𝑍21 + ⋯ + 𝑌17 × 𝑍71 + 𝑌18 × 𝑍81
The first term for each X[i][j] must bring matrix tiles including X[i][j], Y[i][k] and Z[k][j] in
cache. So for each X[i][j], we have misses for 3 submatrices at first, then for the 3 remaining
terms we only have misses for 2 submatrices (as X[i][j] stays in cache.) Thus the total number
of misses per tile that includes X[i][j] is the number of misses for 3 submatrices plus 3 times the
misses for 2 submatrices. Total is: 17 times the misses for each tile that includes X[i][j]. Since
there are 64 tiles for X, the total number of misses is 17x64 (the number of misses for all the
tiles or submatrices). Now each submatrix needs N2/64 misses to bring the submatrix in cache.
Thus the total number of misses is 4456448.
3.
Tiling (blocking) dramatically reduces the total number of cache misses when the cache size is
limited compared to the data (or problem) size. The total number of fp operations remains the
same, i.e. 2N3. The total misses for part 1 is N3+2N2, while it is 17N2 for part 2. With tiling, the
loop overhead is larger as now the nested loop is twice deeper. This overhead involves branch
handling with some extra integer calculations for loop indexes. However, this overhead is
usually much smaller than the penalty caused by long latency cache misses.
Sol.>
a. Each block has 8 words of 32 bits (4 bytes).
The time taken by an iteration with a hit or miss is 6 or 36 cycles respectively:
(i) cache hit:
LOOP : LW R4,0(R3) (1)
ADD R1,R1,R4 (1+1)
ADDI R3,R3,stridex4 (1)
BNE R3,R5,LOOP (1+2flushes)
If the stride is greater than or equal to 8, every load incurs a cache miss and the average execution
time per iteration is 37 cycles. For each stride less than 8, there is a miss pattern that repeats
periodically every LCM(8, stride)/stride memory accesses. The number of misses in each period
14
is LCM(8, stride)/8. For instance, if the stride were one, the pattern period (number of memory
accesses) will be 8 (or LCM(8,1)/1), and only 1 cache miss will occur (or LCM(8, 1)/8). The
number of accesses is also the number of loop iterations, i.e., 8. So the average execution time
37×1+7×7
per iteration in cycles is = 10.75 cycles/iteration.
8
The average execution time for each loop as a function of the stride is as follows:
i) stride ≥ 8
The cache misses in every iteration. The average iteration execution time per iteration is 36
cycles.
ii) stride < 8:
#of misses×37+#of hits×7
The average iteration execution time (in cycles) = =
# of accesses
LCM(8,stride) LCM(8,stride) LCM(8,stride)
( )×37+( − )×7 30
8 stride 8
LCM(8,stride) = × stride +7
8
stride
b.
(i) cache hit:
LOOP : LW R4,0(R3) (1)
ADD R1,R1,R4 (1+1)
PW stridex4(R3) (1)
ADDI R3,R3,stridex4 (1)
BNE R3,R5,LOOP (1+2flushes)
The number of cycles taken by a loop iteration with a cache hit increases to 8 because PW
instruction consumes one additional cycle. Because the prefetch instruction is only one iteration
ahead of the load instruction and the stride of the prefetch instruction is the same as that of the
load instruction, the benefit of having the prefetch instruction comes when the prefetch
instruction accesses the next cache block. It reduces the penalty of a cache miss by the total
number cycles between the prefetch instruction and the load instruction of the next iteration (i.e.,
5 cycles). Therefore, the miss cost is 25 cycles (=30-5). Because there is one retry cycle, the
time taken by an iteration with a miss is 34. We just plug this new penalty number into the
equation obtained in part a.
The average execution time for each loop as a function of the stride is as follows:
i) stride ≥ 8
The cache misses on each iteration. The average iteration execution time per iteration is 34
cycles.
ii) stride < 8:
#of misses×34+#of hits×8
The average iteration execution time (in cycles) = =
# of accesses
LCM(8,stride) LCM(8,stride) LCM(8,stride)
( )×34+( − )×8 26
8 stride 8
LCM(8,stride) = × stride +8
8
stride
c.
i) stride ≥ 8
For stride that is greater than or equal to 8, prefetch is always more effective as the average
execution time for a single iteration is 33 cycles which is lower than average iteration
execution time without prefetch (36 cycles.)
ii) stride < 8:
15
26 30 1
× stride + 8 < × stride + 7 ⇒ 1 < × stride ⇒ stride > 2
8 8 2
16