0% found this document useful (0 votes)
400 views11 pages

HW4S24 - Sol

Uploaded by

ruweiyan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
400 views11 pages

HW4S24 - Sol

Uploaded by

ruweiyan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

University of Southern California

Department of Electrical Engineering


EE557 Spring 2024
Instructor: Murali Annavaram
Homework #4, Due: 11:59PM, Tuesday, April 10th
TOTAL SCORE: 127

Problem 1 [Cache Protocols] (20 pts.)


Assume a shared-memory multiprocessor with a number of processor/private cache units
connected by a shared single-transaction bus. The cache access trace to execute in this problem is
given below: (The notation Ri/X and Wi/X means a read and write operation, respectively, by
processor/cache unit i to block X). The first request read request X by processor/cache 1 will have
to be serviced by the next level.
R1/X, W1/X, R2/X, W3/X, W2/X, W3/X, W3/X, R3/X, W3/X, R1/X, R2/X
We are going to explore the three cache coherence protocols with Write-back: MSI-invalidate,
MSI-update and MESI-invalidate. When the timing model for the protocol actions is like in Table
1, compare the performance (total number of cycles taken to execute the given trace) of the three
cache coherence protocols and explain what makes a protocol more beneficial than the others.
Assume that all caches are initially empty and flush is done before a new access was made. One
snoop action is needed for multiple bus accesses by the same transaction.
Table 1

Request type Time to carry out protocol action

Read/write hit 1 cycle

Bus Read 30 cycles

Read request serviced by next level 80 cycles

Bus upgrade request 10 cycles (including read/write hit)

Bus update 25 cycles (including read/write hit)

Read block and invalidate other copies (BusRdX) 35 cycles

Flush 0 cycle

Snoop action 5 cycles

Solution:

MSI-invalidate (Figure 5.10):


Every bus transaction involves a snoop action. The first request will have to
be serviced by the next level. Therefore, it will first issue a snoop action (5 cycles)
and get the block from the next level (80 cycles). In total, the block is installed after
85 cycles. The second request will be a bus upgrade (10 cycles) and a snoop action

1
(5 cycles) and will take total of 15 cycles. The third request reads X from P1, which
has the only copy of X, and takes 35 cycles (Read from another processor + snoop).
The fourth request writes X from P3, and takes 120 cycles (Read request serviced by
next level + BusRdx + snoop). When processor 2 tries to write X at the fifth request,
W2/X takes 40 cycles (BusRdx + snoop). The following write request by processor
3 also takes 40 cycles (BusRdx + snoop). The following three requests, W3/X, R3/X
and W3/X will take 1 cycle for each(cache hit). In the 10th request, the only valid copy
of block X is in processor 3’s local cache. Therefore, the request is serviced by
the private cache of processor 3 and hence takes 35 cycles(BusRd(Read from
another processor) + snoop). While servicing the block to processor 1, the memory
is also updated. Thus, when the 11th request is given, there are three valid copies of X
in processor 1, 3, and memory. As MSI doesn’t support ownership, it is hard to
select one processor to forward the copy from its own private cache to processor
2. Therefore, the 11th request is serviced by the next level and hence takes 85 cycles.
As a summary, the total number of cycles is (85 + 15 + 35 + 120 + 40 + 40 + 1 + 1
+ 1 + 35 + 85) = 458 cycles.
MSI-update (Figure 7.10 (b)):
One snoop action is for multiple bus accesses by the same transaction. For a
PrWr(S) from I state takes two tansitions, I->S and S->S. The protocol will have the
following events corresponding to the memory accesses:
R1/X: BusRd (Read request serviced by next level + snoop action = 85), W1/X:
Write-hit (1), R2/X: BusRd (Read request serviced by P1 which has the only copy
+ snoop action = 35), W3/X: BusRd (Read request serviced by next level + snoop
action + Bus update = 110), W2/X (Bus update + snoop action = 30), W3/X (Bus
update + snoop action = 30), W3/X (Bus update + snoop action = 30), R3/X (Read
hit = 1), W3/X (Bus update + snoop action = 30), R1/X (Read hit = 1), R2/X (Read
hit = 1). As a summary, the entire sequence will take 354 cycles.
MESI-invalidate:
The only difference between MESI and MSI is shown in the 2nd request
(W1/X). When the 2nd request is executed, processor 1 has the only copy of X.
Therefore, the state of the block is E (Exclusive). In the E state, write request by its
own processor
can be done without bus upgrade. Therefore, the 2 request takes 1 cycle. Note that this doesn’t
nd

happen in the subsequent accesses as there are always multiple copies of X in the system. As a
summary, the entire sequence will take 444 cycles.

Problem 2 [True/False Sharing] (20 pts.)


Now, for the access sequence given below let us assume that each request actually accesses one of
the three different variables (A, B, C) within the block X like below: (The notation Ri/A and Wi/A
means a read and write operation, respectively, by processor/cache unit i to variable A in the block
X)

R1/A, W1/A, W3/B, R1/A, W2/B, W3/B, W3/A, R3/C, W3/C, R1/C, R2/B

Assume that the cache coherence protocol is MSI invalidate.

(a) Refer to Chapter 5.4.6 to learn about true sharing and false sharing miss and Chapter 4.3.5 for
cold miss. Then classify the misses with respect to cold, true sharing, and false sharing misses and
fill the 5th column of Table 2.

2
(b) Which of the misses could be ignored and still guarantee that the execution is correct?

(c) The data traffic (bytes to be transferred) made by each protocol action is shown in Table 3 and
the block size is 64 B. Calculate the traffic generated by the given access sequence. Then, determine
the fraction of essential traffic. The essential traffic is the data traffic that is made by cache misses
that cannot be ignored.

Solution: (a)
Table 2
Req # Processor 1 Processor 2 Processor 3 Miss type

1 R1/A Cold

2 W1/A

3 W3/B Cold

4 R1/A False-sharing

5 W2/B Cold

3
6 W3/B True-sharing

7 W3/A

8 R3/C

9 W3/C

10 R1/C True-sharing

11 R2/B True-sharing

(b) Req 4 can be ignored since it’s false-sharing.


(c) The traffic made by the sequence under MSI protocol is as follows from the 1st
to the 11th accesses: (6 + 64) + 10 + (6 + 64) + (6 + 64) + (6 + 64) + (6 + 64) + 0 +
0 + 0 + (6 +64) + (6 + 64) = 500 bytes.
Since Req 4 can be ignored. Therefore, the fraction of the essential traffic is
430/500 = 0.86 (86 %).

Problem 3 [Synchronization] (20 pts.)


Exercise 7.2 except (e) in the textbook.
Solution:
(a) That’s because BAR is a monotonically increasing value. If the statement reads BAR
while it is being updated, it might miss the current increment, but will
eventually observe BAR reaching N.
(b)
lock: LB R1,bar_lock
BNEZ R1,lock
T&S R1,bar_lock
BNEZ R1, lock
LW R2,BAR
ADDI R2,R2,#1
SW R2,BAR
SB R0,bar_lock
barrier: LW R2,BAR
BNE R2,R3,barrier //assume (R3)=N
(c)
F&A R1,BAR
wait: LW R2,BAR
BNE R2,R3,wait //assume R3 contains the value of N
(d) If R3 contains the value of N, then CAS(R3,R0,BAR) implements the condition
atomically. There is no need to implement the condition statement atomically because the
processor that reads “BAR=N” is the one that increments BAR for the last time (which can
only be done one at a time). By that time all other processors will be in their while loop.
The code can deadlock. Remember that we cannot make any assumption about the relative
4
speed of processors. For example, the process that arrives at the barrier last and resets BAR
to 0 could be suspended right before starting its while loop. During that time other
processes may execute the next iteration all the way to the next barrier execution and start
incrementing BAR. When the suspended process is awakened, it will see BAR not zero
and therefore will busy wait on the first barrier. Since it will never reach the second barrier
to increment BAR, all processes will wait on the second barrier forever.

Problem 4 [Cache Coherence] (15 pts.)


This problem considers the simple MSI, bus-based snooping protocol for cache
coherence discussed in class. There are several processor/cache nodes connected to the
bus, along with main memory. Each processor has a private L1 cache that is direct
mapped and contains 4 blocks of two words each. The initial state of main memory
and each cache block is shown in the figure below. To simplify the figure, the cache
address tag contains the full address and each word shows only two hex characters
(the rest of the data word is all zeroes, not shown). The lower addressed word of the
block is on the right; i.e., for block B0 of cache P0, the data for address x100 is on the
right, and the data for x104 is on the left. The coherence states are denoted M, S, and I
for Modified, Shared, and Invalid, respectively. Addresses and data are represented in
hexadecimal.

List the cache block states that change for each of the actions below. Treat each as an
independent action applied to the initial state shown, do not accumulate changes
from one part to the next. Simply list the blocks that change state in any cache and
memory for each action. List your answer in the form P0.B0: (I, 100, 00, 10) to mean
processor cache P0's Block B0 is Invalid, the tag holds 100, the data words are 00 and 10;
for memory use M:100 (00,00). The action write addr¬data means write data word data
to address addr. When a block changes to Invalid, assume only the state bit changes, so
your answer should include the tag and data field’s actual (current) values.

5
(1) P15: read 118
P15.B3: (S, 118, 00, 18) (1 point)

(2) P15: write 100 ¬48


P15.B0: (M, 100, 00, 48) (1 point)

(3) P15: write 118 ¬80


P15.B3: (M, 118, 00, 80) (1 point)
P1.B3: (I, 118, 00, 18) (1 point)

(4) P15: write 108 ¬80


P15.B1: (M, 108, 00, 80) (1 point)
P0.B1: (I, 108, 00, 08) (1 point)

(5) P15: read 110


P0.B2: (S, 110, 00, 30) (1 point)
P15.B2: (S, 110, 00, 30) (1 point)
M: 110 (00, 30) (1 point)

(6) P15: read 128


P1.B1: (S, 128, 00, 68) (1 point)
P15.B1: (S, 128, 00, 68) (1 point)

6
M: 128 (00, 68) (1 point)

(7) P15: write 110 ¬40


P0.B2: (I, 110, 00, 30) (1 point)
P15.B2: (M, 110, 00, 40) (1 point)
M: 110 (00, 30) (1 point)

Problem 5 [Locks] (12 pts.)


compare-and-swap(R1,R2,L) is an atomic synchronization primitive which atomically
compares the value in memory location L with R1, and if and only if they are equal,
exchanges the values in R2 and L. compare-and-swap(R1,R2,L) can be used to
efficiently emulate many other primitives.

Part A [2 points]
Implement an atomic test-and-set on memory address L in assembly using compare-
and- swap(R1,R2,L) as the only atomic primitive. Let L = 1 when the lock is taken and
L = 0 when it is free, and these will be the only values present at L. You may use any
registers you like.
(HINT: If the location is in a certain state, you do not need to do anything.)

R1 <- 0
R2 <- 1
compare-and-swap(R1,R2,L)

Part B [5 points]
Implement the test-and-test-and-set semantics on memory address L in assembly
using compare-and-swap as the only atomic primitive. Let L=1 when the lock is
taken, and L=0 when it is free. You can use any registers you like as well as ordinary
loads and stores. Include any instructions needed to ensure that the operation
eventually completes successfully, as if you are actually trying to acquire a lock.

R1 <- 1
LOCK: R1 <- Load(L)
BNEZ R1, LOCK
compare-and-swap(R1,R2,L)
BNEZ R2, LOCK

Part C [5 points]
Use compare-and-swap to implement an atomic fetch-and-increment(R1,L) in
assembly, which atomically copies the old value in L to R1 and then increments the
value in L by 1. Again, you can use any registers you like well as ordinary loads and
stores. Include any instructions needed to ensure that the operation eventually
completes successfully; i.e., the increment must be guaranteed to occur atomically.
FINC: R1 <- Load(L)
R2 <- R1 + 1
7
compare-and-swap(R1,R2,L)
BNEQ R1, R2, FINC

Note: Syntactical changes are allowed e.g. instead of R2 <- R1+1 you can use ADDI instruction
etc.

Problem 6 [Sequential Consistency] (20 pts.)


For this problem, assume sequential consistency. Consider the following code
fragments executed on two processors:
Initially P=Q=R=S=0
P1 P2
P=5 L=Q
Q=12 M=P
R=8 N=S
S=6 O=R

Part A [4 points]
The processors execute instructions independent of each other. Thus, the order of
execution of instructions cannot be determined a priori if no constraint is placed on the
execution of the instructions. What synchronization is required to ensure that all of P1’s
instructions are to be executed before any of P2’s instructions, as given above? Ensure
your solution makes clear the constituent memory operations used for the synchronization;
i.e., don’t insert just a call to a function or library without the code for that function/library.
Unnecessarily inefficient solutions will not get full credit.

Solution:

Initially
P=Q=R= S=0
Initially V = 0

P1 P2
while (V == 0);
P=5 L=Q
Q = 12 M=P
R=8 N=S
S=6 O=R
V=1

We also accept solutions that use S as a test variable instead of V, e.g. while (S!=6).

Part B [4 points]
Suppose we don’t care about the relative order of execution of the sets of instructions, but
we want atomicity in execution; i.e., we want all instructions of one processor to complete
before any instructions are executed in the other processor. What synchronization is
8
required to ensure this?

Solution:

This can be achieved by placing the code fragments between lock and unlock operations.
These operations can be implemented using Test & Set or any other technique but must
all be to the same location.

P1 P2
Lock(x) Lock(x)
P=5 L=Q
Q = 12 M=P
R=8 N=S
S=6 O=R
Unlock(x) Unlock(x)

Part C [6 points]
On a sequentially consistent system, what values can L, M, N, and O have at the end of
execution for the code in your solution of problem 6, part A? You must explain your
answer for full credit.

Solution:
L = 12, M = 5, N = 6, O = 8. On a sequentially consistent processor, we can assume that
P2’s reads of P, Q, R, and S won’t occur until its while loop terminates; i.e., until P2
sees the value 1 for V. Once P2 sees the value 1 for V, we know that P1’s write to V has
occurred and so we can assume that P1’s writes to P, Q, R, and S are complete. We are
therefore guaranteed that P2’s reads of P, Q, R, and S will see the values written to those
variables by P1.

We also accept solutions that for part A use S as a test variable instead of V.

Part D [6 points]
On a system that is not known to be sequentially consistent, what possible values can
variables L, M, N, and O have at the end of execution, for the code in your solution of
problem 6 part A? You must explain your answer for full credit.

Solution:
L = 0 or 12, M = 0 or 5, N = 0 or 6, O = 0 or 8. Since the system is not known to be
sequentially consistent, it is possible that some or all of P2’s reads of P, Q, R, S actually
occur before its while loop terminates, or some or all of P1’s writes to P, Q, R, S actually
occur after P1’s write to V. So P2 is not guaranteed to see P1’s writes to P, Q, R, S, and
may see only a subset of those writes.

We also accept solutions that for part A use S as a test variable instead of V.

9
Problem 7 [Cache organization] (20 pts.)
Exercise 7.8 in the textbook with the following modifications:
In the semi-static strategy, if a processor is idle, then it “steals” a task with the highest task
index from another processor, which still has tasks pending in its task queue. The original
schedule has a higher priority when alternative schedules exist. A lower numbered
processor has a higher priority always. Find speedups w.r.t. a single processor machine
with the same processor.

Solution:
1. T(i) = 1+i
Strategy Execution Time Speedup
Static 28 3.25
Semi-static 28 3.25
Dynamic 28 3.25
Optimum 23 3.96

2. T(i) = 13-i
Strategy Execution Time Speedup
Static 28 3.25
Semi-static 26 3.5
Dynamic 24 3.79
Optimum 23 3.96

To compute the speedup, the maximum execution time taken to complete the given tasks by
each processor is divided by the execution time taken by single processor, which is 91. The
examples of scheduling tasks depending on the strategy are given in the tables below. Note
that schedules for dynamic and optimum strategies can be varied. (1) and (2) indicate the cases
where the execution time for each task is T(i)=1+i, and T(i)=13-i, respectively. The maximum
execution time among the threads is in bold.

10
11

You might also like