Roll. No. 2[iojul fol ala 7 |
DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING
NATIONAL INSTITUTE OF TECHNOLOGY, KARNATAKA
TI Sem M.Tech (VL and SPML) - Mid semester Examinations 2022
VL 792-HIGH PERFORMANCE COMPUTER ARCHIECTURES
Time'9:00 AM - 10:30 AM
Max, Marks: 25
 
 
Date:28-3-2022
Duration: 1% hrs
Note: i) Assume any missing data suitably (state your assumption clearly)
ijAnswer all the questions
1 The following table gives instruction frequencies as well as how many cycles the 03
instructions take, for the different classes of instructions for a processor running at a
frequency of 200 MHz
Instruction Type Frequency Cycles
 
Loads & Stores 30% 6 cycles
Arithmetic
50% 4 cycles
Instructions
All Others 20% 3 cycles
2) Calculate the CP! for the processor
'b) The compiler expert says that if you double the number of registers, then the comp
Beis eae hat reais only hal the number of Loads & Stores. What would ie fs
1
Assuming that increased number of registers does not affect the clock is
speed up due to increased number of registers? eejoint operations?
b What percentage of the original execution time was occupied by floating P'
ginal, unenhanced execution time that
(Hint. Amdahl's law depends on the fraction of the ori ot bo used to compute
 
could make use of enhanced mode. Thus, this 50% measurement cann'
speedup with Amdah''s law.)
f for 04
3 Consider a history based branch predictor with 3 bits for eae history ee oe a
prediction. A branch outcome has the patiern TT TT TNNT pA
which repeats. Calculate the steady state prediction accuracy of the branch pre
given branch (show the branch history table contents at steady state)
4 Consider a program with 15% branch instructions. This program is executed on a processor 03.
where branch is resolved in the 11" stage. There are two branch predictor alternatives both
of which alter the clock cycle time(CCT). Predictor A has 10% misprediction rate with 20%
increase in CCT. Predictor B is 80% accurate increasing the CCT by 5%. Which branch
predictor is better? Justify your answer quantitatively (Assume ideal CPI to be 1)
3 Consider the following loop 12
Loop: LD ~R2, 0(R10)
SUB -R2, R2, R3
SD. “R2, 0(R10)V
ADDI R10, R10, #4
BNE R10, R6, LOOP
This code snippet is executed in a MIPS pipelined processor having 5-stages (IF ID EX MEM
: WB). Assume that branch is resolved in ID stage.
This code snippet is executed in a MIPS pipelined processor
WB), Assume that branch is resolved in ID ge. Identify a
Assume the loop takes
as true,
Hint:and schedule the loop optimally.
e) Assume that the processor has dynamic scheduling using Tomasulo’s algorithm. Write the
contents of following tables at the end of 5” cycle of operation. Assume there are two
reservation stations each for LD/SD, add/sub and mul/div ipSffagtions (Issue -1, LD/SD — Ks
A ——
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
    
    
 
 
 
 
 
 
 
 
 
cycles; other EX-1, write -1) » )
Peete Instruction Status |
Fieid | Ro | RI Instruction | Issue | Execute | Write
J POE 3
aes S
Reservation Stations e Sze
Name Busy op Ty Ju fo mm A
lLoaat
lLoaaz
| Jaaat
laga2 :
ll jer
ult26
The following code does computation over two vectors
ADDI R4,R1,#800
 
 
   
 
 
 
 
 
 
 
 
 
 
Loop: LD F2, 0(R1) a i i Bi
Mur. | Se
LD F6, 0(R2) Execution Stage
DIVD F6, F6, F8 Floating Point Mukiplication
SUBD F6, F2, F6 Floating Point Division
SD F6,0(R1) Floating Point Subtraction
ADDIRI,R1, #8 [Integer Calculation Memory Access | aes
ADDIR2, R2, #8
SUB R3,R4, RI
BNEZ R3, Loop
not using Tomasulo's algorithm( Unroll
it without any stalls, bollapsing the loop
(Show the
‘Assume a single-issue pipeline with data forwarding,
“the loop as many times as necessary and schedule
‘Gvarhend instructions. How many cycles does the single loop take to execute?
execution of the scheduled unrolled code filing table)
a
Clock cycle (issue) | Instruction
 
.sulo's algorithm with speculation. Show the
execution of three iterations (Fill the table shawing iteration number, Instruction Issue cycle
number, Execute//Memory, Write CDB, ‘Cummit )\Liow many cycles does the single loop take
to execute?)Assume branches are predicted to be taken. Assume 5 reservation stations for
integer operations, 3 reservation stations for load, 3 reservation stations for store, 2
reservation stations for floating point addition/subtraction, 2 reservation stations for floating
point multiplcation/divsion. Also, assume two function units of each type
‘Assume a duakissue pipeline using Toma:
Consider a processor with 32-bil address line. Physical address js 24 bit wide. Page size is
ete 256 8. There isa fully-associative TLB with 2 entries using true LRU replacement policy.
Each process has a t-level near page table for adress translations. The processor has 1KB
tinfied cache with 4 way associative with each cache tine of size 6 Bytes. A virally indexed
physical lagged cache is used. A virtual address OxEAQ207B3. maps to physical frame
number 0x3071
12
103
 
 
 
 
 
ARC
 
 
 
 
 
[|
 
Roll.No. | | |
 
DEPARTMENT OF ELECTRONICS AND COMMUNICATION El
NATIONAL INSTITUTE OF TECHNOLOGY, KARNATAKA
11 Sem M.Tech (VL) - End semester Examinations 2019
VL 792-HIGH PERFORMANCE COMPUTER ARCHIECTURES
Time:3:00 PM - 6:00 PM
Date:27-4-2019
Duration: 3 hrs Max. Marks: 50
Note: i) Assume any missing data suitably (state your assumption clearly)
Answer all the questions
Compare VLIW and superscalar processors
Write a brief note on NetBurst architecture (clearly stating pipeline structure, memory
management, branch handling and execution order).
‘Suppose that the code to run a transaction is 80% parallelizable (such that performance scales
linearly with the number of cores), but 20% is serial (can only run on one core) and only one
transaction can be runata time. Suppose an application runs at 10000 transactions per second
‘on the single-core chip, how many transactions per second does it achieve on quad-core?
(given quad core has a speed up of 70% per core as compared to single core)
What is the contribution to CP! of conditional branch stalls in a program with 20% branches
given that BTB with 10% miss rate, 4 cycle miss penalty, 92% prediction accuracy and 8 cycle
misprediction penalty is used for branch prediction?
The L2 cache write data bus is 16 B wide and can perform a write to an independent cache
address every 4 processor cycles
‘Ifa write buffer between a write-through L1 cache and a write-back L2 cach
‘what would be the width of each write buffer entry?
NGINEERING
‘Consider symmetric shared memory multiprocessor with three processors (P1, P2, P3) with 10
write-invalidate MSI snooping cache coherence, Each first level cache memory is write-back
direct-mapped, with four blocks each holding two words.
What is the resulting state (i.e., coherence state, tags, and data) of the caches and memory
after the given actionAssume initial state of all caches to be I? (Fill given tables for each cache
and memory after all instructions are executed. Show only the blocks that change; for example,
P0.B0: (1, 120, 00 01) after each instruction execution )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
PO: read 120 ena
7 Block | Coherence | tag] data ny
PN ponte 120 <- 60 state Address | data
Pa: wite 120<-80 bs (tag)
“read 110
P Pa, = 710 00 ‘| 10
ae = 120 00 —~‘/ 20
P3: write 130 <— 78 730 00 25Roll. No.
DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING
NATIONAL INSTITUTE OF TECHNOLOGY, KARNATAKA
IL Sem M.Tech (VL and SPML) - End semester Examinations 2022
VL 792-HIGH PERFORMANCE COMPUTER ARCHIECTURES
Time:2:30 PM - 4:30 PM
 
 
 
 
 
 
 
 
 
 
 
Date'31-06-2022
Duration: 2 brs Max. Marks: 30
Note:i) Assume any missing data suitably (state your assumption clearly)
ipAnswer all the questions
The table Q1 shows the execution time of five routines of a program running on a processor
Find the total execution time and speed up ifthe execution time for the routines A, C and E
 
 
improved by 15% each
} Table Q1
RA RB RC RD RE
ams 14ms 2ms 12ms 2ms
 
 
 
 
 
 
02
‘What is the contribution to CPI of conditional branch stalls in a program with 20% branches 02
given that BTB with 5% miss rate, 4 cycle miss penalty, 92% prediction accuracy and 6 cycle
misprediction penalty is used for branch prediction? (Assume base CPI to be 1)
or White the instruction timing diagram for the following code snippet in 5-stage DLX pipeline 04
without and with data forwarding hardware
LORY, O(R5)
LO R2, 4(R5)
i. -ADD R4, R1, R2
‘SD Ré4, 0(R5)
SAND R4, R1,R2
‘SD Ré, 2(R5)
Vv
4-way set associative cache is used in
ia5 A program has instruction mix given in table Q5: -
 
     
   
 
 
 
Table Q5
Instruction class |Branch [Load [Store [FP
Frequency |16% — |15% 10% [20%
 
 
This program is run on a processor with direct mapped cache having four word block
Instruction miss rate is 2% and data miss rate is 16%. Cache miss penalty is 4 cycles + 1
cycle for each word. Calculate the main memory access time per instruction. (Hint: main
memory is accessed only on cache miss)
6 Consider a machine with a page size of 1KB. There are 4KB of physical memory and 8KB of 06
virtual memory. The TLB is a fully associative cache with space for 4 entries that is currently
empty(use LRU replacement policy). Assume that TLB is reset on page fault. Current page
table entry is given in table Q61
‘The memory address accesses for a program is 0x0294, 0x1A76, Ox05A4, 0x1923, OXOCFF,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
    
     
   
 
 
Ox1A12, OxOFOF, 0x0392, 0x0341. Table 062
Table 061 ee _ ==
PR Pvdicel Frame Number | (PO [Ritts [contd fll)
one 2 | [ox0294 a
via EL Ox1A76 EE
20 3 Ox05A4 a
fi 0x1923
( 0 Ea Ox0CFF
sil Pp: oxiAl2 |
6 f 3 OxOFSF
+0 o (oxo302, | 3
(0x0341
 
 
 
 
 
What is the physical address corresponding to virtual address 0x1685?
‘Populate tableQ62 and show the contents of page table and TLB afer last memory access.
seSecs
 
Consider symmetric shared memory multiprocessor with three processors (P1, P2, P3) with
write-invalidate MESI snooping cache coherence. Each first level cache memory is write-back
direct-mapped, with four blocks each holding two words. _
What is the resulting state (ie., coherence state, tags, and data) of the caches and memory
after the given action Assume initial state of all caches to be |? (Fill given tables for each cache
all instructions are executed. Show ‘only the blocks that change; for
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
and memory
example, P1.B0: (I, 120, 00 01) after each instruction execution)
Block | Coherence | tag| data pene eee eee ee
state Memory
P1; read 120
BO Address data
P41; write 120 <~ 80 (tag)
7 B1
P3: write 120 <~ 80 Ba 708 048
PA: read 110 aan ane
Pt: write 10848 | B3 - -
P41; write 130<—48 e oe
130 00 25
 
 
 
 
 
o7Roll, No. ohh Whfololel 1)
 
DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING
NATIONAL INSTITUTE OF TECHNOLOGY, KARNATAKA
TI Sem M.Tech (VL and SPML) - Mid semester Examinations 2022
\VL792-HIGH PERFORMANCE COMPUTER ARCHIECTURES
Time’9:00 AM - 10:30 AM
Max. Marks: 25
  
Dale:28-3-2022
Duration’ 1% hrs
Note: i) Assume any missing data suitably (state your assumption clearly)
ii)Answer all the questions
1 The following table gives instruction frequencies as well as how many cycles the 03
; instructions take, for the different classes of instructions for a processor running at a
frequency of 200 MHz
Instruction Type Frequency Cycles
Loads & Stores 30% 6 cycles
Arithmetic
i SOD 4 cycles
Instructions
All Others 20% 3 cycles
2) Calculate the CPI for the processor
b) The compiler expert says that if you double the number of registers, then the compiler will
generate code that requires only half the number of Loads & Stores. What would the new
oP?
c) Assuming that increased number of registers does not affect the clock rate what is the
speed up due to increased number of registers?a)
.e was occupied by floating point operations?
|, unenhanced execution time that
‘ement cannot be used to compute
What percentage of the original execution tim:
(Hint Amdah's law depends on the fraction of the origina!
could make use of enhanced mode. Thus, this 50% measur
speedup with Amdahl's law.)
with 3 bits for branch history and 2 bits for
prediction A branch outcome has the pattern TT TT T NIN TTTTTNNTTTTTNN
Pinich repeats Calculate the steady state prediction accuracy of the branch predictor for the
given branch (show the branch history tabie contents at steady state)
Consider 2 history based branch predictor
Consider a program with 15% branch instructions. This program is executed 9” 3 processor
where branch is resolved in the 141 stage. There are two branch predictor alternatives both
weigh alter the clock cycle time(CCT), Predictor A has 10% misprediction rate with 20%
tirease in CCT. Predictor B is 60% accurate increasing the CCT by 5% Which branch
predictor is better? Justify your answer quantitatively (Assume ideal CP! 0 be 1)
Consider the following loop
LOOP: LD R2, O(R10)
SUB R2, R2,R3
sD R2, (R10)
ADDI R10, R10, #4
BNE R10, R6, LOOP
(IF 1D EX MEM
-rhie/ebde snippet isjexecuted in a/MIPS pipelined processor having > sIages
WB), Assume that branch is resolved in ID stage
“This code snippet is executed in 2 MIPS pipelined processor having 5-stages (IF ID EX MEM
WB). Assume that branch is resolved in ID stage Identify all data dependencies in the given
code snippet. Assume the loop takes exactly one iteration to complete. Classify the data
Sependence as {rue, “output or_an dependence. Also list hazard it any due to these,
_Gependencies jint: fill the table (You ‘can number the instructions)
Destination’ Dependency | Hazard
Instruction
 
 
 
‘Source
Instruction
 
 
 
 
 
 
 
 
‘assume a S-stage pipeline (IF 1D EX MEM WB) withovt
‘support for a regi ‘and write in the le. Assul
Stage. White the instruction timing diagram fey
the loop take to execute one iterationttil ‘branch resolution)”.and schedule the loop optimally
Assume that the processor has d
contents of following tables at th
lynamic scheduling using Tomasulo's algorithm, Write the
end of 5! cycle of operation
reservation stations each for LD/SD, add/sub and mul/div j
cycles; other EX -1, write -1) >
 
 
 
 
Assume there are two
instructions. (I
 
 
 
I
Instruction Status
 
 
 
 
 
 
Instruction Issue | Execute Write
 
 
 
 
 
 
 
 
 
 
 
 
|
 
QlaR
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
sue -1, LD/SD ~ 2_Roll. No. 21 aly [clo fale] |
DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING
NATIONAL INSTITUTE OF TECHNOLOGY, KARNATAKA
II Sem M.Tech (VL and SPML) - End semester Examinations 2022
VL 792-HIGH PERFORMANCE COMPUTER ARCHIECTURES
 
 
 
 
 
 
 
Date:31-05-2022 Time:2:30 PM - 4:30 PM
Duration: 2 hrs Max. Marks: 30
Note: i) Assume any missing data suitably (state your assumption clearly)
i)Answer all the questions
Yv The table Q1 shows the execution time of five routines of a program running on a processor. 02
Find the total execution time and speed up if the execution time for the routines A, C and E
improved by 15% each.
Table Q1
RA RB RC RD RE
4ms 14ms 2ms 12 ms 2ms
 
 
 
 
 
 
 
2 What is the contribution to CPI of conditional branch stalls in a program with 20% branches 02
given that BTB with 5% miss rate, 4 cycle miss penalty, 92% prediction accuracy and 8 cycle
; misprediction penalty is used for branch prediction? (Assume base CPI to be 1)
3 Write the instruction timing diagram for the following code snippet in 5-stage DLX pipeline 04
% without and with data forwarding hardware.
LD R1, O(R5)
LO R2, 4(R5)
ADD R4, R1, R2
‘SD R4, 0(R5) .
AND R4, R1,R2
‘SD R4, 2(R5)a A program has instruction mix given in table Q5; co
Table Q5
[Instruction class p= = [store [FP integer «|
Frequency [16% [15% 9 [10 [20% Sone ae
; in eee
 
 
 
3H
This program is run on processor with direct mapped cache having four word block
Instruction miss rate is 2% and data miss rate is 16%. Cache miss penalty is 4 cycles + 1
cycle for each word. Calculate the main
Memory access time per instruction. (Hint: main
memory is accessed only on cache miss)
Consider a machine with a page size of 1KB, There are 4KB of physical memory and @KB of 06
virtual memory. The TLB is a fully associative cache with space for 4 entries that is cirrently
empty(use LRU replacemer
nt policy). Assume that TLB is reset ort page fault. Current page
table entry is given in table Q61
‘The memory address acce:
Ox1A12, OxOF9F, 0x0392,
8888 for a program is 0x0294, Ox1A76, Ox05A4, 0x1923, OxOCFF,
0x0341,
 
 
 
 
 
 
   
  
    
   
 
 
 
 
 
 
 
 
 
 
Table Q62
Table 081 — al re a ae mal
[Valid Bit [Physical FrameNamogy) |Addtess  [TLB Page table |
Valid Bit _|Piysical Frame Number El (hitimiss) | (Valid/page fault)
a 3 eas se ai) CVC
: ; 0x0294 |
0 3
e 1
0
ie
3
 
 
 
 
 
 
 
 
a)
b)
 
‘What is the physical address corresponding to virtual address 0x1686?
Populate tableQ62 and show the contents of page table and TLB after last memory access.
‘A system with virtual memory has a two level paging scheme and a cache. The PIPT cache 03
has a 90% hit rato with a lookup tine of 20 ne Te TLE hate a 06% th loka te of 16
ns. Main memory access time is 50 ns. What is the average time to read a location from
‘memory?
ond
“at 38 Consider symmetric shared mem: t it
‘ory multiprocessor with three processors (P1, P2, P3) with 07
write-invalidate MESI snooping cache coherence. Each first level cache memory is write-back
direct-mapped, with four blocks each holding two words
What is the resulting state (ie., coherence state, tags, and data) of the caches and memory
affer the given action Assume intial state of allcaches to be |? (Fill ven tables for each cache
and memory affer all instructions are executed. Show only the blocks that change; for
‘example, P1.B0: (I, 120, 00 01) after each instruction execution)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ee
Bock | Conn data Tena 7