0% found this document useful (0 votes)
26 views17 pages

Hpca Pyqp

The document outlines the mid-semester examinations for M.Tech students in High-Performance Computer Architectures at the National Institute of Technology, Karnataka, including various questions on instruction frequencies, CPI calculations, branch prediction, and MIPS pipelined processor execution. It includes problems related to processor performance metrics, branch prediction accuracy, and loop optimization techniques. Assumptions are to be clearly stated for any missing data in the calculations and analyses.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
26 views17 pages

Hpca Pyqp

The document outlines the mid-semester examinations for M.Tech students in High-Performance Computer Architectures at the National Institute of Technology, Karnataka, including various questions on instruction frequencies, CPI calculations, branch prediction, and MIPS pipelined processor execution. It includes problems related to processor performance metrics, branch prediction accuracy, and loop optimization techniques. Assumptions are to be clearly stated for any missing data in the calculations and analyses.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 17
Roll. No. 2[iojul fol ala 7 | DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING NATIONAL INSTITUTE OF TECHNOLOGY, KARNATAKA TI Sem M.Tech (VL and SPML) - Mid semester Examinations 2022 VL 792-HIGH PERFORMANCE COMPUTER ARCHIECTURES Time'9:00 AM - 10:30 AM Max, Marks: 25 Date:28-3-2022 Duration: 1% hrs Note: i) Assume any missing data suitably (state your assumption clearly) ijAnswer all the questions 1 The following table gives instruction frequencies as well as how many cycles the 03 instructions take, for the different classes of instructions for a processor running at a frequency of 200 MHz Instruction Type Frequency Cycles Loads & Stores 30% 6 cycles Arithmetic 50% 4 cycles Instructions All Others 20% 3 cycles 2) Calculate the CP! for the processor 'b) The compiler expert says that if you double the number of registers, then the comp Beis eae hat reais only hal the number of Loads & Stores. What would ie fs 1 Assuming that increased number of registers does not affect the clock is speed up due to increased number of registers? ee joint operations? b What percentage of the original execution time was occupied by floating P' ginal, unenhanced execution time that (Hint. Amdahl's law depends on the fraction of the ori ot bo used to compute could make use of enhanced mode. Thus, this 50% measurement cann' speedup with Amdah''s law.) f for 04 3 Consider a history based branch predictor with 3 bits for eae history ee oe a prediction. A branch outcome has the patiern TT TT TNNT pA which repeats. Calculate the steady state prediction accuracy of the branch pre given branch (show the branch history table contents at steady state) 4 Consider a program with 15% branch instructions. This program is executed on a processor 03. where branch is resolved in the 11" stage. There are two branch predictor alternatives both of which alter the clock cycle time(CCT). Predictor A has 10% misprediction rate with 20% increase in CCT. Predictor B is 80% accurate increasing the CCT by 5%. Which branch predictor is better? Justify your answer quantitatively (Assume ideal CPI to be 1) 3 Consider the following loop 12 Loop: LD ~R2, 0(R10) SUB -R2, R2, R3 SD. “R2, 0(R10)V ADDI R10, R10, #4 BNE R10, R6, LOOP This code snippet is executed in a MIPS pipelined processor having 5-stages (IF ID EX MEM : WB). Assume that branch is resolved in ID stage. This code snippet is executed in a MIPS pipelined processor WB), Assume that branch is resolved in ID ge. Identify a Assume the loop takes as true, Hint: and schedule the loop optimally. e) Assume that the processor has dynamic scheduling using Tomasulo’s algorithm. Write the contents of following tables at the end of 5” cycle of operation. Assume there are two reservation stations each for LD/SD, add/sub and mul/div ipSffagtions (Issue -1, LD/SD — Ks A —— cycles; other EX-1, write -1) » ) Peete Instruction Status | Fieid | Ro | RI Instruction | Issue | Execute | Write J POE 3 aes S Reservation Stations e Sze Name Busy op Ty Ju fo mm A lLoaat lLoaaz | Jaaat laga2 : ll jer ult2 6 The following code does computation over two vectors ADDI R4,R1,#800 Loop: LD F2, 0(R1) a i i Bi Mur. | Se LD F6, 0(R2) Execution Stage DIVD F6, F6, F8 Floating Point Mukiplication SUBD F6, F2, F6 Floating Point Division SD F6,0(R1) Floating Point Subtraction ADDIRI,R1, #8 [Integer Calculation Memory Access | aes ADDIR2, R2, #8 SUB R3,R4, RI BNEZ R3, Loop not using Tomasulo's algorithm( Unroll it without any stalls, bollapsing the loop (Show the ‘Assume a single-issue pipeline with data forwarding, “the loop as many times as necessary and schedule ‘Gvarhend instructions. How many cycles does the single loop take to execute? execution of the scheduled unrolled code filing table) a Clock cycle (issue) | Instruction .sulo's algorithm with speculation. Show the execution of three iterations (Fill the table shawing iteration number, Instruction Issue cycle number, Execute//Memory, Write CDB, ‘Cummit )\Liow many cycles does the single loop take to execute?)Assume branches are predicted to be taken. Assume 5 reservation stations for integer operations, 3 reservation stations for load, 3 reservation stations for store, 2 reservation stations for floating point addition/subtraction, 2 reservation stations for floating point multiplcation/divsion. Also, assume two function units of each type ‘Assume a duakissue pipeline using Toma: Consider a processor with 32-bil address line. Physical address js 24 bit wide. Page size is ete 256 8. There isa fully-associative TLB with 2 entries using true LRU replacement policy. Each process has a t-level near page table for adress translations. The processor has 1KB tinfied cache with 4 way associative with each cache tine of size 6 Bytes. A virally indexed physical lagged cache is used. A virtual address OxEAQ207B3. maps to physical frame number 0x3071 12 10 3 ARC [| Roll.No. | | | DEPARTMENT OF ELECTRONICS AND COMMUNICATION El NATIONAL INSTITUTE OF TECHNOLOGY, KARNATAKA 11 Sem M.Tech (VL) - End semester Examinations 2019 VL 792-HIGH PERFORMANCE COMPUTER ARCHIECTURES Time:3:00 PM - 6:00 PM Date:27-4-2019 Duration: 3 hrs Max. Marks: 50 Note: i) Assume any missing data suitably (state your assumption clearly) Answer all the questions Compare VLIW and superscalar processors Write a brief note on NetBurst architecture (clearly stating pipeline structure, memory management, branch handling and execution order). ‘Suppose that the code to run a transaction is 80% parallelizable (such that performance scales linearly with the number of cores), but 20% is serial (can only run on one core) and only one transaction can be runata time. Suppose an application runs at 10000 transactions per second ‘on the single-core chip, how many transactions per second does it achieve on quad-core? (given quad core has a speed up of 70% per core as compared to single core) What is the contribution to CP! of conditional branch stalls in a program with 20% branches given that BTB with 10% miss rate, 4 cycle miss penalty, 92% prediction accuracy and 8 cycle misprediction penalty is used for branch prediction? The L2 cache write data bus is 16 B wide and can perform a write to an independent cache address every 4 processor cycles ‘Ifa write buffer between a write-through L1 cache and a write-back L2 cach ‘what would be the width of each write buffer entry? NGINEERING ‘ Consider symmetric shared memory multiprocessor with three processors (P1, P2, P3) with 10 write-invalidate MSI snooping cache coherence, Each first level cache memory is write-back direct-mapped, with four blocks each holding two words. What is the resulting state (i.e., coherence state, tags, and data) of the caches and memory after the given actionAssume initial state of all caches to be I? (Fill given tables for each cache and memory after all instructions are executed. Show only the blocks that change; for example, P0.B0: (1, 120, 00 01) after each instruction execution ) PO: read 120 ena 7 Block | Coherence | tag] data ny PN ponte 120 <- 60 state Address | data Pa: wite 120<-80 bs (tag) “read 110 P Pa, = 710 00 ‘| 10 ae = 120 00 —~‘/ 20 P3: write 130 <— 78 730 00 25 Roll. No. DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING NATIONAL INSTITUTE OF TECHNOLOGY, KARNATAKA IL Sem M.Tech (VL and SPML) - End semester Examinations 2022 VL 792-HIGH PERFORMANCE COMPUTER ARCHIECTURES Time:2:30 PM - 4:30 PM Date'31-06-2022 Duration: 2 brs Max. Marks: 30 Note:i) Assume any missing data suitably (state your assumption clearly) ipAnswer all the questions The table Q1 shows the execution time of five routines of a program running on a processor Find the total execution time and speed up ifthe execution time for the routines A, C and E improved by 15% each } Table Q1 RA RB RC RD RE ams 14ms 2ms 12ms 2ms 02 ‘What is the contribution to CPI of conditional branch stalls in a program with 20% branches 02 given that BTB with 5% miss rate, 4 cycle miss penalty, 92% prediction accuracy and 6 cycle misprediction penalty is used for branch prediction? (Assume base CPI to be 1) or White the instruction timing diagram for the following code snippet in 5-stage DLX pipeline 04 without and with data forwarding hardware LORY, O(R5) LO R2, 4(R5) i. -ADD R4, R1, R2 ‘SD Ré4, 0(R5) SAND R4, R1,R2 ‘SD Ré, 2(R5) Vv 4-way set associative cache is used in ia 5 A program has instruction mix given in table Q5: - Table Q5 Instruction class |Branch [Load [Store [FP Frequency |16% — |15% 10% [20% This program is run on a processor with direct mapped cache having four word block Instruction miss rate is 2% and data miss rate is 16%. Cache miss penalty is 4 cycles + 1 cycle for each word. Calculate the main memory access time per instruction. (Hint: main memory is accessed only on cache miss) 6 Consider a machine with a page size of 1KB. There are 4KB of physical memory and 8KB of 06 virtual memory. The TLB is a fully associative cache with space for 4 entries that is currently empty(use LRU replacement policy). Assume that TLB is reset on page fault. Current page table entry is given in table Q61 ‘The memory address accesses for a program is 0x0294, 0x1A76, Ox05A4, 0x1923, OXOCFF, Ox1A12, OxOFOF, 0x0392, 0x0341. Table 062 Table 061 ee _ == PR Pvdicel Frame Number | (PO [Ritts [contd fll) one 2 | [ox0294 a via EL Ox1A76 EE 20 3 Ox05A4 a fi 0x1923 ( 0 Ea Ox0CFF sil Pp: oxiAl2 | 6 f 3 OxOFSF +0 o (oxo302, | 3 (0x0341 What is the physical address corresponding to virtual address 0x1685? ‘Populate tableQ62 and show the contents of page table and TLB afer last memory access. se Secs Consider symmetric shared memory multiprocessor with three processors (P1, P2, P3) with write-invalidate MESI snooping cache coherence. Each first level cache memory is write-back direct-mapped, with four blocks each holding two words. _ What is the resulting state (ie., coherence state, tags, and data) of the caches and memory after the given action Assume initial state of all caches to be |? (Fill given tables for each cache all instructions are executed. Show ‘only the blocks that change; for and memory example, P1.B0: (I, 120, 00 01) after each instruction execution) Block | Coherence | tag| data pene eee eee ee state Memory P1; read 120 BO Address data P41; write 120 <~ 80 (tag) 7 B1 P3: write 120 <~ 80 Ba 708 048 PA: read 110 aan ane Pt: write 10848 | B3 - - P41; write 130<—48 e oe 130 00 25 o7 Roll, No. ohh Whfololel 1) DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING NATIONAL INSTITUTE OF TECHNOLOGY, KARNATAKA TI Sem M.Tech (VL and SPML) - Mid semester Examinations 2022 \VL792-HIGH PERFORMANCE COMPUTER ARCHIECTURES Time’9:00 AM - 10:30 AM Max. Marks: 25 Dale:28-3-2022 Duration’ 1% hrs Note: i) Assume any missing data suitably (state your assumption clearly) ii)Answer all the questions 1 The following table gives instruction frequencies as well as how many cycles the 03 ; instructions take, for the different classes of instructions for a processor running at a frequency of 200 MHz Instruction Type Frequency Cycles Loads & Stores 30% 6 cycles Arithmetic i SOD 4 cycles Instructions All Others 20% 3 cycles 2) Calculate the CPI for the processor b) The compiler expert says that if you double the number of registers, then the compiler will generate code that requires only half the number of Loads & Stores. What would the new oP? c) Assuming that increased number of registers does not affect the clock rate what is the speed up due to increased number of registers? a) .e was occupied by floating point operations? |, unenhanced execution time that ‘ement cannot be used to compute What percentage of the original execution tim: (Hint Amdah's law depends on the fraction of the origina! could make use of enhanced mode. Thus, this 50% measur speedup with Amdahl's law.) with 3 bits for branch history and 2 bits for prediction A branch outcome has the pattern TT TT T NIN TTTTTNNTTTTTNN Pinich repeats Calculate the steady state prediction accuracy of the branch predictor for the given branch (show the branch history tabie contents at steady state) Consider 2 history based branch predictor Consider a program with 15% branch instructions. This program is executed 9” 3 processor where branch is resolved in the 141 stage. There are two branch predictor alternatives both weigh alter the clock cycle time(CCT), Predictor A has 10% misprediction rate with 20% tirease in CCT. Predictor B is 60% accurate increasing the CCT by 5% Which branch predictor is better? Justify your answer quantitatively (Assume ideal CP! 0 be 1) Consider the following loop LOOP: LD R2, O(R10) SUB R2, R2,R3 sD R2, (R10) ADDI R10, R10, #4 BNE R10, R6, LOOP (IF 1D EX MEM -rhie/ebde snippet isjexecuted in a/MIPS pipelined processor having > sIages WB), Assume that branch is resolved in ID stage “This code snippet is executed in 2 MIPS pipelined processor having 5-stages (IF ID EX MEM WB). Assume that branch is resolved in ID stage Identify all data dependencies in the given code snippet. Assume the loop takes exactly one iteration to complete. Classify the data Sependence as {rue, “output or_an dependence. Also list hazard it any due to these, _Gependencies jint: fill the table (You ‘can number the instructions) Destination’ Dependency | Hazard Instruction ‘Source Instruction ‘assume a S-stage pipeline (IF 1D EX MEM WB) withovt ‘support for a regi ‘and write in the le. Assul Stage. White the instruction timing diagram fey the loop take to execute one iterationttil ‘branch resolution)”. and schedule the loop optimally Assume that the processor has d contents of following tables at th lynamic scheduling using Tomasulo's algorithm, Write the end of 5! cycle of operation reservation stations each for LD/SD, add/sub and mul/div j cycles; other EX -1, write -1) > Assume there are two instructions. (I I Instruction Status Instruction Issue | Execute Write | QlaR sue -1, LD/SD ~ 2_ Roll. No. 21 aly [clo fale] | DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING NATIONAL INSTITUTE OF TECHNOLOGY, KARNATAKA II Sem M.Tech (VL and SPML) - End semester Examinations 2022 VL 792-HIGH PERFORMANCE COMPUTER ARCHIECTURES Date:31-05-2022 Time:2:30 PM - 4:30 PM Duration: 2 hrs Max. Marks: 30 Note: i) Assume any missing data suitably (state your assumption clearly) i)Answer all the questions Yv The table Q1 shows the execution time of five routines of a program running on a processor. 02 Find the total execution time and speed up if the execution time for the routines A, C and E improved by 15% each. Table Q1 RA RB RC RD RE 4ms 14ms 2ms 12 ms 2ms 2 What is the contribution to CPI of conditional branch stalls in a program with 20% branches 02 given that BTB with 5% miss rate, 4 cycle miss penalty, 92% prediction accuracy and 8 cycle ; misprediction penalty is used for branch prediction? (Assume base CPI to be 1) 3 Write the instruction timing diagram for the following code snippet in 5-stage DLX pipeline 04 % without and with data forwarding hardware. LD R1, O(R5) LO R2, 4(R5) ADD R4, R1, R2 ‘SD R4, 0(R5) . AND R4, R1,R2 ‘SD R4, 2(R5) a A program has instruction mix given in table Q5; co Table Q5 [Instruction class p= = [store [FP integer «| Frequency [16% [15% 9 [10 [20% Sone ae ; in eee 3H This program is run on processor with direct mapped cache having four word block Instruction miss rate is 2% and data miss rate is 16%. Cache miss penalty is 4 cycles + 1 cycle for each word. Calculate the main Memory access time per instruction. (Hint: main memory is accessed only on cache miss) Consider a machine with a page size of 1KB, There are 4KB of physical memory and @KB of 06 virtual memory. The TLB is a fully associative cache with space for 4 entries that is cirrently empty(use LRU replacemer nt policy). Assume that TLB is reset ort page fault. Current page table entry is given in table Q61 ‘The memory address acce: Ox1A12, OxOF9F, 0x0392, 8888 for a program is 0x0294, Ox1A76, Ox05A4, 0x1923, OxOCFF, 0x0341, Table Q62 Table 081 — al re a ae mal [Valid Bit [Physical FrameNamogy) |Addtess [TLB Page table | Valid Bit _|Piysical Frame Number El (hitimiss) | (Valid/page fault) a 3 eas se ai) CVC : ; 0x0294 | 0 3 e 1 0 ie 3 a) b) ‘What is the physical address corresponding to virtual address 0x1686? Populate tableQ62 and show the contents of page table and TLB after last memory access. ‘A system with virtual memory has a two level paging scheme and a cache. The PIPT cache 03 has a 90% hit rato with a lookup tine of 20 ne Te TLE hate a 06% th loka te of 16 ns. Main memory access time is 50 ns. What is the average time to read a location from ‘memory? ond “at 3 8 Consider symmetric shared mem: t it ‘ory multiprocessor with three processors (P1, P2, P3) with 07 write-invalidate MESI snooping cache coherence. Each first level cache memory is write-back direct-mapped, with four blocks each holding two words What is the resulting state (ie., coherence state, tags, and data) of the caches and memory affer the given action Assume intial state of allcaches to be |? (Fill ven tables for each cache and memory affer all instructions are executed. Show only the blocks that change; for ‘example, P1.B0: (I, 120, 00 01) after each instruction execution) ee Bock | Conn data Tena 7

You might also like