Answer:: Remark
Answer:: Remark
單選題
1.   Pipelining in CPU design is aimed to provide the optimal
     (A) throughput (B) latency (C) space parallelism (D) caching of instruction execution.
Answer: (A)
2.   Given a non-pipelined CPU operating at 50 MHz, we are optimizing the CPU with a 4-stage
     pipeline design. In an ideal case, what is the operating clock frequency for the 4-stage pipeline
     CPU? (A) 50 MHz (B) 100 MHz (C) 150 MHz (D) 200 MHz (E) 400 MHz
Answer: (D)
3.   Direct Memory Access (DMA) is (A) worse (B) better (C) the same
     in performance for small transfers than interrupt driven I/O.
Answer: (A)
4.   Suppose a 32-bit CPU with physical address bus A31 A30 ..... A1 A0 and assume that the data cache
     has the following structure:
              Cache structure is set associative with 2 lots per set
              Cache size is 128 KBytes
              Cache block size is 16 Bytes and a word is 32-bit (4 Bytes)
              Cache is indexed with physical address
     The address bits (A) A16 A15 ...A4 (B) A15 A14 ...A4 (C) A14 A13 ...A4 (D) A13 A12 ...A4 (E) A15 A14 ...A3
     are used as index address for the cache.
Answer: (B)
Remark: Number of sets = 128KB / (16B  2) = 4K
Answer: (A)
Remark: Number of sets = 128KB / (16B) = 8K
Answer: (B)
Remark: Execution time = [(5 – 1) + 10G + 2.2  10G] /1G = 32
7.   What instruction in ARM processors does not affect the conditional code?
     (A) ADDS r0, r1, r2 (B) ADD r0, r1, r2 (C) CMP r1, r2 (D) TST r1, r2
Answer: (B)
8.   An application spends 80% of its time doing multiply instructions. If the multiplier is sped up by
     4 times, the application will run (A) 5 (B) 4 (C) 3 (D) 2.5 (E) 2 times faster.
Answer: (D)
9.   An application spends 80% of its time doing multiply instructions. If the multiplier is sped up by
     infinite times, the application will run (A) 5 (B) 4 (C) 3 (D) 2.5 (E) 2 times faster.
Answer: (A)
複選題
10. (A) instruction pre- fetch (B) instruction buffer (C) branching (D) cache misses (B) resource
    constraints will keep the pipeline from being full.
12. If we run the following program on a 32-bit machine, what outputs might be generated?
       #include <stdio.h>
       int main ( )
       {
            int A[3] = {1, 2, 3};
            int *ptr;
            ptr A;
            printf(" %p : %d \n", ptr, *ptr) ;
            ptr++;
            printf(" %p : %d \n", ptr, *ptr) ;
            return 0;
       }
     (A) 0xbfe5a870:1
           0xbfe5a874:2
     (B) 0xbfela5e0:l
           0xbfela5e4:2
     (C) 0xbfe5a870:2
           0xbfe5a874:3
     (D) 0xbfela5e0 :2
           0xbfela5e4:3
     (E) 0xbfe5a870:1
           0xbfe5a874:3
Answer
   (1) The average CPI of the original processor = 6  0.2 + 9  0.3 + 15  0.5 = 11.4
                                     X                 X                  X
           1× 11.4       2×   6−           × 0.6+ 9−         × 0.2+ 15−         × 0.2
     (2)             >              100                50                 20
                                                                                         X > 109.09
            100                                        X
     (3) The average CPI after adding cache for original processor = 11.4 – 2.5 + 0.5 = 9.4
                                    X                 X               X
           1× 9.4       2×    6−          × 0.6+ 9−         × 0.2+ 15 −        × 0.2
                    >              100                50              20
                                                                                         X > 125.37
            100                                   X
Answer
   (1)
                Instruction       Opcode         rs         rt           rd     shamt    funct
           addi $15, $0, -8                    00000      01111          1111 1111 1111 1000
           sltu $16, $15, $0                   01111      00000        10000 00000
           jalr $16, $15                       10000      00000        01111 00000
   (2) addi instruction datapath:
                            4               31-26     To control
             32 bits                                  unit                        ALUOP
                                  Add                                       32 bits
                                             25-21 (15)
                                                           RR1        RD1
                                             20-16 (0)
                        P                                  WR
                                                                                      ALU
                        C         Instr.                   WD
                                  Mem.                                                       32 bits
                                                                 SE
                                              15-0 (-8)                     32 bits
                             4               31-26   To main
              32 bits                                control unit
                                      Add                                        ALUOP
                                                            RegWrite
                                             25-21 (15)                     32 bits
                                                           RR1      RD1
                         P                   20-16 (0)
                                                           RR2
                         C        Instr.     15-11 (16)                               ALU
                                                           WR       RD2
                                  Mem.                                                      32 bits
                                                           WD               32 bits
                                              5-0
                                                     To ALU control unit
         jalr instruction datapath:
                                4
                                      Add                31-26 To main
                                                               control unit
                                                         25-21 (16)
                                                                      RR1     RD1
                            P                            15-11 (15)
                            C         Instr.                          WR
                                                                              RD2
                                      Mem.
                                                                      WD
                                               32 bits
                                                                              32 bits
3.   Suppose there exists a 12-bit IEEE 754 floating point format, with 1 sign bit, 6 exponent bits,
     and 5 mantissa bits.
     (1) How would - be represented in this 12-bit format? And what is the smallest positive
         normalized number. Give the value in decimal of the second number, and show both
         either as 12 bits or as 3 hexadecimal digits.
     (2) Give the nearest representation n of 5.612 in this format.
     (3) What is the actual value of n? Hence, work out its relative error r, to 3 significant digits.
         You may use the fact that a / 5.612 ≈ a  0.1782.
     (4) Calculate n2 using binary floating point multiplication. Show rounding, normalization
         and where you might check for overflow. Give the result as a 12-bit IEEE 754 number.
Answer
   (1)
                                     Decimal value      Binary format               Hexadecimal format
          Smallest positive                -          1 111111 00000                      FE0
          normalized number            1.0  2 -30
                                                       0 000001 00000                      020
     (2) 5.61210 = 101.10011….  2 = 1.0110011….  22
                                     0
          0 100001 01101
     (3) The actual value of n = 101.1012 = 5.625
         r = (5.625 − 5.612) / 5.612 ≈ 0.013 × 0.1782 ≈ 2.32 × 10−3
     (4)
         No normalization necessary; 6-bit excess-31 exponent can accommodate 4 = 1000112 , so no
         overflow. Result: 0 100011 111112 = 11111.12 = 31.510
4.   The simple datapath with the control unit for the MIPS processor is shown below. The input
     to the ALU control unit is the 6-bit opcode field from the instruction.
Add
4 Add Sum
                                                                                    Shift                         PCSrc
                                                                                    left 2
                                  Instruction [31-26]
                                                         Control
Instruction [5-0]
Answer
   (1)
                                               Memto-      Reg-      Mem-     Mem-
           Instruction     RegDst   ALUSrc                                              Branch
                                                Reg        Write     Read     Write
               add            1        0         0          1         0        0           0
               lw             0        1         1          1         1        0           0
               sw                     1                   0         0        1           0
               beq                    0                   0         0        0           1
   (2) If the instruction beq is taken, the new PC address = 52 + 8  4 = 84
       If the instruction beq is not taken, the new PC address = 52
   (3) The clock cycle time is 2220 ps.
            Instruction                                 Latency
                add         500 + 220 + 100 + 180 + 100 + 220 = 1300 (ps)
                beq         500 + 220 + 100 + 180 + 100 =1100 (ps)
                 lw         500 + 220 + 180 + 1000 + 100 + 220 = 2220 (ps)
                 sw         500 + 220 + 180 + 1000 =1900 (ps)
   (4) 2220 – 500 – 1000 = 720 (ps)
   (5) The RAW data dependency between instruction pairs (add, sw) and (add, beq) will affect the
       correctness of the given instructions execution.
   (6) Delayed branch is a type of branch where the instruction immediately following the branch is
          always executed, independent of whether the branch condition is true or false.
          The following shows the code after “from before” scheduling.
                   lw    $1, 50($7)
                   add   $4, $5, $6
                   beq   $1, $4, 8
                   sw    $4, 50($7)
5.   In the original processor shown in problem 4, some actions will be taken when an exception occurs.
     Please pickup the right things from the following table and give correct order about those actions.
     (ex: (a)  (b)  (c))
     (a) stopping the execution of the program     (b) automatically execute a jal instruction
     (c) save all register values into the stack   (d) execute the predefined actions for exceptions
     (e) flushing all instructions                 (f) transfer the control to OS at some specified address
     (g) automatically stall one cycle             (h) save the address of the offending instruction in EPC
6.   In this problem, we assume that the processor in problem 4 has been pipelined into 5 stages: IF,
     ID, EX, MEM, and WB with necessary pipeline registers.
     (1) For the datapath shown in problem4, please modify the datapath to implement the forwarding
          capability that can eliminate the stalls due the data dependency in the code sequence given
          in problem 4. In order to simplify the answers, only the datapath of EX and MEM stages
          are required to be redrawn on your answer sheet.
     (2) Please explain the meanings of the added signals in (1) with the forwarding capability. The
          triggering conditions of the added control signals are also required.
     (3) With the forwarding capability in (1), please give the values of the control
          signals (RegDst, ALUSrc, MemtoReg, RegWrite, Branch) at the fifth cycle while
          executing the given code sequence in problem 4.
Answer
   (1)
                      ID/EX
                                                    EX/MEM
                                                                          MEM/WB
                                                            RegWrite
                                                                 Data
                                                  ALU           memory
FB FA
                             Rs
                             Rt
                             Rt                              EX/MEM.RegisterRd
                             Rd
                                             Forwarding
                                                unit
     (2)
                     Input            Bits                        Usage
              ID/EX.RegisterRs         5     operand reg number, compare to see if match
              ID/EX.RegisterRt         5     operand reg number, compare to see if match
            EX/MEM.RegisterRd          5     destination reg number, compare to see if match
             EX/MEM.RegWrite           1     TRUE if writes to the destination reg
                  Output              Bits                        Usage
                    FA                 2                     forwarding signal
                    FB                 2                     forwarding signal
     (3)
            Control signal        RegDst     ALUSrc       MemtoReg      RegWrite    Branch
               Value                X          1             1             1           0
7.   State whether the following techniques are associated primarily with a software- or
     hardware-based approach to exploiting ILP (Instruction-Level Parallelism). If it is associated
     with a software-based approach, please give an 'S' as the answer. If it is associated with a
     hardware-based approach, please give an 'H' as the answer. In some cases, the answer may be
     both.
     (1) Branch prediction
     (2) Register renaming
     (3) Speculation
     (4) Superscalar
     (5) VLIW
Answer
                  (1)              (2)              (3)              (4)               (5)
                   B                B                B                H                 S
8.   Memory Hierarchy:
     (1) What is the advantage of separating instruction cache and data cache when comparing to
         the unified cache?
     (2) Please state why it is "virtual" in the so-called virtual memory?
     (3) A computer system has 1G bytes main memory and 512K bytes cache with 2-way set
         associativity and 4-byte block size. The partial content of the cache is listed as follows at
         a specific time instance. Please identify the possible two lowest addresses of the data that
         is shaded (i.e. D2).
         (Note: 1... 1 denotes consecutive 1 s and 0.. .0 denotes consecutive 0s)
                Index        Tag      Data(11)      Data(10)      Data(01)   Data(00)
                          100…0           81            82           83          84
                  0
                           011…1          91            92           93          94
                          000…0          A1             A2          A3           A4
                  1
                           011…1         B1             B2          B3           B4
                           111…1         C1             C2          C3           C4
                  2
                          100…0          D1             D2          D3           D4
                          100…0           E1            E2           E3          E4
                  3
                          000…0           F1            F2           F3          F4
Answer
   (1) Split cache compares with unified cache has the advantage of doubling the cache bandwidth.
   (2) In virtual memory system, all physical memory is controlled by the operating system. Some
       memory can be stored in physical RAM chips while other memory is stored on a hard drive.
       When a program needs memory, it requests it from the operating system. Computer
       programmers no longer need to worry about where the memory is physically stored or whether
       the user's computer will have enough memory. The word “virtual” is used to represent the fact
       that some memory is actually stored on a hard drive.
   (3)
         Address format                Tag                      Index               offset
           Field length               12 bits                   16 bits             2 bits
           Binary value           100000000000           0000000000000010             10
102 交大資聯
單選題
1. Which one of the following statements is TRUE?
   (a) From the point of view of a network service provider, latency is a suitable metric to
       measure the performance of the network service server.
   (b) Mini-computer has disappeared from the computer market since main- frame computers
       have been well designed to have much more computing power than mini-computers.
   (c) The CPU operating clock rate has not been continuously and significantly improved these
       years because Moor's law has been invalidate since several years ago.
   (d) CPU and RF/analog designs are generally regarded as two main streams/representatives
       that can represent the semiconductor technology level of a nation.
   (e) With the fixed number of defects per area, a small die-area IC design has better
       manufacture yield than the big die-area one.
Answer: (e)
Answer: (e)
Remark(a): Same ISA means same IC and same circuit design means same CPI.
3.   Given the values of registers $t0 and $t1 in the table below, which of following statements is
     correct.
                                       $t0 = 0xAAAAAAAA
                                       $t1 = 0x22222222
     (a) For the initial values of registers in the table above, the value of$t2 will become
          0x55555552 after the execution of code sequence below.
                                         sll $t2, $t0, 3
                                         sr $t2, $t2, $t1
     (b) For the initial values of registers in the table above, the value of$t2 will become
          0x37777777 after the execution of code sequence below.
                                         srl $t2, $t0, 3
                                         or $t2, $t2, $t1
     (c) For the initial values of registers in the table above, the value of $t2 will become
          0xF7777777 after the execution of code sequence below.
                                         srl $t2, $t0, 3
                                         or $t2, $t2, $t1
     (d) For the initial values of registers in the table above, the value of$t2 will become 0 after
          the execution of instruction below.
                                         slt $t2, $t0, $t1
     (e) For the initial values of registers in the table above, the value of$t2 will become 1 after
          the execution of instruction below.
                                         sltu $t2, $t0, $t1
Answer: (b)
4.   Suppose the program counter (PC) is at address 0x80000000, which of following statements
     is incorrect.
     (a) It is possible to use one branch instruction to get to address 0x7fffff00
     (b) It is possible to use one jump instruction to get to address 0x7fffff00
     (c) It is possible to use one branch instruction to get to address 0x80000040
     (d) It is possible to use one branch instruction to get to address 0x80010000
     (e) It is possible to use one jump instruction to get to address 0x8fff0000
Answer: (b)
Answer: (c)
Remark(c): 5 times
複選題
6. Each bit’s carry out of a carry lookahead adder (CLA) is a function of each bit’s generate,
   propagate, and carry in signals. Consider a two- level 9-bit CLA that is composed of several
   3-bit CLA units with c0 as the carry in and c1 to c9 as the carry outs. Each CLA unit realizes
   3-bit group generate and group propagate functions based on 3-bit inputs, A signal is regarded
   validate if its value is correct; or, it is regarded as invalidate. Signals gi and pi are the generate
   and propagate signals at bit i while G i-j and Pi-j are the group generate and group propagate
   signals associated with bits i to j. Assume all input signals have been available, which ones of
   the following statements are correct ?
   (a) A number of 4 CLA units in total are required to assemble the two- level 9-bit CLA.
   (b) Carry outs c1 , c2 and c3 become validate earlier than other carry out signals.
   (c) G0-2 = g2 + p2 g1 + p2 p1 g0 and P0-2 = p2 p1 p0 .
   (d) Carry outs c4 and c5 become validate earlier than c6 .
   (e) Assume the delays of each NOT, isolated AND, XOR, AND-OR structure gates are 1, 2, 2,
        and 3 time units respectively, and the sum is realized by 2 XOR gates while the propagate
        is realized by a XOR gate. Then the longest delay of the 2- level 9-bit CLA is the 12 time
        units.
                                                  m3
                                        m2                 m3
                   m1
                                          m4
Remark(d):
            Word        Block
                                     Tag         Index     Hit/Miss             3C
           address     address
               0          0            0           0         Miss          Compulsory
              40         10            1           2         Miss          Compulsory
              80         20            2           4         Miss          Compulsory
             340         85           10           5         Miss          Compulsory
              56         14            1           6         Miss          Compulsory
              68         17            2           1         Miss          Compulsory
             172         43            5           3         Miss          Compulsory
             348         87           10           7         Miss          Compulsory
              48         12            1           4         Miss          Compulsory
              80         20            2           4         Miss           Conflict
Remark(e):
            Word        Block
                                     Tag         Index     Hit/Miss             3C
           address     address
               0          0           0            0         Miss          Compulsory
              40         10           1            1         Miss          Compulsory
              80         20           2            2         Miss          Compulsory
             340         85           9            4         Miss          Compulsory
              56         14           1            5         Miss          Compulsory
              68         17           1            8         Miss          Compulsory
             172         43           4            7         Miss          Compulsory
             348         87           9            6         Miss          Compulsory
              48         12           1            3         Miss          Compulsory
              80         20           2            2         Hit
9. A system has a 16-bit virtual address size with 256B pages and a 16-bit byte addressing
   physical memory with physically index 2-way set associative data cache (LRU replacement
   policy) of512B cache size and 16B blocks. The first six accesses to the page table are: 0x1332,
   0xad58, 0x0162, 0x2ea9, 0x813d, 0x9f5a. The system uses a single level page table. Assume
   the TLB and data cache are initially empty and their contents have been updated accordingly
   by the first six accesses (page table is shown below; VPN: virtual page number, PPN: physical
   page number). The CPU includes a fully associative TLB with 2 entries and an LRU
   replacement policy. Data reads from the following virtual addresses are performed (in the
   order listed): 0x813f, 0xl36f, 0x2e5f, 0x9f50, 0x1370. Which ones of following statements are
   correct?
                             VPN         PPN         VPN         PPN
                             0x01        0x27        0x81        0x8a
                             0x13        0x45        0x9f        0xcd
                             0x2e        0xe8        0xad        0x7e
    (a) The required numbers of bits for virtual page number and page offset in virtual address
        are all 8 bits.
    (b) The required number of bits for index and tag in physical address are 4 and 8 bits
        respectively.
    (c) Data access to 0x813fhas access hit in TLB and cache.
    (d) Data access to 0X9f50 has access miss in TLB and access hit in cache.
    (e) Data access to 0X1370 has access miss in TLB and cache.
Remark:
                           Virtual address      Physical address
                                                                       TLB      Cache
                           VPN        PO      tag    index offset
                (a), (b)   8bits     8bits   8bits    4bits   4bits    H/M       H/M
                             13       32      45       3        2       M         M
                             ad       58      7e       5        8       M         M
                             01       62      27       6        2       M         M
                             2e       a9      e8       a        9       M         M
                             81       3d      8a       3        d       M         M
                             9f       5a      cd       5        a       M         M
                  (c)        81       3f      8a       3        f       H         H
                             13       6f      45       6        f       M         M
                             2e       5f      e8       5        f       M         M
                  (d)        9f       50      cd       5        0       M         H
                  (e)        13       70      45       7        0       M         M
10. For the Single-Cycle MIPS CPU shown below, which of following statements are correct?
                                                                                   PCSrc
Add
4 Add Sum
                                               RegWrite         Shift
                                                                left 2
                       Instruction [25-21]
         Read                                Read                                  MemWrite
  PC
         address      Instruction [20-16]    register 2 Read
        Instruction                          Read data 1       ALUSrc                           MemtoReg
             [31-0]                          register 1                  ALU       Address
                                                        Read
        Instruction                          Write data 2                                Read
         memory                              Register                                    data
                                             Write
                      Instruction [15-11]
                                             data Registers              A         Write
                                                                                   data Data
                                  RegDst
                                                                                      memory
                      Instruction [15-0]      16    Sign 32     B
                                                   extend
                                                                                   MemRead
    (a) If the path labeled A has been cut, the instructions add, slt, and sw still can run correctly.
    (b) If the path labeled B has been cut, the instructions lw, sw and beq may fail.
    (c) If the control signal ALUsrc is stuck on 0, the instructions lw, sw, and beq may fail to run
        correctly.
    (d) If the control signal RegDst is stuck on 0, the instructions lw and sw still can run
        correctly.
    (e) If the control signal MemToReg is stuck on 1, the instructions add, sub, lw, and slt may
        fail to run correctly.
                           IF/IDWrite
                                                                                         ID/EX
                                                                                                                EX/MEM
                                                             Control                                                                  MEM/WB
  PCWrite
                           IF/ID                                             0
                                        Instruction
                                                                       Registers
             Instruction                                                                                     ALU
     PC                                                                                                                     Data
              memory
                                                                                                                           memory
                                                                                                     A
                                                                  IF/ID.RegisterRs
                                                                  IF/ID.RegisterRt
                                                                  IF/ID.RegisterRt                                       EX/MEM.
                                                                                                     Rt                  RegisterRd
                                                                  IF/ID.RegisterRd                   Rd
                                                                  ID/EX.RegisterRt
                                                                                                     Rt
                                                                                              B
                                                                                                          Forwarding     MEM/WB.RegisterRd
                                                                                                     Rs
                                                                                                             unit
            (a) Assume the path labeled A has been cut, the following code snippet will fail.
                     add $s2, $s0, $s1
                     add $s3, $s0, $s2
            (b) Assume the path labeled A has been cut, the following code snippet will fail.
                     add $t1, $t1, $s2
                     add $t1, $t0, $t2
                     add $s2, $s1, $t1
            (c) Assume the path labeled B has been cut, the following code snippet will fail.
                     add $t1, $t1, $s2
                     add $t1, $t0, $t2
                     add $s2, $s1, $t1
            (d) Assume the path labeled A has been cut, the following code snippet will fail.
                     lw $s0, 4($t1)
                     add $s2, $s0, $s1
            (e) Assume the path labeled B has been cut, the following code snippet will fail.
                     lw $s0, 4($t1)
                     add $s2, $s0, $s1
Answer: (d)
Remark(a): (5 – 1) + 4 + 1 + 1 = 10
Remark(d): (5 – 1) + 4 + 1 + 2 = 11
102清大資工
1.   Given the table of latencies of individual stages of a MIPS instruction, please answer the
     following questions:
        Instruction fetch Register read ALU operation Data Access Register write
              200ps            100ps             250ps            400ps             100ps
     (a) What is the clock cycle time in a nonpipelined and pipelined processor?
     (b) What is the total latency of a lw instruction in a nonpipelined and pipelined processor?
     (c) If the time for an ALU operation can be shortened/increased by 25%, how much
          speedup/slowdown will it be from pipelining? Explain the reason.
     (d) If you can split one stage into two, each with half the origina l latency, to improve the
          pipelining performance. Which stage would you split and what is the new clock cycle
          time in a pipelined processor?
Answer
    (a) The clock cycle time in a nonpipelined processor is 1050ps
        The clock cycle time in a pipelined processor is 400ps
    (b) The total latency of a lw instruction in a nonpipelined processor is 1050ps
        The total latency of a lw instruction in a pipelined processor is 2000ps
    (c) Shortening/Increasing the ALU operation will not affect the speedup /slowdown obtained
        from pipelining. It would not affect the clock cycle.
    (d) Data Access stage
        New clock cycle time = 250ps
2.   Please answer the following questions about interfacing I/O devices to the processor and
     memory:
     (a) Describe the following terminologies:
         • Memory- mapped I/O.
         • I/O instruction.
         • Device polling.
         • Interrupt-driven communication.
     (b) Which communication pattern is most appropriate for a "Video Game Controller"?
         Explain.
     (c) Prioritize the following interrupts caused by different devices:
         • Mouse Controller.
         • Reboot.
         • Overheat.
Answer
    (a)
          Memory-mapped I/O: An I/O scheme in which portions of address space are assigned to
            I/O devices and reads and writes to those addresses are interpreted as commands to the
              I/O device.
         I/O instruction: A dedicated instruction that is used to give a command to an I/O device and
              that specifies both the device number and the command word (or the location of the
              command word in memory).
         Device polling: The process of periodically checking the status of an I/O device to determine
              the need to service the device.
         Inte rrupt-driven communication: An I/O scheme that employs interrupts to indicate to the
              processor that an I/O device needs attention.
     (b) Device polling
     (c) Mouse Controller  3
         Reboot  2
         Overheat  1
3.   In general, a fully associative cache is better than a direct-mapped cache in term of miss
     rate. However, it is not always the case for a cache, especially for a small cache. Please
     design an example to demonstrate that a direct- mapped cache outperforms a fully
     associative cache in term of miss rate under the replacement policy. Least Recently Used
     (LRU).
Answer: Suppose the cache has 2 blocks and CPU has the follow references of block address
                                   1, 2, 3, 4, 5, 1, 2, 5, 1, 2, 3, 4, 5
Answer
           IC old              IC old
     (a)            =                          = 1.01523
           IC new       IC old × 1−0.05× 0.3
Machine A before adding the new addressing mode is faster by 1.01455 times.
Answer
    (a)
                                A’
                                B
                                C’
                                D’
                                A
                                B’
                                C
                                                                          O1
                                F
                                G
                                H
E’
                                 0
     (b) Since there are total 8 Boolean variables included in the 3 functions, the length of the ROM
         address is 8 bits. The ROM size would be 28  3 bits = 768 bits
6.   Assume the floating point is stored in a 32 bits wide format. The leftmost bit is the sign bit,
     the exponent of 16 in excess-64 format is 7 bits wide, and the mantissa is 24 bits long and a
     hidden 1 is assumed.
     (a) What is the smallest positive number representable in this format (write the answer in
          hex and decimal)?
     (b) What is the largest positive number representable (write the answer in hex and
          decimal)?
     (c) Assume A = 2.34375  l0-2 and B = 2.9015625  102 . Write down the bit pattern using
          this format for A and B.
     (d) Using the format presented in (c), calculate the A – B.
Answer
Answer
   Automatic garbage collection by hardware: The execution time of the new machine is (1 − 0.3) ×
   1.4 = 0.98 times that of the original.
   Special garbage collection instructions: The execution time of the new machine is (1 – 2  0.3/3)
   × 1.2 = 0.96 times that of the original.
   Therefore, the second option is the best choice.
102成大電機
Choose the correct answers for the following multiple choice problems. Each question may have more
than one answer, 10 points each, no partial point, no penalty.
1. Which of the following is (are) true for a 128KB cache with a line size of 64 bytes? Assume that
   the cacheable memory is 4 GB. Address bits are numbered from A0 to A31
   (a) In a direct- mapped implementation, the index field uses address bit A6 to A16 for cache line
       selection.
   (b) In a direct- mapped implementation, the tag length is 13 bits.
   (c) In a direct- mapped implementation, the tag length is 15 bits; the field determining the line
       size is 6 bits in length.
   (d) In a direct- mapped implementation, the total tag size is 30720 bits.
Answer: (c)
Answer: (b)
4.   Which of the following is (are) true for virtual memory system?
     (a) Virtual memory function can be enabled through software control.
     (b) Virtual memory technique treats part of the main memory as a fully-set associative
         write-back cache for program execution.
     (c) A translation lookaside buffer can be seen as the cache of a page table.
     (d) A page table is shared among the programs in execution.
Answer: (d)
Remark: When the PC powers on, the processor's registers are set to some predefined values. One of
   the registers is the instruction pointer (or program counter) register and its value after a power on
   is well defined: it is a 32-bit value of 0xfffffff0. The instruction pointer register points to code to
   be executed by the processor. The computer's hardware translates this address so that it points to a
   BIOS memory block.
   BIOS is a chip on the motherboard that has a relatively small amount of read-only memory (ROM).
   This memory contains various low- level routines that are specific to the hardware supplied with
   the motherboard. So, the processor will first jump to the address 0xfffffff0, which really resides in
   the BIOS's memory. Usually this address contains a jump instruction to the BIOS's POST routines.
   POST stands for Power On Self Test. This is a set of routines including the memory check, system
   bus check and other low-level stuff so that the CPU can initialize the computer properly. The
   important step on this stage is determining the boot device. All modern BIOS's allow the boot
   device to be set manually, so you can boot from a floppy, CD-ROM, harddisk etc.
Remark(c): The BIOS contains the first instruction being fetched.
6.   Which of the following is (are) true about instruction set architecture (ISA)?
     (a) ISA is an abstraction which is the interface between the hardware and the low- level software
         (assembly instructions).
     (b) ISA enables the different implementations of a processor using the same ISA to run identical
         software.
     (c) ISA specifies how a processor pipeline should be implemented.
     (d) MIPS and ARM are both RISC-type ISA and use the same instructions set architecture.
Answer: (b)
Remark: ISA is an abstraction which is the interface between the hardware and the low- level software
       (Machine language).
7.   Which of the following is (are) true about page fault and TLB exception?
     (a) A page fault is signaled by operating system so that the OS can fetch the missing page.
     (b) A TLB exception is handled by the DMA controller in a hard drive.
     (c) Both a page fault and a TLB miss are exceptions and signaled by the processor hardware.
     (d) When a requested page is not found in the main memory, this causes a page fault.
Remark: A page fault is a trap to the software raised by the hardware. The hardware that detects a
   page fault is the memory management unit in a processor. The exception handling software that
   handles the page fault is generally part of the operating system.
9.   Which of the following is (are) true about the label specified in a branch instruction, for instance,
     beq r1, r2, label?
     (a) The label is a pseudo- instruction and the compiler compiles it into a memory location.
     (b) The label is transformed into an offset specifying the difference between the branch
          instruction (or the instruction after the branch) and the branch target.
     (c) The label is transformed into an absolute address in memory for the branch target.
     (d) If the label is an external reference, the linker can figure out its value and finalize the binary
          format for the branch instruction.
Answer: (a)
102成大電通
1.   Read the following paragraph and choose the correct answers from the following multiple
     choice questions. Each question may have more than one answer. No partial point, no penalty.
     “Just as CPU programmers were forced to explicitly manage CPU memories in the days before
     virtual memory, for almost a decade, GPU programmers directly and explicitly managed the
     GPU memory hierarchy. The recent release of NVIDIA's Fermi architecture, however, has
     brought GPUs to an inflection point: it implements a unified address space that eliminates the
     need for explicit memory movement to and from GPU memory structures. Yet, without demand
     paging, something taken for granted in the CPU space, programmers must still explicitly reason
     about available memory. The drawbacks of exposing physical memory size to programmers are
     well known. Which of the followings are true?”
     (a) Without virtual memory technology, CPU programmers must explicitly manage the
          physical memory.
     (b) There is no need for the programmers of the Fermi architecture to wo rry about the available
          memory to use.
     (c) With virtual memory, a CPU programmer implements demand paging policy in his/her
          program.
     (d) GPU stands for General Processing Unit.
Answer: (a)
Answer
     (a) The even when an accessed page in not present in main memory.
     (b) The even when there is no matching entry in the TLB for a page.
     (c) Each active process has its own page table.
3. For CPUs, the problem of exception support was solved at a relatively early stage. This support
   was a key enabler to their success, and instrumental in this success was the definition of precise
   exception handling. In a pipelined processor, multiple exceptions may occur at the same clock
   cycle. Assume this is a five-stage pipeline. Write down which exception (if any) may occur at
   the IF, ID, EXE, MEM, WB stage.
Answer
Stage                                   Exception
 IF     (1) Instruction TLB miss (2) Hardware malfunction
 ID     (1) Undefined Instruction (2) Hardware malfunction
EXE     (1) Arithmetic Overflow (2) Hardware malfunction
MEM     (1) Memory protection error (2) Data TLB miss (3) Hardware malfunction
WB      (1) Hardware malfunction
102 成大資工
  1. Please describe the steps involved in developing and executing assembly language programs?
Answer
   Step1: Write an assembly language program.
   Step2: The assembler translates assembly language statements to their binary equivalents (object
          code).
   Step3: Separately assembled modules are combined into one single load module, by the linker.
   Step4: The program loader copies the program into the computer's main memory, and at execution
          time, program execution begins.
2. Please write the MIPS code to compute the mathematic formula a × b + 3c – 10.
Answer:
   Assume that the values of variables a, b, and c are contained in registers $s0, $s1, and $s2,
   respectively, and the computation result will be included in register $t0 and is smaller than 32 bit
   binary number can represent.
           mul $t0, $s0, $s1
           sll     $t1, $s2, 1
           add     $t1, $t1, $s2
           add     $t0, $t0, $t1
           addi $t0, $t0, -10
  3. (a) If processor A has a higher clock rate than processor B, and processor A also has a higher
         MIPS rating than processor B, explain whether processor A will always execute faster than
         processor B.
     (b) Computer A has an overall CPI of 1.5 and can be run at a clock rate of 700MHz. Computer
         B has a CPI of 2.0 and can be run at a clock rate of 650MHz. We have a particular
         program to run and this program has exactly 120,000 instructions when compiled for
         computer A. How many instructions would the program need to have when compiled for
         Computer B if we want the two computers to have exactly the same execution time for this
         program?
Answer
   (a) We cannot differentiate which machine is faster from the measure of MIPS before the
       capabilities of the ISA of these two machines are given.
   (b) Suppose ICB represents instruction count for Computer B
       Execution time for Computer A= Execution time for Computer B  (120000  1.5) / 700M =
       (ICB  2) / 650M  ICB = 83571.429
  4. (a) What are the two characteristics of program memory accesses that caches exploit?
     (b) What are three types of cache misses?
Answer
    (a) Temporal locality: if an item is referenced, it will tend to be referenced again soon.
        Spatial locality: if an item is referenced, items whose addresses are close by will tend to be
        referenced soon.
    (b) Compulsory misses: a cache miss caused by the first access to a block that has never been in
        the cache.
        Capacity misses: a cache miss that occurs because the cache even with fully associativity, can
        not contain all the block needed to satisfy the request.
        Conflict misses: a cache miss that occurs in a set-associative or direct- mapped cache when
        multiple blocks compete for the same set.
Answer
 2.   About the performance analysis, which of the following statement(s) should be true?
      (a) Assume that a C program is compiled into 1000 machine instructions, which are related to
          the size of the executable file. Then, the average execution time is usually equal to
          multiplying 1000 (instructions) by its average CPI and the clock cyc le time.
      (b) In evaluating the performance by using the benchmark tests, we usually use the geometric
          mean to calculate the average value among various test results.
      (c) MIPS is not a reliable metric since it provides the wrong results when we compare the
          performances of a compiled program running on two machines with the same instruction set
          architecture.
      (d) It is usually a preferred approach to consider only one of the three factors: clock rate, CPI or
          the instruction count, and then try to improve it to decrease the execution time. That is
          called a divide-and-conquer methodology.
      (e) None of the above is true.
Answer: (e)
 3.   About the memory hierarchy, which of the following statement(s) should be true?
      (a) A page fault happens if a page that is not in the physical memory is accessed.
      (b) TLB is a data cache of main memory.
      (c) Adding a secondary cache can help to reduce the miss rate.
      (d) We can always increase the size of cache to improve the cache performance.
      (e) None of the above is true.
Answer: (a)
 4.   What are the advantages of register-register ALU instruction type compared with
      register- memory ALU instruction type and memory- memory ALU instruction type?
      (a) Data can be accessed without loading first.
      (b) Fixed- length
      (c) Simple code-generation
      (d) Instructions take similar Clock Cycle per Instruction (CPI) to execute.
      (e) Large variation in instruction size
 5.   What are the properties critical to program correctness that are normally preserved by control
      dependency?
      (a) Temporal locality
      (b) Spatial locality
      (c) Exception behavior
      (d) Dataflow
      (e) None of the above
單選題
 7. Assume that a computer uses 32-bit addressing and has a 1MB cache with the block size equal to
    32 bytes. The size of the Tag field in the address for a direct mapped cache is L. The size of the
    Tag field in the address for an 8 way set associative cache is M. The size of the Tag field in the
    address for a fully associative cache is N. L + M + N = K and
    (a) K < 50
    (b) 50 ≤ K < 55
    (c) 55 ≤ K < 60
    (d) 60 ≤ K < 65
    (e) 65 ≤ K
Answer: (b)
Remark: L = 12; M = 15; N = 27
 8.   A cache with 64 blocks and a block size of 32 bytes. The byte address 2473537 maps to the
      block number K (0 ≤ K < 64). K mod 5 equals to?
      (a) 0
      (b) 1
      (c) 2
      (d) 3
      (e) 4
Answer: (a)
Remark: K = 50
 9.   The following table shows the information about a machine executing instructions. There are 5
      classes of instructions, each of which takes 3-5 steps.
                                             Register                  Memory          Register
               Instruction Class Fetch                       ALU
                                               Read                     access          Write
              Load Word (LW)         2ns         Ins          2ns        3ns             Ins
              Store Word (SW)        2ns         Ins          2ns        3ns
              R-Format (R)           2ns         Ins          2ns                        Ins
              Branch (B)             2ns         Ins          2ns
              Jump (J)               2ns         Ins          2ns
      For a certain program, 25% of the instructions are LW, 10% are SW, 45% are R, 15% are B, 5%
      are J. Consider the following three cases: (i) If there is no pipelining, the average time required
      for executing one instruction is L (ns). (ii) If a 5-stage pipeline is adopted, the average time
      required for executing one instruction in the ideal case is M (ns). (iii) Assume that the
      destinations of Branch (B) and Jump (J) will not be known until the end of the ALU stage. If
      every B or J instruction incurs two pipeline bubbles, the average time required for executing one
      instruction is N (ns). L + M + N = K. Then
      (a) K < 13
      (b) 13 ≤ K < 14
      (c) 14 ≤ K < 15
      (d) 15 ≤ K < 16
      (e) 16 ≤ K.
Answer: (e)
Remark: L = 9 ns; M = 3 ns; N = 4.2 ns
 10. Suppose that there is a machine with 32-bit addresses and a two- level page table. Note that the
     first 10 bits of an address is an index into the first level page table and the next 10 bits are an
     index into a second level page table. The page tables are stored in memory and each entry in the
     page tables is 4 bytes. Suppose the space allocated to a specific process is 64 Mbytes. How many
     pages are occupied by this process?
     (a) 226 pages
     (b) 214 pages
     (c) 212 pages
     (d) 210 pages
     (e) None of the above
Answer: (b)
Remark: Page offset = 32 – 10 – 10 = 12  page size = 4KB  number of pages = 226 / 212 = 214
 11. (Continued from the previous question.) How much space is occupied in memory by the page
     tables for the specific process?
     (a) 68 kbytes
     (b) 64 kbytes
     (c) 16 kbytes
     (d) 4 kbytes
     (e) None of the above
Answer: (e)
Remark: The size of a small page table = 210  4B = 4KB
        Since only one small page table in each level will reside in memory, the space occupied for
        the page table is 2  4KB = 8KB
 12. Assume that an un-pipelined machine has 9ns clock cycles. The machine uses four cycles for
     ALU operations, four cycles for branches, and five cycles for memory operations. The relative
     frequencies of these operations are 30%, 20%, and 50%, respectively. Suppose that pipelining
     the machines adds Ins of overhead to the clock cycle time. Assume that the ideal CPI is one.
     Ignore any other impact. What is the average clock cycle per instruction (CPI) of the
     un-pipelined machine?
     (a) 5.5
     (b) 4
     (c) 3.5
     (d) 4.5
     (e) 5
Answer: (d)
 13. (Continued from the previous question.) What is the speedup in the instruction execution rate
      gained from pipelining the machine?
      (a) 4.05
      (b) 4.75
      (c) 5.15
      (d) 5.85
      (e) 6.28
Answer: (a)
102中山資工
NOTE: If some questions are unclear or not well defined to you, you can make your own
      assumptions and state them clearly in the answer sheet.
1.   You are the lead designer of a new processor. The processor design and compiler are
     complete, and now you must decide whether to produce the current design as it stands or
     spend additional time to improve it. Please consider and answer the following problems.
     (1) You discuss this problem with your hardware engineering team and arrive at the
         following options:
         (a) Leave the design as it stands. Call this base machine Mbase. It has a clock rate of 500
              MHz.
         (b) Optimize the hardware. The hardware team claims that it can improve the processor
              design to give it a clock rate of 600 MHz. Call this machine Mopt.
         The following measurements have been made using a simulator for Mbase and Mopt.
                                                                    CPI
              Instruction class      Frequency
                                                         Mbase              Mopt
                      A                 40%                2                 2
                      B                 25%                3                 2
                      C                 25%                3                 3
                      D                 10%                5                 4
         What are the CPI and MIPS (million instructions per second) rate for Mbase and Mopt
         Copy Table 1-1 to your answer sheet and fill in the blanks with your answers.
     (2) The compiler team proposes to improve the compiler for the machine to further enhance
         performance. Call this combination of the improved compiler and the base machine
         Mcomp . The instruction improvements from this enhanced compiler have been estimated
         as follows.
         For example, if the base machine executed 500 class A instructions, Mcomp would execute
         0.9  500 = 450 class A instructions for the same program. What is the CPI for Mcomp ?
         How much faster is Mcomp than Mbase? Copy Table 1-2 to your answer sheet and fill in the
         blanks with your answers.
   Table 1-1:
       Machine                 CPI                                 MIPS rate
         Mbase
          Mopt
   Table 1-2:
       Machine                 CPI           Execution time Mbase 1 Execution time Mcomp
         Mcomp
Answer
 (1)
   Table 1-1:
          Machine                CPI                                 MIPS rate
           Mbase                  2.8                                 178.57
            Mopt                 2.35                                 255.32
   CPI for Mbase = 0.4  2 + 0.25  3 + 0.25  3 + 0.1  5 = 2.8
   CPI for Mopt = 0.4  2 + 0.25  2 + 0.25  3 + 0.1  2 = 2.35
   MIPS for M base = 500 / 2.8 = 178.57
   MIPS for Mopt = 600 / 2.35 = 255.32
 (2)
   Table 1-2:
          Machine                CPI              Execution time Mbase /Execution time Mcomp
           Mcomp                 2.81                                1.12
   CPI for Mcomp = (0.4  0.9  2 + 0.25  0.9  3 + 0.25  0.85  3 + 0.1  0.95  5) / (0.4  0.9 +
                0.25  0.9 + 0.25  0.85 + 0.1  0.95) = 2.5075 / 0.8925 = 2.81
   Execution Time for Mbase           (IC  2.8) / 500M
                              =                               = 1.12
   Execution Time for Mcomp     (IC  0.8925  2.81) / 500M
2.   Here is a loop in C:
              Loop:       g = g + A[i];
                          i = i + j;
                         if (i != h) goto Loop;
     Assume that A is an array of 100 elements and the variables g, h, i, and j are in registers $s1,
     $s2, $s3, and $s4, respectively, In addition, assume that the base of the array A is in $s5.
     Other registers that can be used are $t0 and $t1. A possible MIPS assembly code
     corresponding to this C loop is showed as follows:
              Loop:      add $t1, $s3, OPA
                         add $t1, $t1, $t1
                         add $t1, $t1, OPB
                         lw $t0, 0(OPC)
                         add OPD, $s1, $t0
                         add $s3, $s3, $s4
                         bne $s3, OPE, Loop
     Please determine proper values for the operands (OPA, OPB, OPC, OPD, OPE). Copy the
     Table 2 to your answer sheet and fill in the operand values.
     Table 2:
      Operand           OPA              OPB           OPC          OPD                OPE
        Value
Answer
    Table 2:
       Operand            OPA             OPB            OPC             OPD             OPE
         Value            $s3             $s5             $t1            $s1             $s2
3.   Figure 1 shows a 16-bit adder consisting of four 4-bit ALUs using carry lookahead. Assume
     that gi = ai‧bi (generate) and p; = ai + bi (propagate). Determine the value of P0, P1, P2, P3,
     G0, G1, G2, G3, and C4 (CarryOut) when adding two 16-bit numbers 0001101000110011
     and 1110010111101011 with Carry In bit equal to 0. Please copy Table 3 to your answer
     sheet and fill in the blanks with your answers.
     Table 3:
       Signal      P0        P1       P2      P3      G0       G1      G2       G3        C4
       Value
                                    CarryIn
                          a0        CarryIn
                          b0                                        Result0--3
                          a1
                          b1        ALU0
                          a2                          pi
                          b2           P0
                                       G0             gi
                          a3
                          b3                                    Carry-lookahead unit
                                               C1
                                                      ci + 1
                          a4        CarryIn
                          b4                                        Result4--7
                          a5
                          b5        ALU1
                          a6           P1             pi + 1
                          b6           G1             gi + 1
                                                      gi
                          a7
                          b7
                                               C2
                                                      ci + 2
                          a8        CarryIn
                          b8                                        Result8--11
                          a9
                          b9        ALU2
                         a10           P2             pi + 2
                         b10           G2             gi + 2
                         a11
                         b11
                                               C3
                                                      ci + 3
                         a12        CarryIn
                         b12                                        Result12--15
                         a13
                         b13        ALU3
                         a14           P3             pi + 3
                         b14           G3             gi + 3
                         a1
                         a15                   C4
                         b15                          ci + 4
CarryOut
Figure 1
Answer
   Table 3:
      Signal        P0         P1        P2          P3        G0        G1        G2   G3   C4
       Value        0           1         1          1          0         1         0    0    1
4. Compare the single-cycle implementation, in which all instructions take 1 clock cycle, with
   the five-stage (IF, ID, EX, MEM, WB) pipelined implementation using the following eight
   instructions: load word (lw), store word (sw), subtract (sub), and (and), or (or), set- less-than
   (slt) and branch-on-equal (beq). The operation times for the major functional units are 2 ns for
   memory access, 2 ns for ALU operation, and 1 ns for register file read or write.
   (1) Please complete the Table 4-1 by filling in the time for each component to calculate the
         total time of each instruction executed in the single-cycle implementation. Assume that
         the multiplexors, control unit, PC accesses, and sign extension unit have no delay. Please
         copy Table 4-1 to your answer sheet and fill in the blanks with your answers.
   (2) What is the clock cycle time for the single-cycle implementation and the pipelined
         implementation? Please copy Table 4-2 to your answer sheet and fill in the blanks with
         your answers.
   (3) Consider a program consisting of 100 lw instructions and in which each instruction is
         dependent upon the instruction before it. What would the actual CPI be if the program
         were run on single-cycle implementation and the pipelined implementation with a
         forwarding unit and a hazard detection unit? Please copy Table 4-2 to your answer sheet
         and fill in the blanks with your answers.
         Table 4-1:
                               Instruction Register        ALU        Data    Register Total
       Instruction class
                                  fetch        read      operation access      write       time
        Load word (lw)
       Store word (sw)
            R-format
                                   2ns         1 ns         2ns                 1 ns        6 ns
    (add, sub, and, or, slt)
          Branch (beq)
   Table 4-1:
                 Design                    Clock cycle time                Actual CPI
    Single-cycle implementation
      Pipelined implementation
Answer
    (1)
    Table 4-1:
                                  Instruction   Register     ALU        Data    Register   Total
          Instruction class
                                     fetch       read      operation   access    write     time
          Load word (lw)              2ns        1 ns        2ns        2ns      1 ns      8 ns
          Store word (sw)             2ns        1 ns        2ns        2ns                7 ns
              R-format
                                     2ns         1 ns        2ns                  1 ns     6 ns
       (add, sub, and, or, slt)
           Branch (beq)              2ns         1 ns        2ns                           5 ns
     Table 4-2:
                                                  (2)                            (3)
                  Design                    Clock cycle time                  Actual CPI
        Single-cycle implementation               8 ns                            1
         Pipelined implementation                 2 ns                            2
      Remark: Actual CPI for pipeline = [(5 – 1) + 100 + 99] / 100 ≈ 2
5.   Increasing associativity requires more comparators, as well as more tag bits per cache
     block. Assuming a cache of 4K blocks, a four-word block size, and a 32-bit address, find
     the total number of sets and the total number of tag bits for caches that are direct mapped,
     two-way and four-way set associative, and fully associative. Please copy Table 5 to your
     answer sheet and fill in
     Table 5:
                 Cache                The total number of sets       The total number of tag bits
      Direct mapped
      Two-way set associative
      Four-way set associative
      Fully associative
Answer
   Table 5:
                  Cache               The total number of sets       The total number of tag bits
        Direct mapped                            4K                ( 32 – 12 – 4)  4K = 64K bits
        Two-way set associative             4K / 2 = 2K            ( 32 – 11 – 4)  4K = 68K bits
        Four-way set associative            4K / 4 = 2K            ( 32 – 10 – 4)  4K = 72K bits
        Fully associative                         1                ( 32 – 4)  4K = 112K bits
6.   Here is a series of address references given as word addresses: 1, 4, 8, 5, 20, 17, 19, 56,
     11, 4, 43, 5, 6, 9, 17. Assuming a direct- mapped cache with four-word blocks and a total
     size of 16 words that is initially empty, label each reference in the list as a hit or a miss
     and show the final contents of the cache. Please copy Table 6-1 and Table 6-2 to your
     answer sheet and fill in your answers.
     Table 6-1:
     Ref.           1       4       8       5       20       17        19       56       9       11        4        43        5       6       9        17
     Hit/Miss
     Table 6-2:
     Block#                                     0                               1                              2                           3
     Starting Address
Answer
     Table 6-1:
            Ref.        1       4       8       5        20       17     19         56       9        11        4        43       5       6        9        17
       Hit/Miss M               M       M       H        M        M         H        M       M        H        M         M        H       H       M         H
     Table 6-2:
       Block#                                        0                              1                               2                             3
       Starting Address                             16                              4                               8
Remark:
         Word
        address         1       4       8       5        20       17     19         56       9        11        4        43       5       6        9        17
         Block
                        0       1       2       1        5        4         4       14       2        2         1        10       1       1        2        4
        address
          Tag           0       0       0       0        1        1         1        3       0        0         0        2        0       0        0        1
            Index       0       1       2       1        1        0         0        2       2        2         1        2        1       1        2        0
       Hit/Miss M               M       M       H        M        M         H        M       M        H        M         M        H       H       M         H
        Block#                                 0                                 1                              2                                 3
        Final content                   16, 17 , 18, 19                     4, 5 , 6 , 7                  8, 9 ,10 ,11
7.   Suppose we have a processor with a base CPI of 1.0, assuming all references hit in the
     primary cache, and a clock rate of 1 GHz. Assume a main memory access time of 100 ns,
     including all the miss handling.
     (1) Suppose the miss rate per instruction at the primary cache is 5%. What is the effective
          CPI for the machine with one level of caching?
     (2) If we add a secondary cache that has a 10- ns access time for either a hit or a miss and is
          large enough to reduce the miss rate to main memory to 2%. What is the effective CPI for
          the machine with the two- level cache?
Answer
     (1) Effective CPI for one level cache = 1 + 0.05  100 = 6
     (2) Effective CPI for two level cache = 1 + 0.05  10 + 0.02  100 = 3.5
8.   Suppose you want to achieve a speedup of 50 with 100 processors. What fraction of the
     original computation can be sequential?
Answer
                           1
     Speedup = 50 =     𝑓            f = 0.9898  98.98%
                           +(1−𝑓)
                       100
102中興電機
1.   To compare the performance of two different computers: M1 and M2. Table 1 shows the
     measurements on these computers.
                                               Table 1
                Program                    Time on M1                 Time on M2
               Program 1                      4 second                   3 second
               Program 2                     10 second                  20 second
     Table 2 shows the additional measurements on these computers.
                                                 Table 2
            Program           Instructions executed on M1         Instructions executed on M2
           Program 1                     5 × 109                             6 × 109
     (a) If the clock rates of computers M1 and M2 are 4GHz and 6GHz, respectively, find the clock
         cycles per instruction (CPI) for program 1 on both computers ?
     (b) Assuming the CPI for program 2 on each computer is the same as the CPI for program 1,
         find the instruction counts for program 2 running on each computer using the execution time
         in Table 1 ?
Answer:
     (a) CPIM1 = (4  4G) / (5 × 109 ) = 3.2; CPIM2 = (3  6G) / (6 × 109 ) = 3
     (b) ICM1 = (10  4G) / 3.2 = 12.5 × 109 ; ICM2 = (20  6G) / 3 = 40 × 109
2.   For a given instruction set architecture, please list three sources to increase in the CPU
     performance?
Answer:
3.   Convert the following MIPS codes to C codes. The parameter variable n corresponds to the
     argument register $a0. The compiled program starts with the label of the procedure and then
     saves two registers on the stack, the return address and $a0. The MIPS assembly codes are listed
     as follows:
             test:
                    addi $sp, $sp, -8
                    sw $ra, 4($sp)
                    sw $a0, 0($sp)
                    slti $t0, $a0, 1
                    beq $t0, $zero, L1
                        addi   $v0, $zero, 1
                        addi   $sp, $sp, 8
                        jr     $ra
            L1:         addi   $a0, $a0, -l
                        jal    test
                        lw     $a0, 0($sp)
                        lw     $ra, 4($sp)
                        addi   $sp, $sp, 8
                        mul    $v0, $a0, $v0
                        jr     $ra
Answer:
                int test (int n)
                {
                        if (n < 1) return (1);
                            else return (n*fact(n – 1));
                    }
4. Please illustrate and describe the PC-relative addressing mode in MIPS machines.
Answer:
     4. PC-relative addressing
      op   rs     rt      Address                          Memory
PC Word
Answer:
                                                                                          ID/EX
                                                                                           WB
                                                                                                                             EX/MEM
                                                                                                                                                         MEM/WB
                                                                                Control    M                                 WB
                                                                                           EX                                                                   WB
                                     IF/ID                                                                                   M
                          Add
                                                                                                             Add
                                                                                                Shift                             Branch
                                                                     RegWrite
                                                                                                left 2
                                                                                                                   ALUSrc
MemWrite
                                                                                                                                                                     MemtoReg
                                             Instruction
              PC                                               Read
                   Address                                                 Read
                                                               register 1 data 1
                                                               Read                                                  Zero
                                                               register 2                                          ALU
                     Instruction
                                                               Write
                                                                           Read
                                                                                                                    Result         Address Read
                      memory                                   register
                                                                          data 2                                                              data
                                                                                                                                         Data
                                                               Write Registers                                                         memory
                                                               data
                                                                                                                                   Write
                                                                                                                                   data
                                                           Instruction
                                                           [15-0]                Sign                     ALU
                                                                                extend                   control
                                                           Instruction                                                                                MemRead
                                                           [20-16]                                             ALUOp
                                                           Instruction
                                                           [15-11]
                                                                                                         RegDst
Answer:
     (a) At the end of the fourth cycle of execution, registers $3 and $7 are being read.
     (b) At the end of the sixth cycle of execution, register $4 is being written and registers $6 and $1
         are being read.
     (c) Execute this program needs (5 – 1) + 6 = 10 clock cycles
     Remark(c): there are data hazards in the given code but no forwarding scheme is used in the
     datapath. Hence, the code cannot execute correctly.
7.   Considering the following datapath shown below, draw paths (use color pens if possible) to show
     the flow of both data and the program counter for the following instructions.
     (a) add $t0, $t0, $t1          # single cycle datapath
     (b) lw     $a0, 0($t1)         # multicycle datapath
Add
4 Add Sum
                                                                                            Shift                              PCSrc
                                                                                            left 2
                                 Instruction [31-26]
                                                        Control
Instruction [5-0]
(a)
Outputs
Control
Op[5-0]
                                                                                                                                 Jump
                                                                                                                                 address
                                           Instruction [5-0]                                            26   Shift 28            [31-0]
                                                                                                             left 2
                                     Instruction
                                         [31-26]                                                                               8000 0180
                                                                                                             PC[31-28]
        PC                           Instruction                          Read
                    Address              [25-21]                          register 1
                                                                                   Read       A
                     Memory          Instruction
                                         [20-16]                          Read     data 1
                     MemData                                              register 2                            ALU
                                     Instruction                                                                                ALUOut
                                          [15-0]     Instruction          Write Read                                                        EPC
                    Write                                  [5-0]          Registerdata 2      B
                    data              Instruction
                                       Register                           Write
                                                                          data Registers
                                       Instruction
                                            [15-0]
                                                                                                                                           Cause
                                     Memory                         16              32                        ALU
                                       data                               Sign           Shift
                                     register                            extend          left 2              control
Instruction [5-0]
                                                                            (b)
Answer:
   (a)
Add
4 Add Sum
                                                                                            Shift                             PCSrc
                                                                                            left 2
                                 Instruction [31-26]
                                                        Control
Outputs
Control
Op[5-0]
                                                                                                                                        Jump
                                                                                                                                      address
                                                             Instruction [25-0]                                     26 Shift 28        [31-0]
                                                                                                                       left 2
                                                       Instruction
                                                           [31-26]                                                     PC [31-28]
                       PC
                                                       Instruction                     Read
                                     Address               [25-21]                     register 1
                                                                                                 Read      A
                                     Memory            Instruction                     Read
                                                            [20-16]                              data 1
                                                                                       register 2                        ALU
                                     MemData
                                                       Instruction Instruction         Write Read
                                     Write                   [15-0]    [15-11]         Register data 2     B
                                     data               Instruction
                                                          Register                     Write
                                                                                       data Registers
                                                         Instruction
                                                              [15-0]
                                                       Memory                                                           ALU
                                                         data                           Sign          Shift
                                                                                       extend         left 2           control
                                                       register                                 32
                                                                                  16
Answer: (d)
2.   A compiler designer is trying to decide between two code sequences for a particular computer.
     The hardware designers have supplied the following facts:
                                          CPI for each instruction class
                                            A           B           C
                                 CPI         1          2           3
     For a particular high- level language statement, the compiler writer is considering two code
     sequences that require the following instruction counts:
Answer:
   (a) Total clock cycle for code sequence 1 = 1  2 + 2  1 + 3  2 = 10
       Total clock cycle for code sequence 2 = 1  4 + 2  1 + 3  1 = 9
       Code sequence 1 is faster
   (b) CPI for code sequence 1 = 10 / 5 = 2
       CPI for code sequence 2 = 9 / 6 = 1.5
3.   Convert 16-bit binary version of 210 and -210 to 32-bit binary number.
                            1111 1111 1111 1111 1111 1111 1111 11002
Answer:
   210 = 0000 0000 0000 0000 0000 0000 0000 00102
   -210 = 1111 1111 1111 1111 1111 1111 1111 11102
Remark: 題目給的二進位數為4
4.   In ARM, suppose register r0 has the binary number
                               1111 1111 1111 1111 1111 1111 1111 11112
     and that register r1 has the binary number
                               0000 0000 0000 0000 0000 0000 0000 00012
     and the following instruction is executed.
        CMP r0, r1
     Which conditional branch is taken? Please explain your answer.
       BLO     L1 : unsigned branch (branch on lower)
       BLT     L2 : signed branch (branch on less than)
Answer:
       Instruction BLO is not taken to L1
       Instruction BLT is taken to L2
5.   In ARM, the branch instruction is encoded using 24-bit address. You might think that the 24-bit
     address would extend the program limit to 224 or 16MB, which would be fine for many programs
     but constrain some large ones. How to solve this problem? Hint: specify a register that would
     always be added to the branch address.
Answer:
       We can specify the program counter register (PC) that would always be added to the branch
       address, so that a branch instruction would calculate the following:
       Branch target address = PC + branch address
       This sum allows the program to be as large as 2 32 , solving the branch address size problem.
6.   There are situations in pipelining when the next instruction cannot execute in the following clock
     cycle. These events are called hazards. In the following situation, please point out what kind of
     hazards is occurred.
     Suppose you found a sock (短襪) at the folding station for which no match existed. One possible
     strategy is to run down to your room and search through your clothes bureau (衣櫃) to see if you
     can find the match. Obviously, while you are doing the search, loads that have completed drying
     and are ready to fold and those that have finished washing and are ready to dry must wait.
Answer:
   (a)
                         Instruction sequence
                  lw $1, 40($6)
                  add $6, $2, $2
                                                       Delay I3 to avoid RAW hazard on
             a.   nop
                                                       $1 and $6 from I1 and I2
                  nop
                  sw $6, 50($1)
                  lw $5, -16($5)
                  nop
                                                       Delay I2 to avoid RAW hazard on
             b.   nop
                                                       $5 from I1.
                  sw $5, -16($5)
                  add $5, $5, $5
   (b)
                         Instruction sequence
                  lw $1, 40($6)
                                                       No RAW hazard on $1 from I1
             a.   add $6, $2, $2
                                                       (forwarded)
                  sw $6, 50($1)
                  lw $5, -16($5)
                  nop                                  Delay I2 to avoid RAW hazard on
             b.
                  sw $5, -16($5)                       $5 from I1
                  add $5, $5, $5
102 台科大電子
1.   What is the power wall? How does it affect CPU development?
Answer:
     Power Wall is the increasing power consumption and resulting heat generation of the
     processing unit. The power wall can be mitigated by "shrinking" the processor by using
     smaller traces for the same logic. This poses manufacturing, system design, and deployment
     problems.
     The power wall has forced a dramatic change in the design of microprocessor. Rather than
     continuing to decrease the response time of a single program running on the single processor,
     as currently all desktop and server companies are shipping microprocessors with
     multiprocessors per chip, where benefit is often more on throughput than on response time.
Answer:
     (a) Structural hazards: hardware cannot support the instructions executing in the same clock cycle
            (limited resources)
     (b) Data hazards: attempt to use item before it is ready (Data dependency)
         Example:
                1        lw $5, 50($2)
                2        add $2, $5, $4
                3        add $4, $2, $5
                4        beq $8, $9, L1
                5        sub $16, $17, $18
                6        sw $5, 100($2)
                7    L1:
        Type                                             Example
                   假設上列指令是在只有單一記憶體的 datapath 中執行,則在第 4 個時脈週
     Structure     期,指令 1 讀取記憶體資料同時指令 4 也在從同一個記憶體中擷取指令,
     hazard        也就是兩個指令同時對一記憶體進行存取。在這樣情形下就會發生
                   structural hazard
                   指令 1 在第 5 個時脈週期時才會將計算結果寫回暫存器$5,但指令 2 和 3
     Data hazard   分別在時脈週期 3 跟 4 時便需要使用暫存器$5 的內容。此時指令 2 和 3 擷
                   取到暫存器$5 的內容並非最新的,因此在這樣情形下就會發生 data hazard
     (c) Speculation: an approach whereby the compiler or processor guesses the outcome of an
         instruction to remove it as a dependence in executing other instructions.
         For example, the compiler can use speculation to move the lw instruction across sw, if we
         guess the memory addresses for the two instructions are different.
                   sw $1, 100($2)
                   lw $3, 100($4)
3.   Suppose we have a 32-bit virtual address, 4KB pages, and 4 bytes per page table entry. What
     is the total page table size?
Answer
   Number of entries in page table = 4GB / 4KB = 1M
   The page table size = 4B  1M = 4MB
4.   Please use CPU execution time formula to explain why superscalar and super pipeline could
     improve performance.
Answer: CPU execution time = Instruction count  CPI  clock cycle time
     Superscalar enables the processor to execute more than one instruction per clock cycle. This will
          increase IPC (or reduce the average CPI) and thus can improve performance.
     Superpipeline increases the depth of the pipeline (more pipeline stage) to overlap more
          instructions. Performance is potentially greater since the clock cycle time can be shorter.
5.   Please explain why the clock cycle/item in the right-hand side is not consistent with
     instruction/item in the left-hand side?
     1200                                                                       2000
      800
                                                                                1200
      600
                                                                                 800
      400
       0                                                                          0
            4     8     16   32     64    128    256       512 1024 2048 4096          4     8     16   32     64    128    256       512 1024 2048 4096
                                  Size (K items to sort)                                                     Size (K items to sort)
Answer: Quicksort consistently has many fewer misses per item to be sorted.
Remark:
          5
                  Radix sort
          4
          1
              Quicksort
          0
              4     8     16   32     64    128    256       512 1024 2048 4096
                                    Size (K items to sort)
102 台科大資工
1.   A compiler designer is trying to decide between two code sequences for a particular
     computer. The hardware designers have supplied the following facts:
                                                  CPI for instruction class
                                            A                  B                  C
                      CPI                   3                  4                  2
     For a particular high- level- language statement, the compiler writer is considering two code
     sequences that require the following instruction counts:
                 Code sequence             Instruction counts for instruction class
                                            A                  B                  C
                        1                   2                  2                  6
                        2                   2                  5                  1
     (a) Which code sequence executes the most instructions?
     (b) Which will be faster?
     (c) What is the CPI for each sequence?
     Suppose we measure the code for the same program from two different compilers and obtain
     the following data:
                                    Instruction counts (in billions) for instruction class
                  Code form
                                           A                  B                   C
                  Complier 1               1                  2                   7
                  Compiler 2               2                  1                   5
     Assume that the computer's clock rate is 5 GHz.
     (d) Which code sequence will execute faster according to execution time?
     (e) Which code sequence will execute faster according to MIPS?
Answer:
                                     5 10
                              (2  1  5) 109
         MIPS for compiler 2 =                 = 2000
                                   4 106
         According to MIPS, code sequence from compiler 1 is as fast as from compiler 2.
Answer:
Cache1: number of blocks = 4
ID.Flush
                                                A
                                                                               ID/EX
                                                                                                          EX/MEM
                                             Control                            M                                           MEM/WB
                                                                                       Cause
                                     IF/ID                                                                  M
                                                                                        EPC
                                                           Registers       C
                                                                                                   ALU
                       Instruction
 80000180         PC    memory                                                                                      Data
                                                                                                                   memory
IF.Flush
     (a) Please make the right binding between each of units "A", "B", "C", "D", "E" and one of
          following names.
          (1) AND array          (2) Hazard detection unit       (3) Forwarding unit
          (4) XOR array          (5) Shift left 2 unit           (6) Sign extend unit
          (7) MUX
     (b) For the following instruction sequence and assuming that branch (B) is taken. Please
          describe the instruction pair that causes data hazard, the corresponding depend register,
          and the unit required to solve the data hazard.
         (S)          sub     $5, $3, $4      # Reg5 = Reg3 - Reg4
         (A)          add     $1, $4, $5      # Reg1 = Reg4 + Reg5
         (D)          and     $3, $2, $5      # Reg3 = (Reg2 OR Reg5)
         (L)          lw      $7, 0($1)       # Load from MEM (compute base on $1) to $7
         (O)          ori     $9, $3, $7      # Reg9 = (Reg3 OR Reg7)
         (B)          beq     $7, $9, loop # Branch to loop if Reg7 = Reg9
         (R)          sra     $3, $9, 2       # Reg3 = Reg9 >> 2
         (T)          slt     $9, $3, S7      # Reg9 = 1 if Reg3 < Reg7
         (I) loop: addi $9, $7, 40            # Reg9 = Reg7 + 40
     (c) Continue with (b). How many cycles does it take to execute the above instruction sequence?
Answer:
(a)
            A                B               C                   D                E
           (2)              (3)             (4)                 (5)              (6)
         Hazard         Forwarding        XOR array         Shift left 2     Sign extend
      detection unit       unit                                unit              unit
(b)
                 Instruction pair                                Function
                                               Depend
                  that cause data                                   units
                                             register
                        hazard                                   required
                       (S & A)                    $5                   B
                       (S & D)                    $5                   B
                       (A & L)                    $1                   B
                       (D & O)                    $3                   B
                       (L & O)                    $7                 A, B
                       (L & B)                    $7                   A
                       (O & B)                    $9                   A
                       (O & R)                    $9                   B
                       (R & T)                    $3                   B
      Remark: Since the datapath cannot forward any data to ID stage, forwarding unit (B) is
      not needed for (O & B)
(c) Instructions between (L & O) and (O & B) required one clock delay for data hazard,
    respectively.
    Branch is taken imply the instruction (R) is fetched into the pipeline and will be flushed and
    also cause one clock delay. Hence, the total clock cycles required to execute the code is (5 – 1)
    + 7 + 3 = 14 clocks
102 台師大資工
1. Consider a program written in high level programming language is executed separately on
   two processors with different instruction set architectures. Processor A executes 10 billion
   instructions with clock rate 4GHz and CPI 1.0. Processor B executes 8 billion instructions
   with the same clock rate but CPI 1.1.
   (a) Find the MIPS rates of those processors.
   (b) Find the total execution time for each of the processors.
   (c) Compare their performance.
Answer
                     4G                               4G
    (a) MIPSA =               = 4000, MIPSB =                 = 3640
                   1.0× 106                        1.1× 106
                                              10× 109 × 1
    (b) Execution Time for processor A =                      = 2.5
                                                   4× 109
                                       8× 109 × 1.1
         Execution for processor B =                  = 2.2
                                         4 × 109
2. Assume a decimal value -1234 is stored in a 32-bit general purpose register. What is the value
   (in binary or hexadecimal format) stored in the register if it is represented as
   (a) a signed integer, and
   (b) a single-precision floating-point number in the IEEE 754 standard.
Answer: 123410 = 0000 0000 0000 0000 0000 0100 1101 00102
    (a) -123410 = 1111 1111 1111 1111 1111 1011 0010 11102 = FFFFFB2C16
    (b) -0000 0000 0000 0000 0000 0100 1101 00102 = -1.0011010010  210 
                                   S            E                           F
                  binary           1        01110101             00110100100000000000000
               hexadecimal                                    BA9A4000
3. Consider a computer datapath divides the execution of an instruction into five stages, i.e.
   instruction fetch, instruction decode, execution, memory access, and write back. The
   processing times of these stages are 400ps (10 -12 seconds), 200ps, 120ps, 350ps, and 200ps,
   respectively,
   (a) What is the highest operating frequency if it is a single cycle (not pipelined) datapath?
   (b) A pipeline register adds 20ps delay. What is the highest operating frequency if it is a
        five-stage pipeline datapath?
   (c) Assume an infinite code sequence is executed, and no hazard is found between
        instructions. Compare the performances of the single cycle datapath and the pipelined
        datapath.
Answer
    (a) 1 / (400 ps + 200 ps + 120 ps + 350 ps + 200 ps) = 787.4MHz
    (b) 1 / 420 ps = 2.38 GHz
    (c) Pipeline datapath is (400 + 200 + 120 + 350 + 200) / 420 = 3.02 times faster than single cycle
        datapath
4. Consider the code sequence executing in a five-stage (IF, DE, EX, MEM, and WB) pipelined
   datapath.
              lw $1, 0($6)            # $1  MEM[$6]
              lw $2, 1($6)            # $2  MEM [$6 + 1]
              add $3, $1, $2          # $3  $1 + $2
              sub $4, $3, $5          # $4  $3 + $5
              beq $3, $0, L1          # jump to L1 if $3 == $0
              ….
         L1:
   (a) Show all the possible hazards between instructions.
   (b) Describe how the hazards can be resolved in recent computer architectures.
Answer
   (a)
                                            Data hazard
                                 Register            Instruction pair
                                   $1                      (1, 3)
                                   $2                      (2, 3)
                                   $3                  (3, 4), (3, 5)
    (b) Data hazard can be resolved by forwarding technique in recent computer architectures.
        However, forwarding technique cannot save the day is when an instruction tries to read a
        register following a load instruction that writes the same register. Hence, the pipeline need a
        stall between the load and its use.
5. Consider a 128 bytes fully associative mapped cache with 4-word blocks (a word has 32 bits).
   Its replacement policy is LRU. Suppose the following memory word addresses (in decimal)
   are accessed in sequence:
   40, 41, 42, 400, 43, 400, 60, 61, 62, 63, 64, 800, 40, 41, 42, 800, 43, 44, 60, 61, 62, 120, 121,
   122, 123, 168, 169, 41, 42, and 400.
   (a) Show whether it is a hit or a miss in the cache for each of the memory accesses.
   (b) What is the miss rate?
                  128
Answer: There            = 8 blocks in the cache.
                  4× 4
                                   (a)
Word address   Block address    Hit/Miss
     40              10          Miss
     41              10            Hit
     42              10            Hit
    400             100          Miss
     43              10            Hit
    400             100            Hit
     60              15          Miss
     61              15            Hit
     62              15            Hit
     63              15            Hit
     64              16          Miss
    800             200          Miss
     40              10            Hit
     41              10            Hit
     42              10            Hit
    800             200            Hit
     43              10            Hit
     44              11          Miss
     60              15            Hit
     61              15            Hit
     62              15            Hit
    120              30          Miss
    121              30            Hit
    122              30            Hit
    123              30            Hit
    168              42          Miss
    169              42            Hit
     41              10          Miss
     42              10            Hit
    400             100          Miss
    (b)          Miss rate     10/30 = 1/3
102海洋電機
1.   Draw a schematic diagram for a 4-bit adder, using 1-bit full-adder as a basic unit
Answer
                                 a3 b3        a2 b2       a1 b1      a0 b0
c4 + + + + c0
s3 s2 s1 s0
2. Explain sequential logic and combinational logic. Give one example for each class.
Answer
     Sequential logic: sequential logic is a type of logic circuit whose output depends not only on the
         present value of its input signals but on the past history of its inputs. That is, sequential logic
         has state (memory).
         A common example of a circuit employing sequential logic is the flip- flop.
     Combinational logic: The elements that operate on data values are all combinational, which
         means that their outputs dependent only on the current inputs. Given the same input, a
         combinational element always produces the same output.
         The 4-bit ripple carry adder is an example of a combinational; element.
Answer
Answer
     (1) First Fit - A resource allocation scheme (usually for memory). First Fit fits data into memory
         by scanning from the beginning of available memory to the end, until the first free space which
         is at least big enough to accept the data is found. This space is then allocated to the data.
     (2) A resource allocation scheme (usually for memory). Best Fit tries to determine the best place to
         put the new data. One example might be to try and minimize the wasted space at the end of the
         block being allocated - i.e. use the smallest space which is big enough.
     Remark: This topic belongs to OS
5.   Explain the four conditions that must hold simultaneously in a system for causing a deadlock
     situation.
Answer
     (1) Mutual Exclusion Condition: At least one resource must be held in a non-shareable mode, that
         is, only one process at a time claims exclusive control of the resource. If another process
         requests that resource, the requesting process must be delayed until the resource has been
         released.
     (2) Hold and Wait Condition: There must exist a process that is holding a resource already
         allocated to it while waiting for additional resource that are currently being held by other
         processes.
     (3) No-Preemptive Condition: Resources cannot be removed from the processes are used to
         completion or released voluntarily by the process holding it.
     (4) Circular Wait Condition: The processes in the system form a circular list or chain where each
         process in the list is waiting for a resource held by the next process in the list.
     Remark: This topic belongs to OS
Answer
Answer
    (a) The south bridge handles all of a computer's I/O functions, such as USB, the system BIOS,
        and the interrupt controller.
        The north bridge typically handles communications among the CPU, in some cases RAM,
        and PCI Express (or AGP) video cards, and the south bridge.
    (b)
                                      NOR flash                               Nand flash
          名稱由來 Bit cell like a NOR gate                            Bit cell like a NAND gate
          存取方式 Random read/write access                            Block-at-a-time access
             價格         More expensive                             Cheaper
             用途         Instruction Mem. in embedded systems       USB keys, media storage, ...
     (c) A cache write miss refers to a failed attempt to write a piece of data in the cache, which
         results in a main memory access with much longer latency. There are two ways to handle
         write miss.
         Write allocate: the block is fetched from memory and then the appropriate portion of the
             block is overwritten in the cache.
         No write allocate: Update the portion of the block in memory but not put it in the cache.
     (d) 500  0.98  (10 + 2) + 500  0.02  (4 + 55 + 2 + 10) = 6590ns
     (e) Speedup = (7 + 8 + 4 + 4 + 6 + 5) / 8 = 4.25
     (f) Function of F1: if (A = B) then F1 = 1 else F1 = 0;
         Function of F2: if (A > B) then F2 = 1 else F2 = 0
                                8
                                           2
         1                                                            3
                                               7
             6
4 5
Answer
                             No. 6                   No. 7                            No. 8                 No. 9
         (a)
                              2                   0.00010001                           4                  000000001
                             No. 1                   No. 2                            No. 3                 No. 4                   No. 5
         (b)
                              1                        0                               1                      0                      0
3. Datapath問題
Add
4 Add Sum
                                                                                          Shift                           PCSrc
                                                                                          left 2
                                 Instruction [31-26]
                                                        Control
                                         1
                                         2
                                   Instruction [25-21]         Read
       PC      Read
               address
                                                               register 1
                                                                             Read
                                                                                              6
                                   Instruction [20-16]                      data 1
                                                               Read
                   Instruction                                 register 2
                        [31-0]           3                     Write     Read
                                                                                                    ALU
                                                                                                                  Address Read
                                                                                                                           data
               Instruction
                memory
                                         4                     Register data 2
                                                                                                                      Data
                                                               Write                                                 memory
                                   Instruction [15-11]         data Registers                  7
                                                                                                                  Write
                                                                                                                  data
                                   Instruction [15-0]          16                32
                                                                      Sign
                                                                                               ALU
                                                                     extend
                                         5                                                    control
                                                               Instruction [5-0]
                                                                                                                      8
                                                                            Instr.                                  Data
      Reg.            Reg.            Reg.               Reg.                              Instr. Mem.                            Data Mem.
                                                                            Mem.                                    Mem.
     number          content         number             content                              content                               content
                                                                            Addr.                                   Addr.
     00000            800h            10000                0                 100h       lw $t 9, 10($t1)            200h              0
     00001            200h            10001                0                 104h       lw $t 10, 14($t1)           204h              0
     00010         11111111h          10010                0                 108h       add $t3, $t9, $t10          208h              0
     00011             50h            10011                0                 10ch       sw $t3, 4($t1)              20ch              0
     00100               0            10100                0                 110h       beq $t3, $t4 100            210h          55555555h
     00101               0            10101                0                 114h                                   214h          aaaaaaaah
     00110               0            10110                0                 118h                                   218h              0
     00111               0            10111                0
     01000            100h            11000                0
   01001      0      11001       0
   01010      0      11010       0
   01011      0      11011       0
   01100      0      11100       0
   01101      0      11101       0
   01110      0      11110       0
   01111      0      11111       0
    注意:
    (1) PC內容為100h
    (2) 執行指令lw $t9, 10($t1),其中$t9以01001表示,$t1以00001表示,10為16進制
    (3) sw指令說明同(2)
    問題:
    (a) 執行第一個指令時(lw),上圖中箭號所指編號1, 2, 3, 4, 5, 6, 7, 8處之值為何?(5, 6, 7,
        8請以16進制表示)
    (b) 執行第三個指令時(add),上圖中箭號所指編號1, 2, 3, 4, 5, 6, 7, 8處之值為何?
    (c) 所有指令執行完畢後,PC內容為何?請以16進制表示。
Answer