Optional Hardware Solution
Optional Hardware Solution
1     Verilog
Please answer the following four questions about Verilog.
(a) Does the following code result in a sequential circuit or a combinational circuit? Explain why.
 1   module concat ( input clk , input data_in1 , input data_in2 ,
 2                                        output reg [1:0] data_out );
 3    always @ ( posedge clk , data_in1 )
 4      if ( data_in1 )
 5         data_out = { data_in1 , data_in2 };
 6      else if ( data_in2 )
 7         data_out = { data_in2 , data_in1 };
 8   endmodule
      Explanation.
      This code results in a sequential circuit because data_in2 is not in the sensitivity list, and thus a latch is
      inferred for data_out.
                                                      1/25
(b) The input clk is a clock signal. What is the hexadecimal value of the output c right after the third
    positive edge of clk if initially c = 8’hE3 and a = 4’d8 and b = 4’o2 during the entire time?
 1   module mod1 ( input clk , input [3:0] a , input [3:0] b , output reg [7:0] c );
 2   always @ ( posedge clk )
 3     begin
 4       c <= {c , &a , | b };
 5       c [0] <= ^ c [7:6];
 6     end
 7   endmodule
     Explanation.
     Cycle 1: c <= {c, &a, |b} → c <= {1110_0011, 0, 1} → c <= {1000_1101}
                c[0] <= ˆc[7:6] → c[0] <= ˆ{11} → c[0] <= 0
     At the first positive edge of clk, c = 80 b1000_1100
     Cycle 2: c <= {c, &a, |b} → c <= {1000_1100, 0, 1} → c <= {0011_0001}
                c[0] <= ˆc[7:6] → c[0] <= ˆ{10} → c[0] <= 1
     At the second positive edge of clk, c = 80 b0011_0001
     Cycle 3: c <= {c, &a, |b} → c <= {0011_0001, 0, 1} → c <= {1100_0101}
                c[0] <= ˆc[7:6] → c[0] <= ˆ{00} → c[0] <= 0
     At the third positive edge of clk, c = 80 b1100_0100 → c = 80 hC4
     Note that since the assignments to c are non-blocking, c[7 : 6] in line 5 is not affected by the assignment
     to c in line 4 in the same cycle.
(c) Is the following code syntactically correct? If not, please explain the mistake(s) and how to fix it/them.
 1   module 1 nn3r ( input [3:0] d , input op , output [1:0] s );
 2     assign s = op ? ( d [1:0] - d [3:2]) :
 3                             ( d [3:2] + d [1:0]);
 4   endmodule
 5
 6   module top ( input wire [6:0] instr , input wire op , output reg z );
 7
 8     reg [1:0] r1 , r2 ;
 9
14 endmodule
                                                    2/25
Answer and concise explanation:
 The code is not syntactically correct.
 Explanation.
  • Module names cannot start with a number → ’1nn3r’ is not a legal module name.
  • The output signal ’z’ has to be declared as a ’wire’ but not ’reg’.
  • ’r1’ and ’r2’ has to be declared as ’wire’s.
  • The module ’1nn3r’ does not have ports named ’instr’ and ’z’. Those need to be changed to ’d’ and
    ’s’, respectively.
                                                   3/25
(d) Does the following code correctly implement a counter that counts from 1 to 11 by increments of 2 (e.g.,
    1, 3, 5, 7, 9, 11, 1, 3 ...)? If so, say "Correct". If not, correct the code with minimal modification.
 1   module odd_counter ( clk , count );
 2     input wire clk ;
 3     output reg [2:0] count ;
 4     reg [2:0] count_next ;
 5
 6     always@ *
 7     begin
 8       count_next = count ;
 9       if ( count != 11)
10          count_next = count_next + 2;
11       else
12          count_next <= 1;
13     end
14
      Explanation.
      The correct implementation:
       always@* begin
        count_next = count;
        if (count != 11)
          count_next = count_next + 2;
        else
          count_next = 1;
       end
       always@(posedge clk)
        count <= count_next;
      endmodule
                                                   4/25
(e) Does the following code correctly instantiate a 4-bit adder? If so, say "Correct". If not, correct the code
    with minimal modification.
1    module adder ( input a , input b , input c , output sum , output carry );
2    assign sum = a ^ b ^ c ;
3    assign carry = ( a & b ) | ( b & c ) | ( c & a );
4    endmodule
5
7    module adder_4bits ( input [3:0] a , input [3:0] b , output [3:0] sum , output carry );
8    wire [2:0] s ;
9
                                                      5/25
2    Finite State Machine
You are given the following FSM with two one-bit input signals (TA and TB ) and one two-bit output signal
(O). You need to implement this FSM, but you are unsure about how you should encode the states. Answer
the following questions to get a better sense of the FSM and how the three different types of state encoding
we dicussed in the lecture (i.e., one-hot, binary, output) will affect the implementation.
                                                   TA
                        A                       __                       B
                     O: 10                      TB                    O: 11
                     TB                                  __                  __
                                                         TA                  TB
                         C                                               D
                     O: 01                                            O: 00
                                                                                       TB
(a) There is one critical component of an FSM that is missing in this diagram. Please write what is missing
    in the answer box below.
     The reset line or indication for initial state.
                                                       6/25
(c) List one major advantage of each type of state encoding below.
    One-hot encoding
     reduces next-state logic
    Output encoding
     reduces the output logic
(d) Fully describe the FSM with equations given that the states are encoded with one-hot encoding. Assign
    state encodings such that numerical values of states increase monotonically for states A through D while
    using the minimum possible number of bits to represent the states with one-hot encoding. Indicate the
    values you assign to each state and simplify all equations:
     State assignments: A: 0001, B: 0010, C: 0100, D: 1000
     NS[3] = TB * TS[3] + TS[2]
     NS[2] = TB * TS[0] + T A * TS[1]
     NS[1] = T B * (TS[0] + TS[3])
     NS[0] = TS[1] * TA
     O[1] = TS[0] + TS[1]
     O[0] = TS[1] + TS[2]
                                                   7/25
(e) Fully describe the FSM with equations given that the states are encoded with binary encoding. Assign
    state encodings such that numerical values of states increase monotonically for states A through D while
    using the minimum possible number of bits to represent the states with binary encoding. Indicate the
    values you assign to each state and simplify all equations:
     State assignments: A: 00, B: 01, C: 10, D: 11
     NS[1] = T S[1] * (T S[0] * TB + TS[0] T A) + TS[1] * (T S[0] + TS[0] * TB)
     NS[0] = T S[1] * T S[0] * T B + TS[1]
     O[1] = T S[1]
     O[0] = TS[1] XOR TS[0]
(f) Fully describe the FSM with equations given that the states are encoded with output encoding. Use
    the minimum possible number of bits to represent the states with output encoding. Indicate the values
    you assign to each state and simplify all equations:
     State assignments: A: 10, B: 11, C: 01, D: 00
     NS[1] = TS[1] * T S[0] * T B + TS[1] * TS[0] * TA + T S[1] * T S[0] * T B
     = T S[0] * T B + TS[1] * TS[0] * TA
     NS[0] = TS[1] * T S[0] + TS[1] * TS[0] * T A + T S[1] * T S[0] * T B
     = TS[1] * (T S[0] + TS[0] * T A) + T S[1] * T S[0] * T B
     O[1] = TS[1]
     O[0] = TS[0]
                                                   8/25
(g) Assume the following conditions:
     • We can only implement our FSM with 2-input AND gates, 2-input OR gates, and D flip-flops.
     • 2-input AND gates and 2-input OR gates occupy the same area.
     • D flip-flops occupy 3x the area of 2-input AND gates.
   Which state-encoding do you choose to implement in order to minimize the total area of this FSM?
     one-hot: 10 logic gates, 4 FFs
     binary: 16 logic gates, 2 FFs
     output: 10 logic gates, 2 FFs
                                                  9/25
3     Big versus Little Endian Addressing
Consider the 32-bit hexadecimal number 0xcafe2b3a.
    1. What is the binary representation of this number in little endian format? Please clearly mark the bytes
       and number them from low (0) to high (3).
         3a   2b    fe   ca
         0     1    2    3
    2. What is the binary representation of this number in big endian format? Please clearly mark the bytes
       and number them from low (0) to high (3).
         ca   fe   2b    3a
         0    1     2    3
                                                    10/25
4     The MIPS ISA
4.1    Warmup: Computing a Fibonacci Number
The Fibonacci number Fn is recursively defined as
F (n) = F (n − 1) + F (n − 2),
where F (1) = 1 and F (2) = 1. So, F (3) = F (2) + F (1) = 1 + 1 = 2, and so on. Write the MIPS assembly
for the fib(n) function, which computes the Fibonacci number F (n):
int fib(int n)
{
  int a = 0;
  int b = 1;
  int c = a + b;
  while (n > 1) {
    c = a + b;
    a = b;
    b = c;
    n--;
  }
  return c;
}
   Remember to follow MIPS calling convention and its register usage (just for your reference, you may not
need to use all of these registers):
 • The argument n is passed in register $4.
 • The result (i.e., c) should be returned in $2.
 • $8 to $15 are caller-saved temporary registers.
 • $16 to $23 are callee-saved temporary registers.
 • $29 is the stack pointer register.
 • $31 stores the return address.
    Note: A summary of the MIPS ISA is provided at the end of this handout.
                                                     11/25
  NOTE: More than one correct solution exists, this is just one potential solution.
fib:
addi   $sp,   $sp, -16   //   allocate stack space
sw     $16,   0($sp)     //   save r16
add    $16,   $4, $0     //   r16 for arg n
sw     $17,   4($sp)     //   save r17
add    $17,   $0, $0     //   r17 for a, init to 0
sw     $18,   8($sp)     //   save r18
addi   $18,   $0, 1      //   r18 for b, init to 1
sw     $31,   12($sp)    //   save return address
add    $2,    $17, $18   //   c = a + b
branch:
slti $3, $16,     2    //     use r3 as temp
bne $3, $0,       done
add $2, $17,      $18 //      c   =   a + b
add $17, $18,     $0 //       a   =   b
add $18, $2,      $0 //       b   =   c
addi $16, $16,    -1 //       n   =   n - 1
j    branch
done:
lw    $31,    12($sp)    //   restore r31
lw    $18,    8($sp)     //   restore r18
lw    $17,    4($sp)     //   restore r17
lw    $16,    0($sp)     //   restore r16
addi $sp,     $sp, 16    //   restore stack pointer
jr    $31                //   return to caller
                                                 12/25
4.2     MIPS Assembly for REP MOVSB
MIPS is a simple ISA. Complex ISAs—such as Intel’s x86—often use one instruction to perform the function
of many instructions in a simple ISA. Here you will implement the MIPS equivalent for a single Intel x86
instruction, REP MOVSB, which is specified as follows.
    The REP MOVSB instruction uses three fixed x86 registers: ECX (count), ESI (source), and EDI
(destination). The “repeat” (REP) prefix on the instruction indicates that it will repeat ECX times. Each
iteration, it moves one byte from memory at address ESI to memory at address EDI, and then increments
both pointers by one. Thus, the instruction copies ECX bytes from address ESI to address EDI.
(a) Write the corresponding assembly code in MIPS ISA that accomplishes the same function as this in-
    struction. You can use any general purpose register. Indicate which MIPS registers you have chosen to
    correspond to the x86 registers used by REP MOVSB. Try to minimize code size as much as possible.
      Assume: $1 = ECX, $2 = ESI, $3 = EDI
(b) What is the size of the MIPS assembly code you wrote in (a), in bytes? How does it compare to REP
    MOVSB in x86 (note: REP MOVSB occupies 2 bytes)?
       The size of the MIPS assembly code is 4 bytes × 7 = 28 bytes, as compared to 2 bytes for x86 REP
       MOVSB.
(c) Assume the contents of the x86 register file are as follows before the execution of the REP MOVSB:
      EAX:   0xccccaaaa
      EBP:   0x00002222
      ECX:   0xFEE1DEAD
      EDX:   0xfeed4444
      ESI:   0xdecaffff
      EDI:   0xdeaddeed
      EBP:   0xe0000000
      ESP:   0xe0000000
      Now, consider the MIPS assembly code you wrote in (a). How many total instructions will be executed
      by your code to accomplish the same fuction as the single REP MOVSB in x86 accomplishes for the
      given register state?
       The count (value in ECX) is 0xfee1dead = 4276215469. Therefore, the loop body is executed 4276215469
       times. As there are 6 instructions in the loop body, total instructions executed = 6 * 4276215469 + 1 =
       25657292814 + 1 (beq instruction outside of the loop) = 25657292815.
                                                    13/25
(d) Assume the contents of the x86 register file are as follows before the execution of the REP MOVSB:
    EAX:   0xccccaaaa
    EBP:   0x00002222
    ECX:   0x00000000
    EDX:   0xfeed4444
    ESI:   0xdecaffff
    EDI:   0xdeaddeed
    EBP:   0xe0000000
    ESP:   0xe0000000
    Now, answer the same question in (c) for the above register values.
     The count (value in ECX) is 0x00000000 = 0. Therefore, the loop body is executed 0 times. Total
     instructions executed = 1 (beq instruction outside of the loop).
                                                  14/25
5    Data Flow Programs
Draw the data flow graph for the fib(n) function from Question 4.1. You may use the following data flow
nodes in your graph:
 • + (addition)
 • > (left operand is greater than right operand)
 • Copy (copy the value on the input to both outputs)
 • BR (branch, with the semantics discussed in class, label the True and False outputs)
   You can use constant inputs (e.g., 1) that feed into the nodes. Clearly label all the nodes, program
inputs, and program outputs. Try to the use fewest number of data flow nodes possible.
                                                    15/25
16/25
6    Microarchitecture vs. ISA
a) Briefly explain the difference between the microarchitecture level and the ISA level in the transformation
   hierarchy. What information does the compiler need to know about the microarchitecture of the machine
   in order to compile a given program correctly?
    The ISA level is the interface a machine exposes to the software. The microarchitecture is the actual
    underlying implementation of the machine. Therefore, the microarchitecture and changes to the microar-
    chitecture are transparent to the compiler/programmer (except in terms of performance), while changes to
    the ISA affect the compiler/programmer. The compiler does not need to know about the microarchitecture
    of the machine in order to compile the program correctly
b) Classify the following attributes of a machine as either a property of its microarchitecture or ISA:
    Microarchitecture?       ISA?     Attribute
                               X      The machine does not have a subtract instruction
              X                       The ALU of the machine does not have a subtract unit
                               X      The machine does not have condition codes
                               X      A 5-bit immediate can be specified in an ADD instruction
              X                       It takes n cycles to execute an ADD instruction
                               X      There are 8 general purpose registers
              X                       A 2-to-1 mux feeds one of the inputs to ALU
              X                       The register file has one input port and two output ports
                                                   17/25
7   Performance Metrics
• If a given program runs on a processor with a higher frequency, does it imply that the processor always
  executes more instructions per second (compared to a processor with a lower frequency)? (Use less than
  10 words.)
    No, the lower frequency processor might have much higher IPC (instructions per cycle).
    More detail: A processor with a lower frequency might be able to execute multiple instructions per cycle
    while a processor with a higher frequency might only execute one instruction per cycle.
• If a processor executes more of a given program’s instructions per second, does it imply that the processor
  always finishes the program faster (compared to a processor that executes fewer instructions per second)?
  (Use less than 10 words.)
    No, because the former processor may execute many more instructions.
    More detail: The total number of instructions required to execute the full program could be different on
    different processors.
                                                  18/25
8    Performance Evaluation
Your job is to evaluate the potential performance of two processors, each implementing a different ISA. The
evaluation is based on its performance on a particular benchmark. On the processor implementing ISA A,
the best compiled code for this benchmark performs at the rate of 10 IPC. That processor has a 500 MHz
clock. On the processor implementing ISA B, the best compiled code for this benchmark performs at the
rate of 2 IPC. That processor has a 600 MHz clock.
 • What is the performance in Millions of Instructions per Second (MIPS) of the processor implementing
   ISA A?
    ISA A: 10 instructions
                  cycle
                                            cycle
                           ∗ 500, 000, 000 second = 5000 MIPS
                                                    19/25
9    Single-Cycle Processor Datapath
In this problem, you will modify the single-cycle datapath we built up in Lecture 11 to support the JAL
instruction. The datapath that we will start with is provided below. Your job is to implement the necessary
data and control signals to support the JAL instruction, which we define to have the following semantics:
                                 JAL :   R31 ← PC + 4
                                          PC ← PC31...28 || Immediate || 02
Add to the datapath on the next page the necessary data and control signals to implement the JAL instruction.
Draw and label all components and wires very clearly (give control signals meaningful names; if selecting a
subset of bits from many, specify exactly which bits are selected; and so on).
                                                   20/25
10     REP MOVSB
Let’s say you are the lead architect of the next flagship processor at Advanced Number Devices (AND). You
have decided that you want to use the LC-3b ISA for your next product, but your customers want a smaller
semantic gap and marketing is on your case about it. So, you have decided to implement your favorite x86
instruction, REP MOVSB, in LC-3b.
    Specifically, you want to implement the following definition for REP MOVSB (in LC-3b parlance): REP-
MOVSB SR1, SR2, DR which is encoded in LC-3b machine code as:
                    15   14 13    12   11    10   9   8    7 6   5   4   3   2    1 0
                          1010              DR            SR1    0   0   0       SR2
   REPMOVSB uses three registers: SR1 (count), SR2 (source), and DR (destination). It moves a byte
from memory at address SR2 to memory at address DR, and then increments both pointers by one. This is
repeated SR1 times. Thus, the instruction copies SR1 bytes from address SR2 to address DR. Assume that
the value in SR1 is greater than or equal to zero.
  1. Complete the state diagram shown below, using the notation of the LC-3b state diagram. Describe
     inside each bubble what happens in each state and assign each state an appropriate state number. Add
     additional states not present in the original LC-3b design as you see fit.
                                                  21/25
        HW 3 Solutions: Microprogramming Wrap-up and Pipelining
                                                           10, 46
                                    SR1 <− SR1 − 1                     To 18
                                        [REP]              REP = 0
                                           REP = 1
                                                           50
51
                                                           40
                          R=0   MDR <− M[MAR[15:1]’0]
                                             R=1
                                                           42
MAR <− DR
43
DR <− DR + 1
                                                          44
                         R=0    M[MAR[15:1]’0]<−MDR
                                           R=1
                                       To 46
1/8
                                     22/25
       2. Add to the LC-3b datapath any additional structures and any additional control signals needed to
          implement REPMOVSB. Clearly label your additional control signals with descriptive names. Describe
          what value each control signal would take to control the datapath in a particular way.
Modifications to the data path
                                                             REP       COND[2]
                                                                                                          Bus[15]
                                                                       J[5]
REGFILE ......
                         SR2         SR2                               6
                         OUT         OUT
                                                                                                            REP
                           16              16                      6
                                                  Microsequencer Modifications
                          SR2
                         MUX
                                +1   −1                IR[11:9]                              IR[11:9]
            INCDEC         INC/DEC                       111
                                                                               DR            IR[8:6]           SR1
                     2       MUX                       IR[8:6]
                                                       IR[2:0]                               IR[2:0]
                                 ALU                                       2                               2
                                    16                           DRMUX                                 SR1MUX
                          Regfile Modifications        DRMUX Modifications                   SR1MUX Modifications
  • DRMUX/2:
    IR[11:9] ;destination IR[11:9]
    R7 ;destination R7
    IR[8:6] ;destination IR[8:6]
    IR[2:0] ;destination IR[2:0]
  • SR1MUX/2:
    IR[11:9] ;source IR[11:9]
    IR[8:6] ;source IR[8:6]
    IR[2:0] ;source IR[2:0]
  • COND/3:
                                                     23/25
    COND0 : Unconditional
    COND1 : Memory Ready
    COND2 : Branch
3. Describe any changes you need to make to the LC-3b microsequencer. Add any additional logic and
   control signals you need. Clearly describe the purpose and function of each signal and the values it
   would take to control the microsequencer in a particular way.
   Additional control signals
      • INCDEC/2: PASSSR2, +1, -1
      • DRMUX/2:
          –   IR[11:9] ;destination IR[11:9]
          –   R7 ;destination R7
          –   IR[8:6] ;destination IR[8:6]
          –   IR[2:0] ;destination IR[2:0]
      • SR1MUX/2:
          – IR[11:9] ;source IR[11:9]
          – IR[8:6] ;source IR[8:6]
          – IR[2:0] ;source IR[2:0]
      • COND/3:
          –   COND0:    Unconditional
          –   COND1:    Memory Ready
          –   COND2:    Branch
          –   COND3:    Addressing Mode
          –   COND4:    Repeat
                                               24/25
MIPS Instruction Summary
    Opcode                       Example Assembly     Semantics
    add                          add $1, $2, $3       $1 = $2 + $3
    sub                          sub $1, $2, $3       $1 = $2 - $3
    add immediate                addi $1, $2, 100     $1 = $2 + 100
    add unsigned                 addu $1, $2, $3      $1 = $2 + $3
    subtract unsigned            subu $1, $2, $3      $1 = $2 - $3
    add immediate unsigned       addiu $1, $2, 100    $1 = $2 + 100
    multiply                     mult $2, $3          hi, lo = $2 * $3
    multiply unsigned            multu $2, $3         hi, lo = $2 * $3
    divide                       div $2, $3           lo = $2/$3, hi = $2 mod $3
    divide unsigned              divu $2, $3          lo = $2/$3, hi = $2 mod $3
    move from hi                 mfhi $1              $1 = hi
    move from low                mflo $1              $1 = lo
    and                          and $1, $2, $3       $1 = $2 & $3
    or                           or $1, $2, $3        $1 = $2 | $3
    and immediate                andi $1, $2, 100     $1 = $2 & 100
    or immediate                 ori $1, $2, 100      $1 = $2 | 100
    shift left logical           sll $1, $2, 10       $1 = $2 « 10
    shift right logical          srl $1, $2, 10       $1 = $2 » 10
    load word                    lw $1, 100($2)       $1 = memory[$2 + 100]
    store word                   sw $1, 100($2)       memory[$2 + 100] = $1
    load upper immediate         lui $1, 100          $1 = 100 « 16
    branch on equal              beq $1, $2, label    if ($1 == $2) goto label
    branch on not equal          bne $1, $2, label    if ($1 != $2) goto label
    set on less than             slt $1, $2, $3       if ($2 < $3) $1 = 1 else $1 = 0
    set on less than immediate   slti $1, $2, 100     if ($2 < 100) $1 = 1 else $1 = 0
    set on less than unsigned    sltu $1, $2, $3      if ($2 < $3) $1 = 1 else $1 = 0
    set on less than immediate   sltui $1, $2, 100    if ($2 < 100) $1 = 1 else $1 = 0
    jump                         j label              goto label
    jump register                jr $31               goto $31
    jump and link                jal label            $31 = PC + 4; goto label
25/25