Data Science
Data Science
ORGANIZATION
AND
ARCHITECTURE
Debdeep Mukhopadhyay,
CSE, IIT Kharagpur
Datapath Elements and
Their Designs
Why Datapaths?
   The speed of these elements often dominates the
    overall system performance so optimization
    techniques are important.
    However, as we will see, the task is non-trivial since
    there are multiple equivalent logic and circuit
    topologies to choose from, each with adv./disadv. in
    terms of speed, power and area.
   Datapath elements include shifters, adders,
    multipliers, etc.
Bit-slicing method of constructing
ALU
   Bit slicing is a technique for constructing a
    processor from modules of smaller bit width.
   Each of these components processes one
    bit field or "slice" of an operand.
   The grouped processing components would
    then have the capability to process the chosen
    full word-length of a particular software
    design.
Bit slicing
0                 0                    Y<-A                   No shift
0                 1                    Y<-shlA                Shift left
1                 0                    Y<-shrA                Shift right
1                 1                    Y<-0                   Zero
                                                              outputs
      0
              A[1]
     A[0]                         Y[1]
                     MUX
     A[2]
     A[0]
     A[1]                         Y[0]
                     MUX
          0
Verilog Code
module shifter(Con,A,Y);
    input [1:0] Con;
    input[2:0] A;
    output[2:0] Y;
    reg [2:0] Y;
    always @(A or Con)
    begin
        case(Con)
          0: Y=A;
          1: Y=A<<1;
          2: Y=A>>1;
          default: Y=3’b0;
        endcase
     end
endmodule
Combinational logic shifters with
shiftin and shiftout
       Sel          Operation          Function
     0             Y<=A            No shift
     1           Y<-A rol 1      Rotate once
     2           Y<-A rol 2      Rotate twice
     3           Y<- A rol 3    Rotate Thrice
     4           Y<-A rol 4    Rotate four times
     5           Y<-A rol 5    Rotate five times
Verilog Coding
     function [2:0] rotate_left;
     input [5:0] A;
     input [2:0] NumberShifts;
     reg [5:0] Shifting;
     integer N;
     begin
       Shifting = A;
       for(N=1;N<=NumberShifts;N=N+1)
        begin
          Shifting={Shifting[4:0],Shifting[5]};
        end
       rotate_left=Shifting;
     end
    endfunction
Verilog
     always @(Rotate or A)
    begin
      case(Rotate)
        0: Y=A;
        1: Y=rotate_left(A,1);
        2: Y=rotate_left(A,2);
        3: Y=rotate_left(A,3);
        4: Y=rotate_left(A,4);
        5: Y=rotate_left(A,5);
        default: Y=6’bx;
       endcase
     end
Another Way
.
                           data 1
                  n bits
                                    output
                           data 2            n bits
n bits
      S = A⊕ B          Cout                 S = A⊕ B ⊕C
                                                                            Cout               C
  Cout = AgB                               Cout = MAJ ( A, B, C )
                                   S                                                   S
  A     B      Co   S                          A      B      C      Co S
  0     0      0    0                          0      0      0      0   0
  0     1      0    1                          0      0      1      0   1
  1     0      0    1                          0      1      0      0   1
  1     1      1    0                          0      1      1      1   0
                                               1      0      0      0   1
                                               1      0      1      1   0
                                               1      1      0      1   0
                                               1      1      1      1   1
Carry-Ripple Adder
   Simplest design: cascade full adders
       Critical path goes from Cin to Cout
       Design full adder to have fast carry delay
        A4 B4 A3 B 3 A2 B2 A 1 B1
    Cout                                             Cin
                 C3          C2         C1
           S4         S3          S2         S1
Full adder
   Computes one-bit sum, carry:
       si = ai XOR bi XOR ci
       ci+1 = aibi + aici + bici
   Half adder computes two-bit sum.
   Ripple-carry adder: n-bit adder built from full
    adders.
   Delay of ripple-carry adder goes through all
    carry bits.
Verilog for full adder
module fulladd(a,b,carryin,sum,carryout);
  input a, b, carryin; /* add these bits*/
  output sum, carryout; /* results */
     fulladd a0(a[0],b[0],carryin,sum[0],carry[1]);
     fulladd a1(a[1],b[1],carry[1],sum[1],carry[2]);
…
    fulladd a7(a[7],b[7],carry[7],sum[7],carryout]);
endmodule
Generate and Propagate
G[i ] = A[i ].B[i ]                      G[i ] = A[i ].B[i ]
P[i ] = A[i ] ⊕ B[i ]                    P[i ] = A[i ] + B[i ]
C[i ] = G[i ] + P[i ].C[i −1]            C[i ] = G[i ] + P[i ].C[i −1]
S [i ] = P[i ] ⊕ C[i −1]                 S [i ] = A[i ] ⊕ B[i ] ⊕ C[i − 1]
      endmodule
Verilog for carry-lookahead sum unit
module sum(a,b,carryin,result);
  input a, b, carryin; /* add these bits*/
  output result; /* sum */
            ∑ ( j − i)2
           j =i +1
                                − ( j −i )
                                             + (k − i)2− ( k −1−i )
           = 2 − 2− ( k −1−i )
                          p
           [Using,       ∑l2
                         l =1
                                     −l
                                             = 2 − ( p + 2)2− p ]
Carry completion sensing adder
A=011101101101101                A=011101101101101
B=100111000010101                B=100111000010101
------------------------------   ------------------------------
C=000000000000000                C=000101000000101
N=000000000000000                N=000000010000010
------------------------------   ------------------------------
C=000101000000101                C=001111000001101
N=000000010000010                N=000000110000010
Carry completion sensing adder
A=011101101101101                A=011101101101101
B=100111000010101                B=100111000010101
------------------------------   ------------------------------
C=001111000001101                C=011111000011101
N=000000110000010                N=000000110000010
------------------------------   ------------------------------
C=011111000011101                C=111111000111101
N=000000110000010                N=000000110000010
Carry completion sensing adder
A=011101101101101
B=100111000010101
------------------------------
C=111111000111101
N=000000110000010
    -----------------------------
    -
C=111111001111101
N=000000110000010
Carry completion sensing adder
   (A[i],B[i])=(0,0)=>(Ci,Ni)=(0,1)
   (A[i],B[i])=(1,1)=>(Ci,Ni)=(1,0)
   (A[i],B[i])=(0,1)=>(Ci,Ni)=(Ci-1,Ni-1)
   (A[i],B[i])=(1,0)=>(Ci,Ni)=(Ci-1,Ni-1)
   Stop, when for all i, Ci V Ni = 1
Justification
   Ci and Ni together is a coding for the carry.
   When Ci=1, carry can be computed. Make
    Ni=0
   When Ci=0 is the final carry, then indicate by
    Ni=1
   The carry can be surely stated when both Ai
    and Bi are 1’s or 0’s.
Carry-skip adder
   Looks for cases in which carry out of a set of
    bits is identical to carry in.
   Typically organized into b-bit stages.
   Can bypass carry through all stages in a group
    when all propagates are true: Pi Pi+1 … Pi+b-1.
       Carry out of group when carry out of last bit in
        group or carry is bypassed.
Carry-skip structure
      ci
      Pi
      Pi+1         AND
               …
      Pi+b-1
                            OR
                   Ci+b-1
    Carry-skip structure
       fulladd_p a0(a[0],b[0],carryin,sum[0],carry[1],p[0]);
       fulladd_p a1(a[1],b[1],carry[1],sum[1],carry[2],p[1]);
       fulladd_p a2(a[2],b[2],carry[2],sum[2],carry[3],p[2]);
       fulladd_p a3(a[3],b[3],carry[3],sum[3],carry[4],p[3]);
       assign cs4 = carry[4] | (p[0] & p[1] & p[2] & p[3] & carryin);
       fulladd_p a4(a[4],b[4],cs4, sum[4],carry[5],p[4]);
…
     assign carryout = carry[8] | (p[4] & p[5] & p[6] & p[7] & cs4);
endmodule
Delay analysis
   Assume that skip delay = 1 bit carry delay.
   Delay of k-bit adder with block size b:
       T = (b-1) + 0.5 + (k/b –2) + (b-1)
          block 0 OR gate skips   last block
   For equal sized blocks, optimal block size is
    sqrt(k/2).
Delay of Carry-Skip Adder
     tp
                                         ripple adder
bypass adder
                                       N    
               t d = 2( k − 1) t RCA +  − 2 t SKIP
                                        2k  
            4..8
                                               N
Carry-select adder
   Computes two results in parallel, each for
    different carry input assumptions.
   Uses actual carry in to select correct result.
   Reduces delay to multiplexer.
Carry-select structure
Carry-save adder
   Useful in multiplication.
   Input: 3 n-bit operands.
   Output: n-bit partial sum, n-bit carry.
       Use carry propagate adder for final sum.
   Operations:
       s = (x + y + z) mod 2.
       c = [(x + y + z) –2] / 2.
Carry Network is the Essence of a Fast Adder
                 gi pi       Carry is:                             xi             yi
                 0    0      annihilated or killed                                                      gi = xi yi
                 0    1      propagated
                 1    0      generated                                                                 pi = xi ⊕ yi
                 1    1      (impossible)
Carry network
                                                                                                              Ripple; Skip;
            ck
                          c k−1
                                             ...                         ci            ...            c0      Lookahead;
                                   c k−2                                                      c1
                                                       c i+1                                                  Parallel-prefix
si
                                               ...
ck               ck−1              ck−2               c2                        c1         c0
Carry network
            ck
                         c k−1
                                            ...                           ci            ...             c0
                                  c k−2                                                        c1
                                                        c i+1
si
The carry recurrence can be unrolled to obtain each carry signal directly
from inputs, rather than through propagation         Note:
                                                                        Addition symbol
     ci = gi–1 ∨ ci–1 pi–1                                             vs logical OR
        = gi–1 ∨ (gi–2 ∨ ci–2 pi–2) pi–1
        = gi–1 ∨ gi–2 pi–1 ∨ ci–2 pi–2 pi–1
        = gi–1 ∨ gi–2 pi–1 ∨ gi–3 pi–2 pi–1 ∨ ci–3 pi–3 pi–2 pi–1
        = gi–1 ∨ gi–2 pi–1 ∨ gi–3 pi–2 pi–1 ∨ gi–4 pi–3 pi–2 pi–1 ∨ ci–4 pi–4 pi–3 pi–2 pi–1
        =...
Full Carry Lookahead
              x3 y3              x2 y2                             x1 y1     x0 y0
cin
...
s3 s2 s1 s0
                indirectly                                                               g2
  Full carry lookahead is quite practical
  for a 4-bit adder                                  c2                                  p1
  c1   =   g0 ∨ c0 p0                                                                    g1
  c2   =   g1 ∨ g0 p1 ∨ c0 p0 p1                                                         p0
  c3   =   g2 ∨ g1 p2 ∨ g0 p1 p2 ∨ c0 p0 p1 p2       c1
                                                                                         g0
  c4   =   g3 ∨ g2 p3 ∨ g1 p2 p3 ∨ g0 p1 p2 p3                               c0
                                 ∨ c0 p0 p1 p2 p3          Four-bit carry network with
                                                          full lookahead.
    Carry Lookahead Beyond 4 Bits
Consider a 32-bit adder
                                                      No circuit sharing:
c1 = g0 ∨ c0 p0
                                                      Repeated computations
c2 = g1 ∨ g0 p1 ∨ c0 p0 p1
c3 = g2 ∨ g1 p2 ∨ g0 p1 p2 ∨ c0 p0 p1 p2
   .
   .
   .                                                                   32-input AND
c31 = g30 ∨ g29 p30 ∨ g28 p29 p30 ∨ g27 p28 p29 p30 ∨ . . . ∨ c0 p0 p1 p2 p3 ... p29 p30
                                           ...
                                                             High fan-ins necessitate
                               32-input OR                   tree-structured circuits
   Solution to the Fan-in Problem
High-radix addition (i.e., radix 2h)
      Increases the latency for generating g and p signals and sum digits,
      but simplifies the carry network (optimal radix?)
Multilevel lookahead
Either way, the carries c4, c8, and c12 are determined first
     g [i,i+3] = gi+3 ∨ gi+2 pi+3 ∨ gi+1 pi+2 pi+3 ∨ gi pi+1 pi+2 pi+3
     p [i,i+3] = pi pi+1 pi+2 pi+3
                       ci+3          ci+2       ci+1
g[i,i+3] p[i,i+3]
                                                                                pi+3
  c4
                       A 4-bit
                       lookahead                                                gi+3
                       carryp3generator          Block Signal Generation
                            g3                    Intermediate Carries
c3 ci+3
                            p2                                                  pi+2
        A 4-bit
        carry               g2                                                  gi+2
        network
                            p1            ci+2                                  pi+1
   c2
                            g1                                                  gi+1
                            p0                                                  pi
   c1                                     ci+1
                            g0                                                  gi
                                                                           ci
                  c0
     Combining Block g and p Signals
                                                        j0                              i0
                                           j1                i1
                 j2                   i2
j3
                                                                                             Block generate and
                          i3
                                                                                             propagate signals
                                                                                             can be combined in
             c j 2 +1                 c j 1 +1        cj                                     the same way as bit
                                                           0 +1
                                                                                             g and p signals to
                                                                                             form g and p signals
     g p                       g p                g p                    g p
                                                                                             for wider blocks
                                                                                             ci 0
                        4-bit lookahead carry generator
     g p
                                Fig. 6.3    Combining of g and p signals of four
                                (contiguous or overlapping) blocks of arbitrary widths
                                into the g and p signals for the overall block [i0, j3].
 Apr. 2012                                  Computer Arithmetic, Addition/Subtraction                       Slide 62
A Two-Level Carry-Lookahead Adder
                                                                             c12         c8          c4                  c0
 g [0,63]
 p [0,63]           Fig. 6.4 Building a 64-bit carry-lookahead adder from 16
                    4-bit adders and 5 lookahead carry generators.
                                                                             p′
  j1              i1
                                          ¢
                                              g = g" + g'p"
       g    p                                 p = p'p"
                                        (g, p)
                  Block B                                     g    p
 Combining of g and p signals of two (contiguous or overlapping) blocks B'
 and B" of arbitrary widths into the g and p signals for block B.
Formulating the Prefix Computation Problem
The problem of carry determination can be formulated as:
Given    (g0, p0)(g1, p1) . . . (gk–2, pk–2)   (gk–1, pk–1)
             c1                    c2            . . .              ck–1                ck
Carry-in can be viewed as an extra (−1) position: (g–1, p–1) = (cin, 0)
     ¢                                ¢                            (b) A 4-bit
                                                                   Carry
                                                                   lookahead
     ¢                 ¢                                           network
                ¢                   ¢                   ¢
                                                                                                 g[0,1] p[0,1]
 Reason for         2
 latency being
 2 log2k – 2        3
 Brent-Kung
                    5
 parallel prefix
 graph for
 16 inputs.         6