Finite Wordlength Effects in Discrete-Time
Systems
                   V. Rajbabu
              rajbabu@ee.iitb.ac.in
                    DSP Lab
           Department of Electrical Engineering
          Indian Institute of Technology Bombay
                    24 Jan 2024
                                                  1 / 35
                           Outline
Number Representations
Coefficient Quantization
Handling Quantization Effects
Today’s Lab
                                     2 / 35
               Finite Wordlength Effects
Compared to an analog implementation, discrete
implementation losses precision due to quantization
  • A/D and D/A conversions
  • Finite word-length arithmetic
Sources of error
  • Quantization in A/D conversion - signal-to-noise ratio
    (SNR) increases by 6 dB for each additional bit
  • Rounding of multiple products
These errors have to be considered while designing DT
systems
                                                             3 / 35
Number Representations
                Number Representation
Based on finite word-length numeric representations, digital
signal processors (DSPs) can be
   • Fixed-point processors
  • Floating-point processors
                                                               5 / 35
                Number Representation
Based on finite word-length numeric representations, digital
signal processors (DSPs) can be
   • Fixed-point processors
  • Floating-point processors
Floating-point representation
  • General purpose processors use this representation
             x = 2E x̂M eg., in 64-bit, E = 11, 1 sign bit
                        52 bits for decimal part
    E is the exponent, and x̂M ∈ (−1, +1) is the mantissa
                                                               5 / 35
                Number Representation
Based on finite word-length numeric representations, digital
signal processors (DSPs) can be
   • Fixed-point processors
  • Floating-point processors
Floating-point representation
  • General purpose processors use this representation
             x = 2E x̂M eg., in 64-bit, E = 11, 1 sign bit
                        52 bits for decimal part
    E is the exponent, and x̂M ∈ (−1, +1) is the mantissa
Floating-point representation
  • supports wide range of values (with small number of bits)
  • has complex hardware
  • is simpler to design
                                                                5 / 35
             Binary Number Systems
Basic number systems
  • Signed magnitude
 • One’s complement
 • Two’s complement
Two’s complement
 • Most commonly used in binary system
 • Unique representation for zero
 • Simple mathematical operations
 • Subtraction performed using addition
                                          6 / 35
              Fractional Binary Numbers
  • Designer keeps track of radix (decimal) point
Normalized Binary
Normalized M-bit binary numbers are written as:
                     x(B) = b0 .b1 b2 · · · bM−1
where b0 is the sign bit
Decimal representation
                                     M−1
                                     X
                    x(10) = −b0 +           bi 2−i
                                      i=1
                                                     7 / 35
              Fractional Binary Numbers
  • Designer keeps track of radix (decimal) point
Normalized Binary
Normalized M-bit binary numbers are written as:
                     x(B) = b0 .b1 b2 · · · bM−1
where b0 is the sign bit
Decimal representation
                                     M−1
                                     X
                    x(10) = −b0 +           bi 2−i
                                      i=1
Unscaled 
         real number - infinite precision
                    ∞
                        
                          bi 2−i
                    P
x(10) = Xm −b0 +
                    i=1
                                                     7 / 35
                                   Number Circle
An easy way to visualize two’s complement representation
                                            0.00 ≡ 0⋅ 2−2 =    0
                   1.11 ≡ −1⋅ 2−2 = −0.25                           0.01 ≡ 1⋅ 2−2 = 0.25
         1.10 ≡ −2⋅ 2−2 = −0.5                                                0.10 ≡ 2⋅ 2−2 = 0.5
                   1.01 ≡ −3⋅ 2−2 = −0.75                           0.11 ≡ 3⋅ 2−2 = 0.75
                                            1.00 ≡ −4⋅ 2−2 =   −1
                                 Figure: 3-bit Q2 numbers
                                                                                                    8 / 35
                           Q-Format
Q-format is a formal mechanism to keep track of radix (fixed)
point
Q-Format: Q##
refers to a binary number with ## bits to the right of the radix
point
  • Total word length depends on the system
  • In DSPs, Q15 is a common format
  • A 16-bit number in Q15 has 1 sign bit and 15 fractional bits
                             s.b0 b1 · · · b14
Alternate form: Q(I.F )
  • I indicates number of bits to the left of radix point (for sign
    and integer part of a number)
  • F bits to the right of the radix point
                                                                      9 / 35
                  Q-Format - Example
Convert the following numbers to their signed integer value in
Q15
Q-Format
                     0.5
                   −0.5
                   −1.0
                     1.0
                                                                 10 / 35
                  Q-Format - Example
Convert the following numbers to their signed integer value in
Q15
Q-Format
                    0.5 =   16384
                   −0.5 = −16384
                   −1.0 = −32768
                    1.0 = out of range
                        ≈ 32767 = 1 − 2−15
                                                                 10 / 35
                 Q-Format Conversion
Q12 Number
          15         11                            0
          S
               IWL          WL − 1 − IWL
−2IWL ≤ Range < 2IWL − 2−12
Let x be a fractional number that needs to be represented as a
B-bit (WL) signed integer, Qf format as xq
  • For positive x, xq = round(x · 2f )
  • For negative x, xq = −round(|x| · 2f )
                                                                 11 / 35
                    Q-Format: Addition
To obtain: C = A + B, with Qc , Qa , Qb
  • Require Qa and Qb to be equal
  • Let Ma and Mb be size of registers for A and B
  • Intermediate values
       • Intermediate result size = max(Ma , Mb ) + 1
       • Intermediate QI = Qa = Qb
  • Final values
       • Top Mc bits are used, lowest fractional bits are discarded
                     Qc = Qa − (Mc − max(Ma , Mb ) − 1)
                         = Qb − (Mc − max(Ma , Mb ) − 1)
Adding N numbers of length M
    Final word length : ???
                                                                      12 / 35
                    Q-Format: Addition
To obtain: C = A + B, with Qc , Qa , Qb
  • Require Qa and Qb to be equal
  • Let Ma and Mb be size of registers for A and B
  • Intermediate values
       • Intermediate result size = max(Ma , Mb ) + 1
       • Intermediate QI = Qa = Qb
  • Final values
       • Top Mc bits are used, lowest fractional bits are discarded
                     Qc = Qa − (Mc − max(Ma , Mb ) − 1)
                         = Qb − (Mc − max(Ma , Mb ) − 1)
Adding N numbers of length M
    Final word length : M + ⌈log2 N⌉
                                                                      12 / 35
               Q-Format: Multiplication
To obtain: C = A × B, with Qc , Qa , Qb
  • Let Ma and Mb be size of registers for A and B
  • Ma and Mb or Qa and Qb need not be equal
  • Intermediate values
      • Intermediate result size = Ma + Mb
      • Intermediate QI = Qa + Qb
  • Final values
      • Top Mc bits are used and lowest fractional bits are
        discarded
      • Qc = (Qa + Qb ) − (Ma + Mb − Mc )
                                                              13 / 35
                     Fixed-point Arithmetic
                           Table: Fixed-point Arithmetic
Floating-point                       Fixed-point                     IWL of result
                           IX > IY                   IX < IY
      X := Y        X := (Y ≫ (IX − IY ))    X := (Y ≪ (IY − IX ))          IX
      X +Y           X + (Y ≫ (IX − IY ))    (X ≫ (IY − IX )) + Y    max(IX , IY ) + 1
      X ∗Y                  X ∗Y                     X ∗Y                IX + IY
a
    IX , IY - Integer word length (IWL) of X and Y
b
    Overflow needs to be avoided for valid results
                                                                                 14 / 35
Coefficient Quantization
                       Quantization
  • Quantization - represents numerical values with finite
    number of bits
  • Significant in fixed-point representation
Quantization in signal processing
  • data - round-off errors
  • coefficients/parameters - changes system transfer function
                                                                 16 / 35
Implementation of DT Filter
                              17 / 35
                   Quantization Errors
Possible errors in a quantized (fixed-point) system
  • Input quantization
  • Coefficient quantization
  • Product quantization (round-off error, underflow)
  • Overflow
                                                        18 / 35
                      Quantization Example
Suppose you want to implement a filter using fractional
arithmetic (Q1.3)
    h[n] = [ 13   1
                  3
                       1
                       3]
    x[n] = [ 15   2
                  5
                       3
                       5
                            4
                            5]
    y [n] = [0.0667     0.2      0.4   0.6   0.4667   0.2667] (ideal)
                                                                        19 / 35
                      Quantization Example
Suppose you want to implement a filter using fractional
arithmetic (Q1.3)
    h[n] = [ 13   1
                  3
                       1
                       3]
    x[n] = [ 15   2
                  5
                       3
                       5
                            4
                            5]
    y [n] = [0.0667     0.2      0.4   0.6   0.4667   0.2667] (ideal)
Q-format representation: Q1.3 refers to 4 bits with the MSB
representing sign and three bits representing fractional part
                                                                        19 / 35
                Quantization Example
Consider h = 0.3333 represented in Q(1.3) or Q3 (Qf )
  • Word length is 4 bits
  • hq = round(h × 2f ) = round(2.66) = 3
  • Q{h}bin = 0 011
  • Equivalent decimal value 0.375
                                                        20 / 35
                 Quantization Example
Consider h = 0.3333 represented in Q(1.3) or Q3 (Qf )
  • Word length is 4 bits
  • hq = round(h × 2f ) = round(2.66) = 3
  • Q{h}bin = 0 011
  • Equivalent decimal value 0.375
Typically output of product or sum will have register with more
bits (say in this case 8 bits) to avoid overflow
                                                                  20 / 35
                 Quantization Example
Consider h = 0.3333 represented in Q(1.3) or Q3 (Qf )
  • Word length is 4 bits
  • hq = round(h × 2f ) = round(2.66) = 3
  • Q{h}bin = 0 011
  • Equivalent decimal value 0.375
Typically output of product or sum will have register with more
bits (say in this case 8 bits) to avoid overflow
   • Say x = 0.2, we have Q{x}bin = 0 010
  • Product Q{hbin } × Q{xbin } = 00 000110
                                                                  20 / 35
                 Quantization Example
Consider h = 0.3333 represented in Q(1.3) or Q3 (Qf )
  • Word length is 4 bits
  • hq = round(h × 2f ) = round(2.66) = 3
  • Q{h}bin = 0 011
  • Equivalent decimal value 0.375
Typically output of product or sum will have register with more
bits (say in this case 8 bits) to avoid overflow
   • Say x = 0.2, we have Q{x}bin = 0 010
  • Product Q{hbin } × Q{xbin } = 00 000110
  • Use nearest rounded value hbin × xbin = 0 001
  • Comparing to actual value of 0.2 × 0.3333 = 0.0667 we
    have 0.125
                                                                  20 / 35
                Quantization Example
Quantized h[n] - Q(1.3)
                h[0]     h[1]     h[2]
   h[n]       0.3333   0.3333   0.3333
  Q{h[n]}      0.375   0.375     0.375
 Q{h[n]}bin   0.011    0.011     0.011
                                         21 / 35
                Quantization Example
Quantized h[n] - Q(1.3)
                h[0]      h[1]       h[2]
   h[n]       0.3333    0.3333     0.3333
  Q{h[n]}      0.375    0.375       0.375
 Q{h[n]}bin   0.011     0.011       0.011
Quantized x[n] - Q(1.3)
               x[0]     x[1]      x[2]    x[3]
   x[n]        0.2      0.4       0.6     0.8
  Q{x[n]}      0.25    0.375     0.625   0.75
 Q{x[n]}bin   0.010    0.011     0.101   0.110
                                                 21 / 35
                    Quantization Example
Quantized h[n] - Q(1.3)
                    h[0]       h[1]       h[2]
   h[n]           0.3333     0.3333     0.3333
  Q{h[n]}          0.375     0.375       0.375
 Q{h[n]}bin       0.011      0.011       0.011
Quantized x[n] - Q(1.3)
                   x[0]     x[1]       x[2]      x[3]
   x[n]            0.2      0.4        0.6       0.8
  Q{x[n]}          0.25    0.375      0.625     0.75
 Q{x[n]}bin       0.010    0.011      0.101     0.110
Quantized y[n] - Q(1.3)
                      y[0]     y[1]      y[2]      y[3]     y[4]     y[5]
     y[n]           0.0667     0.2       0.4       0.6    0.4667   0.2667
   Q{y[n]}           0.125     0.25     0.375     0.625     0.5     0.25
 Q{y [n]}actual      0.125     0.25      0.5      0.625     0.5     0.25
                                                                        21 / 35
           Coefficient Quantization Effects
In IIR filters the response can be sensitive to coefficient values
   • Quantization of filter coefficients causes the roots of the
     numerator and denominator polynomials of the z-transform
     to move
   • Changing the position of these roots in z-plane causes the
     frequency response to change and may even cause the
     filter to go unstable as poles are moved onto or outside the
     unit circle
                                                                     22 / 35
           Coefficient Quantization Severity
Severity of the effect is effected by
  • Tightly clustered roots
  • Roots close to the unit circle
  • Many roots (a long filter)
  • Filter structure
  • Roots close to the real axis (for DF implementation)
                                                           23 / 35
          Coefficient Quantization Severity
To reduce severity of these effects
  • Use smaller filter sections
  • Select filters with less tightly clustered roots
  • Use more bits of precision
  • Try slight variations on the filter to find which provides the
    best response under the quantization constraints
  • Scale the coefficients (by choosing the Q#) to
    advantageously trade-off significant bits and overflow
    probability
                                                                     24 / 35
    Quantization Effects - Filter Sturcuture
• Direct form (I and II) structures - filter and signal
  quantization errors can accumulate
                                                          25 / 35
     Quantization Effects - Filter Sturcuture
 • Direct form (I and II) structures - filter and signal
   quantization errors can accumulate
Cascaded Second-order Sections
 • SOSs require a much smaller range of coefficient values
 • Filter coefficient errors in one SOS do not affect other
   SOSs
 • Scaling may be applied separately to each section to
   reduce signal quantization effects
 • Sets of poles and zeros may be matched to limit the
   dynamic range at the output of each section
 • SOSs may be ordered to minimize quantization effects
 • Cascades of SOSs require more operations than a DF-II
   filter
                                                              25 / 35
      Quantization Effects - Filter Sturcuture
Folded FIR Filter Structure
  • Order of adds matter - folded FIR filters add from the
    outside in
                                                             26 / 35
      Quantization Effects - Filter Sturcuture
Folded FIR Filter Structure
  • Order of adds matter - folded FIR filters add from the
    outside in
  • FIR filters almost always have their smallest values at the
    ends
                                                                  26 / 35
      Quantization Effects - Filter Sturcuture
Folded FIR Filter Structure
  • Order of adds matter - folded FIR filters add from the
    outside in
  • FIR filters almost always have their smallest values at the
    ends
  • For long FIR filters, folded structures improve numerical
    accuracy over Direct Form
                                                                  26 / 35
Handling Quantization Effects
                    Quantization Effects
Finite word length effects
  • Overflow errors
         Can be avoided by appropriate scaling
  • Round-off errors
         Difficult to avoid - requires appropriate fixed-point arithmetic
                                                                            28 / 35
                      Round-off Noise
Product of fixed-point numbers
  • Product output requires more bits than inputs
  • Truncation or rounding of result can lead to errors
       • Extended precision registers help in reducing this error
Sum of fixed-point numbers
  • Output sum requires one-bit more than inputs
  • Truncation or rounding of result can lead to errors
       • Not as severe in product
                                                                    29 / 35
                          Scaling
Scaling
  • Prevents overflow
  • Provides a trade-off between SNR and overflow
Scaling in filter design/implementation
  • Normalize inputs, coefficients to ±1
  • Based on magnitude of frequency response
                                                    30 / 35
                Approaches to Scaling
Absolute scaling
  • Scale assuming worst-case inputs/data
  • Guarantees no overflow
  • Leads to less accurate results (more quantization error)
Dynamic scaling
  • Monitor range of variables and scale if required
  • Increases computation
                                                               31 / 35
           Floating-point to Fixed-point
• Implement and verify floating-point algorithm
    • Estimate minimum/maximum ( range) of variables
• Convert floating-point variables to fixed-point
• Decide on scaling, based on architecture (word length)
    • Range of variables can help in fixing integer word length
      (IWL)
• Replace floating-point arithmetic with fixed-point arithmetic
    • Consider available accumulator and register word lengths
                                                                  32 / 35
     FFT Computation: Finite Word-length
• Similar effects (quantization, round-off and overflow) also
  affect FFT computation
• Depends on FFT length used
• Use appropriate scaling at each stage of FFT (before a
  butterfly computation)
                                                                33 / 35
Today’s Lab
                     Lab Exercises
• Evaluating y = ax + b
• Evaluating y [n] = h[n] ∗ x[n]
                                     35 / 35
                     Lab Exercises
• Evaluating y = ax + b
• Evaluating y [n] = h[n] ∗ x[n]
• FIR quantization effects
       filterDesigner
                                     35 / 35