0 ratings0% found this document useful (0 votes) 55 views13 pagesFinite Word Length
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, 
claim it here.
Available Formats
Download as PDF or read online on Scribd
CHAPTER j I i
13 Analysis of Finite Word
Length Effects || ee
  
    
   
  
  
 
    
    
  
   
 
     
   
13.1 INTRODUCTION
In discussing filter realizations (Chapter 10), we have assumed that #
the signals, filter coefficients, and results of arithmetic operations such
and multiplication were represented in an infinite precision. In practice,
be represented only to a finite precision. Precision is defined as the
(or difference) between two consecutive numbers that can be obtained
number of bits. All digital processors have finite word length. Once we
ital filter, it must be implemented in a finite word length machines.
reductions in filter performance caused by moving from infinite preci
precision are called finite word length effects. Consider, for example,
filter
 
 
  
y(n) = —a,y(n — 1) + bz(n) + by2(n — 1)
In implementing this filter, we must deal with the following problems:
1. The input signal values x(n) must be represented with a finite number
introduces quantization error. The process of representing the data
number of bits is known as quantization,
2. All the filter coefficients (a), bo, and b;) cannot be represented exactly.
quantized for representation with a finite number of bits, which may int.
in the filter frequency response. Its poles and zeros are not in the desi
As a result, the singularities in the z-plane are limited to lie on a grid
locations.
3. If the coefficients are quantized to B bits and the signals are quantized:
then the product terms —a,y(n—1), byx(n), and b,2(n—1) will each be
bits. All the product terms must be quantized. Quantization of the
introduces noise into the filter, which has the effect of reducing the si
ratio (SNR) of the filter.
4. Another source of error occurs as a result of the summing operation.
of several B-bit numbers will not necessarily fit within B bite. WI
large to fit, it results in overflow, which can cause large er!
the SNR. Overflow can also yield high-amplitude oscillat
nite impulse responce (IIR) filters, and this in turn
M
 
 
 
  
   
Scanned with CamScannerAnalysis o
We cannot combine all quantiznt f Finite Word Length Effects 813
following reasons: mation effects toge
1, The filter coe!
etree Squartiees CRY ft
Quantized coefficiens, Mantized once
ent
ether
into a single error analysis for the
 
 
4 during the desi
tion. Hence, the effe S$ remain uring the design process
By ect of coeffi constant in th aco
and H(e) from their idea) eet antization is to heer mplemmts:
the system is still linear, It neem forms, nm “ddehehbie toads ae
lhe the moll cterministic manner, and
ifications, we can optimize it, rodent fesign does not meet the spec-
specifications. Tedesign it, and restructure it to satisfy the
Signal quantizati chit |
occurs in the operation of the filter and can be treated as a
random process, Each 5
i quantization operation can be modelled as producing white
noise that is uncorrelate:
Sat ed from ot ota
simplifies the noise analysis, other quantization noise sources in the filter. This
igital filters
pes nea eee be designed carefully in order to minimize the
he . These effects depend on the method used to represent
43.2 REPRESENTATION OF NUMBERS
In a digital processor, or a computer, numbers are represented as combinations of a
finite number of binary digits or bits that take on the values of 0 and 1. There are
two common forms that are used to represent numbers in a digital processor: (a)
fixed-point and (b) floating-point representations. In the fixed-point representation,
the binary point is assumed to be in a fixed position, whereas in the floating-point
"representation, the position of the binary point is variable. Floating-point represen-
tations have the advantage of representing very large and very small numbers (large
dynamic range); hence, issues of overflow and scaling do not come into play. However,
the spacing between adjacent floating-point values is not constant, that is, the quan-
tization step size is not ‘uniform; instead, it is proportional to the magnitude of the
number.
  
 
  
  
   
  
  
  
 
i 1 Fixed-point Representation of Numbers
‘a fixed-point representation, a number is represented as a string of bits with a binary
~ point. i i ant represent the integer part of the number,
" ts to the left of the binary point rep! :
fens, = na right of the binary point represent the fractional part of the suber.
if, denote the Jocation of the binary point, then the binary number 1011.011
decimal value of
E 3
xe ye = (b-s
en3
 
99 + b22? + b-12! + bo) + (b)2°1 + 2? + 2)
pi "
eo tag a 8) = 11.375
any number X can be expressed a8
Spat oshseod (13.1)
ebay Birtosbobeh saletaes
Scanned with CamScannerJ 814 Digital Signal Processing
represent the digit, 7 is the radix oF base, n is the number «,
aie wae the number of fractional digits. If r = 10, then Eq. (13.1) bec :
re ‘ation of X, and the radix point » betwoen bo and by is thee
Similarly, if r = 2, then Eq. (13.
radix point » between by and by
ically in the processor.
When a B-bit integer format (n = B, m=
then the range is from 0 to 28 — 1, that is,
0
X= (hopb-p'-berboo)a= Dy 2"
isI-B
        
  
 
 
  
    
 
te
 
1) becomes the binary representation»,
is the binary point. The binary poin, |”
= 0) is used to represent tin.;,
Ie in
= by_p22-! + bp-2P? +++ 612! + boo
When b; = 0 for all i, then Xinin = 0, and when b = 1 for all i, then X,,,.
example, when a 4-bit integer format (n and m = 0) is used to represen:
integers, then the range is from 0 to 2#—1= 15.
When a B-bit fraction format (n = Poe = B) is used to represent
fractions, then the range is from 0 to 1~ 2~®, that
  
        
   
     
  
 
 
Bere me
X = (boobiba---ba)2 = Yb:274 = boob2™ + by2-? +--+ br?
i=0
For unsigned fractions,
1 loll
=> 001010
‘The negat!
omplement form.
Q-24)-\x|=2- 76-8 :
   
ynt Form co a
save namorcan be toeaael WO Soe
significant como ome complement Tn two's complement form, fractional
X is represented a5
‘Oobibe OB X>0 (positive number) col
inobsbo Fp) +10.00--- 011 fF x <0 (negative number)
act bea
Scanned with CamScanner816 Digital Signal Processing
 
re ie one’ f by. Note that for X <0, we hy
he one’s complement of
where 5, = 1 = by is tl
= [16B:B2-+- Ba) + [0600---01] 4
NRK AEH Ts 5 of
+ 0-2940-274+0-2°7 +-..41-279]
x [tar] 2-8 = [2-2-8 - |x|] 42-8
Xe = 2- |X]
in two's lement form, the
is the magnitude of X. Therefore, in two's compl :
peerhetfer Rriarisciiad by subtracting the magnitude from the 2. We
write Eq. (13.6) as
x for X20. (positive number)
alae IX] for X <0 (negative number)
 
—_—_ used two's complement, there is only one
Example 13.2 Express the fraction § and —2 in
‘two's complement form. Assume B = 4.
Solution:
Jn two's complement form,
Therefore, signed numbers
two's complement form in m
Processors.
=1 is also rep
are represented
1ost of the digital si
       
  
  
    
   
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
‘One's
complement is obtained by subtracti
= § = 0.1010 from 2 = 10,0000, ‘That is, 4 oe
es ae
—} 0.100
2 = on
2 10,0000 0.010
Ml=3 =  ostoi0 :
2-1X| = 1010
The ‘Negative fracti
i ion —§ is represented by 140110
ates eee fom Table 13.1 shows’ that
    
 
 
 
 
 
Scanned with CamScannerAnalysis of Fi
Vets of Finite Word Length Effects 817
       
  
   
   
   
    
  
   
 
  
   
   
 
   
  
  
   
  
  
  
 
   
  
allows numbers to be represent,
onan sented with 4
netic can red ‘ h a large dy,
arithmetic the problem of overnre” C7hamic range. Therefore, fi i
In floating-point representation erflow that occurs te fa ere, floating-point
8, a '# in fixed-point representation.
 
imber X can be expressed as
X=Mx28
(13.9)
where M is the mantissa
and £ is the exponent. Both mantles and exponent
have sign bits and ‘
‘ ‘ epresented
and M is a fraction, such that in any fixed-point format. B is an integer
the floating-point representation is 2eic «/.. <1 When M is in thi
ation is said to be ie in this range,
000101101 and > be normaliz
badirely " eee are represented by 0.101101 we - rain
ng-point format, both mantissa and exponent re stored ik
registers.
When two floating-poin ;
arent then one of tren a ane nteents mat eal ey
take place. Multiplication is accomplished by mltoh ne sted before addition eon
Bid copdjusting the product yy multiplying mantissas, adding exponents,
The Institute of Electrical and Electronics
ae a) introduces a standard for |S | @ ~ ¢¢ | ™™ ~ ™
representing floating-point numbers in a 32-bit Ex ‘Mantissa(23 bit
format and is called the IEEE-754 standard. aes: is
According to this standard, the 32-bit single pre. (uepomtiaeat
cision, floating-point number uses one sign bit, 8
bits for the exponent, and 23 bits for the fractional mantissa. The 32-bit format is shown
in Fig. 13.2, where $ is the sign bit, and # and M represent the exponent and mantissa
fields, respectively. The IEEE-754 standard for the 32-bit single-precision, floating-point
number is represented by
 
 
 
 
 
 
 
X =(-1)8 x Mx 25-17 (13.10)
where the mantissa M is coded as 0 < M < 1 and the exponent is coded to be
biased as E — 127, To get both positive and negative exponents, the bias is Pro:
vided by an integer; usually the bias is chosen as 27 — 1 = 127 when the exponent
E is 8 bits. Without the bias, an bit integer number varies from 0 to 255, but
with a bias of 127, the exponent varies from —127 to 127. In order to increase
the range of the mantissa, one integer bit is added to M, so that i. * as
sented as 1.M. Now it is assumed to be normalized, but this bit is no £0 naa By
the total word length. The standard also specifies the following properties
. 5 gB-127.
|. When 0 < E < 255, then X =(-1) x 1,M x bass
y =0 and M40, then X= (-1)8 x Oo x27,
7 ey * then X is not a number and is denoted as NaN.
s 1 oe (>: Soo.
Ba vite = 255 and M=0, then X © »
rece ao soa Moen 25 CNT
        
       
    
 
 
 
  
i it and 23-fractional
with a one integer bit
th 23 bits, The IEEE-754 format has a
from 1.18 x 107% to 3.4 x 10%).
  
  
  
jized mantissa
eM denotes the a EL peri‘Scanned with CamScannerFig. 13.3 Quantization
process model
    
rest niumber that conve),
that canbe tore
* © i quantize, then Qf
8 shown in Pig. 13.3, ‘hei
     
   
 
    
  
   
   
  
   
 
 
 
 
   
 
 
   
43.3.1 Truncation
 
13.3.2 Rounding
Binary rounding is the process in whi
: f et which the
matically decide whether to add binary 1 Ho dee tL it is needed to mathe
1th bit is binary 0, then binary 1 will not Preceding B bits. If the (B +
If the (B + 1)th bit is binary 1 be added to the preceding B bits.
Bbits. ¥ 1, then binary 1 will be added to the preceding
For example, if B = 3, then we can have 2° = 8 j
t0.000, 001, 010, 011, 100, tor ie ee different values corresponding
“ aleastettl tccthe son , al uppose B = 3 bits represent a
umber oltage of a signal. Then, 2° = 8 different voltage levels
OV, 1V, 2V, 3V, 4V, 5v, ev, 3 ; ae
are > 2V, 3V, 4V, 5V, 6V, and 7V, and the step size is 1V. An input sam-
ple of ov would be converted to 000; an input sample of 1V would be converted to
001; an input of 2 V would be converted to 010, and so on. Suppose the input sam-
ple is z = 1.75V, which is represented as x = 1.75 = 001,11; then L = 5. Would
this continuous amplitude sample be converted to 001 or 010? The answer depends
upon the type of quantization (truncation or rounding) used to convert this contin-
uous amplitude sample into discrete amplitude sample. If the type of quantization
is truncation, then all values from 1.0V up to but not including 2.0V are converted
to 001.
Qe] = Q[1.75] = Q[oo1.11] => 001.
type ization is i .5V up to but not including,
of quantization is rounding, then all values from 0 5 u
a are iad to 001. Values from 1.5V up to but not including 2.5V are converted
to 010.
001
Qlx| = Q[1.75] = 9001011] —= jeer ee
~ : rs p a ‘*
oe (a) Truncation With 3 Aap t step coms
     
 
   
  
   
 
   
rsa saad t :
Consider a number 2 = bia ssi signed magnitude CEE 0,0.25,0.5, pe
i a 0.5, and — 0.75. Their binary
a magnitude orp) rounding aie O 10, 0511, 100, 1601,
{@) truncation and (b) ro resentation is Oasis ee
have a S-bit register. 10, ay
bit) bit regis:
et 0,375 = 00011;
  
= 3 (including
ude form,‘Scanned with CamScanner; Analysis
jantion, we get 01015 = 5, ‘wis of Finite Word Length Effects 821
 
 
  
ste © = QPPIOIOIN.) = O101. = 0.5 6101010, When we a
a Lb =37 Di numer nie NOS ky he most signiiennt
42-5 =3 0,101 = G.695, St MME quantiontion, we get
 
ne oor
multiplication \n decimal for
oe mas625. In binary, form 0.111 wottoe qocuee
———— a
If we add two fixed-point fj he eum,
fraction
: Abit fractions 110) 1m: then overflow can oceut, For example, ot
101 and
Seat tetiater. nd 61000 is 1.0101. That is, the sum cannot be contained
= Q0.101010) = 0,101 = 0.66
The error is 0.65625
 
0.625 = 0.031
  
—aple 18-7 This example shows that overfiow atten eves sama a eet
rtm = =
 
=
Sajeee yee wading two frectlonal numbers, 1100 = —4, which is incorrect because there
‘ottion: was an overflow into the sign bit.
magnitude form eae a Abit signed
8 Fiabe tyne ta which we wi
‘agnitude ich we will add two Example 13.8 Prove that the quantizer (trunca-
‘n and zp. If the sum of these two numbers
 
saab (Cf. By thon the rene tt aed toda) is a non-linear system.
spe correct and there is no overflow. Bolutiont
i det a = } = e010 and 2 = 3 = 0,011. nae Oe +
The correct sum is x + 2 = 3, which is within ‘Example 13.9 Compare fixed-point and floating:
the dynamic range. Simple binary addition gives point representations.
n+ 2 = 00010 + 00011 = 00101 =
Correct because there is no overflow.
2 Let m = } = Oolll and a
‘The correct sum is +m = 3
which is ‘than the dynamic range. Simple
Fee ation gives 21+ = On111 + 0011 =
1,010 = —3, which is incorrect because there
onan into the sign bit.
 
 
 
 
    
 
 
 
 
 
 
    
Consider a 4bit two's
Two's complement form crhich we will add
number a
tmotambers aj end az. The dynamic range P19,
complement number is (-1+ 5)
Hae itn ie robes then the result will
be correct. Otherwise, there is an
_3 = loll.
= 19110 and 7 = ¢
Ze = | which is within
ae ‘addition givesoe oa I She mantissa is given by
 
—
      
  
ly P< QiM\~ Ms} a8
ee <2" (Qi) M)s} a
1
(error in floating-point word) -2¢ 3 2° 9-8
Since } < M < 1, we can write
a27?  aye-*
a
 
(13.29)
 
    
       
    
  
  
  
    
  
 
ivati i due to
the partial derivative 32 represents the incremental change in the pole ps
change ae coefficients he the total error Ap; is the sum of all the incremental
: errors in each of the coefficients a.
abet “To find oe, use the chain rule
Bait : [ oa] wn (13.30)
       
BPs
oo, [25
 
 
”%
or of Eq. (13.30) is
 
  
    
: e ‘ 5 1
ie) 52 Sa’ | ‘=P (43.31)
alana” BU lem
; or of Eq. (13:30) i8
Bo
 
Pi
 
 
a \ th830 Digital Signal Processing
 
Lf
Oa,
 
oat
Now substituting Eq. (13.33) in Eq. (13.29), we get the total perturbation errop
Bol pp-1-k
4x = at Am
“OTL (ez
jst
a4i
This equation gives a measure of the sensitivity of the ith pole as a function of
coefficient quantization errors Aa,. The term (p; — pj) in the denominator is a
between the two poles. Hence, if the poles are closely clustered, then |p, — p,|is
there by resulting in a large perturbation, Ap;. This, in turn, would lead to a
error in the frequency response. Narrowband filters are highly sensitive to
quantization because the poles are tightly clustered in these filters.
The perturbation Ap, is minimum if the length |p;— -p;| is maximum. We can maxi
the length |p; — p;| by implementing the filter as a cascade or parallel combinati
first- and/or second-order sections. The real poles can be implemented as
sections. For each of these first-order sections, there is only one pole, and hence,
sections. Since the complex conjugate poles are usually far apart, the perturbation:
A formula analogous to Eq. (13.34) can be obtained for the sensitivity of the
to errors in the parameters b,. For a cascade form realization, the zeros can be
into first- and second-order factors. For a parallel form realization, we use partial fr
expansion. Therefore, each first-order section has no zero and each second-order
has one zero. Therefore, there is no clustering of zeros for each individual section imei
cascade or parallel form. Thus, the coefficient quantization has less effect in cascade’
Parallel realizations when compared with the direct form realizations.
13.6.1 Coefficient Sensitivity Analysis of a Second-order Direct
Form Il Structure cel -
Consider a second-order filter with system function
 
  
  
1
ere De
1—2rcosO2-! + 722-2 ~ 22 — Orcosdz +r?
Pi = rel? and pp = re~, It has two coefficients a = 2"!
acture is shown in Fig. 13.7. With infinite PP
Scanned with CamScanner
H(z) =