0% found this document useful (0 votes)
55 views13 pages

Finite Word Length

Uploaded by

SHIVANI SHARMA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
55 views13 pages

Finite Word Length

Uploaded by

SHIVANI SHARMA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 13
CHAPTER j I i 13 Analysis of Finite Word Length Effects || ee 13.1 INTRODUCTION In discussing filter realizations (Chapter 10), we have assumed that # the signals, filter coefficients, and results of arithmetic operations such and multiplication were represented in an infinite precision. In practice, be represented only to a finite precision. Precision is defined as the (or difference) between two consecutive numbers that can be obtained number of bits. All digital processors have finite word length. Once we ital filter, it must be implemented in a finite word length machines. reductions in filter performance caused by moving from infinite preci precision are called finite word length effects. Consider, for example, filter y(n) = —a,y(n — 1) + bz(n) + by2(n — 1) In implementing this filter, we must deal with the following problems: 1. The input signal values x(n) must be represented with a finite number introduces quantization error. The process of representing the data number of bits is known as quantization, 2. All the filter coefficients (a), bo, and b;) cannot be represented exactly. quantized for representation with a finite number of bits, which may int. in the filter frequency response. Its poles and zeros are not in the desi As a result, the singularities in the z-plane are limited to lie on a grid locations. 3. If the coefficients are quantized to B bits and the signals are quantized: then the product terms —a,y(n—1), byx(n), and b,2(n—1) will each be bits. All the product terms must be quantized. Quantization of the introduces noise into the filter, which has the effect of reducing the si ratio (SNR) of the filter. 4. Another source of error occurs as a result of the summing operation. of several B-bit numbers will not necessarily fit within B bite. WI large to fit, it results in overflow, which can cause large er! the SNR. Overflow can also yield high-amplitude oscillat nite impulse responce (IIR) filters, and this in turn M Scanned with CamScanner Analysis o We cannot combine all quantiznt f Finite Word Length Effects 813 following reasons: mation effects toge 1, The filter coe! etree Squartiees CRY ft Quantized coefficiens, Mantized once ent ether into a single error analysis for the 4 during the desi tion. Hence, the effe S$ remain uring the design process By ect of coeffi constant in th aco and H(e) from their idea) eet antization is to heer mplemmts: the system is still linear, It neem forms, nm “ddehehbie toads ae lhe the moll cterministic manner, and ifications, we can optimize it, rodent fesign does not meet the spec- specifications. Tedesign it, and restructure it to satisfy the Signal quantizati chit | occurs in the operation of the filter and can be treated as a random process, Each 5 i quantization operation can be modelled as producing white noise that is uncorrelate: Sat ed from ot ota simplifies the noise analysis, other quantization noise sources in the filter. This igital filters pes nea eee be designed carefully in order to minimize the he . These effects depend on the method used to represent 43.2 REPRESENTATION OF NUMBERS In a digital processor, or a computer, numbers are represented as combinations of a finite number of binary digits or bits that take on the values of 0 and 1. There are two common forms that are used to represent numbers in a digital processor: (a) fixed-point and (b) floating-point representations. In the fixed-point representation, the binary point is assumed to be in a fixed position, whereas in the floating-point "representation, the position of the binary point is variable. Floating-point represen- tations have the advantage of representing very large and very small numbers (large dynamic range); hence, issues of overflow and scaling do not come into play. However, the spacing between adjacent floating-point values is not constant, that is, the quan- tization step size is not ‘uniform; instead, it is proportional to the magnitude of the number. i 1 Fixed-point Representation of Numbers ‘a fixed-point representation, a number is represented as a string of bits with a binary ~ point. i i ant represent the integer part of the number, " ts to the left of the binary point rep! : fens, = na right of the binary point represent the fractional part of the suber. if, denote the Jocation of the binary point, then the binary number 1011.011 decimal value of E 3 xe ye = (b-s en3 99 + b22? + b-12! + bo) + (b)2°1 + 2? + 2) pi " eo tag a 8) = 11.375 any number X can be expressed a8 Spat oshseod (13.1) ebay Birtosbobeh saletaes Scanned with CamScanner J 814 Digital Signal Processing represent the digit, 7 is the radix oF base, n is the number «, aie wae the number of fractional digits. If r = 10, then Eq. (13.1) bec : re ‘ation of X, and the radix point » betwoen bo and by is thee Similarly, if r = 2, then Eq. (13. radix point » between by and by ically in the processor. When a B-bit integer format (n = B, m= then the range is from 0 to 28 — 1, that is, 0 X= (hopb-p'-berboo)a= Dy 2" isI-B te 1) becomes the binary representation», is the binary point. The binary poin, |” = 0) is used to represent tin.;, Ie in = by_p22-! + bp-2P? +++ 612! + boo When b; = 0 for all i, then Xinin = 0, and when b = 1 for all i, then X,,,. example, when a 4-bit integer format (n and m = 0) is used to represen: integers, then the range is from 0 to 2#—1= 15. When a B-bit fraction format (n = Poe = B) is used to represent fractions, then the range is from 0 to 1~ 2~®, that Bere me X = (boobiba---ba)2 = Yb:274 = boob2™ + by2-? +--+ br? i=0 For unsigned fractions, 1 loll => 001010 ‘The negat! omplement form. Q-24)-\x|=2- 76-8 : ynt Form co a save namorcan be toeaael WO Soe significant como ome complement Tn two's complement form, fractional X is represented a5 ‘Oobibe OB X>0 (positive number) col inobsbo Fp) +10.00--- 011 fF x <0 (negative number) act bea Scanned with CamScanner 816 Digital Signal Processing re ie one’ f by. Note that for X <0, we hy he one’s complement of where 5, = 1 = by is tl = [16B:B2-+- Ba) + [0600---01] 4 NRK AEH Ts 5 of + 0-2940-274+0-2°7 +-..41-279] x [tar] 2-8 = [2-2-8 - |x|] 42-8 Xe = 2- |X] in two's lement form, the is the magnitude of X. Therefore, in two's compl : peerhetfer Rriarisciiad by subtracting the magnitude from the 2. We write Eq. (13.6) as x for X20. (positive number) alae IX] for X <0 (negative number) —_—_ used two's complement, there is only one Example 13.2 Express the fraction § and —2 in ‘two's complement form. Assume B = 4. Solution: Jn two's complement form, Therefore, signed numbers two's complement form in m Processors. =1 is also rep are represented 1ost of the digital si ‘One's complement is obtained by subtracti = § = 0.1010 from 2 = 10,0000, ‘That is, 4 oe es ae —} 0.100 2 = on 2 10,0000 0.010 Ml=3 = ostoi0 : 2-1X| = 1010 The ‘Negative fracti i ion —§ is represented by 140110 ates eee fom Table 13.1 shows’ that Scanned with CamScanner Analysis of Fi Vets of Finite Word Length Effects 817 allows numbers to be represent, onan sented with 4 netic can red ‘ h a large dy, arithmetic the problem of overnre” C7hamic range. Therefore, fi i In floating-point representation erflow that occurs te fa ere, floating-point 8, a '# in fixed-point representation. imber X can be expressed as X=Mx28 (13.9) where M is the mantissa and £ is the exponent. Both mantles and exponent have sign bits and ‘ ‘ ‘ epresented and M is a fraction, such that in any fixed-point format. B is an integer the floating-point representation is 2eic «/.. <1 When M is in thi ation is said to be ie in this range, 000101101 and > be normaliz badirely " eee are represented by 0.101101 we - rain ng-point format, both mantissa and exponent re stored ik registers. When two floating-poin ; arent then one of tren a ane nteents mat eal ey take place. Multiplication is accomplished by mltoh ne sted before addition eon Bid copdjusting the product yy multiplying mantissas, adding exponents, The Institute of Electrical and Electronics ae a) introduces a standard for |S | @ ~ ¢¢ | ™™ ~ ™ representing floating-point numbers in a 32-bit Ex ‘Mantissa(23 bit format and is called the IEEE-754 standard. aes: is According to this standard, the 32-bit single pre. (uepomtiaeat cision, floating-point number uses one sign bit, 8 bits for the exponent, and 23 bits for the fractional mantissa. The 32-bit format is shown in Fig. 13.2, where $ is the sign bit, and # and M represent the exponent and mantissa fields, respectively. The IEEE-754 standard for the 32-bit single-precision, floating-point number is represented by X =(-1)8 x Mx 25-17 (13.10) where the mantissa M is coded as 0 < M < 1 and the exponent is coded to be biased as E — 127, To get both positive and negative exponents, the bias is Pro: vided by an integer; usually the bias is chosen as 27 — 1 = 127 when the exponent E is 8 bits. Without the bias, an bit integer number varies from 0 to 255, but with a bias of 127, the exponent varies from —127 to 127. In order to increase the range of the mantissa, one integer bit is added to M, so that i. * as sented as 1.M. Now it is assumed to be normalized, but this bit is no £0 naa By the total word length. The standard also specifies the following properties . 5 gB-127. |. When 0 < E < 255, then X =(-1) x 1,M x bass y =0 and M40, then X= (-1)8 x Oo x27, 7 ey * then X is not a number and is denoted as NaN. s 1 oe (>: Soo. Ba vite = 255 and M=0, then X © » rece ao soa Moen 25 CNT i it and 23-fractional with a one integer bit th 23 bits, The IEEE-754 format has a from 1.18 x 107% to 3.4 x 10%). jized mantissa eM denotes the a EL peri ‘Scanned with CamScanner Fig. 13.3 Quantization process model rest niumber that conve), that canbe tore * © i quantize, then Qf 8 shown in Pig. 13.3, ‘hei 43.3.1 Truncation 13.3.2 Rounding Binary rounding is the process in whi : f et which the matically decide whether to add binary 1 Ho dee tL it is needed to mathe 1th bit is binary 0, then binary 1 will not Preceding B bits. If the (B + If the (B + 1)th bit is binary 1 be added to the preceding B bits. Bbits. ¥ 1, then binary 1 will be added to the preceding For example, if B = 3, then we can have 2° = 8 j t0.000, 001, 010, 011, 100, tor ie ee different values corresponding “ aleastettl tccthe son , al uppose B = 3 bits represent a umber oltage of a signal. Then, 2° = 8 different voltage levels OV, 1V, 2V, 3V, 4V, 5v, ev, 3 ; ae are > 2V, 3V, 4V, 5V, 6V, and 7V, and the step size is 1V. An input sam- ple of ov would be converted to 000; an input sample of 1V would be converted to 001; an input of 2 V would be converted to 010, and so on. Suppose the input sam- ple is z = 1.75V, which is represented as x = 1.75 = 001,11; then L = 5. Would this continuous amplitude sample be converted to 001 or 010? The answer depends upon the type of quantization (truncation or rounding) used to convert this contin- uous amplitude sample into discrete amplitude sample. If the type of quantization is truncation, then all values from 1.0V up to but not including 2.0V are converted to 001. Qe] = Q[1.75] = Q[oo1.11] => 001. type ization is i .5V up to but not including, of quantization is rounding, then all values from 0 5 u a are iad to 001. Values from 1.5V up to but not including 2.5V are converted to 010. 001 Qlx| = Q[1.75] = 9001011] —= jeer ee ~ : rs p a ‘* oe (a) Truncation With 3 Aap t step coms rsa saad t : Consider a number 2 = bia ssi signed magnitude CEE 0,0.25,0.5, pe i a 0.5, and — 0.75. Their binary a magnitude orp) rounding aie O 10, 0511, 100, 1601, {@) truncation and (b) ro resentation is Oasis ee have a S-bit register. 10, ay bit) bit regis: et 0,375 = 00011; = 3 (including ude form, ‘Scanned with CamScanner ; Analysis jantion, we get 01015 = 5, ‘wis of Finite Word Length Effects 821 ste © = QPPIOIOIN.) = O101. = 0.5 6101010, When we a a Lb =37 Di numer nie NOS ky he most signiiennt 42-5 =3 0,101 = G.695, St MME quantiontion, we get ne oor multiplication \n decimal for oe mas625. In binary, form 0.111 wottoe qocuee ———— a If we add two fixed-point fj he eum, fraction : Abit fractions 110) 1m: then overflow can oceut, For example, ot 101 and Seat tetiater. nd 61000 is 1.0101. That is, the sum cannot be contained = Q0.101010) = 0,101 = 0.66 The error is 0.65625 0.625 = 0.031 —aple 18-7 This example shows that overfiow atten eves sama a eet rtm = = = Sajeee yee wading two frectlonal numbers, 1100 = —4, which is incorrect because there ‘ottion: was an overflow into the sign bit. magnitude form eae a Abit signed 8 Fiabe tyne ta which we wi ‘agnitude ich we will add two Example 13.8 Prove that the quantizer (trunca- ‘n and zp. If the sum of these two numbers saab (Cf. By thon the rene tt aed toda) is a non-linear system. spe correct and there is no overflow. Bolutiont i det a = } = e010 and 2 = 3 = 0,011. nae Oe + The correct sum is x + 2 = 3, which is within ‘Example 13.9 Compare fixed-point and floating: the dynamic range. Simple binary addition gives point representations. n+ 2 = 00010 + 00011 = 00101 = Correct because there is no overflow. 2 Let m = } = Oolll and a ‘The correct sum is +m = 3 which is ‘than the dynamic range. Simple Fee ation gives 21+ = On111 + 0011 = 1,010 = —3, which is incorrect because there onan into the sign bit. Consider a 4bit two's Two's complement form crhich we will add number a tmotambers aj end az. The dynamic range P19, complement number is (-1+ 5) Hae itn ie robes then the result will be correct. Otherwise, there is an _3 = loll. = 19110 and 7 = ¢ Ze = | which is within ae ‘addition gives oe oa I She mantissa is given by — ly P< QiM\~ Ms} a8 ee <2" (Qi) M)s} a 1 (error in floating-point word) -2¢ 3 2° 9-8 Since } < M < 1, we can write a27? aye-* a (13.29) ivati i due to the partial derivative 32 represents the incremental change in the pole ps change ae coefficients he the total error Ap; is the sum of all the incremental : errors in each of the coefficients a. abet “To find oe, use the chain rule Bait : [ oa] wn (13.30) BPs oo, [25 ”% or of Eq. (13.30) is : e ‘ 5 1 ie) 52 Sa’ | ‘=P (43.31) alana” BU lem ; or of Eq. (13:30) i8 Bo Pi a \ th 830 Digital Signal Processing Lf Oa, oat Now substituting Eq. (13.33) in Eq. (13.29), we get the total perturbation errop Bol pp-1-k 4x = at Am “OTL (ez jst a4i This equation gives a measure of the sensitivity of the ith pole as a function of coefficient quantization errors Aa,. The term (p; — pj) in the denominator is a between the two poles. Hence, if the poles are closely clustered, then |p, — p,|is there by resulting in a large perturbation, Ap;. This, in turn, would lead to a error in the frequency response. Narrowband filters are highly sensitive to quantization because the poles are tightly clustered in these filters. The perturbation Ap, is minimum if the length |p;— -p;| is maximum. We can maxi the length |p; — p;| by implementing the filter as a cascade or parallel combinati first- and/or second-order sections. The real poles can be implemented as sections. For each of these first-order sections, there is only one pole, and hence, sections. Since the complex conjugate poles are usually far apart, the perturbation: A formula analogous to Eq. (13.34) can be obtained for the sensitivity of the to errors in the parameters b,. For a cascade form realization, the zeros can be into first- and second-order factors. For a parallel form realization, we use partial fr expansion. Therefore, each first-order section has no zero and each second-order has one zero. Therefore, there is no clustering of zeros for each individual section imei cascade or parallel form. Thus, the coefficient quantization has less effect in cascade’ Parallel realizations when compared with the direct form realizations. 13.6.1 Coefficient Sensitivity Analysis of a Second-order Direct Form Il Structure cel - Consider a second-order filter with system function 1 ere De 1—2rcosO2-! + 722-2 ~ 22 — Orcosdz +r? Pi = rel? and pp = re~, It has two coefficients a = 2"! acture is shown in Fig. 13.7. With infinite PP Scanned with CamScanner H(z) =

You might also like