Chapter 5 – Floating Point Numbers
· Floating point representation is used to represent real numbers (i.e. numbers with fractions)
· Floating point representation supports huge range with reasonable storage size
· Examples: 32-bit storage size can represent a number
o As large as 1045
o As small as 10-45
· Floating point representation suffers from the following main disadvantages:
o Potential loss of precision due to limited number of significant digits
o Relatively, large storage requirements
o Slow calculations
· This chapters covers:
o Review of exponential notation
o Floating point representation in computers
o Floating point calculations
o IEEE 745 floating point standard
o Packed Decimal Format (BCD)
o Overflow and Underflow Conditions
Review of Exponential Notation
· Exponential notation or scientific notation is a conventional method for representing floating
point numbers
· Exponential notation format consist of 6 components
o Mantissa
o Sign of the mantissa
o Exponent
o Sign of the exponent
o Base of the exponent
o Location of the fraction point
Example: -50.5 x 10-20
· Fraction point position is flexible and can be adjusted without changing the number magnitude
· Changes to the fraction point requires adjustment to the exponent
o For every move to the right, the exponent must be decremented
o For every move to the left, the exponent must be incremented
Examples
The number -50.5 x 10-20 can be represented as
-505. x 10-21
-.505 x 10-18
-.000505 x 10-15
· However shifting should not be arbitrary, since that may affect the precision of the number
· If we limit the mantissa to 5 digits, the last representation results in lose of 1 precision point (i.e.
error)
Floating Point Representation in Computers
· Computers uses a representation method very similar to the exponential notation
· Binary is used instead of decimal
· Storage size of 32, 64, and 128 bits are typically used
· See Figure 5.4 in page 132 for a typical 32-bits floating point format
o Leftmost bit is the mantissa sign
o Followed by 8 bits exponent
o Followed by 23 bits mantissa
o The fraction point is implied to be at the beginning of the mantissa
o Exponent is stored in excess-128 notation
o Base of the exponent is implied as base 2
· Computers uses many different proprietary and standard Floating Point representation methods
· The following standards are in common use and will be studied later in the chapter
o IEEE 754 Floating Point representation standard
Floating Point Representation Review
· Assume the following decimal floating point format: SMMMMMMM
(S = mantissa sign, M = mantissa, decimal point is implied at the end)
· The above format provides a range of: ±9999999
· Let’s introduce 2 exponent (EE) digits in place of 2 mantissa digits: SSEEMMMM (second S is
the exponent sign)
· The above format provides a range of: ± 0001 x 10-99 to ± 9999 x 1099
· With this representation we have traded off 2 digits of precision to increase the range
· There exist a trade off between Precision and Range
o The more digits assigned for the mantissa, the higher the precision and lower the range
o The more digits assigned for the exponent the higher the range and lower the precision
· Floating point formats has the following attributes
o A number is assigned a storage space (i.e. fixed number of bits)
o The storage space is divided into 4 parts
§ Mantissa sign
§ Mantissa
§ Exponent sign
§ Exponent
o The following remaining parts are implied and hence do not need to be stored
§ Exponent base
§ Fraction point position
· There are a number of trade-offs that need to be considered when designing a floating point
format:
o Storage size
§ Increase precision and range
§ But also increase storage requirements
o Base of the exponent
§ Binary base provide low range capability but requires simple calculations
§ Higher base (e.g. hex) provide high range but results in more complex
calculations
o Location of binary point
§ Usually positioned at the beginning of number to provide maximum precision
o Number of bits to use for the exponent
§ The higher the number, the higher the range and lower the precision and visa
versa
o Number of bits to use for the mantissa
§ The higher the number, the higher the precision and lower the range and visa
versa
o Method to handle the sign for the exponent
§ Sign free representation is required
§ 2’s complement can be used but excess-N is more common
o Method to handle the sign for the mantissa
§ Sign free representation is not required
§ Sign-and-magnitude is typically used
§ 2’s complement can also be used but less common
Example Floating Point Format
o SEEMMMMM format
o Excess-50
o Base 10 exponent
o Base 10 mantissa
o Implied decimal point at the beginning of the number
· This format provides a range as small as: ± .00001 x 10-50 and as large as: ± .99999 x 10+49
Excess-N
· One important consideration in floating point is
o How to handle the sign of the exponent
o 2’s complement is an obvious solution
o However, the Excess-N method is more commonly used
· N is a predefined value separating the positive range from the negative one
o Value ≥ N is positive
o Value < N is negative
· See Figure 5.1 in page 125 for Excess-50 representation
· Excess-N provides the following important advantages over 2’s complement
o Simpler in calculation
o More flexible as N can be adjusted to adjust the range of positives and negatives
§ The smaller the N the larger the positive range and the smaller the negative
range
§ The larger the N the smaller the positive range and the larger the negative range
Conversion from Excess–N to Sign-and-Magnitude
· Subtract exponent from N
Examples
1. Convert 30 represented in Excess-50 to sign-and-magnitude representation
= 30 – 50 = -20
2. Convert 60 represented in Excess-50 to sign-and-magnitude representation
= 60– 50 = 10
Conversion from Sign-and-Magnitude Notation to Excess–N
· Add N to the exponent
Examples
1. Convert –10 represented in sign-and-magnitude to Excess-50
= 50 + (–10) = 40
2. Convert 0 represented in sign-and-magnitude to Excess-50
= 50 + 0 = 50
Normalization and Formatting of Floating Point Numbers
· Normalization is the process of eliminating leading zeros from the mantissa
· The objective of normalization is to maximize precision given the number of digits limitation
· Normalization can only be performed if the exponent has enough range
Examples
1. Normalize .0003 x 1020
= .3 x 1017
2. Normalize .0003 x 10-20
= .3 x 10-23
3. Normalize .0003 x 10-98 assuming 2 exponent digits
Cannot be normalized as there is no enough range in the exponent
Converting from real number to Floating Point format
· The following steps provide the method to convert an integer or real number to floating point
format:
1. Convert the number to exponential notation format
2. Place the fraction point to its proper position
3. Normalize the number
4. Convert exponent from sign-and-magnitude to Excess-N
5. Store the number in the floating point format
Examples
Given the following floating point format
o SEEMMMMM format
o Use 0 for positive and 5 for negative
o Excess-50
o Base 10 exponent
o Base 10 mantissa
o Implied decimal point at the beginning of the mantissa
· Convert 246.8035 into the above floating point format
1. Convert to exponent notation format = 246.8035 x 100
2. Set decimal point to proper position = .2468035 x 103
3. Normalize already normalized
4. Convert exponent to Excess-N = 50 + 3 = 53
5. Store in floating point format = 05324680
· Convert – .00000075 into the above floating point format
1. Convert to exponent notation format = .00000075 x 100
2. Set decimal point to proper position already in proper position
3. Normalize = .75 x 10-6
4. Convert exponent to Excess-N = 50 + (-6) = 44
5. Store in floating point format = 54475000
· Convert 1255 x 10-3 into the above floating point format
1. Convert to exponent notation format = 1255. x 10-3
2. Set decimal point to proper position = .1255 x 101
3. Normalize already normalized
4. Convert to Excess-N = 50 + 1 = 51
5. Store in floating point format = 05112550
Converting from Floating Point format to real number
· The following steps provide the method to convert from floating point format to real number
format
1. Convert the mantissa sign digit to (+ or -)
2. Convert from Excess-N to sign-and-magnitude
3. Convert to exponential notation format
4. Convert to real number format
Examples
· Assume the SEEMMMMM floating point format
· Convert 05324657 to real number
1. Convert the sign digit = +
2. Convert from Excess-N to sign-and-magnitude = 53 – 50 = 3
3. Convert to exponential notation format = .24657 x 103
4. Convert to real number format = 246.57
· Convert 54810000 to real number
1. Convert the sign digit = -
2. Convert from Excess-N to sign-and-magnitude = 48 – 50 = -2
3. Convert to exponential notation format = .10000 x 10-2
4. Convert to real number format = - .001
· Convert 05112550 to real number
1. Convert the sign digit = +
2. Convert from Excess-N to sign-and-magnitude = 51 – 50 = 1
3. Convert to exponential notation format = .12550 x 101
4. Convert to real number format = 1.255
Floating Point Calculations
· Floating point arithmetic is more complex and costly than that of integer arithmetic
· Exponent and mantissa both has to be computed separately
Addition and Subtraction
· Addition/subtraction is done using the following method
1. Align the exponents (the smaller exponent should aligned until it matches the larger exponent)
2. Add/subtract the mantissas
3. Place decimal point in proper position if necessary
4. Store number in the floating point format
Examples
· Assume the SEEMMMMM floating point format
1. Add 05199520 + 04967850
1. Align the second number exponent = 0510067850
2. Add the mantissas = .99520 + .0067850 = 1.0019850
3. Adjust decimal point = .10019850 and Exponent = 51 + 1 = 52
4. Store in floating point format = 05210020
2. Subtract 05199520 - 04967850
1. Align the second number exponent = 0510067850
2. Subtract the mantissas = .99520 – .0067850 = .9883250
3. Adjust decimal point already in proper position
4. Store in floating point format = 05198833
Multiplication
· Alignment is not necessary when performing multiplication
· Multiplication is done using the following method
1. Multiply the two mantissas
2. Adding the two exponents – N
3. Normalize if necessary
4. Store number in the floating point format
Examples
· Assume the SEEMMMMM floating point format
o Excess-50
o Base 10 exponent
o Base 10 mantissa
o Implied decimal point at the beginning of the number
· Multiply 05220000 x 04712500
1. Multiply the 2 mantissas = .20000 x .12500 = .02500
2. Compute exponent = 52 + 47 – 50 = 49
2. Normalize result = .25000 and adjust Exponent = 49 – 1 = 48
4. Store in floating point format = 04825000
Division
· Alignment is not necessary when performing division
· Division is done by
1. Divide the two mantissas
2. Subtract the first number exponent – second number exponent + N
3. Place decimal point in proper position
4. Store number in the floating point format
Examples
· Divide 05275000 ÷ 05025000
1. Divide the 2 mantissas = .75000 ÷ .25000 = 3.00000
2. Compute exponent = 52 – 50 + 50 = 52
2. Place decimal point in proper position = 0.30000 and adjust Exponent = 52 + 1 = 53
4. Store in floating point format = 05330000
IEEE 754 Floating Point Standard
· IEEE has developed a standard for both 32 and 64 bits floating point representation
· The standard was targeted to be used in Personal Computer (IBM-type PC and Apple Macintosh)
· Apple Macintosh also provides its own 80-bit format
· IEEE 754 defines a 32-bits format called single-precision floating point format
o Leftmost bit is the mantissa sign (0 for positive and 1 for negative)
o Followed by 8 bits exponent
o Followed by 24 bit mantissa (23 bits + implied which is always assumed to be 1)
o Exponent is represented using Excess-127 which gives an exponent range of: 2-126 to 2+127
Exponents 0 (2-127) and 255 (2+128) are reserved for special use
o Implied exponent base is 2
o Fraction point position is to right of the leading mantissa bit
o Special numbers (e.g. 0, ∞, very small none normalized numbers, etc.) are supported
o Supported precession is approximately 7 decimal significant digits
o Allows for approximate range of 10-45 to 10+38
· IEEE 754 defines a 64-bits format called double-precision floating point format
o It works similar to the single-precision format
o 11 bits for exponent and 52 bits for mantissa
o Supported precession is approximately 15 decimal significant digits
o Allows for approximate range of 10-300 to 10+300
Convert Decimal Real Number to IEEE 754 Floating Point Format
· The following steps provide the method to convert a decimal real number to IEEE 754 Floating
Point format:
o Convert the decimal number to binary
o Adjust binary point to proper position
o Normalize the number
o Convert exponent from sign-and-magnitude to Excess-127
o Convert exponent to binary
o Store the number in the floating point format
1. Convert 36.510 to single-precision IEEE 754 floating point format
1. Convert to binary = 100100.1
2. Adjust binary point to proper position = 1.001001 x 25
3. Normalize already normalized
4. Convert exponent to Excess-127 = 127 + 5 = 132
5. Convert exponent to binary = 10000100
6. Store in floating point format = 0 10000100 00100100000000000000000
2. Convert –0.25 to single-precision IEEE 754 floating point format
1. Convert to binary = .01
2. Adjust binary point to proper position = 0.1 x 2-1
3. Normalize = 1.0 x 2-2
4. Convert exponent to Excess-127 = 127 - 2 = 125
5. Convert exponent to binary = 01111101
6. Convert to floating point format = 1 01111101 00000000000000000000000
Convert from IEEE 754 Floating Point Format to real number
· The following steps provide the method to convert IEEE 754 Floating Point to decimal real
number:
o Convert exponent from binary to decimal
o Convert from Excess-127 to sign-and-magnitude
o Convert to exponent notation
o Remove exponent (if possible)
o Convert from binary to decimal real number
1. Convert 1 01111101 00000000000000000000000 to decimal real number
1. Convert exponent to decimal = 125
2. Convert Excess-127 to sign-and-magnitude = 125 – 127 = -2
3. Convert to exponent notation = - 1.0 x 2-2
4. Remove exponent = - 0.01
5. Convert to decimal real number = - 0.25
2. Convert 0 10000001 11001100000000000000000 to decimal real number
1. Convert exponent to decimal = 129
2. Convert Excess to Exponent = 129 – 127 = 2
3. Convert to exponent notation = 1.110011 x 22
4. Remove exponent = 111.0011
5. Convert to decimal real number = 7.1875
Packed Decimal Format (BCD)
· Conversion of floating point numbers may loose accuracy when converted to another base (e.g.
decimal to binary and visa versa)
· Many applications, especially business application that deals with money, requires full accuracy
of the numbers
· BCD satisfies the full accuracy objective
· BCD in floating point is very similar to the BCD used to represent integer numbers
· Many business-oriented high-level languages (e.g. COBOL) supports the packed decimal format
· Figure 5.8, page 138 shows 128-bits packed decimal format used in IBM 370/390 and VAX
computers
· The format allows for 31 decimal digits (1 digit per 4 bits)
· Least significant 4 bits are used for the sign 1100 for +, 1101 for -)
· The location of the decimal point is not stored and must be maintained by the application program
Examples
· Convert -150.5410 to IBM 370/390 BCD Floating Point format
o Convert sign to BCD format = 1101
o Convert digit by digit to BCD format = 0001 0101 0000 0101 0100
o Pad with zeros to fill the entire storage space = you need 104 leading zeros
o Convert to BCD format
= 0000 (… 104 zeros …) 000101010000010101001101
Overflow and Underflow Conditions
· An Overflow occur when the number is too large to be stored
· An Underflow occur when the number is too small to be stored
· See Figure 5.2 in page 115 for illustration of overflow and underflow in floating point
representation