0% found this document useful (0 votes)
118 views10 pages

Itec1000 Lecture Note 5

Chapter 5 discusses floating point representation for real numbers, highlighting its advantages such as a wide range and reasonable storage size, as well as disadvantages like potential precision loss and slow calculations. It covers topics including exponential notation, IEEE 754 standard, and methods for converting between real numbers and floating point format. The chapter also details arithmetic operations with floating point numbers, emphasizing the complexity compared to integer arithmetic.

Uploaded by

陈斯
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
118 views10 pages

Itec1000 Lecture Note 5

Chapter 5 discusses floating point representation for real numbers, highlighting its advantages such as a wide range and reasonable storage size, as well as disadvantages like potential precision loss and slow calculations. It covers topics including exponential notation, IEEE 754 standard, and methods for converting between real numbers and floating point format. The chapter also details arithmetic operations with floating point numbers, emphasizing the complexity compared to integer arithmetic.

Uploaded by

陈斯
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Chapter 5 – Floating Point Numbers

·         Floating point representation is used to represent real numbers (i.e. numbers with fractions)
·         Floating point representation supports huge range with reasonable storage size
·         Examples: 32-bit storage size can represent a number
o        As large as 1045
o        As small as 10-45
 
·         Floating point representation suffers from the following main disadvantages:
o        Potential loss of precision due to limited number of significant digits
o        Relatively, large storage requirements
o        Slow calculations
 
·         This chapters covers:
o        Review of exponential notation
o        Floating point representation in computers
o        Floating point calculations
o        IEEE 745 floating point standard
o        Packed Decimal Format (BCD)
o        Overflow and Underflow Conditions

Review of Exponential Notation

·         Exponential notation or scientific notation is a conventional method for representing floating


point numbers
·         Exponential notation format consist of 6 components
o        Mantissa
o        Sign of the mantissa
o        Exponent
o        Sign of the exponent
o        Base of the exponent
o        Location of the fraction point
Example: -50.5 x 10-20
 
·         Fraction point position is flexible and can be adjusted without changing the number magnitude
·         Changes to the fraction point requires adjustment to the exponent
o        For every move to the right, the exponent must be decremented
o        For every move to the left, the exponent must be incremented
 
Examples
The number -50.5 x 10-20 can be represented as
-505. x 10-21
-.505 x 10-18
-.000505 x 10-15
 
·         However shifting should not be arbitrary, since that may affect the precision of the number
·         If we limit the mantissa to 5 digits, the last representation results in lose of 1 precision point (i.e.
error)

Floating Point Representation in Computers

·         Computers uses a representation method very similar to the exponential notation


·         Binary is used instead of decimal
·         Storage size of 32, 64, and 128 bits are typically used
·         See Figure 5.4 in page 132 for a typical 32-bits floating point format
o        Leftmost bit is the mantissa sign
o        Followed by 8 bits exponent
o        Followed by 23 bits mantissa
o        The fraction point is implied to be at the beginning of the mantissa
o        Exponent is stored in excess-128 notation
o        Base of the exponent is implied as base 2
·         Computers uses many different proprietary and standard Floating Point representation methods
·         The following standards are in common use and will be studied later in the chapter
o        IEEE 754 Floating Point representation standard
 
Floating Point Representation Review
 
·         Assume the following decimal floating point format: SMMMMMMM              
(S = mantissa sign, M = mantissa, decimal point is implied at the end)
·         The above format provides a range of: ±9999999
·         Let’s introduce 2 exponent (EE) digits in place of 2 mantissa digits: SSEEMMMM  (second S is
the exponent sign)
·         The above format provides a range of:  ± 0001 x 10-99 to ± 9999 x 1099
·         With this representation we have traded off 2 digits of precision to increase the range
·         There exist a trade off between Precision and Range
o        The more digits assigned for the mantissa, the higher the precision and lower the range
o        The more digits assigned for the exponent the higher the range and lower the precision
 
·         Floating point formats has the following attributes
o        A number is assigned a storage space (i.e. fixed number of bits)
o        The storage space is divided into 4 parts
§         Mantissa sign
§         Mantissa
§         Exponent sign
§         Exponent
o        The following remaining parts are implied and hence do not need to be stored
§         Exponent base
§         Fraction point position
 
·         There are a number of trade-offs that need to be considered when designing a floating point
format:
o        Storage size
§         Increase precision and range
§         But also increase storage requirements
o        Base of the exponent
§         Binary base provide low range capability but requires simple calculations
§         Higher base (e.g. hex) provide high range but results in more complex
calculations
o        Location of binary point
§         Usually positioned at the beginning of number to provide maximum precision
o        Number of bits to use for the exponent
§         The higher the number, the higher the range and lower the precision and visa
versa
o        Number of bits to use for the mantissa
§         The higher the number, the higher the precision and lower the range and visa
versa
o        Method to handle the sign for the exponent
§         Sign free representation is required
§         2’s complement can be used but excess-N is more common
o        Method to handle the sign for the mantissa
§         Sign free representation is not required
§         Sign-and-magnitude is typically used
§         2’s complement can also be used but less common
 
Example Floating Point Format
o        SEEMMMMM format
o        Excess-50
o        Base 10 exponent
o        Base 10 mantissa
o        Implied decimal point at the beginning of the number
·         This format provides a range as small as: ± .00001 x 10-50 and as large as: ± .99999 x 10+49
 
Excess-N
·         One important consideration in floating point is
o        How to handle the sign of the exponent
o        2’s complement is an obvious solution
o        However, the Excess-N method is more commonly used
·         N is a predefined value separating the positive range from the negative one
o        Value ≥ N is positive
o        Value < N is negative
·         See Figure 5.1 in page 125 for Excess-50 representation
 
·         Excess-N provides the following important advantages over 2’s complement
o        Simpler in calculation
o        More flexible as N can be adjusted to adjust the range of positives and negatives 
§         The smaller the N the larger the positive range and the smaller the negative
range
§         The larger the N the smaller the positive range and the larger the negative range
 
Conversion from Excess–N to Sign-and-Magnitude
·         Subtract exponent from N
 
Examples
1. Convert 30 represented in Excess-50 to sign-and-magnitude representation
= 30 – 50 = -20
 
2. Convert 60 represented in Excess-50 to sign-and-magnitude representation
= 60– 50 = 10
 
Conversion from Sign-and-Magnitude Notation to Excess–N
·         Add N to the exponent
 
Examples
1. Convert –10 represented in sign-and-magnitude to Excess-50
= 50 + (–10) = 40
 
2. Convert 0 represented in sign-and-magnitude to Excess-50
= 50 + 0  = 50
 
Normalization and Formatting of Floating Point Numbers
·         Normalization is the process of eliminating leading zeros from the mantissa
·         The objective of normalization is to maximize precision given the number of digits limitation
·         Normalization can only be performed if the exponent has enough range
 
Examples
1. Normalize .0003 x 1020
= .3 x 1017
 
2. Normalize .0003 x 10-20
= .3 x 10-23
 
3. Normalize .0003 x 10-98 assuming 2 exponent digits
 Cannot be normalized as there is no enough range in the exponent
 
Converting from real number to Floating Point format
·         The following steps provide the method to convert an integer or real number to floating point
format:
1. Convert the number to exponential notation format
2. Place the fraction point to its proper position
3. Normalize the number
4. Convert exponent from sign-and-magnitude to Excess-N
5. Store the number in the floating point format
 
Examples
Given the following floating point format
o        SEEMMMMM format
o        Use 0 for positive and 5 for negative
o        Excess-50
o        Base 10 exponent
o        Base 10 mantissa
o        Implied decimal point at the beginning of the mantissa
 
·         Convert 246.8035 into the above floating point format
1. Convert to exponent notation format   = 246.8035 x 100
2. Set decimal point to proper position     = .2468035 x 103
3. Normalize                                   already normalized
4. Convert exponent to Excess-N              = 50 + 3 = 53
5. Store in floating point format = 05324680
 
·         Convert – .00000075 into the above floating point format
1. Convert to exponent notation format   = .00000075 x 100
2. Set decimal point to proper position   already in proper position
3. Normalize                                   = .75 x 10-6
4.  Convert exponent to Excess-N             = 50 + (-6) = 44
5.  Store in floating point format                = 54475000
 
·         Convert 1255 x 10-3 into the above floating point format
1. Convert to exponent notation format   = 1255. x 10-3
2. Set decimal point to proper position     = .1255 x 101
3. Normalize                                   already normalized
4.  Convert to Excess-N                               = 50 + 1 = 51
5.  Store in floating point format                = 05112550
 
Converting from Floating Point format to real number
·         The following steps provide the method to convert from floating point format to real number
format
1. Convert the mantissa sign digit to (+ or -)
2. Convert from Excess-N to sign-and-magnitude
3. Convert to exponential notation format
4. Convert to real number format
 
Examples
·         Assume the SEEMMMMM floating point format
 
·         Convert 05324657 to real number
1. Convert the sign digit                                                             =  +
2. Convert from Excess-N to sign-and-magnitude = 53 – 50 = 3
3. Convert to exponential notation format                               = .24657 x 103
4.  Convert to real number format                                              = 246.57
 
·         Convert 54810000 to real number
1. Convert the sign digit                                                             =  -
2. Convert from Excess-N to sign-and-magnitude                  = 48 – 50 = -2
3. Convert to exponential notation format                               = .10000 x 10-2
4.  Convert to real number format                                              = - .001
 
·         Convert 05112550 to real number
1. Convert the sign digit                                                             =  +
2. Convert from Excess-N to sign-and-magnitude = 51 – 50 = 1
3. Convert to exponential notation format                               = .12550 x 101
4.  Convert to real number format                                              = 1.255

Floating Point Calculations

·         Floating point arithmetic is more complex and costly than that of integer arithmetic
·         Exponent and mantissa both has to be computed separately
 
Addition and Subtraction
·         Addition/subtraction is done using the following method
1. Align the exponents (the smaller exponent should aligned until it matches the larger exponent)
2. Add/subtract the mantissas
3. Place decimal point in proper position if necessary
4. Store number in the floating point format
 
Examples
·         Assume the SEEMMMMM floating point format
 
1. Add 05199520 + 04967850
1. Align the second number exponent      = 0510067850
2. Add the mantissas                                   = .99520 + .0067850 = 1.0019850
3. Adjust decimal point                               = .10019850 and Exponent = 51 + 1 = 52
4.  Store in floating point format                = 05210020
 
2. Subtract 05199520 - 04967850
1. Align the second number exponent      = 0510067850
2. Subtract the mantissas                            = .99520 – .0067850 = .9883250
3. Adjust decimal point                               already in proper position
4.  Store in floating point format                = 05198833
 
Multiplication
·         Alignment is not necessary when performing multiplication
·         Multiplication is done using the following method
1.        Multiply the two mantissas
2.        Adding the two exponents – N
3.        Normalize if necessary
4.        Store number in the floating point format
 
Examples
·         Assume the SEEMMMMM floating point format
o        Excess-50
o        Base 10 exponent
o        Base 10 mantissa
o        Implied decimal point at the beginning of the number
 
·         Multiply 05220000 x 04712500
1. Multiply the 2 mantissas                        = .20000 x .12500 = .02500
2. Compute exponent                                   = 52 + 47 – 50 = 49
2. Normalize result                                        = .25000 and adjust Exponent = 49 – 1 = 48
4. Store in floating point format = 04825000
 
Division
·         Alignment is not necessary when performing division
·         Division is done by
1.        Divide the two mantissas
2.        Subtract the first number exponent – second number exponent + N
3.        Place decimal point in proper position
4.        Store number in the floating point format
 
Examples
·         Divide 05275000 ÷ 05025000
1. Divide the 2 mantissas                                            = .75000 ÷ .25000 = 3.00000
2. Compute exponent                                                   = 52 – 50 + 50 = 52
2. Place decimal point in proper position = 0.30000 and adjust Exponent = 52 + 1 = 53
4. Store in floating point format                 = 05330000
IEEE 754 Floating Point Standard

·         IEEE has developed a standard for both 32 and 64 bits floating point representation
·         The standard was targeted to be used in Personal Computer (IBM-type PC and Apple Macintosh)
·         Apple Macintosh also provides its own 80-bit format
·         IEEE 754 defines a 32-bits format called single-precision floating point format
o        Leftmost bit is the mantissa sign (0 for positive and 1 for negative)
o        Followed by 8 bits exponent
o        Followed by 24 bit mantissa (23 bits + implied which is always assumed to be 1)
o        Exponent is represented using Excess-127 which gives an exponent range of: 2-126 to 2+127
Exponents 0 (2-127) and 255 (2+128) are reserved for special use
o        Implied exponent base is 2
o        Fraction point position is to right of the leading mantissa bit
o        Special numbers (e.g. 0, ∞, very small none normalized numbers, etc.) are supported
o        Supported precession is approximately 7 decimal significant digits
o        Allows for approximate range of 10-45 to 10+38
 
·         IEEE 754 defines a 64-bits format called double-precision floating point format
o        It works similar to the single-precision format
o        11 bits for exponent and 52 bits for mantissa
o        Supported precession is approximately 15 decimal significant digits
o        Allows for approximate range of 10-300 to 10+300
 
Convert Decimal Real Number to IEEE 754 Floating Point Format
·         The following steps provide the method to convert a decimal real number to IEEE 754 Floating
Point format:
o        Convert the decimal number to binary
o        Adjust binary point to proper position
o        Normalize the number
o        Convert exponent from sign-and-magnitude to Excess-127
o        Convert exponent to binary
o        Store the number in the floating point format
 
1. Convert 36.510 to single-precision IEEE 754 floating point format
1. Convert to binary                                                     = 100100.1
2. Adjust binary point to proper position                                = 1.001001 x 25
3. Normalize                                                                     already normalized
4. Convert exponent to Excess-127                           = 127 + 5 = 132
5. Convert exponent to binary                                   = 10000100
6. Store in floating point format                 = 0 10000100 00100100000000000000000
 
2. Convert –0.25 to single-precision IEEE 754 floating point format
1. Convert to binary                                                     = .01
2. Adjust binary point to proper position                                = 0.1 x 2-1
3. Normalize                                                                   = 1.0 x 2-2
4. Convert exponent to Excess-127                           = 127 - 2 = 125
5. Convert exponent to binary                                   = 01111101
6. Convert to floating point format                            = 1 01111101 00000000000000000000000
 
Convert from IEEE 754 Floating Point Format to real number
·         The following steps provide the method to convert IEEE 754 Floating Point to decimal real
number:
o        Convert exponent from binary to decimal
o        Convert from Excess-127 to sign-and-magnitude
o        Convert to exponent notation
o        Remove exponent (if possible)
o        Convert from binary to decimal real number
 
1. Convert 1 01111101 00000000000000000000000 to decimal real number
1. Convert exponent to decimal                                 = 125
2. Convert Excess-127 to sign-and-magnitude        = 125 – 127 = -2
3. Convert to exponent notation                                = - 1.0 x 2-2
4. Remove exponent                                                    = - 0.01
5. Convert to decimal real number                             = - 0.25
 
2. Convert 0 10000001 11001100000000000000000 to decimal real number
1. Convert exponent to decimal                                 = 129
2. Convert Excess to Exponent                  = 129 – 127 = 2
3. Convert to exponent notation                                = 1.110011 x 22
4. Remove exponent                                                    = 111.0011
5. Convert to decimal real number                             = 7.1875

Packed Decimal Format (BCD)

·         Conversion of floating point numbers may loose accuracy when converted to another base (e.g.
decimal to binary and visa versa)
·         Many applications, especially business application that deals with money, requires full accuracy
of the numbers
·         BCD satisfies the full accuracy objective
·         BCD in floating point is very similar to the BCD used to represent integer numbers
·         Many business-oriented high-level languages (e.g. COBOL) supports the packed decimal format
·         Figure 5.8, page 138 shows 128-bits packed decimal format used in IBM 370/390 and VAX
computers
·         The format allows for 31 decimal digits (1 digit per 4 bits)
·         Least significant 4 bits are used for the sign 1100 for +, 1101 for -)
·         The location of the decimal point is not stored and must be maintained by the application program
 
Examples
·         Convert -150.5410 to IBM 370/390 BCD Floating Point format
o        Convert sign to BCD format                                       = 1101
o        Convert digit by digit to BCD format                        = 0001 0101 0000 0101 0100
o        Pad with zeros to fill the entire storage space         = you need 104 leading zeros
o        Convert to BCD format
= 0000 (… 104 zeros …) 000101010000010101001101

Overflow and Underflow Conditions

·         An Overflow occur when the number is too large to be stored


·         An Underflow occur when the number is too small to be stored
·         See Figure 5.2 in page 115 for illustration of overflow and underflow in floating point
representation
 

You might also like