Outline
Fractional numbers
Floating point scientific notation
Floating Point Representation
Floating point in binary
IEEE Floating Point Standard
DCS111 Computer Architecture
Behaviour of Floating Point Numbers
Recap: fractions
Decimal 5.6710 is
5 x 100 plus
Fractional Numbers 6 x 10-1 plus
7 x 10–2
… not whole numbers
Binary 11.0112 is
1 x 21 plus
1 x 20 plus
0 x 2-1 plus Quiz: what is
1 x 2–2 plus 11.0112 in decimal?
1 x 2–3
Recap: fractions Recap: fractions
Quiz: what is a third as a Quiz: what is a third as a
decimal: N.NNNNN? decimal: N.NNNNN?
Third is 0.33333…
Not all numbers can be represented exactly
(with limited digits)
1
Problem Solution 1 – Fixed Point
How to hold fractions in computers?
Divide bits between whole and fractional parts
0 0 1 1 1 1 0 1
integer bits fractional bits integer bits fractional bits
Point always Quiz: what is this in
in the same decimal?
place
Solution 1 – Fixed Point Evaluation of Fix Point
Divide bits between whole and fractional parts
Range versus Accuracy
High accuracy means low range
High range means low accuracy
Has uses
integer bits fractional bits
Quiz:
• What is maximum number?
Really just scaled integers
range
• What is difference between Software library for fixed point numbers
successive numbers? accuracy No need for special hardware
Scientific (Exponent) Notation Scientific (Exponent) Notation
3.21 x 105 6.54 x 10-5 3.21 x 105 6.54 x 10-5
Mantissa
321,000 and 0.0000654
Exponent
5 -5
Same accuracy
Mantissa is a fraction
Different magnitude
Exponent is an integer
Both mantissa and exponent can be negative
Quiz: Write these number as decimal, without exponents
2
Normalisation
Advantage of Scientific Notation
}
Large range
0.002 x 100
Constant proportional accuracy (… with
0.2 x 10-2
exceptions)
2.0 x 10-3 all the same value
20 x 10-4
Normalised number has 1 digit before the point
Binary Floating Point
1.01 x 22
1.1 x 2-2
Floating Point in Binary
Exponent: positive or negative
Mantissa: positive or negative
Quiz:
• Effect of negative mantissa?
• Effect of negative exponent?
Normalised Binary FP Representation (32 bits)
Sign bit S
In normalised binary scientific notation
Exponent E
1.mmmm…mmm x 2E
Mantissa M
unless the number is 0
1.mmm…mmm is the mantissa
E is the exponent
exponent fraction (mantissa)
sign
First digit
always 1
3
Representation (32 bits) Negative exponents - how?
Sign bit S – 1 bit
Aim: ALU (Arithmetic Logic Unit) can reuse
Exponent E – 8 bits integer machinery
Mantissa M – 23 bits BUT
Eg, comparison with zero: x > 0
Easy because of sign bit
Floating point numbers can be easily classified as
negative or positive
exponent fraction (mantissa)
sign
Comparison of two floating point numbers x<y
not so straightforward...
(-1)S x 1.M x 2E choose exponent representation to help
First digit always 1, so
not included
Exponent in 2's Comp ?? Representation of Exponents
Consider: 1/2 < 1
We want:
half: 0.1 = 1.0 x 2-1 (normalised) FP number order to follow (unsigned) bit order
one: 1.0 = 1.0 x 20 (normalised) 11111111 to represent the highest positive exponent
0 11111111 000 …
Use biased representation
0 00000000 000 …
Bad Design
Bias by N (Excess N) Bias by N (Excess N)
Representation of negative numbers used in
Excess 7
floating point numbers
Numbers in ‘correct’ order 0000 -7 1000 1
0001 -6 1001 2
0010 -5 1010 3
excess-N-rep(X) = unsigned-rep(X + N) 0011 -4 1011 4
0100 -3 1100 5
Excess 7 0101 -2 1101 6
0110 -1 1110 7
excess-7-rep(-3) = unsigned-rep(-3 + 7) 0111 0 1111 8
= 0100
excess-7-rep(-7) = 0000 E.g –2 is represented as unsigned(7-2)
excess-7-rep(4) = unsigned-rep(4 + 7) = unsigned(5)
= 1011 = 0101
4
IEEE 754-1985
What is IEEE?
Standard important for
IEEE Standard exchange of data
portability of code
Representation for FP numbers in
32-bit (single precision)
64-bit (double precision)
IEEE 32-bit FP IEEE 32-bit FP
Sign bit S – 1 bit
Sign bit S – 1 bit
Mantissa M – 23 bits
Mantissa M – 23 bits
Exponent E – 8 bits
S E M
exponent fraction (mantissa)
sign
Exponent E – 8 bits
Bias is 127 (-1)S x (1.M) x 2E-127
Exponents –126 (00000001) to +127 (11111110)
Exponents 00000000 and 11111111 special
Example 1 – Convert to FP Example 2 – Convert from FP
Represent 0.312510 = 5/16
What number is represented by:
5/16 = 1/4 + 1/16 = 0.01012= 1.01*2-2
0 01111101 010000 ... 000
S = 0
S = 0
E = -2 + bias = -2 + 127 = 12510=01111101
E = 0111 1101 = 12510
M = 010....000
Real exponent = E-bias = 125-127 = -2
M = 1/4
(-1)S x (1+M) x 2E-bias
0 01111101 010000 ... 000 = (1 + 1/4) x (1/4)
= 5/16
5
Quiz IEEE FP Extra’s
What are
Zero
Both E and M = zero
0 10000001 111000 ... 000 Can be positive or negative
1 01111001 011000 ... 000
+/- Infinity (exponent all 1's)
De-normalised numbers
E=0
Convert to 32 FP using IEEE
close to zero, exponent is -126
4.125
-7.625
Overflow and Underflow
Overflow
Behaviour of Floating Point Results too large (positive or negative) to be
Numbers represented
Underflow
Result too close to zero (positive or negative) to be
represented
Range – 32 bit FP Range – 32 bit FP
negative zero positive negative zero positive
smallest smallest positive (>0) largest smallest smallest positive (>0) largest
largest negative largest negative
Quiz: find the largest and smallest FP in IEEE
Largest/smallest +/- (2 – 223) x 2127 ≈ 1038
32-bit
Near zero (normalised numbers)
+/- 1.0 x 2-126
6
How do they behave? Summary
If x, y are positive is:
FP scientific notation
x+y>x ?
Normalised representation in binary
If x and y are different can:
Bias to represent -ve to +ve range in exponent
x–y=0?
Notice how a 32-bit binary number can
Do these rules hold: represent many different entities in memory
(x + y) + z = x + (y + z) ?
Underflow as well as overflow
(x * y) * z = x * (y * z) ?
x * (y + z) = x*y + x*z ?
Different evaluation orders have different rounding errors