Floating Point Numbers
Review of Numbers
• Computers are made to deal with numbers
• What can we represent in N bits?
• Unsigned integers:
0 to 2N - 1
• Signed Integers (Two’s Complement)
-2(N-1) to 2(N-1) - 1
Signed Integers
-2(N-1) - 1 to 2(N-1) - 1
Other Numbers
• What about other numbers?
• Very large numbers? (seconds/century)
3,155,760,00010 (3.1557610 x 109)
• Very small numbers? (atomic diameter)
0.0000000110 (1.010 x 10-8)
• Rationals (repeating pattern)
• 2/3 (0.666666666. . .)
• Irrationals
21/2 (1.414213562373. . .)
• Transcendentals
• e (2.718...), (3.141...)
• All represented in scientific notation
Fractional Binary Numbers
2i
2i-1
4
••• 2
1
bi bi-1 ••• b2 b1 b0 b-1 b-2 b-3 ••• b-j
1/2
1/4 •••
1/8
• Representation 2-j
• Bits to right of “binary point” represent fractional powers of 2
• Represents rational number:
Fractional Binary Numbers: Examples
Value Representation
5 3/4 = 23/4 101.112 = 4 + 1 + 1/2 + 1/4
2 7/8 = 23/8 010.1112 = 2 + 1/2 + 1/4 + 1/8
1 7/16 = 23/16 001.01112 = 1 + 1/4 + 1/8 + 1/16
Observations
Divide by 2 by shifting right (unsigned)
Multiply by 2 by shifting left
Numbers of form 0.111111…2 are just below 1.0
1/2 + 1/4 + 1/8 + … + 1/2i + … ➙ 1.0
Use notation 1.0 – ε
Representable Numbers
• Limitation #1
• Can only exactly represent numbers of the form x/2k
• Other rational numbers have repeating bit representations
• Value Representation
• 1/3 0.0101010101[01]…2
• 1/5 0.001100110011[0011]…2
• 1/10 0.0001100110011[0011]…2
• Limitation #2
• Just one setting of binary point within the w bits
• Limited range of numbers (very small values? very large?)
Objective
• To understand the fundamentals of floating-
point representation
• To know the IEEE-754 Floating Point
Standard
Patriot Missile
• Gulf War I
• Failed to intercept
incoming Iraqi scud
missile (Feb 25, 1991)
• 28 American soldiers
killed
GAO Report: GAO/IMTEC-92-26 Patriot Missile Software
Problem
http://www.fas.org/spp/starwars/gao/im92026.htm
Patriot Design
• Intended to operate only for a few hours
• Defend Europe from Soviet aircraft and missile
• Four 24-bit registers (1970s design!)
• Kept time with integer counter: incremented every
1/10 second
• Calculate speed of incoming missile to predict
future positions:
velocity = loc1 – loc0/(count1 – count0) * 0.1
• But, cannot represent 0.1 exactly!
Floating Imprecision
• 24-bits:
0.1 = 1/24 + 1/25 + 1/28 + 1/29
+ 1/212 + 1/213 + 1/216 + 1/217
+ 1/220 + 1/221
= 209715 / 2097152
Error is 0.2/2097152 = 1/10485760
One hour = 3600 seconds
3600 * 1/10485760 * 10 = 0.0034s
20 hours = 0.0687s
Miss target! (137 meters)
Two weeks before the incident, Army officials received Israeli data
indicating some loss in accuracy after the system had been running
for 8 consecutive hours. Consequently, Army officials modified the
software to improve the system's accuracy. However, the modified
software did not reach Dhahran until February 26, 1991--the day
after the Scud incident.
GAO Report
http://fas.org/spp/starwars/gao/im92026.htm
Floating Point Representation
Example:
• Numerical Form: 1521310 = (-1)0 x 1.11011011011012 x 213
(–1)s M 2E
• Sign bit s determines whether number is negative or positive
• Significand M normally a fractional value in range [1.0,2.0).
• Exponent E weights value by power of two
• Encoding
• MSB s is sign bit s
• exp field encodes E (but is not equal to E)
• frac field encodes M (but is not equal to M)
s exp frac
Exponential Notation
• The following are equivalent
representations of 1,234
123,400.0 x 10-2
12,340.0 x 10-1 The representations differ
1,234.0 x 100 in that the decimal place –
123.4 x 101 the “point” -- “floats” to
the left or right (with the
12.34 x 102 appropriate adjustment in
1.234 x 103 the exponent).
0.1234 x 104
Parts of a Floating Point Number
Exponent
-0.9876 x 10-3
Sign of
Sign of Location of exponent
mantissa decimal point Mantissa
Base
IEEE 754 Standard
• Most common standard for representing floating
point numbers
• Single precision: 32 bits, consisting of...
• Sign bit (1 bit)
• Exponent (8 bits)
• Mantissa (23 bits)
• Double precision: 64 bits, consisting of…
• Sign bit (1 bit)
• Exponent (11 bits)
• Mantissa (52 bits)
Prof. Willian Kahan
Single Precision Format
32 bits
Mantissa (23 bits)
Exponent (8 bits)
Sign of mantissa (1 bit)
Normalization
• The mantissa is normalized
• Has an implied decimal place on left
• Has an implied “1” on left of the decimal place
• E.g.,
• Mantissa 10100000000000000000000
• Represents… 1.1012 = 1.62510
• Normalized form: no leadings 0s
(exactly one digit to left of decimal point)
• Normalized: 1.0 x 10-9
• Not normalized: 0.1 x 10-8,10.0 x 10-10
Excess Notation
• To include +ve and –ve exponents, “excess”
notation is used
• Single precision: excess 127
• Double precision: excess 1023
• The value of the exponent stored is larger than the
actual exponent
• E.g., excess 127,
• Exponent 10000111
• Represents… 135 – 127 = 8
Example
• Single precision
0 10000010 11000000000000000000000
1.112
130 – 127 = 3
0 = positive mantissa
+1.112 x 23 = 1110.02 = 14.010
Hexadecimal
• It is convenient and common to represent
the original floating point number in
hexadecimal
• The preceding example…
0 10000010 11000000000000000000000
4 1 6 0 0 0 0 0
Converting from Floating Point
• E.g., What decimal value is represented by
the following 32-bit floating point number?
C17B000016
• Step 1
• Express in binary and find S, E, and M
C17B000016 =
1 10000010 111101100000000000000002
S E M
1 = negative
0 = positive
• Step 2
• Find “real” exponent, n
• n = E – 127
= 100000102 – 127
= 130 – 127
=3
• Step 3
• Put S, M, and n together to form binary result
• (Don’t forget the implied “1.” on the left of the
mantissa.)
-1.11110112 x 2n =
-1.11110112 x 23 =
-1111.10112
• Step 4
• Express result in decimal
-1111.10112
-15 2-1 = 0.5
2-3 = 0.125
2-4 = 0.0625
0.6875
Answer: -15.6875
Converting from Floating Point
• E.g., What decimal value is represented by
the following 32-bit floating point number?
42808000 16
Converting to Floating Point
• E.g., Express 36.562510 as a 32-bit floating
point number (in hexadecimal)
• Step 1
• Express original value in binary
36.562510 =
100100.10012
• Step 2
• Normalize
100100.10012 =
1.0010010012 x 25
• Step 3
• Determine S, E, and M
+1.0010010012 x 25
n E = n + 127
S M
= 5 + 127
= 132
= 100001002
S = 0 (because the value is positive)
• Step 4
• Put S, E, and M together to form 32-bit binary
result
0 10000100 001001001000000000000002
S E M
• Step 5
• Express in hexadecimal
0 10000100 001001001000000000000002 =
0100 0010 0001 0010 0100 0000 0000 00002 =
4 2 1 2 4 0 0 016
Answer: 4212400016
Converting to Floating Point
• E.g., Express 6.510 as a 32-bit floating point
number (in hexadecimal)
Converting to Floating Point
• E.g., Express 0.1 as a 32-bit floating point
number (in hexadecimal)
Zero, Infinity, and NaN
• Zero
– Exponent field E = 0 and fraction F = 0
– +0 and –0 are possible according to sign bit S
• Infinity
– Infinity is a special value represented with maximum E and F = 0
• For single precision with 8-bit exponent: maximum E = 255
• For double precision with 11-bit exponent: maximum E = 2047
– Infinity can result from overflow or division by zero
– +∞ and –∞ are possible according to sign bit S
• NaN (Not a Number)
– NaN is a special value represented with maximum E and F ≠ 0
– Result from exceptional situations, such as 0/0 or sqrt(negative)
– Operation on a NaN results is NaN: Op(X, NaN) = NaN
Simple 6-bit Floating Point Example
S Exponent3 Fraction2
• 6-bit floating point representation
– Sign bit is the most significant bit
– Next 3 bits are the exponent with a bias of 3
– Last 2 bits are the fraction
• Same general form as IEEE
– Normalized, denormalized
– Representation of 0, infinity and NaN
• Value of normalized numbers (–1)S × (1.F)2 × 2E – 3
• Value of denormalized numbers (–1)S × (0.F)2 × 2– 2
Values Related to Exponent
Exp. exp E 2E
0 000 -2 ¼ Denormalized
1 001 -2 ¼
2 010 -1 ½
3 011 0 1
Normalized
4 100 1 2
5 101 2 4
6 110 3 8
7 111 n/a Inf or NaN
Dynamic Range of Values
s exp frac E value
0 000 00 -2 0
0 000 01 -2 1/4*1/4=1/16 smallest denormalized
0 000 10 -2 2/4*1/4=2/16
0 000 11 -2 3/4*1/4=3/16 largest denormalized
0 001 00 -2 4/4*1/4=4/16=1/4=0.25 smallest normalized
0 001 01 -2 5/4*1/4=5/16
0 001 10 -2 6/4*1/4=6/16
0 001 11 -2 7/4*1/4=7/16
0 010 00 -1 4/4*2/4=8/16=1/2=0.5
0 010 01 -1 5/4*2/4=10/16
0 010 10 -1 6/4*2/4=12/16=0.75
0 010 11 -1 7/4*2/4=14/16
Dynamic Range of Values
s exp frac E value
0 011 00 0 4/4*4/4=16/16=1
0 011 01 0 5/4*4/4=20/16=1.25
0 011 10 0 6/4*4/4=24/16=1.5
0 011 11 0 7/4*4/4=28/16=1.75
0 100 00 1 4/4*8/4=32/16=2
0 100 01 1 5/4*8/4=40/16=2.5
0 100 10 1 6/4*8/4=48/16=3
0 100 11 1 7/4*8/4=56/16=3.5
0 101 00 2 4/4*16/4=64/16=4
0 101 01 2 5/4*16/4=80/16=5
0 101 10 2 6/4*16/4=96/16=6
0 101 11 2 7/4*16/4=112/16=7
Dynamic Range of Values
s exp frac E value
0 110 00 3 4/4*32/4=128/16=8
0 110 01 3 5/4*32/4=160/16=10
0 110 10 3 6/4*32/4=192/16=12
0 110 11 3 7/4*32/4=224/16=14 largest normalized
0 111 00
0 111 01 NaN
0 111 10 NaN
0 111 11 NaN
Floating Point Addition Example
• Consider adding: (1.111)2 × 2–1 + (1.011)2 × 2–3
– For simplicity, we assume 4 bits of precision (or 3 bits of
fraction)
• Cannot add significands … Why?
– Because exponents are not equal
• How to make exponents equal?
– Shift the significand of the lesser exponent right
until its exponent matches the larger number
• (1.011)2 × 2–3 = (0.1011)2 × 2–2 = (0.01011)2 × 2–1
– Difference between the two exponents = –1 – (–3) = 2
1.111
– So, shift right by 2 bits +
0.01011
• Now, add the significands: Carry 10.00111
Addition Example
• So, (1.111)2 × 2–1 + (1.011)2 × 2–3 = (10.00111)2 × 2–1
• However, result (10.00111)2 × 2–1 is NOT normalized
• Normalize result: (10.00111)2 × 2–1 = (1.000111)2 × 20
– In this example, we have a carry
– So, shift right by 1 bit and increment the exponent
• Round the significand to fit in appropriate number of bits
– We assumed 4 bits of precision or 3 bits of fraction
• Round to nearest: (1.000111)2 ≈ (1.001)2 1.000 111
– Renormalize if rounding generates a carry + 1
• Detect overflow / underflow 1.001
– If exponent becomes too large (overflow) or too small (underflow)
Summary: IEEE Floating Point
Single Precision (32 bits)
1 8 bits 23 bits
Sign
Exponent Fraction
31 30 23 22 0
Exponent values: 0 zeroes
1-254 exp + 127
255 infinities, NaN
Value = (1 – 2*Sign) (1 + Fraction)Exponent - 127
Denormalized Values
• Condition
• exp = 000…0
• Value
• Exponent value E = –Bias + 1
• Significand value M = 0.xxx…x2
• xxx…x: bits of frac
• Cases
• exp = 000…0, frac = 000…0
• Represents value 0
• Note that have distinct values +0 and –0
• exp = 000…0, frac 000…0
• Numbers very close to 0.0
Special Values
• Condition
• exp = 111…1
• Cases
• exp = 111…1, frac = 000…0
• Represents value(infinity)
• Operation that overflows
• Both positive and negative
• E.g., 1.0/0.0 = 1.0/0.0 = +, 1.0/0.0 =
• exp = 111…1, frac 000…0
• Not-a-Number (NaN)
• Represents case when no numeric value can be
determined
• E.g., sqrt(–1),
Interesting Numbers
• Description exp frac Numeric Value
• Zero 00…00 00…00 0.0
• Smallest Pos. Denorm. 00…00 00…01 2– {23,52} X 2– {126,1022}
• Single 1.4 X 10–45
• Double 4.9 X 10–324
• Largest Denormalized 00…00 11…11 (1.0 – ) X 2– {126,1022}
• Single 1.18 X 10–38
• Double 2.2 X 10–308
• Smallest Pos. Normalized 00…01 00…00 1.0 X 2– {126,1022}
• Just larger than largest denormalized
• One 01…11 00…00 1.0
• Largest Normalized 11…10 11…11 (2.0 – ) X 2{127,1023}
• Single 3.4 X 1038
• Double 1.8 X 10308
Visualization: Floating Point
Encodings
− +
−Normalized −Denorm +Denorm +Normalized
NaN NaN
0 +0
Tiny Floating Point Example
s exp frac
1 4-bits 3-bits
• 8-bit Floating Point Representation
• the sign bit is in the most significant bit
• the next four bits are the exp, with a bias of 7
• the last three bits are the frac
• Same general form as IEEE Format
• normalized, denormalized
• representation of 0, NaN, infinity
v = (–1)s M 2E
Dynamic Range (s=0 only) norm: E = exp – Bias
denorm: E = 1 – Bias
s exp frac E Value
0 0000 000 -6 0
0 0000 001 -6 1/8*1/64 = 1/512 closest to zero
Denormalized 0 0000 010 -6 2/8*1/64 = 2/512 (-1)0(0+1/4)*2-6
numbers …
0 0000 110 -6 6/8*1/64 = 6/512
0 0000 111 -6 7/8*1/64 = 7/512 largest denorm
0 0001 000 -6 8/8*1/64 = 8/512 smallest norm
0 0001 001 -6 9/8*1/64 = 9/512 (-1)0(1+1/8)*2-6
…
0 0110 110 -1 14/8*1/2 = 14/16
0 0110 111 -1 15/8*1/2 = 15/16 closest to 1 below
Normalized 0 0111 000 0 8/8*1 = 1
numbers 0 0111 001 0 9/8*1 = 9/8 closest to 1 above
0 0111 010 0 10/8*1 = 10/8
…
0 1110 110 7 14/8*128 = 224
0 1110 111 7 15/8*128 = 240 largest norm
0 1111 000 n/a inf
Distribution of Values
• 6-bit IEEE-like format
• e = 3 exponent bits s exp frac
• f = 2 fraction bits 1 3-bits 2-bits
• Bias is 23-1-1 = 3
• Notice how the distribution gets denser toward zero.
8 values
-15 -10 -5 0 5 10 15
Denormalized Normalized Infinity
Floats are not Reals
231−1=2,147,483,647
Int’s:
eg. 40000 * 40000 --> 1600000000
600000* 600000 --> ?
Floats:
Eg 2 : Is (x + y) + z = x + (y + z)?
eg
(1e20 + -1e20) + 3.14 --> 3.14
1e20 + (-1e20 + 3.14) --> ??
Need to understand details of underlying implementations
IEEE 754
IEEE 754 Binary16 (F16) Format
Component Bits
Sign bit 1
Exponent 5
Fraction 10
Total 16 bits (2 bytes)
Overview of IEEE 754 Binary 32
Field Bits Description
Sign 1 0 = positive, 1 = negative
Exponent 8 Encodes exponent with bias
Precision bits (fractional
Fraction (Mantissa) 23
part)
IEEE 754
IEEE 754 Binary16 (F128) Format
Field Bits Description
Sign 1 0 = positive, 1 = negative
Encodes exponent using a
Exponent 15
bias of 16383
Fractional part of the
Fraction (Mantissa) 112
significand
IEEE 754 Binary64
Field Bits Description
Sign 1 0 = positive, 1 = negative
Exponent 11 Encodes exponent with bias
Precision bits (fractional
Fraction (Mantissa) 52
part)