Great Ideas
UC Berkeley
in UC Berkeley
Teaching Professor Computer Architecture Professor
Dan Garcia (a.k.a. Machine Structures) Bora Nikolić
Floating Point
Garcia, Nikolić
cs61c.org
Quote of the day
“95% of the folks out
there are completely
clueless about floating-
point.”
– James Gosling, 1998-02-28
ú Sun Fellow
ú Java Inventor
Garcia, Nikolić
Floating Point (3)
Review of Numbers
§ Computers made to process numbers
§ What can we represent in N bits?
ú 2N things, and no more! They could be…
ú Unsigned integers:
0 to 2N - 1
(for N=32, 2N - 1 = 4,294,967,295)
ú Signed Integers (Two’s Complement)
-2(N-1) to 2(N-1) - 1
(for N=32, 2(N-1) - 1 = 2,147,483,647)
Garcia, Nikolić
Floating Point (4)
What about other numbers?
§ Very large numbers (sec/millennium)
ú 31,556,926,00010 (3.155692610 x 1010)
§ Very small numbers? (Bohr radius)
ú 0.000000000052917710m (5.2917710 x 10-11)
§ #s with both integer & fractional parts?
ú 1.5
§ First consider #3.
ú …our solution will also help with 1 and 2.
Garcia, Nikolić
Floating Point (5)
Representation of Fractions
§ “Binary Point” like decimal point signifies
boundary betw. integer and fractional parts:
§ Example 6-bit representation
§ 10.10102 = 1x21 + 1x2-1 + 1x2-3 = 2.62510
xx.yyyy
21 20 2-4
2-1 2-2 2-3
§ If we assume “fixed binary point”, range of 6-bit
representations with this format:
ú 0 to 3.9375 (almost 4)
Garcia, Nikolić
Floating Point (6)
Fractional Powers of 2
Mark Lu’s “Binary Float Displayer” i 2-i
0 1.0 1
1 0.5 1/2
2 0.25 1/4
3 0.125 1/8
4 0.0625 1/16
5 0.03125 1/32
6 0.015625
7 0.0078125
8 0.00390625
9 0.001953125
10 0.0009765625
11 0.00048828125
12 0.000244140625
13 0.0001220703125
14 0.00006103515625
15 0.000030517578125
Garcia, Nikolić
Floating Point (7)
Representation of Fractions with Fixed Pt.
§ What about addition and multiplication?
ú Addition is straightforward 01.100 1.510
+ 00.100 0.510
10.000 2.010
01.100 1.510
ú Multiplication a bit 00.100 0.510
more complex: 00 000
000 00
0110 0
00000
00000
0000110000
ú Where’s the answer, 0.11?
Need to remember where point is… þ
Garcia, Nikolić
Floating Point (8)
Representation of Fractions
§ So far, in our examples we used a “fixed” binary point what we
really want is to “float” the binary point. Why?
ú Floating binary point most effective use of our limited bits (and thus
more accuracy in our number representation):
ú E.g., put 0.1640625 into binary. Represent as in 5-bits choosing where
to put the binary point.
ú … 000000.001010100000…
ú Store these bits and keep track of the binary point 2 places to the left
of the MSB.
ú Any other solution would lose accuracy!
§ With floating point representation, each numeral carries an
exponent field recording the whereabouts of its binary point.
§ The binary point can be outside the stored bits, so very large and
small numbers can be represented.
Garcia, Nikolić
Floating Point (10)
Scientific Notation (in Decimal)
mantissa exponent
6.02ten x 1023
decimal point radix (base)
§ Normalized form: no leadings 0s
(exactly one digit to left of decimal point)
§ Alternatives to representing 1/1,000,000,000
ú Normalized: 1.0 x 10-9
ú Not normalized: 0.1 x 10-8, 10.0 x 10-10
Garcia, Nikolić
Floating Point (11)
Scientific Notation (in Binary)
mantissa exponent
1.01two x 2-1
“binary point” radix (base)
§ Computer arithmetic that supports it
called floating point, because it represents
numbers where the binary point is not
fixed, as it is for integers
ú Declare such variable in C as float
Garcia, Nikolić
Floating Point (12)
Floating Point Representation (1/2)
§ Normal format: +1.xxx…xtwo*2yyy…ytwo
§ Multiple of Word Size (32 bits)
31 30 23 22 0
S Exponent Significand
1 bit 8 bits 23 bits
§ S represents Sign
§ Exponent represents y’s
§ Significand represents x’s
§ Represent numbers as small as
1.2 x 10-38 to as large as 3.4 x 1038 Garcia, Nikolić
Floating Point (13)
Floating Point Representation (2/2)
§ What if result too large?
ú (> 3.4x1038 , < -3.4x1038 )
ú Overflow! à Exponent larger than represented in 8-bit
Exponent field
§ What if result too small?
ú (>0 and < 1.2x10-38 , <0 and > -1.2x10-38 )
ú Underflow! à Negative exponent larger than represented in
8-bit Exponent field
overflow Underflow overflow
-2x1038 -1 -2x10-38 0 2x10-38 1 2x1038
§ What would help reduce chances of overflow and/or
underflow?
Garcia, Nikolić
Floating Point (14)
IEEE 754 Floating Point Standard (1/3)
§ Single Precision (DP similar):
31 30 23 22 0
S Exponent Significand
1 bit 8 bits 23 bits
§ Sign bit: 1 means negative, 0 means positive
§ Significand:
ú To pack more bits, leading 1 implicit for normalized numbers
ú 1 + 23 bits single, 1 + 52 bits double
ú always true: 0 < Significand < 1 (for normalized numbers)
§ Note: 0 has no leading 1, so reserve exponent value 0
just for number 0
Garcia, Nikolić
Floating Point (15)
IEEE 754 Floating Point Standard (2/3)
§ IEEE 754 uses “biased exponent” representation.
ú Designers wanted FP numbers to be used even if no FP
hardware; e.g., sort records with FP numbers using integer
compares
ú Wanted bigger (integer) exponent field to represent bigger
numbers.
ú 2’s complement poses a problem (because negative numbers
look bigger)
ú We’re going to see that the numbers are ordered EXACTLY as
in sign-magnitude
I.e., counting from binary odometer 00…00 up to 11…11 goes from 0
to +MAX to -0 to -MAX to 0
Garcia, Nikolić
Floating Point (16)
IEEE 754 Floating Point Standard (3/3)
§ Called Biased Notation, where bias is number subtracted
to get real number
ú IEEE 754 uses bias of 127 for single prec.
ú Subtract 127 from Exponent field to get exponent value
§ Summary (single precision, or fp32):
31 30 23 22 0
S Exponent Significand
1 bit 8 bits 23 bits
§ (-1)S x (1 + Significand) x 2(Exponent-127)
§ Double precision identical, except exponent bias of
1023 (half, quad similar)…
Garcia, Nikolić
Floating Point (17)
“Father” of the Floating point standard
§ IEEE Standard 754 for
Binary Floating-Point
Arithmetic.
1989
ACM Turing
Award Winner!
Prof. Kahan
www.cs.berkeley.edu/~wkahan/ieee754status/754story.html þ
Garcia, Nikolić
Floating Point (18)
Representation for ± ∞
§ In FP, divide by 0 should produce ± ∞, not
overflow.
§ Why?
ú OK to do further computations with ∞
ú E.g., X/0 > Y may be a valid comparison
ú Ask math majors
§ IEEE 754 represents ± ∞
ú Most positive exponent reserved for ∞
ú Significands all zeroes
Garcia, Nikolić
Floating Point (20)
Representation for 0
§ Represent 0?
ú exponent all zeroes
ú significand all zeroes
ú What about sign? Both cases valid.
+0: 0 00000000 00000000000000000000000
-0: 1 00000000 00000000000000000000000
Garcia, Nikolić
Floating Point (21)
Special Numbers
§ What have we defined so far? (Single Precision)
Exponent Significand Object
0 0 0
0 nonzero ???
1-254 anything +/- fl. pt. #
255 0 +/- ∞
255 nonzero ???
§ Professor Kahan had clever ideas;
“Waste not, want not”
ú Wanted to use Exp=0,255 & Sig!=0
Garcia, Nikolić
Floating Point (22)
Representation for Not a Number
§ What do I get if I calculate sqrt(-
4.0) or 0/0?
ú If ∞ not an error, these shouldn’t be either
ú Called Not a Number (NaN)
ú Exponent = 255, Significand nonzero
§ Why is this useful?
ú Hope NaNs help with debugging?
ú They contaminate: op(NaN, X) = NaN
ú Can use the significand to identify which!
Garcia, Nikolić
Floating Point (23)
Representation for Denorms (1/2)
§ Problem: There’s a gap among representable FP
numbers around 0
ú Smallest representable pos num:
a = 1.0… 2 * 2-126 = 2-126
ú Second smallest representable pos num:
b = 1.000……1 2 * 2-126
= (1 + 0.00…12) * 2-126 Normalization and
= (1 + 2-23) * 2-126 implicit 1 is to blame!
= 2-126 + 2-149
b
ú a - 0 = 2-126 - +
0 a
ú b-a= 2-149 Gaps!
Garcia, Nikolić
Floating Point (24)
Representation for Denorms (2/2)
§ Solution:
ú We still haven’t used Exponent = 0, Significand
nonzero
ú DEnormalized number: no (implied) leading 1, implicit
exponent = -126.
ú Smallest representable pos num:
a = 2-149
ú Second smallest representable pos num:
b = 2-148
- +
0
Garcia, Nikolić
Floating Point (25)
Special Numbers Summary
§ Reserve exponents, significands:
Exponent Significand Object
0 0 0
0 nonzero Denorm
1-254 anything +/- fl. pt. #
255 0 +/- ∞
255 nonzero NaN
þ
Garcia, Nikolić
Floating Point (26)
Example
§ What is the decimal equivalent of:
1 1000 0001 111 0000 0000 0000 0000 0000
S Exponent Significand
(-1)S x (1 + Significand) x 2(Exponent-127)
(-1)1 x (1 + .111)2 x 2(129-127)
-1 x (1.111)2 x 2(2)
-111.12
-7.510
Garcia, Nikolić
Floating Point (28)
Example: Representing 1/3
§ 1/3
= 0.33333…10
= 0.25 + 0.0625 + 0.015625 + 0.00390625 + …
= 1/4 + 1/16 + 1/64 + 1/256 + …
= 2-2 + 2-4 + 2-6 + 2-8 + …
= 0.0101010101… 2 * 20
= 1.0101010101… 2 * 2-2
ú Sign: 0
ú Exponent = -2 + 127 = 125 = 01111101
ú Significand = 0101010101…
0 0111 11010101 0101 0101 0101 0101 010
Garcia, Nikolić
Floating Point (29)
Understanding the Significand (1/2)
§ Method 1 (Fractions):
ú In decimal: 0.34010 Þ 34010/100010
Þ 3410/10010
ú In binary: 0.1102 Þ 1102/10002 = 610/810
Þ 112/1002 = 310/410
ú Advantage: less purely numerical, more
thought oriented; this method usually helps
people understand the meaning of the
significand better
Garcia, Nikolić
Floating Point (30)
Understanding the Significand (2/2)
§ Method 2 (Place Values):
ú Convert from scientific notation
ú In decimal:
1.6732 = (1x100) + (6x10-1) + (7x10-2) + (3x10-3) + (2x10-4)
ú In binary:
1.1001 = (1x20) + (1x2-1) + (0x2-2) + (0x2-3) + (1x2-4)
ú Interpretation of value in each position extends
beyond the decimal/binary point
ú Advantage: good for quickly calculating significand
value; use this method for translating FP numbers
þ
Garcia, Nikolić
Floating Point (31)
Floating Point Fallacy
§ FP add associative?
ú x = – 1.5 x 1038, y = 1.5 x 1038, and z = 1.0
ú x + (y + z) = –1.5x1038 + (1.5x1038 + 1.0)
= –1.5x1038 + (1.5x1038) = 0.0
ú (x + y) + z = (–1.5x1038 + 1.5x1038) + 1.0
= (0.0) + 1.0 = 1.0
§ Therefore, Floating Point add is not associative!
ú Why? FP result approximates real result!
ú This example: 1.5 x 1038 is so much larger than 1.0
that 1.5 x 1038 + 1.0 in floating point representation is
still 1.5 x 1038
Garcia, Nikolić
Floating Point (33)
Precision and Accuracy Don’t confuse these two terms!
§ Precision is a count of the number bits in used to
represent a value.
§ Accuracy is the difference between the actual value of
a # and its computer representation.
§ High precision permits high accuracy but doesn’t
guarantee it.
ú It is possible to have high precision but low accuracy.
§ Example: float pi = 3.14;
ú pi will be represented using all 24 bits of the significant
(highly precise), but is only an approximation (not accurate).
Garcia, Nikolić
Floating Point (34)
Rounding
§ When we perform math on real numbers, we
have to worry about rounding to fit the result
in the significant field.
§ The FP hardware carries two extra bits of
precision, and then round to get the proper
value
§ Rounding also occurs when converting:
double to a single precision value, or
floating point number to an integer
Garcia, Nikolić
Floating Point (35)
IEEE FP Rounding Modes
§ Round towards + ∞ Examples in
ú ALWAYS round “up”: 2.001 ® 3, -2.001 ® -2 decimal
§ Round towards - ∞ (but, of
ú ALWAYS round “down”: 1.999 ® 1, -1.999 ® -2 course,
§ Truncate IEEE754 in
ú Just drop the last bits (round towards 0) binary)
§ Unbiased (default mode). Midway? Round to even
ú Normal rounding, almost: 2.4 ® 2, 2.6 ® 3, 2.5 ® 2, 3.5 ® 4
ú Round like you learned in grade school (nearest int)
ú Except if the value is right on the borderline, in which case we round to
the nearest EVEN number
ú Ensures fairness on calculation
ú This way, half the time we round up on tie, the other half time we round
down. Tends to balance out inaccuracies
Garcia, Nikolić
Floating Point (36)
Now you know why you see these errors…
Saturday Morning
Breakfast Comics
www.smbc-comics.com/comic/2013-06-05
Garcia, Nikolić
Floating Point (37)
FP Addition
§ More difficult than with integers
§ Can’t just add significands
§ How do we do it?
ú De-normalize to match exponents
ú Add significands to get resulting one
ú Keep the same exponent
ú Normalize (possibly changing exponent)
§ Note: If signs differ, just perform a
subtract instead.
Garcia, Nikolić
Floating Point (38)
Casting floats to ints and vice versa
(int) floating_point_expression
Coerces and converts it to the nearest integer (C
uses truncation)
i = (int) (3.14159 * f);
(float) integer_expression
converts integer to nearest floating point
f = f + (float) i;
Garcia, Nikolić
Floating Point (39)
int ® float ® int
if (i == (int)((float) i)) {
printf(“true”);
}
§ Will not always print “true”
§ Most large values of integers don’t have
exact floating point representations!
§ What about double?
Garcia, Nikolić
Floating Point (40)
float ® int ® float
if (f == (float)((int) f)) {
printf(“true”);
}
§ Will not always print “true”
§ Small floating point numbers (<1) don’t
have integer representations
§ For other numbers, rounding errors
þ
Garcia, Nikolić
Floating Point (41)
Double Precision Fl. Pt. Representation
§ binary64: Next Multiple of Word Size (64 bits)
31 30 20 19 0
S Exponent Significand
1 bit 11 bits 20 bits
Significand (cont’d)
32 bits
§ Double Precision (vs. Single Precision)
ú C variable declared as double
ú Represent numbers almost as small as
2.0 x 10-308 to almost as large as 2.0 x 10308
ú But primary advantage is greater accuracy
due to larger significand Garcia, Nikolić
Floating Point (43)
Other Floating Point Representations
§ Quad-Precision? Yep! (128 bits) “binary128”
ú Unbelievable range, precision (accuracy)
ú 15 exponent bits, 112 significand bits
§ Oct-Precision? Yep! “binary256”
ú 19 exponent bits, 236 significant bits
§ Half-Precision? Yep! “binary16” or “fp16”
ú 1/5/10 bits
§ Half-Precision?
Yep! “bfloat16”
ú Competing with fp16
ú Same range as fp32!
ú Used for faster ML
en.wikipedia.org/wiki/Floating_point Garcia, Nikolić
Floating Point (44)
Floating Point Soup
31 30 23 22 0
FP32
S Exponent Significand
1 bit 8 bits 23 bits
15 14 10 9 0
FP16 S Exponent Significand
1 bit 5 bits 10 bits
15 14 76 0
BFLOAT16 S Exponent Significand
1 bit 8 bits 7 bits
18 17 109 0
TF32 S Exponent Significand
1 bit 8 bits 10 bits Garcia, Nikolić
Floating Point (45)
Who Uses What in Domain Accelerators?
Accelerator int4 int8 int16 fp16 bf16 fp32 tf32
Google TPU v1 x
Google TPU v2 x
Google TPU v3 x
Nvidia Volta TensorCore x x x
Nvidia Ampere TensorCore x x x x x x x
Nvidia DLA x x x
Intel AMX x x
Amazon AWS Inferentia x x x
Qualcomm Hexagon x
Huawei Da Vinci x x
MediaTek APU 3.0 x x x
Samsung NPU x
Tesla NPU x
Garcia, Nikolić
Floating Point (46)
en.wikipedia.org/wiki/Unum_(number_format)
Unum
§ Everything so far has had a
fixed set of bits for
Exponent and Significant
ú What if they were variable?
ú Add a “u-bit” to tell whether
number is exact or range
ú “Promises to be to floating
point what floating point is
to fixed point”
§ Claims to save power!
Dr. John Gustafson
Garcia, Nikolić
Floating Point (47)
www.h-schmidt.net/FloatConverter/IEEE754.html
Conclusion
§ Floating Point lets us:
ú Represent numbers containing both integer and fractional parts; makes
efficient use of available bits.
ú Store approximate values for very large and very small #s.
§ IEEE 754 Floating Point Standard is most widely accepted
attempt to standardize interpretation of such numbers
ú Every computer since ~1997 follows these conventions)
§ Summary (single precision, or fp32):
31 30 23 22 0
S Exponent Significand
1 bit 8 bits 23 bits Can store
(-1)S x (1 + Significand) x 2(Exponent-127) NaN, ± ∞
þ
Exponent tells Significand how much (2i) to count by (…, 1/4, 1/2, 1, 2, …)
Garcia, Nikolić
Floating Point (48)