Computer arithmetic
Victor Eijkhout
Fall 2023
1 Eijkhout – Computer arithmetic – Fall 2023
Justification
This short session will explain the basics of floating point arithmetic,
mostly focusing on round-off and its influence on computations.
2 Eijkhout – Computer arithmetic – Fall 2023
Numbers in scientific computing
Integers: . . . , −2, −1, 0, 1, 2, . . .
Rational numbers: 1/3, 22/7: not often encountered
√
Real numbers 0, 1, −1.5, 2/3,
2, log 10, . . .
√ √
Complex numbers 1 + 2i , 3 − 5i , . . .
Computers use a finite number of bits to represent numbers,
so only a finite number of numbers can be represented, and no
irrational numbers (even some rational numbers).
3 Eijkhout – Computer arithmetic – Fall 2023
First we dig into bits
4 Eijkhout – Computer arithmetic – Fall 2023
Bit operations
boolean bitwise (C) bitwise (F) bitwise (Py)
and && & iand &
or || | ior |
not ! ˜
xor ˆ ieor
Bit shift operations in C:
left shift<<
right shift >>
Fortran: mvbits
5 Eijkhout – Computer arithmetic – Fall 2023
Arithmetic with bit ops
Left-shift is multiplication by 2:
i_times_2 = i<<1;
Extract bits:
i_mod_8 = i & 7
(How does that last one work?)
6 Eijkhout – Computer arithmetic – Fall 2023
Exercise 1: Bit operations
Use bit operations to test whether a number is odd or even.
Can you think of more than one way?
7 Eijkhout – Computer arithmetic – Fall 2023
Integers
8 Eijkhout – Computer arithmetic – Fall 2023
Integers
Scientific computation mostly uses real numbers. Integers are mostly
used for array indexing.
We look at
1. integers as supported by the hardware;
2. integers as they exist in programming languages;
3. (and not software-defined integers)
9 Eijkhout – Computer arithmetic – Fall 2023
In C/C++ and Fortran
C:
A short int is at least 16 bits;
An integer is at least 16 bits, but often 32 bits;
A long integer is at least 32 bits, but often 64;
A long long integer is at least 64 bits.
Fortran uses kinds, not necessarily equal to number of bytes:
integer(2) :: i2
integer(4) :: i4
integer(8) :: i8
Specify the number of decimal digits with selected_int_kind(n).
10 Eijkhout – Computer arithmetic – Fall 2023
Exercise 2: Powers of two
Print 2n for n = 0, . . . , 31. There are at least two ways of generating
these powers.
Also print the bit pattern. What is unexpected?
11 Eijkhout – Computer arithmetic – Fall 2023
Negative integers
Problem:
How do we represent them?
How do we do efficient arithmetic on them?
Define
rep : Z → 2n
‘representation of the number N ∈ Z as bitstring of length n.’
int : 2n → Z
‘interpretation of the bitstring of length n as number N ∈ Z ’
12 Eijkhout – Computer arithmetic – Fall 2023
Negative integers
Use of sign bit: typically first bit
s i 1 . . . in
Simplest solution:
(
n≥0 rep(n) = 0, i1 , . . . i31
n < 0 rep(−n) = 1, i1 , . . . i31
13 Eijkhout – Computer arithmetic – Fall 2023
Sign bit
Interpretation
bitstring 00 · · · 0 . . . 01 · · · 1 10 · · · 0 . . . 11 · · · 1
as unsigned int 0 . . . 231 − 1 231 . . . 232 − 1
as naive signed 0 . . . 231 − 1 −0 . . . −231 + 1
14 Eijkhout – Computer arithmetic – Fall 2023
Shifting
Interpret unsigned number n as n − B
bitstring 00 · · · 0 . . . 01 · · · 1 10 · · · 0 . . . 11 · · · 1
as unsigned int 0 . . . 231 − 1 231 . . . 232 − 1
31
as shifted int −2 ... −1 0 . . . 231 − 1
15 Eijkhout – Computer arithmetic – Fall 2023
2’s Complement
Let m be a signed integer, then the 2’s complement ‘bit pattern’ rep(m)
is a non-negative integer defined as follows:
If 0 ≤ m ≤ 231 − 1, the normal bit pattern for m is used, that is
0 ≤ m ≤ 231 − 1 ⇒ rep(m) = m.
For −231 ≤ n ≤ −1, n is represented by the bit pattern for
232 − |n|:
−231 ≤ n ≤ −1 ⇒ rep(m) = 232 − |n|.
16 Eijkhout – Computer arithmetic – Fall 2023
2’s complement visualized
bitstring 00 · · · 0 . . . 01 · · · 1 10 · · · 0 . . . 11 · · · 1
as unsigned int 0 . . . 231 − 1 231 . . . 232 − 1
31
as 2’s comp. integer 0 ··· 2 −1 −231 · · · −1
17 Eijkhout – Computer arithmetic – Fall 2023
Integer arithmetic
Problem: processor is very good at artithmetic on (unsigned) bit
strings.
How does that translate to arithmetic on integers?
?
int (rep(x ) ∗ rep(y )) = x ∗ y
18 Eijkhout – Computer arithmetic – Fall 2023
Addition in 2’s complement
Add m + n, where m, n are representable:
0 ≤ |m|, |n| < 231 .
The easy case is 0 < m, n, as long as there is no overflow.
19 Eijkhout – Computer arithmetic – Fall 2023
Addition in 2’s complement (cont’d)
Case m > 0, n < 0, and m + n > 0. Then rep(m) = m and
rep(n) = 232 − |n|, so the unsigned addition becomes
rep(m) + rep(n) = m + (232 − |n|) = 232 + m − |n|.
Since m − |n| > 0, this result is > 232 .
However, this is basically m + n with the overflow bit set.
20 Eijkhout – Computer arithmetic – Fall 2023
Subtraction in 2’s complement
Subtraction m − n:
Case: m < n. Observe that −n has the bit pattern of 232 − n.
Also, m + (232 − n) = 232 − (n − m) where 0 < n − m < 231 − 1,
so 232 − (n − m) is the 2’s complement bit pattern of m − n.
Case: m > n. The bit pattern for −n is 232 − n, so m + (−n) as
unsigned is m + 232 − n = 232 + (m − n). Here m − n > 0. The
232 is an overflow bit; ignore.
21 Eijkhout – Computer arithmetic – Fall 2023
Overflow
There is a limited number of bits, so numbers that are too large in
absolute value can not be represented.
Overflow.
This is not a fatal error: your program continues with the wrong result.
22 Eijkhout – Computer arithmetic – Fall 2023
Exercise 3: Integer overflow
Investigate what happens when you perform an integer calculation that
leads to overflow. What does your compiler say if you try to write down
a nonrepresentible number explicitly, for instance in a declaration or
assignment statement?
Language lawyer remark: signed integer overflow is Undefined
Behavior in C/C++.
23 Eijkhout – Computer arithmetic – Fall 2023
Floating point numbers
24 Eijkhout – Computer arithmetic – Fall 2023
Floating point math is hard!
And the consequences if you get it wrong can be considerable.
25 Eijkhout – Computer arithmetic – Fall 2023
Floating point numbers
Analogous to scientific notation x = 6.022 · 1023 :
−1 −i e
x = ±Σti =0 di β β
sign bit
β is the base of the number system
0 ≤ di ≤ β − 1 the digits of the mantissa:
one digit before the radix point, so mantissa < β
e ∈ [L, U ] exponent, stored with bias: unsigned int where
fl(L) = 0
26 Eijkhout – Computer arithmetic – Fall 2023
Examples of floating point systems
β t L U
IEEE single (32 bit) 2 23 -126 127
IEEE double (64 bit) 2 53 -1022 1023
Old Cray 64bit 2 48 -16383 16384
IBM mainframe 32 bit 16 6 -64 63
packed decimal 10 50 -999 999
BCD is tricky: 3 decimal digits in 10 bits
(we will often use β = 10 in the examples, because it’s easier to read
for humans, but all practical computers use β = 2)
Internal processing in 80 bit
27 Eijkhout – Computer arithmetic – Fall 2023
Limitations
Overflow: more than β(1 − β−t +1 )βU or less than −β(1 − β−t +1 )βU
Underflow: positive numbers less than βL
Gradual underflow: β−t +1 · βL
Overflow leads to Inf.
28 Eijkhout – Computer arithmetic – Fall 2023
Exercise 4: Floating point overflow
p
For real numbers x , y , the quantity g = (x 2 + y 2 )/2 satisfies
g ≤ max{|x |, |y |}
so it is representable if x and y are. What can go wrong if you
compute g using the above formula? Can you think of a better way?
29 Eijkhout – Computer arithmetic – Fall 2023
Other exceptions
Overflow: Inf
Inf−Inf→NaN√
also 0/0 or −1
This does not stop your program in general
sometimes possible
30 Eijkhout – Computer arithmetic – Fall 2023
The normalization problem
Do we allow
1.100 · 100 , 0.110 · 101 , 0.011 · 102 ?
This makes testing for equality hard.
Solution: normalized numbers have one nonzero before the radix
point.
31 Eijkhout – Computer arithmetic – Fall 2023
Normalized floating point numbers
Require first digit in the mantissa to be nonzero.
Equivalent: mantissa part 1 ≤ xm < β
Unique representation for each number,
also: in binary this makes the first digit 1, so we don’t need to store
that.
(do you see a problem?)
With normalized numbers, underflow threshold is 1 · βL ;
‘gradual underflow’ possible, but usually not efficient.
32 Eijkhout – Computer arithmetic – Fall 2023
IEEE 754, 32-bit pattern
sign exponent mantissa
p e = e1 · · · e8 s = s1 . . . s23
31 30 · · · 23 22 · · · 0
± 2e−127 s1 · 2−1 + · · · + s23 · 2−23
(except e = 0, 255)
33 Eijkhout – Computer arithmetic – Fall 2023
IEEE 754, 32-bit, all cases
(e1 · · · e8 ) numerical value range
(0 · · · 0) = 0 ±0.s1 · · · s23 × 2−126 s = 0 · · · 01 ⇒ 2−23 · 2−126 = 2−149 ≈ 10−45
s = 1 · · · 11 ⇒ (1 − 2−23 ) · 2−126
(0 · · · 01) = 1 ±1.s1 · · · s23 × 2−126 s = 0 · · · 01 ⇒ 1 · 2−126 ≈ 10−37
(0 · · · 010) = 2 ±1.s1 · · · s23 × 2−125
···
(01111111) = 127 ±1.s1 · · · s23 × 20 s = 0 · · · 00 ⇒ 1 · 20 = 1
s = 0 · · · 01 ⇒ 1 + 2−23 · 20 = 1 + ε
s = 1 · · · 11 ⇒ (2 − 2−23 ) · 20 = 2 − ε
(10000000) = 128 ±1.s1 · · · s23 × 21 s = 0 · · · 00 ⇒ 1 · 21 = 2
··· et cetera
(11111110) = 254 ±1.s1 · · · s23 × 2127
(11111111) = 255 s1 · · · s23 = 0 ⇒ ±∞
s1 · · · s23 ̸= 0 ⇒ NaN
34 Eijkhout – Computer arithmetic – Fall 2023
Exercise 5: Float vs Int
Note that the exponent doesn’t come at the end. This has an
interesting consequence.
What is the interpretation of
0···0111
as int? What as float?
What is the largest integer that is representible as float?
35 Eijkhout – Computer arithmetic – Fall 2023
Other precisions
There is a 64-bit format, with 53 bits mantissa.
IEEE envisioned a sliding scale of precisions: see Intel 80-bit
registers
Half precision, and recent invention bfloat16
36 Eijkhout – Computer arithmetic – Fall 2023
Floating point math
37 Eijkhout – Computer arithmetic – Fall 2023
Representation error
Error between number x and representation x̃:
absolute x − x̃ or |x − x̃ |
relative x −x̃
x
or x −x̃
x
Equivalent: x̃ = x ± ε ⇔ |x − x̃ | ≤ ε ⇔ x̃ ∈ [x − ε, x + ε].
Also: x̃ = x (1 + ε) often shorthand for x̃ −
x
x
≤ε
38 Eijkhout – Computer arithmetic – Fall 2023
Example
Decimal, t = 3 digit mantissa: let x = 1.256, x̃round = 1.26,
x̃truncate = 1.25
Error in the 4th digit.
Different story for decimal vs binary.
How would this story change with a non-zero exponent,
for instance 1.256 · 1012 ?
39 Eijkhout – Computer arithmetic – Fall 2023
Exercise 6: Round-off
The number e ≈ 2.72, the base for the natural logarithm, has various
definitions. One of them is
e = lim (1 + 1/n)n . (1)
n→∞
Write a single precision program that tries to compute e in this
manner. (Do not use the pow function: code the power explicitly.)
Evaluate the expression for an upper bound n = 10k for some k . (How
far do you let k range?) Explain the output for large n. Comment on
the behavior of the error.
40 Eijkhout – Computer arithmetic – Fall 2023
Machine precision
Any real number can be represented to a certain precision:
x̃ = x (1 + ε) where
truncation: ε = β−t +1
rounding: ε = 21 β−t +1
This is called machine precision: maximum relative error.
32-bit single precision: mp ≈ 10−7
64-bit double precision: mp ≈ 10−16
Maximum attainable accuracy.
Another definition of machine precision: smallest number ε such that
1 + ε > 1.
41 Eijkhout – Computer arithmetic – Fall 2023
Exercise 7: Machine epsilon
Write a small program that computes the machine epsilon for both
single and double precision. Does it make any difference if you set the
compiler optimization levels low or high?
(For C++ programmers: can you write a templated program that works
for single and double precision?)
42 Eijkhout – Computer arithmetic – Fall 2023
Addition
1. align exponents
2. add mantissas
3. adjust exponent to normalize
Example: 1.00 + 2.00 × 10−2 = 1.00 + .02 = 1.02. This is exact, but
what happens with 1.00 + 2.55 × 10−2 ?
Example: 5.00 × 101 + 5.04 = (5.00 + 0.504) × 101 → 5.50 × 101
Any error comes from limiting the mantissa: if x is the true sum and x̃
the computed sum, then x̃ = x (1 + ε) with |ε| < 10−2
43 Eijkhout – Computer arithmetic – Fall 2023
The ‘correctly rounded arithmetic’ model
Assumption (enforced by IEEE 754):
The numerical result of an operation is the rounding of the
exactly computed result.
fl(x1 ⊙ x2 ) = (x1 ⊙ x2 )(1 + ε)
where ⊙ = +, −, ∗, /
Note: this holds only for a single operation!
44 Eijkhout – Computer arithmetic – Fall 2023
Guard digits
Correctly rounding is not trivial, especially for subtraction.
Example: t = 2, β = 10: 1.0 − 9.5 × 10−1 , exact result
0.05 = 5.0 × 10−2 .
Simple approach:
1.0 − 9.5 × 10−1 = 1.0 − 0.9 = 0.1 = 1.0 × 10−1
Using ‘guard digit’:
1.0 − 9.5 × 10−1 = 1.0 − 0.95 = 0.05 = 5.0 × 10−2 , exact.
In general 3 extra bits needed.
45 Eijkhout – Computer arithmetic – Fall 2023
Fused Mul-Add instructions
(also ‘fused multiply-accumulate’)
c ← a∗b+c
Addition plus multiplication, but not independent
Processors can have dedicated hardware for FMA (also IEEE
754-2008)
Internally evaluated in higher precision: 80-bit.
Very useful for certain linear algebra (which?) Not for other
operations (examples?)
46 Eijkhout – Computer arithmetic – Fall 2023
Associativity
Computate 4 + 6 + 7 in one significant digit.
Evaluation left-to-right gives:
(4 · 100 + 6 · 100 ) + 7 · 100 ⇒ 10 · 100 + 7 · 100 addition
⇒ 1 · 101 + 7 · 100 rounding
⇒ 1.0 · 101 + 0.7 · 101 using guard digit
⇒ 1.7 · 101
⇒ 2 · 101 rounding
On the other hand, evaluation right-to-left gives:
4 · 100 + (6 · 100 + 7 · 100 ) ⇒ 4 · 100 + 13 · 100 addition
⇒ 4 · 100 + 1 · 101 rounding
⇒ 0.4 · 101 + 1.0 · 101 using guard digit
⇒ 1.4 · 101
⇒ 1 · 101 rounding
47 Eijkhout – Computer arithmetic – Fall 2023
Error propagation under addition
Let s = x1 + x2 , and x = s̃ = x̃1 + x̃2 with x̃i = xi (1 + εi )
x̃ = s̃(1 + ε3 )
= x1 (1 + ε1 )(1 + ε3 ) + x2 (1 + ε2 )(1 + ε3 )
= x1 + x2 + x1 (ε1 + ε3 ) + x2 (ε2 + ε3 )
⇒ x̃ = s(1 + 2ε)
⇒ errors are added
Assumptions: all εi approximately equal size and small;
xi > 0
48 Eijkhout – Computer arithmetic – Fall 2023
Multiplication
1. add exponents
2. multiply mantissas
3. adjust exponent
Example:
.123 × .567 × 101 = .069741 × 101 → .69741 × 100 → .697 × 100 .
What happens with relative errors?
49 Eijkhout – Computer arithmetic – Fall 2023
Examples
50 Eijkhout – Computer arithmetic – Fall 2023
Subtraction
Correct rounding only applies to a single operation.
Example: 1.24 − 1.23 = 0.01 → 1. × 10−2 :
result is exact, but only one significant digit.
What if 1.24 = fl(1.244) and 1.23 = fl(1.225)? Correct
result 1.9 × 10−2 ; almost 100% error.
Cancellation leads to loss of precision
subsequent operations with this result are inaccurate
this can not be fixed with guard digits and such
⇒ avoid subtracting numbers that are likely close.
51 Eijkhout – Computer arithmetic – Fall 2023
ABC-formula
√
2
Example: ax 2 + bx + c = 0 → x = −b± 2ab −4ac
2
suppose b > 0 and b ≫ 4ac√ then the ‘+’ solution will be inaccurate
−b− b2 −4ac
Better: compute x− = 2a
and use x+ · x− = −c /a.
52 Eijkhout – Computer arithmetic – Fall 2023
Example
Equation
f (x ) = εx 2 − (1 + ε2 )x + ε,
Roots for ε small:
x+ ≈ ε−1 , x− ≈ ε.
Textbook solution for small ε
x− ≈ 0, f (x− ) ≈ ε
Accurately:
f (x− ) ≈ ε3
53 Eijkhout – Computer arithmetic – Fall 2023
Numerical test
textbook accurate
ε x− f (x− ) x− f (x− )
10−3 1.000 · 10 −03
−2.876 · 10−14 1.000 · 10−03
−2.168 · 10−19
10−4 1.000 · 10 −04
5.264 · 10−14 1.000 · 10−04
0.000
10−5 1.000 · 10 −05
−8.274 · 10−13 1.000 · 10−05 −1.694 · 10−21
10−6 1.000 · 10−06 −3.339 · 10−11 1.000 · 10−06 −2.118 · 10−22
10−7 9.992 · 10−08 7.993 · 10−11 1.000 · 10−07 1.323 · 10−23
10−8 1.110 · 10−08 −1.102 · 10−09 1.000 · 10−08 0.000
10−9 0.000 1.000 · 10−09 1.000 · 10−09 −2.068 · 10−25
10−10 0.000 1.000 · 10−10 1.000 · 10−10 0.000
(2)
54 Eijkhout – Computer arithmetic – Fall 2023
Serious example
1
Evaluate Σ10000
n=1 n2 = 1.644834
in 6 digits: machine precision is 10−6 in single precision
First term is 1, so partial sums are ≥ 1, so 1/n2 < 10−6 gets ignored, ⇒ last
7000 terms (or more) are ignored, ⇒ sum is 1.644725: 4 correct digits
Solution: sum in reverse order; exact result in single precision
Why? Consider ratio of two terms:
n2 n2 1 2
= = ≈ 1+
(n − 1)2 n2 − 2n + 1 1 − 2/n + 1/n2 n
with aligned exponents:
n − 1: .00 · · · 0 10 · · · 00
n: .00 · · · 0 10 · · · 01 0 · · · 0
k = log(n/2) positions
The last digit in the smaller number is not lost if n < 2/ε
55 Eijkhout – Computer arithmetic – Fall 2023
Another serious example
Previous example was due to finite representation; this example is more due
to algorithm itself.
R1 n
Consider yn = 0 xx−5 dx = n1 − 5yn−1 (monotonically decreasing)
y0 = ln 6 − ln 5.
In 3 decimal digits:
computation correct result
y0 = ln 6 − ln 5 = .182|322 × 101 . . . 1.82
y1 = .900 × 10−1 .884
y2 = .500 × 10−1 .0580
y3 = .830 × 10−1 going up? .0431
y4 = −.165 negative? .0343
Reason? Define error as ỹn = yn + εn , then
ỹn = 1/n − 5ỹn−1 = 1/n + 5nn−1 + 5εn−1 = yn + 5εn−1
so εn ≥ 5εn−1 : exponential growth.
56 Eijkhout – Computer arithmetic – Fall 2023
Stability of linear system solving
Problem: solve Ax = b, where b inexact.
A(x + ∆x ) = b + ∆b.
Since Ax = b, we get A∆x = ∆b. From this,
Ax =b ∥A∥∥x ∥ ≥ ∥b∥
⇒
∆x = A−1 ∆b ∥∆x ∥ ≤ ∥A−1 ∥∥∆b∥
∥∆x ∥ ∥∆b∥
⇒ ≤ ∥A∥∥A−1 ∥
∥x ∥ ∥b∥
‘Condition number’. Attainable accuracy depends on matrix properties
57 Eijkhout – Computer arithmetic – Fall 2023
Consequences of roundoff
Multiplication and addition are not associative:
problems for parallel computations.
compute a + b + c + d
sequential parallel
((a + b) + c ) + d (a+b)+(c+d)
Operations with “same” outcomes are not equally stable:
matrix inversion is unstable, elimination is stable
58 Eijkhout – Computer arithmetic – Fall 2023
Exercise 8: Fixed-point iteration
Consider the iteration
(
2xn if 2xn < 1
xn+1 = f (xn ) =
2xn − 1 if 2xn ≥ 1
Does this function have a fixed point, x0 ≡ f (x0 ), or is there a cycle
x1 = f (x0 ), x0 ≡ x2 = f (x1 ) et cetera?
Now code this function and see what happens with various starting
points x0 . Can you explain this?
59 Eijkhout – Computer arithmetic – Fall 2023
More
60 Eijkhout – Computer arithmetic – Fall 2023
Complex numbers
Two real numbers: real and imaginary part.
Storage:
Store real/imaginary adjacent: easy to pass address of one
number
Store array of real, then array of imaginary. Better for stride 1
access if only real parts are needed. Other considerations.
61 Eijkhout – Computer arithmetic – Fall 2023
Other arithmetic systems
Some compilers support higher precisions.
Arbitrary precision: GMPlib
Interval arithmetic
Half precision bfloat16
62 Eijkhout – Computer arithmetic – Fall 2023