Indian Institute of Technology Roorkee
MAB-103: Numerical Methods
Unit-1 Error Analysis Session-2025-26
Real numbers, like π(3.1415 . . .), can have endless decimal places. However, computers can only
store a limited number of digits. This means that at some point, depending on the computer’s
memory, the remaining digits must be ignored. As a result, the real number stored on the computer
is an approximation of the true value. While the discarded digits may be tiny, this approximation
can sometimes lead to surprisingly large errors in calculations.
This chapter dives into the world of errors caused by using approximate numbers in calculations.
Here’s a breakdown of what we’ll cover:
• Representing Numbers on Computers: We’ll explore how computers store real numbers,
which can have infinite digits, with a limited number of digits.
• Approximation Techniques for Memory: We’ll discuss two methods for approximating num-
bers to fit within a computer’s memory constraints.
• Understanding Errors and Significant Digits: We’ll introduce the concept of errors and how
significant digits help us understand the precision of our approximations.
• Error Propagation and Numerical Stability: This section examines how errors from ap-
proximations can grow throughout calculations. We’ll also explore the concept of condition
numbers and how they relate to the stability of calculating functions.
1 Floating points form:
Computers use a special format called floating-point representation to store real numbers. This
section will explain how it works.
Definition 1. Let x be a non-zero real number. An n-digit floating-point number in base β is
expressed as follows:
f loat(x) = (−1)s × (.d1 d2 · · · dn )β × β e (1)
where
d1 d2 dn
(.d1 d2 · · · dn )β = + 2 + ··· + n (2)
β β β
is a β-fraction known as the mantissa or significand, s = 1 or 0 denotes the sign, and e is an
integer referred to as the exponent. The number β is also termed the radix, and the point preceding
d1 in (1) is called the radix point.
• When β = 2, the floating-point representation in (1) is known as the binary floating-point
representation. When β = 10, it is referred to as the decimal floating-point representation.
• Note that the floating-point representation (1) contains only a finite number of digits,
whereas a real number can have an infinite sequence of digits, such as 1/3 = 0.3333 · · ·
Consequently, the representation (1) is merely an approximation of a real number.
1
Definition 2. A floating-point number is considered normalized if either d1 6= 0 or d1 = d2 =
d3 = · · · = dn = 0.
Next, we will discuss the following two examples based on decimal floating-point representation.
• The real number x = 5.167 can be expressed as
5.167 = (−1)0 × (.5167) × 10−1 ,
in which case, we have
s = 0, β = 10, e = 1, d1 = 5, d2 = 1, d3 = 6 and d4 = 7.
It is important to note that this representation constitutes the normalized floating-point
format.
• The real number x = −0.00023 can be expressed as
−0.00023 = (−1)1 × (.000023) × 101 ,
which is not in the normalized form. The normalazed representation is as follows:
−0.00023 = (−1)1 × (.23) × 10−3 ,
in which case, we have
s = 1, β = 10, e = −3, d1 = 2, and d2 = 3.
Definition 3. (Overflow and underflow) The exponent e is constrained to a specific range
m < e < M. (3)
During the calculation, if a computed number has an exponent e > M , we say there is a memory
overflow. Conversely, if e < m, we say there is a memory underflow.
• In the event of overflow, the computer typically generates meaningless results or displays
the symbol NaN, indicating that the value obtained from the calculation is ’not a number’.
On some systems, the symbol ∞ is also represented as NaN. Underflow, on the other hand,
is less critical because the computer will usually treat the number as zero.
• The floating-point representation (1) of a number imposes two limitations: the number of
digits n in the mantissa and the range of e. The number n is referred to as the precision or
length of the floating-point representation.
• The IEEE (Institute of Electrical and Electronics Engineers) standard for floating-point
arithmetic, known as IEEE 754, stands as the predominant framework for floating-point
computation. It is adopted by numerous hardware components such as CPUs and FPUs,
including Intel processors, and is also supported by a wide array of software implementations.
Many programming languages either allow or mandate that arithmetic operations utilize
IEEE 754 formats and operations. The IEEE 754 floating-point representation for a binary
number x is defined as follows:
f loat(x) = (−1)s × (1.a1 a2 · · · an )2 × 2e , (4)
where a1 , · · · , an are binary digits (either 1 or 0). The IEEE 754 standard exclusively employs
binary operations.
2
• The IEEE single precision floating-point format utilizes 4 bytes (32 bits) for storing a number.
Within these 32 bits, 24 bits are allocated to store the mantissa (where each binary digit
requires 1 bit of storage), 1 bit is designated for the sign (s), and the remaining 8 bits are
reserved for the exponent. the standard format is as follows:
|(sign) b1 | (exponent) b2 · · · b9 | (mantissa) b10 b11 · · · b32 |
Note that only 23 bits are used for the mantissa. This omission occurs because the digit
1 before the binary point in (4) isn’t stored in memory and is inserted during calculations
instead. The exponent e is constrained within a specified range −126 ≤ e ≤ 127.
• The IEEE double precision floating-point representation of a number offers a precision
equivalent to 53 binary digits, and the exponent e is constrained within a specified range
−1023 ≤ e ≤ 1023.
1.1 Chopping and Rounding a Number
Any real number x can be precisely represented as follows:
x = (−1)s × (.d1 d2 · · · dn dn+1 · · · )β × β e .
where d1 6= 0 or d2 = d3 = · · · = 0, s = 0 or 1, and e satisfies (3).The floating-point form (1)
provides an approximate representation of x. We denote this approximation as f loat(x). There
are two methods to derive f loat(x) from x, as defined below.
• The chopped machine approximation of x is defined as follows:
f loat(x) = (−1)s × (.d1 d2 · · · dn )β × β e (5)
• The rounded machine approximation of x is defined as follows:
– If 0 ≤ dn+1 < β/2, then
f loat(x) = (−1)s × (.d1 d2 · · · dn )β × β e
– If β/2 < dn+1 < β, then
f loat(x) = (−1)s × (.d1 d2 · · · (dn + 1))β × β e
– If dn+1 = β/2 with one of the dn+1+i 6= 0, where i = 1, 2, · · · , then
f loat(x) = (−1)s × (.d1 d2 · · · (dn + 1))β × β e
– If dn+1 = β/2 with all dn+1+i = 0, where i = 1, 2, · · · , then
(
(−1)s × (.d1 d2 · · · (dn + 1))β × β e if dn is an odd number,
f loat(x) =
(−1)s × (.d1 d2 · · · dn )β × β e if dn is an even number.
3
1.2 Errors
The approximate representation of a real number inevitably differs from the actual value, and this
difference is referred to as an error. The error in a computed quantity is defined as follows:
Error = True value − Approximate value
• The absolute error is the absolute value of the aforementioned error.
• The relative error quantifies the error in proportion to the magnitude of the true value, as
expressed by
Error
Relative Error =
True value
• Percentage error is the relative error multiplied by 100.
• Truncation error occurs when a Taylor series is approximated by a finite number of terms.
Example 1. Find the truncation error of a second degree polynomial approximation to
√
f (x) = 1/ 1 + x, x ∈ [0, 1]
using the Taylor series expansion about x = 0.
1.3 Significant Digits
Significant digits are often used instead of relative error.
Definition 4. We say xA approximates x to r significant β-digits if
1
|x − xA | ≤ β s−r+1
2
with s the largest integer such that β s ≤ |x|.
• For x = 1/3, the approximate number xA = 0.333 has three significant digits, given that
|x − xA | ≈ 0.00033 < 0.0005 = 0.5 × 10−3
with 10−1 < 0.333 · · · = x. Thus, we have s = −1 and r = 3.
• For x = 0.02138, the approximate number xA = 0.02144 has two significant digits, since
|x − xA | ≈ 0.0006 < 0.0005 = 0.5 × 10−3
with 10−2 < 0.02138 · · · = x. Thus, we have s = −2 and r = 2.
• Simply put, the number of leading non-zero digits in xA that match the corresponding digits
in the true value x is referred to as the number of significant digits in xA .
4
• Consider two real numbers
x = 7.6545428, y = 7.6544201.
The numbers
xA = 7.6545421, yA = 7.6544200
are approximation to x and y, accurate to six and seven significant digits, respectively. In
eight-digit floating-point arithmetic,
zA = xA − yA = 0.12210000 × 10−3
is the exact difference between xA and yA and
z = x − y = 0.12270000 × 10−3
is the exact difference between x and y. Therefore,
z − zA = 0.6 × 10−6 < 0.5 × 10−5
and hence zA has only two significant digits with respect to z as 10−4 < z = 0.0001227.
Thus, we began with two approximate numbers xA and yA which are accurate to six and
seven significant digits with respect to x and y respectively. However, their difference, zA ,
has only two significant digits relative to z. This indicates a loss of significant digits during
the subtraction process.
1.4 Propagation of Error
Once an error is introduced, it influences subsequent results as it propagates through subsequent
calculations. Initially, we will examine how results are affected when using approximate numbers
instead of exact ones, and then we will address function evaluation.
Let xA and yA represent the approximate numbers used in the calculation, while x and y
denote their corresponding true values. Next, we will explore how errors propagate through the
four fundamental arithmetic operations.
1.4.1 Propagated error in addition and subtraction
Let x = xA + and y = yA + η be positive numbers. The relative error Er (xA ± yA ) is given by
(x ± y) − (xA ± yA ) ±η
Er (xA ± yA ) = =
x±y x±y
This demonstrates that relative errors propagate gradually with addition, but amplify significantly
with subtraction, especially when xT ≈ yT , as illustrated in the previous examples.
1.4.2 Propagated error in multiplication
The relative error Er (xA × yA ) is given by
(x × y) − (xA × yA )
Er (xA × yA ) = = Er (xA ) + Er (yA ) − Er (xA )Er (yA ).
x×y
This indicates that relative errors propagate gradually with multiplication.
5
1.4.3 Propagated error in division
The relative error Er (xA /yA ) is given by
(x/y) − (xA /yA ) 1
Er (xA /yA ) = = (Er (xA ) − Er (yA ))
x/y 1 − Er (yA )
This indicates that relative errors propagate gradually with division, unless Er (yA ) ≈ 1. However,
this scenario is highly improbable as we typically expect the error to be very small, close to zero,
in which case the right-hand side is approximately equal to Er (xA ) − Er (yA ).
1.4.4 Total calculation error
When performing floating-point arithmetic on a computer, computing xA ωyA (where ω represents
one of the basic arithmetic operations ‘+0 , ‘−0 , ‘×0 and ‘/0 ) introduces an additional rounding or
truncation error. The computed value of xA ωyA will encompass the propagated error along with
a rounding or truncation error. To be precise, let ω̂ denote the complete operation as executed on
the computer, encompassing any rounding or truncation. Therefore, the total error is expressed
as:
xωy − xA ω̂yA = (xωy − xA ωyA ) + (xA ωyA − xA ω̂yA )
The first term on the right represents the propagated error, while the second term denotes the
error arising from rounding or truncating the computed result xA ωyA .
1.5 Propagated error in function evaluation
Consider evaluating f (x) at the approximate value instead of at x. Next, assess how accurately
f (xA ) approximates f (x)? By applying the mean value theorem, we find that
f (x) − f (xA ) = f 0 (ξ)(x − xA ),
where ξ is an unknown point between x and xA . The relative error of f (x) with respect to f (xA )
is expressed as follows:
f 0 (ξ)
Er (f (xA )) = xEr (xA ).
f (x)
Given that xA and x are assumed to be very close to each other and ξ lies between x and xA , we
can approximate as follows:
f (x) − f (xA ) ≈ f 0 (xA )(x − xA ) ≈ f 0 (x)(x − xA ).
Definition 5. The condition number of a function f at the point x = c is defined as
f 0 (c)
c
f (c)
Definition 6. Consider a function f (x) that requires n steps for evaluation. The overall evaluation
process is deemed unstable if at least one step is ill-conditioned. Conversely, if all steps are well-
conditioned, the process is considered stable.
6
Error Propagation in Multivariable Functions
Problem Statement
Let f = f (x1 , x2 , . . . , xn ) be a function of several variables. Each variable xi has an associated
small uncertainty or error ∆xi . We aim to estimate the resulting uncertainty ∆f in the function
f.
First-Order (Linear) Approximation
Using the first-order Taylor expansion, the propagated error in f is approximated by:
∂f ∂f ∂f
∆f ≈ ∆x1 + ∆x2 + · · · + ∆xn
∂x1 ∂x2 ∂xn
This gives an upper bound on the total error assuming worst-case additive behavior.
Example: First-Order Error Propagation
Let:
f (x, y) = x · y
Given:
x = 5.0 ± 0.1, y = 3.0 ± 0.2
Step 1: Compute partial derivatives
∂f ∂f
= y, =x
∂x ∂y
Step 2: Evaluate at the given values
∂f ∂f
= 3.0, = 5.0
∂x ∂y
Step 3: Apply the first-order error propagation formula
∂f ∂f
∆f ≈ ∆x + ∆y
∂x ∂y
∆f ≈ (3.0)(0.1) + (5.0)(0.2) = 0.3 + 1.0 = 1.3
Final Answer
f = xy = (5.0)(3.0) = 15.0
f = 15.0 ± 1.3