0% found this document useful (0 votes)
19 views45 pages

Unit - 1 (CFD)

This document introduces numerical analysis and its application in computational fluid dynamics, focusing on the representation of integers in computer systems. It explains the differences between unsigned and signed integers, detailing various representation schemes such as sign-magnitude, 1's complement, and 2's complement. The document emphasizes the importance of 2's complement for signed integers due to its efficiency in representing zero and simplifying arithmetic operations.

Uploaded by

SRIKANTH KETHA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views45 pages

Unit - 1 (CFD)

This document introduces numerical analysis and its application in computational fluid dynamics, focusing on the representation of integers in computer systems. It explains the differences between unsigned and signed integers, detailing various representation schemes such as sign-magnitude, 1's complement, and 2's complement. The document emphasizes the importance of 2's complement for signed integers due to its efficiency in representing zero and simplifying arithmetic operations.

Uploaded by

SRIKANTH KETHA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

[COMPUTATIONAL FLUID DYNAMICS]

Unit - 1

NUMBER SYSTEM AND ERRORS


This book provides an introduction to numerical analysis and is intended to be
used by students of mathematics, engineering, and science. Numerical analysis
can be defined as the development and implementation of techniques to find
numerical solutions to mathematical problems. It involves engineering and
physics in converting a physical phenomenon into a mathematical method; it
involves mathematics in developing techniques for the approximate solution of
mathematical equations describing the model, and finally it involves computer
science for the implementation of these techniques in an optimal fashion for the
particular computer available. With the accessibility of computers, it is now
possible to get rapid and accurate solutions to many complex problems that
create difficulties for the mathematician, engineer, and scientist. In elementary
calculus we learned how to differentiate and integrate to get exact answers to
a remarkably diverse range of realistic problems that could not be solved by
purely algebraic methods. Unfortunately, from a practical point of view, the
techniques of elementary (or even advanced) calculus alone are not adequate
for solving calculus type problems such as solving polynomial equations of
degrees greater than four or even simple equations such as

or to evaluate integrals of type

However, it is impossible to get the exact solutions of these problems. Even


when an analytical solution can be found, it may be more theoretical than
practical. Fortunately, one rarely needs exact answers. Indeed, in the real
world problems themselves are usually inexact because they are generally
possessed in terms...

UNIT - I 1
[COMPUTATIONAL FLUID DYNAMICS]

REPRESENTATION OF INTEGERS
Integers are whole numbers or fixed-point numbers with the radix
point fixed after the least-significant bit. They are contrast to real
numbers or floating-point numbers, where the position of the radix point varies.
It is important to take note that integers and floating-point numbers are treated
differently in computers. They have different representation and are processed
differently (e.g., floating-point numbers are processed in a so-called floating-
point processor). Floating-point numbers will be discussed later.
Computers use a fixed number of bits to represent an integer. The commonly-
used bit-lengths for integers are 8-bit, 16-bit, 32-bit or 64-bit. Besides bit-
lengths, there are two representation schemes for integers:
1. Unsigned Integers: can represent zero and positive integers.
2. Signed Integers: can represent zero, positive and negative integers.
Three representation schemes had been proposed for signed
integers:
a. Sign-Magnitude representation
b. 1's Complement representation
c. 2's Complement representation
You, as the programmer, need to decide on the bit-length and representation
scheme for your integers, depending on your application's requirements.
Suppose that you need a counter for counting a small quantity from 0 up to 200,
you might choose the 8-bit unsigned integer scheme as there is no negative
numbers involved.
3.1 n-bit Unsigned Integers
Unsigned integers can represent zero and positive integers, but not negative
integers. The value of an unsigned integer is interpreted as "the magnitude of
its underlying binary pattern".
Example 1: Suppose that n=8 and the binary pattern is 0100 0001B, the value
of this unsigned integer is 1×2^0 + 1×2^6 = 65D.
Example 2: Suppose that n=16 and the binary pattern is 0001 0000 0000
1000B, the value of this unsigned integer is 1×2^3 + 1×2^12 = 4104D.
Example 3: Suppose that n=16 and the binary pattern is 0000 0000 0000
0000B, the value of this unsigned integer is 0.
An n-bit pattern can represent 2^n distinct integers. An n-bit unsigned integer
can represent integers from 0 to (2^n)-1, as tabulated below:

UNIT - I 2
[COMPUTATIONAL FLUID DYNAMICS]

n Minimum Maximum

8 0 (2^8)-1 (=255)

16 0 (2^16)-1 (=65,535)

32 0 (2^32)-1 (=4,294,967,295) (9+ digits)

64 0 (2^64)-1 (=18,446,744,073,709,551,615) (19+ digits)

3.2 Signed Integers

Signed integers can represent zero, positive integers, as well as negative


integers. Three representation schemes are available for signed integers:
1. Sign-Magnitude representation
2. 1's Complement representation
3. 2's Complement representation
In all the above three schemes, the most-significant bit (msb) is called the sign
bit. The sign bit is used to represent the sign of the integer - with 0 for positive
integers and 1 for negative integers. The magnitude of the integer, however, is
interpreted differently in different schemes.
3.3 n-bit Sign Integers in Sign-Magnitude Representation

In sign-magnitude representation:
 The most-significant bit (msb) is the sign bit, with value of 0
representing positive integer and 1 representing negative integer.
 The remaining n-1 bits represents the magnitude (absolute value) of
the integer. The absolute value of the integer is interpreted as "the
magnitude of the (n-1)-bit binary pattern".
Example 1: Suppose that n=8 and the binary representation is 0 100 0001B.
Sign bit is 0 ⇒ positive
Absolute value is 100 0001B = 65D
Hence, the integer is +65D
Example 2: Suppose that n=8 and the binary representation is 1 000 0001B.
Sign bit is 1 ⇒ negative
Absolute value is 000 0001B = 1D
Hence, the integer is -1D
Example 3: Suppose that n=8 and the binary representation is 0 000 0000B.
Sign bit is 0 ⇒ positive

UNIT - I 3
[COMPUTATIONAL FLUID DYNAMICS]

Absolute value is 000 0000B = 0D


Hence, the integer is +0D
Example 4: Suppose that n=8 and the binary representation is 1 000 0000B.
Sign bit is 1 ⇒ negative
Absolute value is 000 0000B = 0D
Hence, the integer is -0D

The drawbacks of sign-magnitude representation are:


1. There are two representations (0000 0000B and 1000 0000B) for the
number zero, which could lead to inefficiency and confusion.
2. Positive and negative integers need to be processed separately.
3.4 n-bit Sign Integers in 1's Complement Representation

In 1's complement representation:


 Again, the most significant bit (msb) is the sign bit, with value of 0
representing positive integers and 1 representing negative integers.
 The remaining n-1 bits represents the magnitude of the integer, as
follows:
o for positive integers, the absolute value of the integer is
equal to "the magnitude of the (n-1)-bit binary pattern".
o for negative integers, the absolute value of the integer is
equal to "the magnitude of the complement (inverse) of the
(n-1)-bit binary pattern" (hence called 1's complement).
Example 1: Suppose that n=8 and the binary representation 0 100 0001B.
Sign bit is 0 ⇒ positive

UNIT - I 4
[COMPUTATIONAL FLUID DYNAMICS]

Absolute value is 100 0001B = 65D


Hence, the integer is +65D
Example 2: Suppose that n=8 and the binary representation 1 000 0001B.
Sign bit is 1 ⇒ negative
Absolute value is the complement of 000 0001B, i.e., 111 1110B = 126D
Hence, the integer is -126D
Example 3: Suppose that n=8 and the binary representation 0 000 0000B.
Sign bit is 0 ⇒ positive
Absolute value is 000 0000B = 0D
Hence, the integer is +0D
Example 4: Suppose that n=8 and the binary representation 1 111 1111B.
Sign bit is 1 ⇒ negative
Absolute value is the complement of 111 1111B, i.e., 000 0000B = 0D
Hence, the integer is -0D

Again, the drawbacks are:


1. There are two representations (0000 0000B and 1111 1111B) for
zero.
2. The positive integers and negative integers need to be processed
separately.
3.5 n-bit Sign Integers in 2's Complement Representation

In 2's complement representation:


 Again, the most significant bit (msb) is the sign bit, with value of 0
representing positive integers and 1 representing negative integers.

UNIT - I 5
[COMPUTATIONAL FLUID DYNAMICS]

 The remaining n-1 bits represents the magnitude of the integer, as


follows:
o for positive integers, the absolute value of the integer is
equal to "the magnitude of the (n-1)-bit binary pattern".
o for negative integers, the absolute value of the integer is
equal to "the magnitude of the complement of the (n-1)-bit
binary pattern plus one" (hence called 2's complement).
Example 1: Suppose that n=8 and the binary representation 0 100 0001B.
Sign bit is 0 ⇒ positive
Absolute value is 100 0001B = 65D
Hence, the integer is +65D
Example 2: Suppose that n=8 and the binary representation 1 000 0001B.
Sign bit is 1 ⇒ negative
Absolute value is the complement of 000 0001B plus 1, i.e., 111 1110B + 1B =
127D
Hence, the integer is -127D
Example 3: Suppose that n=8 and the binary representation 0 000 0000B.
Sign bit is 0 ⇒ positive
Absolute value is 000 0000B = 0D
Hence, the integer is +0D
Example 4: Suppose that n=8 and the binary representation 1 111 1111B.
Sign bit is 1 ⇒ negative
Absolute value is the complement of 111 1111B plus 1, i.e., 000 0000B + 1B =
1D
Hence, the integer is -1D

UNIT - I 6
[COMPUTATIONAL FLUID DYNAMICS]

3.6 Computers use 2's Complement Representation for Signed


Integers

We have discussed three representations for signed integers: signed-magnitude,


1's complement and 2's complement. Computers use 2's complement in
representing signed integers. This is because:
1. There is only one representation for the number zero in 2's
complement, instead of two representations in sign-magnitude and
1's complement.

UNIT - I 7
[COMPUTATIONAL FLUID DYNAMICS]

2. Positive and negative integers can be treated together in addition


and subtraction. Subtraction can be carried out using the "addition
logic".
Example 1: Addition of Two Positive Integers: Suppose that n=8, 65D +
5D = 70D
65D → 0100 0001B
5D → 0000 0101B(+
0100 0110B → 70D (OK)

Example 2: Subtraction is treated as Addition of a Positive and a


Negative Integers: Suppose that n=8, 65D - 5D = 65D + (-5D) = 60D
65D → 0100 0001B
-5D → 1111 1011B(+
0011 1100B → 60D (discard carry - OK)

Example 3: Addition of Two Negative Integers: Suppose that n=8, -65D -


5D = (-65D) + (-5D) = -70D
-65D → 1011 1111B
-5D → 1111 1011B(+
1011 1010B → -70D (discard carry - OK)

Because of the fixed precision (i.e., fixed number of bits), an n-bit 2's
complement signed integer has a certain range. For example, for n=8, the range
of 2's complement signed integers is -128 to +127. During addition (and
subtraction), it is important to check whether the result exceeds this range, in
other words, whether overflow or underflow has occurred.
Example 4: Overflow: Suppose that n=8, 127D + 2D = 129D (overflow -
beyond the range)
127D → 0111 1111B
2D → 0000 0010B(+
1000 0001B → -127D (wrong)

Example 5: Underflow: Suppose that n=8, -125D - 5D = -130D (underflow -


below the range)
-125D → 1000 0011B

UNIT - I 8
[COMPUTATIONAL FLUID DYNAMICS]

-5D → 1111 1011B(+


0111 1110B → +126D (wrong)

The following diagram explains how the 2's complement works. By re-arranging
the number line, values from -128 to +127 are represented contiguously by
ignoring the carry bit.

3.7 Range of n-bit 2's Complement Signed Integers


An n-bit 2's complement signed integer can represent integers from -2^(n-
1) to +2^(n-1)-1, as tabulated. Take note that the scheme can represent all the
integers within the range, without any gap. In other words, there is no missing
integers within the supported range.
n minimum maximum

8 -(2^7) (=-128) +(2^7)-1 (=+127)

16 -(2^15) (=-32,768) +(2^15)-1 (=+32,767)

32 -(2^31) (=-2,147,483,648) +(2^31)-1 (=+2,147,483,647)(9+ digits)

64 -(2^63) (=- +(2^63)-1 (=+9,223,372,036,854,775,807)(18+ digits)


9,223,372,036,854,775,808)

UNIT - I 9
[COMPUTATIONAL FLUID DYNAMICS]

3.8 Decoding 2's Complement Numbers


1. Check the sign bit (denoted as S).
2. If S=0, the number is positive and its absolute value is the binary
value of the remaining n-1 bits.
3. If S=1, the number is negative. you could "invert the n-1 bits and plus
1" to get the absolute value of negative number.
Alternatively, you could scan the remaining n-1 bits from the right
(least-significant bit). Look for the first occurrence of 1. Flip all the
bits to the left of that first occurrence of 1. The flipped pattern gives
the absolute value. For example,
4. n = 8, bit pattern = 1 100 0100B
5. S = 1 → nega ve
6. Scanning from the right and flip all the bits to the left of the first
occurrence of 1 ⇒ 011 1100B = 60D
Hence, the value is -60D
3.9 Big Endian vs. Little Endian

Modern computers store one byte of data in each memory address or location,
i.e., byte addressable memory. An 32-bit integer is, therefore, stored in 4
memory addresses.
The term"Endian" refers to the order of storing bytes in computer memory. In
"Big Endian" scheme, the most significant byte is stored first in the lowest
memory address (or big in first), while "Little Endian" stores the least significant
bytes in the lowest memory address.
For example, the 32-bit integer 12345678H (30541989610) is stored as 12H 34H
56H 78H in big endian; and 78H 56H 34H 12H in little endian. An 16-bit integer
00H 01H is interpreted as 0001H in big endian, and 0100H as little endian.
3.10 Exercise (Integer Representation)
1. What are the ranges of 8-bit, 16-bit, 32-bit and 64-bit integer, in
"unsigned" and "signed" representation?
2. Give the value of 88, 0, 1, 127, and 255 in 8-bit unsigned
representation.
3. Give the value of +88, -88 , -1, 0, +1, -128, and +127 in 8-bit 2's
complement signed representation.
4. Give the value of +88, -88 , -1, 0, +1, -127, and +127 in 8-bit sign-
magnitude representation.

UNIT - I 10
[COMPUTATIONAL FLUID DYNAMICS]

5. Give the value of +88, -88 , -1, 0, +1, -127 and +127 in 8-bit 1's
complement representation.
6. [TODO] more.
Answers
1. The range of unsigned n-bit integers is [0, 2^n - 1]. The range of n-
bit 2's complement signed integer is [-2^(n-1), +2^(n-1)-1];
2. 88 (0101 1000), 0 (0000 0000), 1 (0000 0001), 127 (0111 1111), 255
(1111 1111).
3. +88 (0101 1000), -88 (1010 1000), -1 (1111 1111), 0 (0000 0000), +1
(0000 0001), -128 (1000 0000), +127 (0111 1111).
4. +88 (0101 1000), -88 (1101 1000), -1 (1000 0001), 0 (0000 0000 or
1000 0000), +1 (0000 0001), -127 (1111 1111), +127 (0111 1111).
5. +88 (0101 1000), -88 (1010 0111), -1 (1111 1110), 0 (0000 0000 or
1111 1111), +1 (0000 0001), -127 (1000 0000), +127 (0111 1111).
FLOATING POINT ARITHMETIC
A floating-point number (or real number) can represent a very large value
(e.g., 1.23×10^88) or a very small value (e.g., 1.23×10^-88). It could also
represent very large negative number (e.g., -1.23×10^88) and very small
negative number (e.g., -1.23×10^-88), as well as zero, as illustrated:

A floating-point number is typically expressed in the scientific notation, with


a fraction (F), and an exponent (E) of a certain radix (r), in the form of F×r^E.
Decimal numbers use radix of 10 (F×10^E); while binary numbers use radix of 2
(F×2^E).
Representation of floating point number is not unique. For example, the
number 55.66 can be represented as 5.566×10^1, 0.5566×10^2, 0.05566×10^3,
and so on. The fractional part can be normalized. In the normalized form, there
is only a single non-zero digit before the radix point. For example, decimal
number 123.4567 can be normalized as 1.234567×10^2; binary
number 1010.1011B can be normalized as 1.0101011B×2^3.
It is important to note that floating-point numbers suffer from loss of
precision when represented with a fixed number of bits (e.g., 32-bit or 64-bit).
This is because there are infinite number of real numbers (even within a small
range of says 0.0 to 0.1). On the other hand, a n-bit binary pattern can represent
UNIT - I 11
[COMPUTATIONAL FLUID DYNAMICS]

a finite 2^n distinct numbers. Hence, not all the real numbers can be
represented. The nearest approximation will be used instead, resulted in loss of
accuracy.
It is also important to note that floating number arithmetic is very much less
efficient than integer arithmetic. It could be speed up with a so-called
dedicated floating-point co-processor. Hence, use integers if your application
does not require floating-point numbers.
In computers, floating-point numbers are represented in scientific notation
of fraction (F) and exponent (E) with a radix of 2, in the form of F×2^E.
Both E and F can be positive as well as negative. Modern computers adopt IEEE
754 standard for representing floating-point numbers. There are two
representation schemes: 32-bit single-precision and 64-bit double-precision.
4.1 IEEE-754 32-bit Single-Precision Floating-Point Numbers

In 32-bit single-precision floating-point representation:


 The most significant bit is the sign bit (S), with 0 for positive numbers
and 1 for negative numbers.
 The following 8 bits represent exponent (E).
 The remaining 23 bits represents fraction (F).

Normalized Form
Let's illustrate with an example, suppose that the 32-bit pattern is 1 1000
0001 011 0000 0000 0000 0000 0000, with:
 S=1
 E = 1000 0001
 F = 011 0000 0000 0000 0000 0000
In the normalized form, the actual fraction is normalized with an implicit leading
1 in the form of 1.F. In this example, the actual fraction is 1.011 0000 0000 0000
0000 0000 = 1 + 1×2^-2 + 1×2^-3 = 1.375D.
The sign bit represents the sign of the number, with S=0 for positive and S=1 for
negative number. In this example with S=1, this is a negative number, i.e., -
1.375D.
In normalized form, the actual exponent is E-127 (so-called excess-127 or bias-
127). This is because we need to represent both positive and negative exponent.
UNIT - I 12
[COMPUTATIONAL FLUID DYNAMICS]

With an 8-bit E, ranging from 0 to 255, the excess-127 scheme could provide
actual exponent of -127 to 128. In this example, E-127=129-127=2D.
Hence, the number represented is -1.375×2^2=-5.5D.
De-Normalized Form
Normalized form has a serious problem, with an implicit leading 1 for the
fraction, it cannot represent the number zero! Convince yourself on this!
De-normalized form was devised to represent zero and other numbers.
For E=0, the numbers are in the de-normalized form. An implicit leading 0
(instead of 1) is used for the fraction; and the actual exponent is always -126.
Hence, the number zero can be represented with E=0 and F=0 (because 0.0×2^-
126=0).
We can also represent very small positive and negative numbers in de-
normalized form with E=0. For example, if S=1, E=0, and F=011 0000 0000 0000
0000 0000. The actual fraction is 0.011=1×2^-2+1×2^-3=0.375D. Since S=1, it is
a negative number. With E=0, the actual exponent is -126. Hence the number
is -0.375×2^-126 = -4.4×10^-39, which is an extremely small negative number
(close to zero).
Summary
In summary, the value (N) is calculated as follows:
 For 1 ≤ E ≤ 254, N = (-1)^S × 1.F × 2^(E-127). These numbers are in the
so-called normalized form. The sign-bit represents the sign of the
number. Fractional part (1.F) are normalized with an implicit leading
1. The exponent is bias (or in excess) of 127, so as to represent both
positive and negative exponent. The range of exponent is -
126 to +127.
 For E = 0, N = (-1)^S × 0.F × 2^(-126). These numbers are in the so-
called denormalized form. The exponent of 2^-126 evaluates to a very
small number. Denormalized form is needed to represent zero
(with F=0 and E=0). It can also represents very small positive and
negative number close to zero.
 For E = 255, it represents special values, such as ±INF (positive and
negative infinity) and NaN (not a number). This is beyond the scope of
this article.
Example 1: Suppose that IEEE-754 32-bit floating-point representation pattern
is 0 10000000 110 0000 0000 0000 0000 0000.
Sign bit S = 0 ⇒ positive number
E = 1000 0000B = 128D (in normalized form)
Fraction is 1.11B (with an implicit leading 1) = 1 + 1×2^-1 + 1×2^-2 = 1.75D
UNIT - I 13
[COMPUTATIONAL FLUID DYNAMICS]

The number is +1.75 × 2^(128-127) = +3.5D

Example 2: Suppose that IEEE-754 32-bit floating-point representation pattern


is 1 01111110 100 0000 0000 0000 0000 0000.
Sign bit S = 1 ⇒ negative number
E = 0111 1110B = 126D (in normalized form)
Fraction is 1.1B (with an implicit leading 1) = 1 + 2^-1 = 1.5D
The number is -1.5 × 2^(126-127) = -0.75D

Example 3: Suppose that IEEE-754 32-bit floating-point representation pattern


is 1 01111110 000 0000 0000 0000 0000 0001.
Sign bit S = 1 ⇒ negative number
E = 0111 1110B = 126D (in normalized form)
Fraction is 1.000 0000 0000 0000 0000 0001B (with an implicit leading 1) = 1 +
2^-23
The number is -(1 + 2^-23) × 2^(126-127) = -0.500000059604644775390625
(may not be exact in decimal!)

Example 4 (De-Normalized Form): Suppose that IEEE-754 32-bit floating-


point representation pattern is 1 00000000 000 0000 0000 0000 0000 0001.
Sign bit S = 1 ⇒ negative number
E = 0 (in de-normalized form)
Fraction is 0.000 0000 0000 0000 0000 0001B (with an implicit leading 0) = 1×2^-
23
The number is -2^-23 × 2^(-126) = -2×(-149) ≈ -1.4×10^-45

4.2 Exercises (Floating-point Numbers)


1. Compute the largest and smallest positive numbers that can be
represented in the 32-bit normalized form.
2. Compute the largest and smallest negative numbers can be
represented in the 32-bit normalized form.
3. Repeat (1) for the 32-bit denormalized form.
4. Repeat (2) for the 32-bit denormalized form.

UNIT - I 14
[COMPUTATIONAL FLUID DYNAMICS]

Hints:
1. Largest positive number: S=0, E=1111 1110 (254), F=111 1111 1111
1111 1111 1111.
Smallest positive number: S=0, E=0000 00001 (1), F=000 0000 0000
0000 0000 0000.
2. Same as above, but S=1.
3. Largest positive number: S=0, E=0, F=111 1111 1111 1111 1111
1111.
Smallest positive number: S=0, E=0, F=000 0000 0000 0000 0000
0001.
4. Same as above, but S=1.
Notes For Java Users
You can use JDK methods Float.intBitsToFloat(int
bits) or Double.longBitsToDouble(long bits) to create a single-precision 32-
bit float or double-precision 64-bit double with the specific bit patterns, and
print their values. For examples,
System.out.println(Float.intBitsToFloat(0x7fffff));
System.out.println(Double.longBitsToDouble(0x1fffffffffffffL));

4.3 IEEE-754 64-bit Double-Precision Floating-Point Numbers

The representation scheme for 64-bit double-precision is similar to the 32-bit


single-precision:
 The most significant bit is the sign bit (S), with 0 for positive numbers
and 1 for negative numbers.
 The following 11 bits represent exponent (E).
 The remaining 52 bits represents fraction (F).

The value (N) is calculated as follows:


 Normalized form: For 1 ≤ E ≤ 2046, N = (-1)^S × 1.F × 2^(E-1023).
 Denormalized form: For E = 0, N = (-1)^S × 0.F × 2^(-1022). These are
in the denormalized form.
 For E = 2047, N represents special values, such
as ±INF (infinity), NaN (not a number).

UNIT - I 15
[COMPUTATIONAL FLUID DYNAMICS]

4.4 More on Floating-Point Representation

There are three parts in the floating-point representation:


 The sign bit (S) is self-explanatory (0 for positive numbers and 1 for
negative numbers).
 For the exponent (E), a so-called bias (or excess) is applied so as to
represent both positive and negative exponent. The bias is set at half
of the range. For single precision with an 8-bit exponent, the bias is
127 (or excess-127). For double precision with a 11-bit exponent, the
bias is 1023 (or excess-1023).
 The fraction (F) (also called the mantissa or significand) is composed
of an implicit leading bit (before the radix point) and the fractional bits
(after the radix point). The leading bit for normalized numbers is 1;
while the leading bit for denormalized numbers is 0.
Normalized Floating-Point Numbers
In normalized form, the radix point is placed after the first non-zero digit,
e,g., 9.8765D×10^-23D, 1.001011B×2^11B. For binary number, the leading bit is
always 1, and need not be represented explicitly - this saves 1 bit of storage.
In IEEE 754's normalized form:
 For single-precision, 1 ≤ E ≤ 254 with excess of 127. Hence, the actual
exponent is from -126 to +127. Negative exponents are used to
represent small numbers (< 1.0); while positive exponents are used to
represent large numbers (> 1.0).
N = (-1)^S × 1.F × 2^(E-127)
 For double-precision, 1 ≤ E ≤ 2046 with excess of 1023. The actual
exponent is from -1022 to +1023, and
N = (-1)^S × 1.F × 2^(E-1023)
Take note that n-bit pattern has a finite number of combinations (=2^n), which
could represent finite distinct numbers. It is not possible to represent
the infinite numbers in the real axis (even a small range says 0.0 to 1.0 has
infinite numbers). That is, not all floating-point numbers can be accurately
represented. Instead, the closest approximation is used, which leads to loss of
accuracy.
The minimum and maximum normalized floating-point numbers are:
Preci Normalized
Normalized N(max)
sion N(min)

Single 0080 0000H 7F7F FFFFH


0 00000001 0 11111110 00000000000000000000000B

UNIT - I 16
[COMPUTATIONAL FLUID DYNAMICS]

000000000000 E = 254, F = 0
00000000000B N(max) = 1.1...1B × 2^127 = (2 - 2^-23) × 2^127
E = 1, F = 0 (≈3.4028235 × 10^38)
N(min) = 1.0B
× 2^-126
(≈1.17549435
× 10^-38)

Doubl 0010 0000 7FEF FFFF FFFF FFFFH


e 0000 0000H N(max) = 1.1...1B × 2^1023 = (2 - 2^-52) × 2^1023
N(min) = 1.0B × (≈1.7976931348623157 × 10^308)
2^-1022
(≈2.225073858
5072014 × 10^-
308)

Denormalized Floating-Point Numbers


If E = 0, but the fraction is non-zero, then the value is in denormalized form, and
a leading bit of 0 is assumed, as follows:
 For single-precision, E = 0,
N = (-1)^S × 0.F × 2^(-126)
 For double-precision, E = 0,
N = (-1)^S × 0.F × 2^(-1022)
Denormalized form can represent very small numbers closed to zero, and zero,
which cannot be represented in normalized form, as shown in the above figure.
The minimum and maximum of denormalized floating-point numbers are:

UNIT - I 17
[COMPUTATIONAL FLUID DYNAMICS]

Precision Denormalized D(min)

Single 0000 0001H


0 00000000 00000000000000000000001B
E = 0, F = 00000000000000000000001B
D(min) = 0.0...1 × 2^-126 = 1 × 2^-23 × 2^-126 = 2^-149
(≈1.4 × 10^-45)

Double 0000 0000 0000 0001H


D(min) = 0.0...1 × 2^-1022 = 1 × 2^-52 × 2^-1022 = 2^-1074
(≈4.9 × 10^-324)

Special Values
Zero: Zero cannot be represented in the normalized form, and must be
represented in denormalized form with E=0 and F=0. There are two
representations for zero: +0 with S=0 and -0 with S=1.
Infinity: The value of +infinity (e.g., 1/0) and -infinity (e.g., -1/0) are represented
with an exponent of all 1's (E = 255 for single-precision and E = 2047 for double-
precision), F=0, and S=0 (for +INF) and S=1 (for -INF).
Not a Number (NaN): NaN denotes a value that cannot be represented as real
number (e.g. 0/0). NaN is represented with Exponent of all 1's (E = 255 for
single-precision and E = 2047 for double-precision) and any non-zero fraction.
LOSS OF SIGNIFICANCE
The term loss of significance refers to an undesirable effect in calculations
using floating-point arithmetic.
Computers and calculators employ floating-point arithmetic to express and
work with fractional numbers. Numbers in this arithmetic consist of a fixed
number of digits, with a decimal point that “floats” among them. In calculations,
only that fixed number of digits is maintained, so that the results of calculations
are only approximate — which is often quite adequate.
Loss of significance occurs in the subtraction of two nearly equal numbers,
which produces a result much smaller than either of the original numbers. The
effect can be an unacceptable reduction in the number of accurate (significant)
digits in the result.
An algorithm for calculating solutions of a mathematical problem is
called stable if a small change to its input does not result in a large change to its
result.
A floating-point representation of a fractional number can be thought of as
a small change in the number. So, loss of significance will always result if an
unstable algorithm is used with floating-point arithmetic.
UNIT - I 18
[COMPUTATIONAL FLUID DYNAMICS]

One of the main goals in the mathematical field of numerical analysis is to


find stable algorithms for mathematical problems.
Here are a couple of examples.

Floating-point subtraction

Consider the fractional number


.1234567891234567890 .
A floating-point representation of this number on a machine that keeps 10
floating-point digits would be
.1234567891 ,
which seems pretty close—the difference is very small in comparison with either
of the two numbers.
Now perform the calculation
.1234567891234567890 − .1234567890 .
The real answer, accurate to 10 digits, is
.0000000001234567890
But on the 10-digit floating-point machine, the calculation yields
.1234567891 − .1234567890 = .0000000001 .
Whereas the original numbers are accurate in all of the first (most significant)
10 digits, their floating-point difference is only accurate in its first digit. This
amounts to loss of information.
Another way of looking at it is that, when the input to the calculation is
changed by a small amount, the relative change in the result is drastic. Try
changing the input to the calculation very slightly, say, from .1234567891 to
.1234567892. Once again, in 10-digit floating-point arithmetic,
.1234567892 − .1234567890 = .0000000002 ,
which is twice the size as the previous result of .0000000001. An algorithm that
shows this kind of behavior is called unstable.
It is possible to do all rational arithmetic keeping all significant digits, but to
do so is often prohibitively slower than floating-point arithmetic. Furthermore,
it usually only postpones the problem: What if the input to the algorithm is
empirical data that is itself accurate to only 10 digits? The same effect will occur.
The best solution is to avoid such circumstances altogether. Where they are
not completely avoidable, one must be on the watch for them.
The root of the problem always lies in floating-point subtraction. A familiar
practical algorithm will serve as an example.

UNIT - I 19
[COMPUTATIONAL FLUID DYNAMICS]

Instability of the quadratic formula

For example, consider the venerable quadratic formula for solving a


quadratic equation
a x2 + b x + c = 0 .
The quadratic formula gives the two solutions as
x = ( −b ± ( b2 − 4 a c )1/2 ) / ( 2 a ) .
The case a = 1, b = 200, c = -0.000015 will illustrate the problem:
2
x + 200 x − 0.000015 = 0 .
Then
(b2 − 4 a c)1/2 = (2002 + 4 * 1 * .000015)1/2 = 200.00000015…
In real arithmetic, the roots are
( −200 − 200.00000015 ) / 2 = −200.000000075 ,
( −200 + 200.00000015 ) / 2 = .000000075 .
In 10-digit floating-point arithmetic,
( −200 − 200.0000001 ) / 2 = −200.00000005 ,
( −200 + 200.0000001 ) / 2 = .00000005 .
Although the bigger solution is accurate to ten digits, the first nonzero digit
of the smaller solution is wrong.
The quadratic formula does not constitute a stable algorithm to calculate
the two roots of a quadratic equation. The instability is due to the subtraction
of two numbers of about the same size.

A better algorithm

A better algorithm for solving quadratic equations is based on these


observations:

 one solution is always accurate when the other is not


 given one solution of the quadratic, the other is easy to find.

If
x1 = ( −b + ( b2 − 4 a c )1/2 ) / ( 2 a )
and
x2 = ( −b − ( b2 − 4 a c )1/2 ) / ( 2 a ) ,
then the identity holds:
x1 x2 = c / a ,
which expresses the value of one root in terms of the other.
The algorithm is: use the quadratic formula to find the larger of the two
solutions (the one where two like values aren’t subtracted), then use this

UNIT - I 20
[COMPUTATIONAL FLUID DYNAMICS]

identity to calculate the other root. Since no subtraction is involved in this


second calculation, no loss of precision occurs.
Applying this algorithm to the example problem, and using 10-digit floating-
point arithmetic, the larger of the two solutions, as before, is x1 =
−200.00000005. The other solution is then
x2 = c / (−200.00000005) = 0.000000075 ,
which is accurate.

Discussion

The example of the quadratic formula is quite typical of numerical problems


that arise in computer calculations.
The problem is always due to subtraction of two like quantities.
Multiplication, division, and addition of like quantities are not to blame.
Subtraction of like quantities amounts to loss of information.
A problem, whose solutions undergo limited change upon a small change of
input to the problem, is called a well-posed problem.
If a problem is not well-posed, there is little hope of any numerical algorithm
behaving any better at calculating its solutions. On the other hand, if a problem
is well-posed, and an algorithm for calculating its solutions is unstable, one could
say the algorithm is bad. It should always be possible to arrange for a stable
algorithm to a well-posed problem—but this is where the art lies.

Other examples

Cubic and Quartic formulas


The traditional formulas for solving cubic and quartic equations are
also unstable for the same reasons as for the quadratic.
Gaussian elimination
This venerable method for solving systems of linear equations, as
taught in high school algebra courses, is quite unstable numerically. It is
unstable even for systems which themselves are quite well-posed.
The instability is a result of subtractions of terms of similar size in the
algorithm. The field of study called numerical linear algebra is largely
concerned with finding better algorithms for solving linear problems.
Eigenanalysis
The decomposition of a square matrix into eigenvalues and
eigenvectors is worse than the previous examples. In general, the
problem itself is awfully unstable. (There are classes of matrices for which
it is stable.) There is nothing that a numerical algorithm can do to help
this.
UNIT - I 21
[COMPUTATIONAL FLUID DYNAMICS]

Every time I have seen this messy fact came up, someone was doing
an eigenanalysis where it was quite uncalled-for. For example, a simple
iteration will give the largest eigenvalue; often that is all that is needed.
Numerical differentiation
The calculus process of differentiation involves a subtraction of
numbers that are by nature almost the same. This causes a lot of trouble
in numerical calculations.
The obvious view of solving differential equations, is one of
integrating a differential expression; the usual approach is to try to get an
adequate result a step size large enough that the loss of significance in the
differences begins to dominate. More often, practitioners rely on dumb
luck.
Note that the inverse process, integration, does not suffer from
this…often re-writing the problem in terms of integrals provides an
attractive alternative algorithm.
This is a nice example of the dictum: an instability is a stability in
reverse.
ERROR PROPAGATION
Propagation of Error (or Propagation of Uncertainty) is defined as the effects on
a function by a variable's uncertainty. It is a calculus derived statistical
calculation designed to combine uncertainties from multiple variables to
provide an accurate measurement of uncertainty.

Introduction

Every measurement has an air of uncertainty about it, and not all uncertainties
are equal. Therefore, the ability to properly combine uncertainties from
different measurements is crucial. Uncertainty in measurement comes about in
a variety of ways: instrument variability, different observers, sample differences,
time of day, etc. Typically, error is given by the standard deviation (σx) of a
measurement.

Anytime a calculation requires more than one variable to solve, propagation of


error is necessary to properly determine the uncertainty. For example, lets say
we are using a UV-Vis Spectrophotometer to determine the molar absorptivity
of a molecule via Beer's Law: A = ε l c. Since at least two of the variables have an
uncertainty based on the equipment used, a propagation of error formula must
be applied to measure a more exact uncertainty of the molar absorptivity. This
example will be continued below, after the derivation.

UNIT - I 22
[COMPUTATIONAL FLUID DYNAMICS]

Derivation of Exact Formula

Suppose a certain experiment requires multiple instruments to carry out. These


instruments each have different variability in their measurements. The results
of each instrument are given as: a, b, c, d... (For simplification purposes, only the
variables a, b, and c will be used throughout this derivation). The end result
desired is x, so that x is dependent on a, b, and c. It can be written that x is a
function of these variables:

x=f(a,b,c)(1)

Because each measurement has an uncertainty about its mean, it can be written
that the uncertainty of dxi of the ith measurement of x depends on the
uncertainty of the ith measurements of a, b, and c:

dxi=f(dai,dbi,dci)(2)

The total deviation of x is then derived from the partial derivative of x with
respect to each of the variables:

dx=(δxδa)b,cda,(δxδb)a,cdb,(δxδc)a,bdc(3)

A relationship between the standard deviations of x and a, b, c, etc... is formed


in two steps:

i. by squaring Equation 33, and


ii. taking the total sum from i=1to i=N, where N is the total number of
measurements.

In the first step, two unique terms appear on the right hand side of the
equation: square terms and cross terms.

Square Terms
(δxδa)2(da)2,(δxδb)2(db)2,(δxδc)2(dc)2(4)

Cross Terms
(δxda)(δxdb)dadb,(δxda)(δxdc)dadc,(δxdb)(δxdc)dbdc(5

Square terms, due to the nature of squaring, are always positive, and therefore
never cancel each other out. By contrast, cross terms may cancel each other out,
due to the possibility that each term may be positive or negative. If da, db,
and dc represent random and independent uncertainties, about half of the cross

UNIT - I 23
[COMPUTATIONAL FLUID DYNAMICS]

terms will be negative and half positive (this is primarily due to the fact that the
variables represent uncertainty about a mean). In effect, the sum of the cross
terms should approach zero, especially as N� increases. However, if the
variables are correlated rather than independent, the cross term may not cancel
out.

Assuming the cross terms do cancel out, then the second step - summing
from i=1 to i=N - would be:

∑(dxi)2=(δxδa)2∑(dai)2+(δxδb)2∑(dbi)2(6)

Dividing both sides by N−1

∑(dxi)2N−1=(δxδa)2∑(dai)2N−1+(δxδb)2∑(dbi)2N−1(7)

The previous step created a situation where Equation 77 could mimic the
standard deviation equation. This is desired, because it creates a statistical
relationship between the variable x, and the other variables a, b, c, etc... as
follows:

The standard deviation equation can be rewritten as the variance (σ2x��2)


of x:

∑(dxi)2N−1=∑(xi−x¯)2N−1=σ2x(8)

Rewriting Equation 77 using the statistical relationship created yields the Exact
Formula for Propagation of Error:

σ2x=(δxδa)2σ2a+(δxδb)2σ2b+(δxδc)2σ2c(9)

Thus, the end result is achieved. Equation 99 shows a direct statistical


relationship between multiple variables and their standard deviations. In the
next section, derivations for common calculations are given, with an example of
how the derivation was obtained.

Arithmetic Calculations of Error Propagation

In the following calculations a, b, and c are measured variables from an


experiment and σa, σb, and σc are the standard deviations of those variables.

UNIT - I 24
[COMPUTATIONAL FLUID DYNAMICS]

Addition or Subtraction

If x=a+b−c then

σx=σa2+σb2+σc2−−−−−−−−−−−−√(10)
Multiplication or Division

If x=a×bc�=�×�� then

σxx=(σaa)2+(σbb)2+(σcc)2−−−−−−−−−−−−−−−−−−−−−√(11)
Exponential

If x=ay then

σxx=y(σaa)(12)
Logarithmic

If x=log(a) then

σx=0.434(σaa)(13)
Anti-logarithmic

If x=antilog(a) then

σxx=2.303(σa)(14)
Note

Addition, subtraction, and logarithmic equations leads to an absolute standard


deviation, while multiplication, division, exponential, and anti-logarithmic
equations lead to relative standard deviations.

Derivation of Arithmetic Example

The Exact Formula for Propagation of Error in Equation 99 can be used to derive
the arithmetic examples noted above. Starting with a simple equation:

x=a×bc(15)

UNIT - I 25
[COMPUTATIONAL FLUID DYNAMICS]

where x is the desired results with a given standard deviation, and a, b, and c are
experimental variables, each with a difference standard deviation. Taking the
partial derivative of each experimental variable, a, b, and c:

(δxδa)=bc(16)
(δxδb)=ac(17)

and

(δxδc)=−abc2(18)

Plugging these partial derivatives into Equation 99 gives:

σ2x=(bc)2σ2a+(ac)2σ2b+(−abc2)2σ2c(19)

Dividing Equation 1919 by Equation 1515 squared yields:

σ2xx2=(bc)2σ2a(abc)2+(ac)2σ2b(abc)2+(−abc2)2σ2c(abc)2(20)

Canceling out terms and square-rooting both sides yields Equation 1111:

σxx=(σaa)2+(σbb)2+(σcc)2−−−−−−−−−−−−−−−−−−−−−√
Example 11

Continuing the example from the introduction (where we are calculating


the molar absorptivity of a molecule), suppose we have a concentration of 13.7
(±0.3) moles/L, a path length of 1.0 (±0.1) cm, and an absorption of 0.172807
(±0.000008). The equation for molar absorptivity is dictated by Beer's law:

ε=Alc.
Solution

Since Beer's Law deals with multiplication/division, we'll use Equation 1111:

σϵϵ=(0.0000080.172807)2+(0.11.0)2+(0.313.7)2−−−−−−−−−−−−−−−−−−−−−−−−−
−−−−√=0.10237

As stated in the note above, Equation 1111 yields a relative standard deviation,
or a percentage of the ε variable. Using Beer's Law, ε = 0.012614 L moles-1 cm-
1
Therefore, the σϵ for this example would be 10.237% of ε, which is 0.001291.

UNIT - I 26
[COMPUTATIONAL FLUID DYNAMICS]

Accounting for significant figures, the final answer would be:

ε = 0.013 ± 0.001 L moles-1 cm-1

If you are given an equation that relates two different variables and given the
relative uncertainties of one of the variables, it is possible to determine the
relative uncertainty of the other variable by using calculus. In problems, the
uncertainty is usually given as a percent. Let's say we measure the radius of a
very small object. The problem might state that there is a 5% uncertainty when
measuring this radius.

To actually use this percentage to calculate unknown uncertainties of other


variables, we must first define what uncertainty is. Uncertainty, in calculus, is
defined as:

(dxx)=(Δxx)=uncertainty
Example 22

Let's look at the example of the radius of an object again. If we know the
uncertainty of the radius to be 5%, the uncertainty is defined as

(dxx)=(Δxx)=5%=0.05.

Now we are ready to use calculus to obtain an unknown uncertainty of another


variable. Let's say we measure the radius of an artery and find that the
uncertainty is 5%. What is the uncertainty of the measurement of the volume of
blood pass through the artery? Let's say the equation relating radius and volume
is:

V(r)=c(r2)

where c is a constant, r is the radius and V(r) is the volume.

Solution

The first step to finding the uncertainty of the volume is to understand our given
information. Since we are given the radius has a 5% uncertainty, we know that
(∆r/r) = 0.05. We are looking for (∆V/V).

Now that we have done this, the next step is to take the derivative of this
equation to obtain:

UNIT - I 27
[COMPUTATIONAL FLUID DYNAMICS]

dVdr=ΔVΔr=2cr

We can now multiply both sides of the equation to obtain:

ΔV=2cr(Δr)

Since we are looking for (∆V/V), we divide both sides by V to get:

ΔVV=2cr(Δr)V

We are given the equation of the volume to be V=c(r)2, so we can plug this back
into our previous equation for V� to get:

ΔVV=2cr(Δr)c(r)2

Now we can cancel variables that are in both the numerator and denominator
to get:

ΔVV=2Δrr=2(Δrr)

We have now narrowed down the equation so that ∆r/r is left. We know the
value of uncertainty for ∆r/r to be 5%, or 0.05. Plugging this value in for ∆r/r we
get:

ΔVV=2(0.05)=0.1=10%

The uncertainty of the volume is 10%. This method can be used in chemistry as
well, not just the biological example shown above.

CONDITION AND INSTABILITY


Condition of a Problem
Every problem that we try to solve is based on an expression of some form or
another. To have confidence in our solution we would first need to know that
the expression is continuous in its inputs, so that we would not get completely
different results from slight changes in the input. However, this is not enough.
We also need to know that the expression is wellconditioned. If it is well-
conditioned, then small changes in the input to the expression, will lead to small
changes in the results. If small changes in the input lead to large changes in the
output, then we call the problem ill-conditioned. The exact cutoff between well-

UNIT - I 28
[COMPUTATIONAL FLUID DYNAMICS]

and ill-conditioned depends on the context of the problem and the uses of the
results.
Example:
Suppose we want to evaluate the expression y = x/(1 − x).
With x = 0.93
we get y = 13.28...,
but with x = 0.94
we get y = 15.66....
So we would probably say that this expression is ill-conditioned when evaluated
for x near 0.93.
On the other hand,
if we use x = −0.93 and x = −0.94,
we get values of −0.4818... and −0.4845... and we would say this it is well-
conditioned for x near −0.93.
For many types of problems we can compute a condition number that indicates
the magnification of the changes. The condition number is defined by
Relative error in the output ≈ Condition number × Relative error in the input.
For example, consider evaluating a function f(x) at a point x = x0. The input is x0
and the output is f(x0). If we perturb the input to x = x0 + then the output is
f(x0 + ) and by applying the Mean Value Theorem we get

Applying this to f(x) = x/(1 − x) we get Cf (x) = 1 |1−x| . Then Cf (0.93) = 14.28...
and Cf (−0.93) = 0.5181.... This is consistent with what we saw in the example
above.
With the condition number as defined above, we still have no sharp cutoff
between well- and ill-conditioned. We do know that if the condition number is
less than 1 then it is well-conditioned and if the condition number is arbitrarily
large then it is ill-conditioned. We usually study condition in limiting cases where
these extremes are observed. For example, f(x) = x/(1 − x) is clearly ill-
conditioned for x near 1 and well-conditioned for x < 0 and x > 2.
Stability of an Algorithm
When we study an algorithm our interest is the same as for an expression: we
want small changes in the input to only produce small changes in the output. An
algorithm or numerical process is called stable if this is true and it is called
unstable if large changes in the output are produced. Analyzing an algorithm for
UNIT - I 29
[COMPUTATIONAL FLUID DYNAMICS]

stability is more complicated than determining the condition of an expression,


even if the algorithm simply evaluates the expression. This is because an
algorithm consists of many basic calculations and each one must be analyzed
and, due to roundoff error, we must consider the possibility of small errors being
introduced in every computed value.
For example evaluating y = x/(1 − x) would be broken into 2 steps: t = 1 − x and
y = x/t. Also, we would consider both x and t to have small errors.
An algorithm is stable if every step is well-conditioned. It is unstable if any step
is ill-conditioned.
Example: Consider evaluating f(x) = √ 1 + x − 1 for x near 0.
Cf (x) = √ 1+x+1 2 √ 1+x so Cf (0) = 1
so it is not ill-conditioned.
To compute
this we would have the 3 steps: (1) t1 = 1 + x, (2) t2 = √ t1 and (3) f(x) = t2 − 1.
Steps (1) and (2) are wellconditioned (condition numbers 0 and 1/2),
while Step (3) is ill-conditioned.
Thus we would say that this algorithm is unstable.
This last example is interesting because the problem is well-conditioned but the
obvious algorithm used to evaluate it is unstable. What this means is that there
is a different algorithm for evaluating the original expression which would be
stable. In this case we can re-formulate our function as f(x) = √ x 1+x+1 by
multiplying the original function by √ √ 1+x+1 1+x+1 . This new version is
equivalent to the original function but by evaluating it in the obvious way, we
have a stable algorithm.
COMPUTATIONAL METHODS FOR ERROR ESTIMATION
any engineering problems are too time consuming to solve or may not be able
to be solved analytically. In these situations, numerical methods are usually
employed. Numerical methods are techniques designed to solve a problem
using numerical approximations. An example of an application of numerical
methods is trying to determine the velocity of a falling object. If you know the
exact function that determines the position of your object, then you could
potentially differentiate the function to obtain an expression for the velocity.
More often, you will use a machine to record readings of times and positions
that you can then use to numerically solve for velocity:

where f is your function, t is the time of the reading, and h is the distance to the
next time step.
UNIT - I 30
[COMPUTATIONAL FLUID DYNAMICS]

Because your answer is an approximation of the analytical solution, there is an


inherent error between the approximated answer and the exact solution. Errors
can result prior to computation in the form of measurement errors or
assumptions in modeling. The focus of this blog post will be on understanding
two types of errors that can occur during computation: roundoff errors and
truncation errors.
Roundoff Error
Roundoff errors occur because computers have a limited ability to represent
numbers. For example, π has infinite digits, but due to precision limitations, only
16 digits may be stored in MATLAB. While this roundoff error may seem
insignificant, if your process involves multiple iterations that are dependent on
one another, these small errors may accumulate over time and result in a
significant deviation from the expected value. Furthermore, if a manipulation
involves adding a large and small number, the effect of the smaller number may
be lost if rounding is utilized. Thus, it is advised to sum numbers of similar
magnitudes first so that smaller numbers are not “lost” in the calculation.

One interesting example that we covered in my Engineering Computation class,


that can be used to illustrate this point, involves the quadratic formula. The
quadratic formula is represented as follows:

Using a = 0.2, b = – 47.91, c = 6 and if we carry out rounding to two decimal


places at every intermediate step:

The error between our approximations and true values can be found as follows:

UNIT - I 31
[COMPUTATIONAL FLUID DYNAMICS]

As can be seen, the smaller root has a larger error associated with it because
deviations will be more apparent with smaller numbers than larger numbers.

If you have the insight to see that your computation will involve operations with
numbers of differing magnitudes, the equations can sometimes be cleverly
manipulated to reduce roundoff error. In our example, if the quadratic formula
equation is rationalized, the resulting absolute error is much smaller because
fewer operations are required and numbers of similar magnitudes are being
multiplied and added together:

Truncation Error
Truncation errors are introduced when exact mathematical formulas are
represented by approximations. An effective way to understand truncation
error is through a Taylor Series approximation. Let’s say that we want to
approximate some function, f(x) at the point xi+1, which is some distance, h, away
from the basepoint xi, whose true value is shown in black in Figure 1. The Taylor
series approximation starts with a single zero order term and as additional terms
are added to the series, the approximation begins to approach the true value.
However, an infinite number of terms would be needed to reach this true value.

UNIT - I 32
[COMPUTATIONAL FLUID DYNAMICS]

Figure 1: Graphical representation of a Taylor Series approximation (Chapra,


2017)
The Taylor Series can be written as follows:

where Rn is a remainder term used to account for all of the terms that were not
included in the series and is therefore a representation of the truncation error.
The remainder term is generally expressed as Rn=O(hn+1) which shows that
truncation error is proportional to the step size, h, raised to the n+1 where n is
the number of terms included in the expansion. It is clear that as the step size
decreases, so does the truncation error.
The Tradeoff in Errors
The total error of an approximation is the summation of roundoff error and
truncation error. As seen from the previous sections, truncation error decreases
as step size decreases. However, when step size decreases, this usually results
in the necessity for more precise computations which consequently results in an
increase in roundoff error. Therefore, the errors are in direct conflict with one
another: as we decrease one, the other increases.

UNIT - I 33
[COMPUTATIONAL FLUID DYNAMICS]

However, the optimal step size to minimize error can be determined. Using an
iterative method of trying different step sizes and recording the error between
the approximation and the true value, the following graph shown in Figure 2 will
result. The minimum of the curve corresponds to the minimum error achievable
and corresponds to the optimal step size. Any error to the right of this point
(larger step sizes) is primarily due to truncation error and the increase in error
to the left of this point corresponds to where roundoff error begins to dominate.
While this graph is specific to a certain function and type of approximation, the
general rule and shape will still hold for other cases.

Figure 2: Plot of Error vs. Step Size (Chapra, 2017)


Hopefully this blog post was helpful to increase awareness of the types of errors
that you may come across when using numerical methods! Internalize these
golden rules to help avoid loss of significance:

 Avoid subtracting two nearly equal numbers


 If your equation has large and small numbers, work with smaller
numbers first
UNIT - I 34
[COMPUTATIONAL FLUID DYNAMICS]

 Consider rearranging your equation so that numbers of a similar


magnitude are being used in an operation

CONVERGENCE OF SEQUENCES

SOLUTION OF A SYSTEM OF SIMULTANEOUS LINEAR


ALGEBRAIC EQUATIONS
In Mathematics, linear equations are the equations in which the highest power
of the variable is one. The result of the linear equation is always a straight line.
Thus, simultaneous linear equations are the system of two linear equations in
two or three variables that are solved together to find a common solution. The
simultaneous linear equation is represented as follows:

UNIT - I 35
[COMPUTATIONAL FLUID DYNAMICS]

Here, a1 and a2 are the coefficients of x, and b1 and b2 are the coefficients of y,
and c1 and c2 are constant. The solution for the system of linear equations is the
ordered pair (x, y), which satisfies the given equation.
In this article, we are going to learn different methods of solving simultaneous
linear equations with steps and many solved examples in detail.

Methods to Solve Simultaneous Linear Equations


Generally, three different methods are used to solve the simultaneous linear
equations. They are:

 Substitution Method
 Elimination Method

 Graphical Method

Now, let us discuss all these three methods in detail with examples.

Substitution Method
Follow the below steps to solve the system of linear equations or simultaneous
linear equations using the substitution method:

1. Rearrange one of the given equations to express x in terms of y.


2. Now, the expression for x can be substituted in the other equation to find
the value of y.
3. Finally, substitute the value of y in any of the equations to find the value
of x.

Example 1:
Solve the following system of linear equations using the substitution method:

UNIT - I 36
[COMPUTATIONAL FLUID DYNAMICS]

3x – 4y = 0
9x – 8y = 12
Solution:
Given:
3x – 4y = 0 …(1)
9x – 8y = 12 …(2)
Step 1: Rearrange equation (1) to express x in terms of y:
⇒3x = 4y
⇒x = 4y/3 …(3)
Now substitute x = 4y/3 in (2), we get
⇒9(4y/3) – 8y = 12
⇒(36y/3) – 8y = 12
⇒12y – 8y = 12
⇒4y = 12
⇒y = 12/4 = 3
Hence, the value of y is 3
Now, substitute y= 3 in (3) to get the value of x
⇒ x = [4(3)]/3 = 12/3 = 4
Therefore, x = 4
Thus, the solution for the simultaneous linear equations (x, y) is (4, 3).

Elimination Method
To find the solutions (x, y) for the system of linear equations using
the elimination method, go through the below steps:

1. Multiply the given equations by a constant, so as to make the coefficients


of the variables in the equations equal.
2. Add or subtract the equations to eliminate the variable having the same
coefficients.
3. Now, solve the equation for one variable.
4. Substitute the variable value in any of the equations to find the value of
the other variable.
UNIT - I 37
[COMPUTATIONAL FLUID DYNAMICS]

5. Finally, the ordered pair (x, y) is the solution of the simultaneous equation.

Example 2:
Solve the system of equations using the elimination method:
2x+3y= 11
x+2y = 7
Solution:
Given:
2x+3y= 11 …(1)
x+2y = 7 …(2)
Now, multiply the equation (2) by 2, we get
2x + 4y = 14 …(3)
Now, solve the equation (1) and (3),
2x + 3y = 11
2x + 4y = 14
– – –
__________
0 – y = -3
⇒ -y = -3
⇒y=3
Now, substitute y = 3 in equation (2),
x + 2(3) = 7
x+6=7
x=1
Hence, x = 1 and y = 3
Therefore, the solution for the system of equations 2x+3y= 11and x+2y = 7 is x =
1 and y=3.

Graphical Method
To solve the simultaneous linear equations graphically, follow the below steps:
UNIT - I 38
[COMPUTATIONAL FLUID DYNAMICS]

1. First, find the coordinates of two equations simultaneously by


substituting the values of x as 1, 2, 3, 4, etc.
2. Now, plot the points and we get two straight lines.
3. Observe the common point of intersection of two straight lines.
4. The common point of intersection of two straight lines is the solution of
the simultaneous linear equation.

Also, read: Graphing of Linear Equations.


Example 3:
Solve the simultaneous linear equations graphically: x+y = 4 and x -y = 0.
Solution:
Given:
x+y = 4 …(1)
x -y = 0 …(2)
Let us take equation (1), x+y = 4
When x = 1,
⇒ y = 4-1 = 3
When x = 2,
⇒y= 4-2 = 2
When x = 3,
⇒ y = 4-3 = 1

x 1 2 3

y 3 2 1
Now, take the second equation x-y = 0
When x = 1,
⇒ -y = 0-1
⇒y=1
When x = 2,
⇒ -y = 0-2
⇒y=2
UNIT - I 39
[COMPUTATIONAL FLUID DYNAMICS]

When x = 3,
⇒ -y = 0-3
⇒y=3

x 1 2 3

y 1 2 3
Now, plot these points in the XY coordinate plane.

From the graph, it is observed that the point of intersection of two straight lines
is (2, 2), which is the solution for the given simultaneous linear equation.

ITERATIVE SCHEMES OF MATRIX INVERSION


The solution of linear algebraic systems of the form , where is an matrix (all
matrices in this paper are of the same dimension, unless it is clearly stated)
and is a given vector, is the center to many numerical simulations and is often
the most time-consuming part of a computation. Direct methods, which work
essentially based on finding the inverse of the coefficients matrix , are very
UNIT - I 40
[COMPUTATIONAL FLUID DYNAMICS]

robust, and they tend to require a predictable amount of resources in terms of


time and storage. Their only problem, in fact, lies in the massive need of time
and memory in calculations, which normally put them out of interest especially
for the cases when is sparse.
At this time, iterative methods can be taken into account. The relaxation and
Krylov subspace methods are of such type, which are reliable in solving large
scale sparse systems. On the other hand, we have another type of iterative
methods, which are called Schulz-type iterations in the literature (see, e.g.,).
These techniques are based on finding robust approximate inverses of a given
matrix.
The oldest technique of this type is the Schulz method defined bywhere is the
identity matrix. In the general case, it is known to converge with ,
where and denotes the spectral radius. Such schemes are also useful for
sensitivity analysis when accurate approximate inverses are needed for both
square and rectangular matrices. Notice that, for rectangular matrices, one may
obtain their generalized inverse using such iterative methods.
These solvers are also the method of choice in certain areas such as circuits,
power system networks, and chemical plant modeling. A practical application of
Schulz-type methods, which recently attracted some numerical analysts to itself,
is in preconditioning. In fact, by having an initial value, one is able to produce
any approximate inverse preconditioners up to the desired accuracy and then
solve the preconditioned linear system as rapidly as possible.
Hence, we are interested in finding a new iterative method belonging to the
class of Schulz-type methods for finding approximate inverses in this work.
Remark 1. We use matrix norm 2 in all subsequent derivations and discussions
unless the other forms are clearly stated.
The rest of this paper is organized in what follows. In the next section, we briefly
review some of the existing Schulz-type methods and provide a new
mathematical proof for one of the unproved iterative methods. Another
contribution of this paper is presented in Section 3. That section is also devoted
to the analysis of convergence. Section 4 covers the numerical simulations and
some notes in order to have the best feedback in practical implementations.

DIRECT METHODS FOR MATRIX INVERSION


We start with a system of equations (right), x- y + 3z = 2
to be solved. The number of variables must equal
the number of equations, both equal to 3 here: 2x + y + 2z = 2

UNIT - I 41
[COMPUTATIONAL FLUID DYNAMICS]

-2x - 2y + z = 3

On Prof. McFarland's tests, you would be asked to solve the above problem

"by the matrix inverse method" in three distinct steps, as follows :

[1]
Write the given system (above) as a single matrix equation:

Capital letter variables represent the matrices (not numbers)


which sit directly above them. Hence, the above equation is
a matrix equation. Notice how the matrix entries correspond
to numbers in the original system of 3 equations

[2]
Solve the matrix equation obtained in step [1] above; i.e., find X.

In the MATRIX INVERSE METHOD (unlike Gauss/Jordan), we solve for the matrix
variable X by left-multiplying both sides of the above matrix equation (AX=B) by A-1. Typically,
A-1 is calculated as a separate exercize ; otherwise, we must pause here to calculate A-1.

The left side (above) is easy to calculate


because an identity matrix I 3 appears.
Note: A-1 is on the LEFT side of both products.

Step [2] is now completed by finding A-1. B ===>

[3]
Using [2] above, write the solution to the original system:

solution to original system


original system

x = -3
x- y + 3z = 2
14
2x + y + 2z = 2 y=
5

UNIT - I 42
[COMPUTATIONAL FLUID DYNAMICS]

-2x - 2y + z = 3 z=
13
5

DIRECT METHODS FOR BANDED MATRICES


In mathematics, particularly matrix theory, a band matrix or banded matrix is
a sparse matrix whose non-zero entries are confined to a diagonal band,
comprising the main diagonal and zero or more diagonals on either side.

Band matrix
Bandwidth
Formally, consider an n×n matrix A=(ai,j ). If all matrix elements are zero outside
a diagonally bordered band whose range is determined by constants k1 and k2:
then the quantities k1 and k2 are called the lower bandwidth and upper
bandwidth, respectively.[1] The bandwidth of the matrix is the maximum
of k1 and k2; in other words, it is the number k such that if .

Examples

 A band matrix with k1 = k2 = 0 is a diagonal matrix


 A band matrix with k1 = k2 = 1 is a tridiagonal matrix
 For k1 = k2 = 2 one has a pentadiagonal matrix and so on.
 Triangular matrices
o For k1 = 0, k2 = n−1, one obtains the definition of an
upper triangular matrix
o similarly, for k1 = n−1, k2 = 0 one obtains a lower triangular matrix.
 Upper and lower Hessenberg matrices
 matrices when bandwidth is limited.
 Block diagonal matrices
 Shift matrices and shear matrices
 Matrices in Jordan normal form
 A skyline matrix, also called "variable band matrix" – a generalization of
band matrix
 The inverses of Lehmer matrices are constant tridiagonal matrices, and
are thus band matrices.

UNIT - I 43
[COMPUTATIONAL FLUID DYNAMICS]

Applications
In numerical analysis, matrices from finite element or finite difference problems
are often banded. Such matrices can be viewed as descriptions of the coupling
between the problem variables; the banded property corresponds to the fact
that variables are not coupled over arbitrarily large distances. Such matrices can
be further divided – for instance, banded matrices exist where every element in
the band is nonzero. These often arise when discretising one-dimensional
problems.
Problems in higher dimensions also lead to banded matrices, in which case the
band itself also tends to be sparse. For instance, a partial differential equation
on a square domain (using central differences) will yield a matrix with a
bandwidth equal to the square root of the matrix dimension, but inside the band
only 5 diagonals are nonzero. Unfortunately, applying Gaussian elimination (or
equivalently an LU decomposition) to such a matrix results in the band being
filled in by many non-zero elements.

Band storage
Band matrices are usually stored by storing the diagonals in the band; the rest is
implicitly zero.
For example, a tridiagonal matrix has bandwidth 1. The 6-by-6 matrix is stored
as the 6-by-3 matrix
A further saving is possible when the matrix is symmetric. For example,
consider a symmetric 6-by-6 matrix with an upper bandwidth of 2:
This matrix is stored as the 6-by-3 matrix:

Band form of sparse matrices


From a computational point of view, working with band matrices is always
preferential to working with similarly dimensioned square matrices. A band
matrix can be likened in complexity to a rectangular matrix whose row
dimension is equal to the bandwidth of the band matrix. Thus the work involved
in performing operations such as multiplication falls significantly, often leading
to huge savings in terms of calculation time and complexity.
As sparse matrices lend themselves to more efficient computation than dense
matrices, as well as in more efficient utilization of computer storage, there has
been much research focused on finding ways to minimise the bandwidth (or

UNIT - I 44
[COMPUTATIONAL FLUID DYNAMICS]

directly minimise the fill-in) by applying permutations to the matrix, or other


such equivalence or similarity transformations.
The Cuthill–McKee algorithm can be used to reduce the bandwidth of a
sparse symmetric matrix. There are, however, matrices for which the reverse
Cuthill–McKee algorithm performs better. There are many other methods in
use.
The problem of finding a representation of a matrix with minimal bandwidth by
means of permutations of rows and columns is NP-hard.

UNIT - I 45

You might also like