0% found this document useful (0 votes)
100 views11 pages

Floating Point Representation of Numbers: Wide Range

Floating point (FP) representation allows numbers to be stored in a wide range from very small to very large values. FP uses a sign bit, exponent, and significand (mantissa) to represent numbers. IEEE 754 standards define single and double precision FP formats using 32 and 64 bits respectively. Special values like infinity, Not-a-Number (NaN), and denormalized numbers are also defined. MIPS architecture supports 32 single precision and 16 double precision FP registers for operations like addition, subtraction, multiplication and division.

Uploaded by

doomachaley
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
100 views11 pages

Floating Point Representation of Numbers: Wide Range

Floating point (FP) representation allows numbers to be stored in a wide range from very small to very large values. FP uses a sign bit, exponent, and significand (mantissa) to represent numbers. IEEE 754 standards define single and double precision FP formats using 32 and 64 bits respectively. Special values like infinity, Not-a-Number (NaN), and denormalized numbers are also defined. MIPS architecture supports 32 single precision and 16 double precision FP registers for operations like addition, subtraction, multiplication and division.

Uploaded by

doomachaley
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Floating point Representation of Numbers

FP is useful for representing a number in a wide


range: very small to very large. It is widely
used in the scientific world. Consider, the following
FP representation of a number

Exponent E
+/- x x x x

significand F (also called mantissa)


y y y y y y y y y y y y

Sign bit

In decimal it means (+/-) 1. yyyyyyyyyyyy x 10xxxx


In binary, it means (+/-) 1. yyyyyyyyyyyy x 2xxxx
(The 1 is implied)

IEEE 754 single-precision (32 bits)


s xxxxxxxx yyyyyyyyyyyyyyyyyyyyyyy

Single precision

23 bits

Largest = 1. 1 1 1 x 2
Smallest = 1.000 x 2

+127

128

2 x 10

1 x 10

+38

-38

These can be positive and negative, depending on s.


(But there are exceptions too)

IEEE 754 double precision (64 bits)

exponent

11 bits

significand
52 bits

Largest =

1. 1 1 1 x 2

Smallest =

1.000 X 2

+1023
1024

Overflow and underflow in FP


An overflow occurs when the number if too large
to fit in the frame. An underflow occurs when
the number is too small to fit in the given frame.

How do we represent zero?


IEEE standards committee solved this by
making zero a special case: if every bit is zero
(the sign bit being irrelevant), then the
number is considered zero.

Then how do we represent 1.0?

Then how do we represent 1.0?


It should have been 1.0 x 20 (same as 0)! The way
out of this is that the interpretation of the
exponent bits is not straightforward. The
exponent of a single-precision float is "shift-

127" encoded (biased representation),


meaning that the actual exponent is (xxxxxxx
minus 127). So thankfully, we can get an exponent
of zero by storing 127.

Exponent = 11111111 (i.e. 255) means 255-127 = 128


Exponent = 01111111 (i.e. 127) means 127-127 = 0
Exponent = 00000001 (i.e. 1) means 1-127 = -126

More on Biased Representation


The consequence of shift-127

Exponent = 00000000 (reserved for 0) can no


more be used to represent the smallest number.
We forego something at the lower end of the
spectrum of representable exponents, (which could
be 2-127). That said, it seems wise, to give up the
smallest exponent instead of giving up the ability
to represent 1 or zero!

More special cases


Zero is not the only "special case" float. There are also
representations for positive and negative infinity, and for a
not-a-number (NaN) value, for results that do not make
sense (for example, non-real numbers, or the result of an
operation like infinity times zero). How do these work? A
number is infinite if every bit of the exponent is 1 (yes, we
lose another one), and is NaN if every bit of the exponent is 1
plus any mantissa bits are 1. The sign bit still distinguishes
+/-inf and +/-NaN. Here are a few sample floating point
representations:
Exponent

Mantissa

Object

Zero

Nonzero

Denormalized number*

1-254

Anything

+/- FP number

255

+ / - infinity

255

Nonzero

NaN like 0/0 or 0x inf

* Any non-zero number that is smaller than the smallest normal


number is a denormalized number. The production of a denormal is
sometimes called gradual underflow because it allows a calculation to
lose precision slowly when the result is small.

Floating point operations in MIPS


32 separate single precision FP registers in MIPS
f0, f1, f2, f31,
Can also be used as 16 double precision registers
f0, f2, f4, f30 (f0 means f0,f1 f2 means f2,f3)
These reside in a coprocessor

C1 in the same package

Operations supported
add.s

$f2, $f4, $f6

# f2 = f4 + f6 (single precision)

add.d

$f2, $f4, $f6

# f2 = f4 + f6 (double precision)

(Also subtract, multiply, divide format are similar)


lwc1

$f1, 100($s2)

# f1 = M [s2 + 100]

(32-bit load)

mtc1

$t0, $f0

# f0 = t0 (move to coprocessor 1)

mfc1

$t1, $f1

# t1 = f1 (move from coprocessor 1)

Sample program
Evaluation of a Polynomial a.x2 + b.x + c

# $f0 --- x
# $f2 --- sum of terms
.....
Pseudoinstruction

a:
b:
c:

# Evaluate the quadratic


l.s
$f2,a
mul.s
$f2,$f2,$f0

# sum = a
# sum = ax

l.s
add.s
mul.s

$f4,b
$f2,$f2,$f4
$f2,$f2,$f0

# get b
# sum = ax + b
# sum = (ax+b)x = ax^2 + bx

l.s
add.s
......

$f4,c
$f2,$f2,$f4

# get c
# sum = ax^2 + bx + c

.data
.float 1.0
.float 1.0
.float 1.0

Floating Point Addition

Example using decimal


A = 9.999 x 10 1, B = 1.610 x 10 1, A+B =?

Step 1. Align the smaller exponent with the larger


one.
B = 0.0161 x 101 = 0.016 x 101 (round off)
Step 2. Add significands
9.999 + 0.016 = 10.015, so A+B = 10.015 x 101
Step 3. Normalize
A+B = 1.0015 x 102
Step 4. Round off
A+B = 1.002 x 102

Now, try to add 0.5 and 0.4375 in binary.

Floating Point Multiplication

Example using decimal


A = 1.110 x 1010, B = 9.200 x 10-5

A x B =?

Step 1. Exponent of A x B = 10 + (-5) = 5


Step 2. Multiply significands
1.110x 9.200 = 10.212000
Step 3. Normalize the product
10.212 x 105 = 1.0212 x 106
Step 4. Round off
A x B = 1.021 x 106
Step 5. Decide the sign of A x B (+ x + = +)

So, A x B = + 1.021 x 106

You might also like