0% found this document useful (0 votes)

5 views54 pages

L2 DataTypeFloat

The document discusses floating point numbers, their representation in computers, and the IEEE-754 standard for floating point representation. It highlights the limitations of representing certain numbers, the structure of floating point numbers, and provides examples of converting between decimal and floating point formats. Additionally, it covers special cases such as zero, infinity, and NaN (Not a Number).

Uploaded by

riteshthedeveloper

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views54 pages

L2 DataTypeFloat

Uploaded by

riteshthedeveloper

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 54

Floating Point Numbers

Review of Numbers
• Computers are made to deal with numbers
• What can we represent in N bits?
• Unsigned integers:
0 to 2N - 1
• Signed Integers (Two’s Complement)
-2(N-1) to 2(N-1) - 1

Signed Integers
-2(N-1) - 1 to 2(N-1) - 1
Other Numbers
• What about other numbers?
• Very large numbers? (seconds/century)
3,155,760,00010 (3.1557610 x 109)
• Very small numbers? (atomic diameter)
0.0000000110 (1.010 x 10-8)
• Rationals (repeating pattern)
• 2/3 (0.666666666. . .)
• Irrationals
21/2 (1.414213562373. . .)
• Transcendentals
• e (2.718...),  (3.141...)
• All represented in scientific notation
Fractional Binary Numbers
2i
2i-1

4
••• 2
1

bi bi-1 ••• b2 b1 b0 b-1 b-2 b-3 ••• b-j

1/2
1/4 •••
1/8

• Representation 2-j
• Bits to right of “binary point” represent fractional powers of 2
• Represents rational number:
Fractional Binary Numbers: Examples

 Value Representation
5 3/4 = 23/4 101.112 = 4 + 1 + 1/2 + 1/4
2 7/8 = 23/8 010.1112 = 2 + 1/2 + 1/4 + 1/8
1 7/16 = 23/16 001.01112 = 1 + 1/4 + 1/8 + 1/16

 Observations
 Divide by 2 by shifting right (unsigned)
 Multiply by 2 by shifting left
 Numbers of form 0.111111…2 are just below 1.0
 1/2 + 1/4 + 1/8 + … + 1/2i + … ➙ 1.0
 Use notation 1.0 – ε
Representable Numbers

• Limitation #1
• Can only exactly represent numbers of the form x/2k
• Other rational numbers have repeating bit representations

• Value Representation
• 1/3 0.0101010101[01]…2
• 1/5 0.001100110011[0011]…2
• 1/10 0.0001100110011[0011]…2

• Limitation #2
• Just one setting of binary point within the w bits
• Limited range of numbers (very small values? very large?)
Objective

• To understand the fundamentals of floating-

point representation
• To know the IEEE-754 Floating Point
Standard
Patriot Missile
• Gulf War I
• Failed to intercept
incoming Iraqi scud
missile (Feb 25, 1991)
• 28 American soldiers
killed

GAO Report: GAO/IMTEC-92-26 Patriot Missile Software

Problem
http://www.fas.org/spp/starwars/gao/im92026.htm
Patriot Design
• Intended to operate only for a few hours
• Defend Europe from Soviet aircraft and missile
• Four 24-bit registers (1970s design!)
• Kept time with integer counter: incremented every
1/10 second
• Calculate speed of incoming missile to predict
future positions:
velocity = loc1 – loc0/(count1 – count0) * 0.1
• But, cannot represent 0.1 exactly!
Floating Imprecision
• 24-bits:
0.1 = 1/24 + 1/25 + 1/28 + 1/29
+ 1/212 + 1/213 + 1/216 + 1/217
+ 1/220 + 1/221
= 209715 / 2097152
Error is 0.2/2097152 = 1/10485760
One hour = 3600 seconds
3600 * 1/10485760 * 10 = 0.0034s
20 hours = 0.0687s
Miss target! (137 meters)
Two weeks before the incident, Army officials received Israeli data
indicating some loss in accuracy after the system had been running
for 8 consecutive hours. Consequently, Army officials modified the
software to improve the system's accuracy. However, the modified
software did not reach Dhahran until February 26, 1991--the day
after the Scud incident.
GAO Report

http://fas.org/spp/starwars/gao/im92026.htm
Floating Point Representation
Example:
• Numerical Form: 1521310 = (-1)0 x 1.11011011011012 x 213
(–1)s M 2E
• Sign bit s determines whether number is negative or positive
• Significand M normally a fractional value in range [1.0,2.0).
• Exponent E weights value by power of two

• Encoding
• MSB s is sign bit s
• exp field encodes E (but is not equal to E)
• frac field encodes M (but is not equal to M)

s exp frac
Exponential Notation
• The following are equivalent
representations of 1,234

123,400.0 x 10-2
12,340.0 x 10-1 The representations differ
1,234.0 x 100 in that the decimal place –
123.4 x 101 the “point” -- “floats” to
the left or right (with the
12.34 x 102 appropriate adjustment in
1.234 x 103 the exponent).
0.1234 x 104
Parts of a Floating Point Number

Exponent
-0.9876 x 10-3

Sign of
Sign of Location of exponent
mantissa decimal point Mantissa

Base
IEEE 754 Standard
• Most common standard for representing floating
point numbers
• Single precision: 32 bits, consisting of...
• Sign bit (1 bit)
• Exponent (8 bits)
• Mantissa (23 bits)
• Double precision: 64 bits, consisting of…
• Sign bit (1 bit)
• Exponent (11 bits)
• Mantissa (52 bits)

Prof. Willian Kahan

Single Precision Format

32 bits

Mantissa (23 bits)

Exponent (8 bits)

Sign of mantissa (1 bit)

Normalization
• The mantissa is normalized
• Has an implied decimal place on left
• Has an implied “1” on left of the decimal place
• E.g.,
• Mantissa  10100000000000000000000
• Represents… 1.1012 = 1.62510
• Normalized form: no leadings 0s
(exactly one digit to left of decimal point)
• Normalized: 1.0 x 10-9
• Not normalized: 0.1 x 10-8,10.0 x 10-10
Excess Notation
• To include +ve and –ve exponents, “excess”
notation is used
• Single precision: excess 127
• Double precision: excess 1023
• The value of the exponent stored is larger than the
actual exponent
• E.g., excess 127,
• Exponent  10000111
• Represents… 135 – 127 = 8
Example
• Single precision
0 10000010 11000000000000000000000

1.112

130 – 127 = 3

0 = positive mantissa

+1.112 x 23 = 1110.02 = 14.010

Hexadecimal
• It is convenient and common to represent
the original floating point number in
hexadecimal
• The preceding example…

0 10000010 11000000000000000000000

4 1 6 0 0 0 0 0
Converting from Floating Point

• E.g., What decimal value is represented by

the following 32-bit floating point number?

C17B000016
• Step 1
• Express in binary and find S, E, and M

C17B000016 =

1 10000010 111101100000000000000002
S E M

1 = negative
0 = positive
• Step 2
• Find “real” exponent, n
• n = E – 127
= 100000102 – 127
= 130 – 127
=3
• Step 3
• Put S, M, and n together to form binary result
• (Don’t forget the implied “1.” on the left of the
mantissa.)

-1.11110112 x 2n =

-1.11110112 x 23 =

-1111.10112
• Step 4
• Express result in decimal
-1111.10112
-15 2-1 = 0.5
2-3 = 0.125
2-4 = 0.0625
0.6875

Answer: -15.6875
Converting from Floating Point

• E.g., What decimal value is represented by

the following 32-bit floating point number?
42808000 16
Converting to Floating Point

• E.g., Express 36.562510 as a 32-bit floating

point number (in hexadecimal)
• Step 1
• Express original value in binary

36.562510 =

100100.10012
• Step 2
• Normalize

100100.10012 =

1.0010010012 x 25
• Step 3
• Determine S, E, and M

+1.0010010012 x 25
n E = n + 127
S M
= 5 + 127
= 132
= 100001002

S = 0 (because the value is positive)

• Step 4
• Put S, E, and M together to form 32-bit binary
result
0 10000100 001001001000000000000002
S E M
• Step 5
• Express in hexadecimal

0 10000100 001001001000000000000002 =

0100 0010 0001 0010 0100 0000 0000 00002 =

4 2 1 2 4 0 0 016

Answer: 4212400016
Converting to Floating Point

• E.g., Express 6.510 as a 32-bit floating point

number (in hexadecimal)
Converting to Floating Point

• E.g., Express 0.1 as a 32-bit floating point

number (in hexadecimal)
Zero, Infinity, and NaN
• Zero
– Exponent field E = 0 and fraction F = 0
– +0 and –0 are possible according to sign bit S
• Infinity
– Infinity is a special value represented with maximum E and F = 0
• For single precision with 8-bit exponent: maximum E = 255
• For double precision with 11-bit exponent: maximum E = 2047
– Infinity can result from overflow or division by zero
– +∞ and –∞ are possible according to sign bit S
• NaN (Not a Number)
– NaN is a special value represented with maximum E and F ≠ 0
– Result from exceptional situations, such as 0/0 or sqrt(negative)
– Operation on a NaN results is NaN: Op(X, NaN) = NaN
Simple 6-bit Floating Point Example
S Exponent3 Fraction2

• 6-bit floating point representation

– Sign bit is the most significant bit
– Next 3 bits are the exponent with a bias of 3
– Last 2 bits are the fraction
• Same general form as IEEE
– Normalized, denormalized
– Representation of 0, infinity and NaN
• Value of normalized numbers (–1)S × (1.F)2 × 2E – 3
• Value of denormalized numbers (–1)S × (0.F)2 × 2– 2
Values Related to Exponent
Exp. exp E 2E
0 000 -2 ¼ Denormalized

1 001 -2 ¼
2 010 -1 ½
3 011 0 1
Normalized
4 100 1 2
5 101 2 4
6 110 3 8
7 111 n/a Inf or NaN
Dynamic Range of Values
s exp frac E value
0 000 00 -2 0
0 000 01 -2 1/4*1/4=1/16 smallest denormalized
0 000 10 -2 2/4*1/4=2/16
0 000 11 -2 3/4*1/4=3/16 largest denormalized
0 001 00 -2 4/4*1/4=4/16=1/4=0.25 smallest normalized
0 001 01 -2 5/4*1/4=5/16
0 001 10 -2 6/4*1/4=6/16
0 001 11 -2 7/4*1/4=7/16
0 010 00 -1 4/4*2/4=8/16=1/2=0.5
0 010 01 -1 5/4*2/4=10/16
0 010 10 -1 6/4*2/4=12/16=0.75
0 010 11 -1 7/4*2/4=14/16
Dynamic Range of Values
s exp frac E value
0 011 00 0 4/4*4/4=16/16=1
0 011 01 0 5/4*4/4=20/16=1.25
0 011 10 0 6/4*4/4=24/16=1.5
0 011 11 0 7/4*4/4=28/16=1.75
0 100 00 1 4/4*8/4=32/16=2
0 100 01 1 5/4*8/4=40/16=2.5
0 100 10 1 6/4*8/4=48/16=3
0 100 11 1 7/4*8/4=56/16=3.5
0 101 00 2 4/4*16/4=64/16=4
0 101 01 2 5/4*16/4=80/16=5
0 101 10 2 6/4*16/4=96/16=6
0 101 11 2 7/4*16/4=112/16=7
Dynamic Range of Values
s exp frac E value
0 110 00 3 4/4*32/4=128/16=8
0 110 01 3 5/4*32/4=160/16=10
0 110 10 3 6/4*32/4=192/16=12
0 110 11 3 7/4*32/4=224/16=14 largest normalized
0 111 00 
0 111 01 NaN
0 111 10 NaN
0 111 11 NaN
Floating Point Addition Example
• Consider adding: (1.111)2 × 2–1 + (1.011)2 × 2–3
– For simplicity, we assume 4 bits of precision (or 3 bits of
fraction)
• Cannot add significands … Why?
– Because exponents are not equal
• How to make exponents equal?
– Shift the significand of the lesser exponent right
until its exponent matches the larger number
• (1.011)2 × 2–3 = (0.1011)2 × 2–2 = (0.01011)2 × 2–1
– Difference between the two exponents = –1 – (–3) = 2
1.111
– So, shift right by 2 bits +
0.01011
• Now, add the significands: Carry 10.00111
Addition Example
• So, (1.111)2 × 2–1 + (1.011)2 × 2–3 = (10.00111)2 × 2–1
• However, result (10.00111)2 × 2–1 is NOT normalized
• Normalize result: (10.00111)2 × 2–1 = (1.000111)2 × 20
– In this example, we have a carry
– So, shift right by 1 bit and increment the exponent
• Round the significand to fit in appropriate number of bits
– We assumed 4 bits of precision or 3 bits of fraction
• Round to nearest: (1.000111)2 ≈ (1.001)2 1.000 111
– Renormalize if rounding generates a carry + 1
• Detect overflow / underflow 1.001
– If exponent becomes too large (overflow) or too small (underflow)
Summary: IEEE Floating Point
Single Precision (32 bits)
1 8 bits 23 bits
Sign

Exponent Fraction

31 30 23 22 0

Exponent values: 0 zeroes

1-254 exp + 127
255 infinities, NaN

Value = (1 – 2*Sign) (1 + Fraction)Exponent - 127

Denormalized Values
• Condition
• exp = 000…0
• Value
• Exponent value E = –Bias + 1
• Significand value M = 0.xxx…x2
• xxx…x: bits of frac
• Cases
• exp = 000…0, frac = 000…0
• Represents value 0
• Note that have distinct values +0 and –0
• exp = 000…0, frac  000…0
• Numbers very close to 0.0
Special Values
• Condition
• exp = 111…1
• Cases
• exp = 111…1, frac = 000…0
• Represents value(infinity)
• Operation that overflows
• Both positive and negative
• E.g., 1.0/0.0 = 1.0/0.0 = +, 1.0/0.0 = 
• exp = 111…1, frac  000…0
• Not-a-Number (NaN)
• Represents case when no numeric value can be
determined
• E.g., sqrt(–1), 
Interesting Numbers
• Description exp frac Numeric Value
• Zero 00…00 00…00 0.0
• Smallest Pos. Denorm. 00…00 00…01 2– {23,52} X 2– {126,1022}
• Single  1.4 X 10–45
• Double  4.9 X 10–324
• Largest Denormalized 00…00 11…11 (1.0 – ) X 2– {126,1022}
• Single  1.18 X 10–38
• Double  2.2 X 10–308
• Smallest Pos. Normalized 00…01 00…00 1.0 X 2– {126,1022}
• Just larger than largest denormalized
• One 01…11 00…00 1.0
• Largest Normalized 11…10 11…11 (2.0 – ) X 2{127,1023}
• Single  3.4 X 1038
• Double  1.8 X 10308
Visualization: Floating Point
Encodings

− +
−Normalized −Denorm +Denorm +Normalized

NaN NaN
0 +0
Tiny Floating Point Example

s exp frac
1 4-bits 3-bits

• 8-bit Floating Point Representation

• the sign bit is in the most significant bit
• the next four bits are the exp, with a bias of 7
• the last three bits are the frac

• Same general form as IEEE Format

• normalized, denormalized
• representation of 0, NaN, infinity
v = (–1)s M 2E

Dynamic Range (s=0 only) norm: E = exp – Bias

denorm: E = 1 – Bias

s exp frac E Value

0 0000 000 -6 0
0 0000 001 -6 1/8*1/64 = 1/512 closest to zero
Denormalized 0 0000 010 -6 2/8*1/64 = 2/512 (-1)0(0+1/4)*2-6
numbers …
0 0000 110 -6 6/8*1/64 = 6/512
0 0000 111 -6 7/8*1/64 = 7/512 largest denorm
0 0001 000 -6 8/8*1/64 = 8/512 smallest norm
0 0001 001 -6 9/8*1/64 = 9/512 (-1)0(1+1/8)*2-6
…
0 0110 110 -1 14/8*1/2 = 14/16
0 0110 111 -1 15/8*1/2 = 15/16 closest to 1 below
Normalized 0 0111 000 0 8/8*1 = 1
numbers 0 0111 001 0 9/8*1 = 9/8 closest to 1 above
0 0111 010 0 10/8*1 = 10/8
…
0 1110 110 7 14/8*128 = 224
0 1110 111 7 15/8*128 = 240 largest norm
0 1111 000 n/a inf
Distribution of Values
• 6-bit IEEE-like format
• e = 3 exponent bits s exp frac
• f = 2 fraction bits 1 3-bits 2-bits
• Bias is 23-1-1 = 3
• Notice how the distribution gets denser toward zero.
8 values

-15 -10 -5 0 5 10 15
Denormalized Normalized Infinity
Floats are not Reals

231−1=2,147,483,647
Int’s:
eg. 40000 * 40000 --> 1600000000
600000* 600000 --> ?
Floats:
Eg 2 : Is (x + y) + z = x + (y + z)?

eg
(1e20 + -1e20) + 3.14 --> 3.14
1e20 + (-1e20 + 3.14) --> ??
Need to understand details of underlying implementations
IEEE 754

IEEE 754 Binary16 (F16) Format

Component Bits
Sign bit 1
Exponent 5
Fraction 10
Total 16 bits (2 bytes)

Overview of IEEE 754 Binary 32

Field Bits Description
Sign 1 0 = positive, 1 = negative
Exponent 8 Encodes exponent with bias
Precision bits (fractional
Fraction (Mantissa) 23
part)
IEEE 754

IEEE 754 Binary16 (F128) Format

Field Bits Description

Sign 1 0 = positive, 1 = negative
Encodes exponent using a
Exponent 15
bias of 16383
Fractional part of the
Fraction (Mantissa) 112
significand

IEEE 754 Binary64

Field Bits Description
Sign 1 0 = positive, 1 = negative
Exponent 11 Encodes exponent with bias
Precision bits (fractional
Fraction (Mantissa) 52
part)

IEEE 754 Floating Point Guide
No ratings yet
IEEE 754 Floating Point Guide
38 pages
Floating Point Representation: Reading: B&O 2.4
No ratings yet
Floating Point Representation: Reading: B&O 2.4
44 pages
Number Representation
No ratings yet
Number Representation
7 pages
Fixed Point and Floating Point Number Representations
No ratings yet
Fixed Point and Floating Point Number Representations
7 pages
Number Representation Explained
No ratings yet
Number Representation Explained
5 pages
Floating Point Numbers
No ratings yet
Floating Point Numbers
26 pages
Lec 2 Unit-1
No ratings yet
Lec 2 Unit-1
65 pages
Computer Architecture: Data Types
No ratings yet
Computer Architecture: Data Types
25 pages
Introduction To Numerical Computing: Statistics 580 Number Systems
No ratings yet
Introduction To Numerical Computing: Statistics 580 Number Systems
35 pages
Floating Point: Contents and Introduction
No ratings yet
Floating Point: Contents and Introduction
7 pages
arch1-LECTURE-NUMBER REPRESENTATION
No ratings yet
arch1-LECTURE-NUMBER REPRESENTATION
42 pages
IEEE FP Representation
No ratings yet
IEEE FP Representation
3 pages
05 Floating Point
No ratings yet
05 Floating Point
24 pages
Floating Point Representation - M.eng Term Paper
No ratings yet
Floating Point Representation - M.eng Term Paper
6 pages
3-EED220 Lecture 3
No ratings yet
3-EED220 Lecture 3
22 pages
Fixed and Floating Point Representation
No ratings yet
Fixed and Floating Point Representation
5 pages
Data Representation
No ratings yet
Data Representation
28 pages
Computer Arithmetic Basics
No ratings yet
Computer Arithmetic Basics
18 pages
IEEE Floating Point Conversion Guide
No ratings yet
IEEE Floating Point Conversion Guide
34 pages
Floating Points
No ratings yet
Floating Points
31 pages
Lecture11 Slides 1
No ratings yet
Lecture11 Slides 1
52 pages
Module 2 - PART D Floating
No ratings yet
Module 2 - PART D Floating
30 pages
Module2.1 of Nothing
No ratings yet
Module2.1 of Nothing
7 pages
Ieee Standard For Floating Point Numbers
No ratings yet
Ieee Standard For Floating Point Numbers
5 pages
Chap2 Float
No ratings yet
Chap2 Float
20 pages
IEEE 754: Floating Point Guide
No ratings yet
IEEE 754: Floating Point Guide
10 pages
Computer Architecture Basics
No ratings yet
Computer Architecture Basics
64 pages
DR - Shoeb ME212 Lec-3
No ratings yet
DR - Shoeb ME212 Lec-3
43 pages
CH03 Data II
No ratings yet
CH03 Data II
31 pages
Numerical Methods Binary FloatingPoint Errors
No ratings yet
Numerical Methods Binary FloatingPoint Errors
109 pages
Floating Point Numbers: CS031 September 12, 2011
No ratings yet
Floating Point Numbers: CS031 September 12, 2011
22 pages
Floating Point
No ratings yet
Floating Point
10 pages
Fixed Versus Floating Point
No ratings yet
Fixed Versus Floating Point
5 pages
Binary Tutorial
No ratings yet
Binary Tutorial
10 pages
FIXED and FLOAT
No ratings yet
FIXED and FLOAT
8 pages
What Are Floating Point Numbers?
No ratings yet
What Are Floating Point Numbers?
7 pages
Unit 2
No ratings yet
Unit 2
16 pages
13.3 Real Numbers & Normalized Floating-Point
No ratings yet
13.3 Real Numbers & Normalized Floating-Point
17 pages
Soc2040 SP Week 5 Lecture1 Slides On Data Representation Part4 Spring 2024
No ratings yet
Soc2040 SP Week 5 Lecture1 Slides On Data Representation Part4 Spring 2024
46 pages
The IEEE Standard For Floating Point Arithmetic
No ratings yet
The IEEE Standard For Floating Point Arithmetic
9 pages
Lecture Slides Week4
No ratings yet
Lecture Slides Week4
42 pages
COMPX203 Computer Systems: Number Representation
No ratings yet
COMPX203 Computer Systems: Number Representation
33 pages
IEEE Standard 754 Floating Point Numbers
No ratings yet
IEEE Standard 754 Floating Point Numbers
7 pages
8.3 Floating Point Numbers
No ratings yet
8.3 Floating Point Numbers
19 pages
Binary Data Representation Guide
No ratings yet
Binary Data Representation Guide
27 pages
Floating Point 6up
No ratings yet
Floating Point 6up
7 pages
Chapter 1 - Izaac-Wang - Computational Quantum Mechanics (2018)
No ratings yet
Chapter 1 - Izaac-Wang - Computational Quantum Mechanics (2018)
12 pages
Floating Point Numbers
No ratings yet
Floating Point Numbers
27 pages
Lecture 14 - Arithmetic Subsystems - Numbering Systems and Floating Point Unit (FPU)
No ratings yet
Lecture 14 - Arithmetic Subsystems - Numbering Systems and Floating Point Unit (FPU)
32 pages
L2-Variables and Floating Point Number System
No ratings yet
L2-Variables and Floating Point Number System
38 pages
Floating Point Representation
No ratings yet
Floating Point Representation
18 pages
Lec 06
No ratings yet
Lec 06
49 pages
Lecture 4
No ratings yet
Lecture 4
21 pages
Floating Point Numbers
No ratings yet
Floating Point Numbers
8 pages
4.4 - 1 New Floating Point
No ratings yet
4.4 - 1 New Floating Point
22 pages
Floating Point Numbers
No ratings yet
Floating Point Numbers
20 pages
CBSE Class 10 Real Numbers Test
No ratings yet
CBSE Class 10 Real Numbers Test
9 pages
Whole Numbers Basics
No ratings yet
Whole Numbers Basics
3 pages
Greedy Algorithms and MST Techniques
No ratings yet
Greedy Algorithms and MST Techniques
69 pages
Chapter 1b DeMoivres Theorem
No ratings yet
Chapter 1b DeMoivres Theorem
16 pages
Introduction To The Theory of Computation
No ratings yet
Introduction To The Theory of Computation
4 pages
Combinatorics Geometry and Probability A Tribute To Paul Erds Bla Bollobs Download
No ratings yet
Combinatorics Geometry and Probability A Tribute To Paul Erds Bla Bollobs Download
89 pages
Sarah J. Johnson - Iterative Error Correction - Turbo, Low-Density Parity-Check and Repeat-Accumulate Codes
No ratings yet
Sarah J. Johnson - Iterative Error Correction - Turbo, Low-Density Parity-Check and Repeat-Accumulate Codes
356 pages
Network Science Meets Circuit Theory: Resistance Distance, Kirchhoff Index, and Foster's Theorems With Generalizations and Unification
No ratings yet
Network Science Meets Circuit Theory: Resistance Distance, Kirchhoff Index, and Foster's Theorems With Generalizations and Unification
14 pages
Exercises On Proving
No ratings yet
Exercises On Proving
49 pages
ECE 606, Algorithms: Mahesh Tripunitara Tripunit@uwaterloo - Ca ECE, University of Waterloo
No ratings yet
ECE 606, Algorithms: Mahesh Tripunitara Tripunit@uwaterloo - Ca ECE, University of Waterloo
4 pages
Assignment 1 Questions
No ratings yet
Assignment 1 Questions
16 pages
Fundamental Theorem of Arithmetic
No ratings yet
Fundamental Theorem of Arithmetic
7 pages
Analytical Modeling of Parallel Programs S. Oliveira
No ratings yet
Analytical Modeling of Parallel Programs S. Oliveira
13 pages
Unit II Sub: Artificial Intelligence Prof Priya Singh
No ratings yet
Unit II Sub: Artificial Intelligence Prof Priya Singh
25 pages
MJM0S201043
No ratings yet
MJM0S201043
3 pages
Assignment Problems
100% (2)
Assignment Problems
36 pages
UHS Math Club Solutions
No ratings yet
UHS Math Club Solutions
5 pages
Cyclic Codes
No ratings yet
Cyclic Codes
18 pages
STM Q Paper 2-Mid
No ratings yet
STM Q Paper 2-Mid
2 pages
Heuristic Search Techniques in AI
No ratings yet
Heuristic Search Techniques in AI
6 pages
Algorithm Complexity & B-Trees
No ratings yet
Algorithm Complexity & B-Trees
16 pages
A Better Root Finding Method Using False Position and Inverse Quadratic Interpolation Method
No ratings yet
A Better Root Finding Method Using False Position and Inverse Quadratic Interpolation Method
3 pages
2024 ATMAA Program Outline
No ratings yet
2024 ATMAA Program Outline
7 pages
Find The Number of Positive Integers Less Than 101 That Cannot Be Written As The Difference of Two Squares of Integers
No ratings yet
Find The Number of Positive Integers Less Than 101 That Cannot Be Written As The Difference of Two Squares of Integers
1 page
Context Sensitive Earley
No ratings yet
Context Sensitive Earley
18 pages
Divisibility by 7, 11, and 13
No ratings yet
Divisibility by 7, 11, and 13
4 pages
Ada Lab Manual
No ratings yet
Ada Lab Manual
57 pages
CP RoadMap
No ratings yet
CP RoadMap
8 pages
HCF and LCM Questions With Solutions
No ratings yet
HCF and LCM Questions With Solutions
4 pages
Grade 10 Maths HHW
No ratings yet
Grade 10 Maths HHW
3 pages

L2 DataTypeFloat

Uploaded by

L2 DataTypeFloat

Uploaded by

Floating Point Numbers

bi bi-1 ••• b2 b1 b0 b-1 b-2 b-3 ••• b-j

• To understand the fundamentals of floating-

GAO Report: GAO/IMTEC-92-26 Patriot Missile Software

Prof. Willian Kahan

Mantissa (23 bits)

Sign of mantissa (1 bit)

+1.112 x 23 = 1110.02 = 14.010

• E.g., What decimal value is represented by

• E.g., What decimal value is represented by

• E.g., Express 36.562510 as a 32-bit floating

S = 0 (because the value is positive)

0100 0010 0001 0010 0100 0000 0000 00002 =

• E.g., Express 6.510 as a 32-bit floating point

• E.g., Express 0.1 as a 32-bit floating point

• 6-bit floating point representation

Exponent values: 0 zeroes

Value = (1 – 2*Sign) (1 + Fraction)Exponent - 127

• 8-bit Floating Point Representation

• Same general form as IEEE Format

Dynamic Range (s=0 only) norm: E = exp – Bias

s exp frac E Value

IEEE 754 Binary16 (F16) Format

Overview of IEEE 754 Binary 32

IEEE 754 Binary16 (F128) Format

Field Bits Description

IEEE 754 Binary64

You might also like