0% found this document useful (0 votes)

112 views41 pages

Floating Point Arithmetic: Numbers

The document discusses floating point arithmetic and numbers. It covers topics like the IEEE 754 floating point standard, representations of fractional binary numbers, early differences in floating point formats between manufacturers that caused portability issues, and how the IEEE standard addressed these issues by standardizing formats and behaviors. It provides details on IEEE 754 single and double precision floating point encoding and representations.

Uploaded by

Yuukia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

112 views41 pages

Floating Point Arithmetic: Numbers

Uploaded by

Yuukia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Floating Point Arithmetic

• Topics
Numbers in general
IEEE Floating Point Standard
Rounding
Floating Point Operations
Mathematical properties
Puzzles test basic understanding
Bizarre FP factoids

School of Computing 1 CS5830

Numbers
• Many types
Integers
» decimal in a binary world is an interesting subclass
• consider rounding issues for the financial community
Fixed point
» still integers but with an assumed decimal point
» advantage – still get to use the integer circuits you know
Weird numbers
» interval – represented by a pair of values
» log – multiply and divide become easy
• computing the log or retrieving it from a table is messy though
» mixed radix – yy:dd:hh:mm:ss
» many forms of redundant systems – we’ve seen some examples
Floating point
» inherently has 3 components: sign, exponent, significand
» lots of representation choices can and have been made
• Only two in common use to date: int’s and float’s
you know int’s - hence time to look at float’s

School of Computing 2 CS5830

Page 1
Floating Point
• Macho scientific applications require it
require lots of range
compromise: exact is less important than close enough
» enter numeric stability problems
• In the beginning
each manufacturer had a different representation
» code had no portability
chaos in macho numeric stability land
• Now we have standards
life is better but there are still some portability issues
» primarily caused by details “left to the discretion of the
implementer”
» fortunately these differences are not commonly encountered

School of Computing 3 CS5830

Fractional Binary Numbers

2i
2i–1

4
••• 2
1
bi bi–1 ••• b2 b1 b0 . b–1 b–2 b–3 ••• b–j
1/2
1/4 •••
1/8

2–j

• Representation
Bits to right of “binary point” represent fractional powers of 2
i
Represents rational number:
∑
b ⋅2 k k
k =− j

School of Computing 4 CS5830

Page 2
Fractional Binary Numbers
• Value Representation
5-3/4 101.112
2-7/8 10.1112
63/64 0.1111112
• Observations
Divide by 2 by shifting right
Multiply by 2 by shifting left
Numbers of form 0.111111…2 just below 1.0
» 1/2 + 1/4 + 1/8 + … + 1/2i + … → 1.0
» We’ll use notation 1.0 – ε

School of Computing 5 CS5830

Representable Numbers
• Limitation
Can only exactly represent numbers of the form x/2k
Other numbers have repeating bit representations
• Value Representation
1/3 0.0101010101[01]…2
1/5 0.001100110011[0011]…2
1/10 0.0001100110011[0011]…2

School of Computing 6 CS5830

Page 3
Early Differences: Format
• IBM 360 and 370
7 bits 24 bits

S E exponent F fraction/mantissa

Value = (-1)S x 0.F x 16E-64

excess notation – why?

• DEC PDP and VAX

8 bits 23 bits

S E exponent F fraction/mantissa

Value = (-1)S x 0.1F x 2E-128

hidden bit

School of Computing 7 CS5830

Early Floating Point Problems

• Non-unique representation
.345 x 1019 = 34.5 x 1017 Î need a normal form
often implies first bit of mantissa is non-zero
» except for hex oriented IBM Î first hex digit had to be non-zero
• Value significance
close enough mentality leads to a too small to matter category
• Portability
clearly important – more so today
vendor Tower of Floating Babel wasn’t going to cut it
• Enter IEEE 754
huge step forward
portability problems do remain unfortunately
» however they have become subtle
» this is both a good and bad news story

School of Computing 8 CS5830

Page 4
Floating Point Representation
• Numerical Form
–1s M 2E
» Sign bit s determines whether number is negative or positive
» Significand S normally a fractional value in some range e.g.
[1.0,2.0] or [0.0:1.0].
• built from a mantissa M and a default normalized representation
» Exponent E weights value by power of two
• Encoding
s exp frac

MSB is sign bit

exp field encodes E
frac field encodes M

School of Computing 9 CS5830

Problems with Early FP

• Representation differences
exponent base
exponent and mantissa field sizes
normalized form differences
some examples shortly
• Rounding Modes
• Exceptions
this is probably the key motivator of the standard
underflow and overflow are common in FP calculations
» taking exceptions has lots of issues
• OS’s deal with them differently
• huge delays incurred due to context switches or value checks
• e.g. check for -1 before doing a sqrt operation

School of Computing 10 CS5830

Page 5
IEEE Floating Point
• IEEE Standard 754-1985 (also IEC 559)
Established in 1985 as uniform standard for floating point arithmetic
» Before that, many idiosyncratic formats
Supported by all major CPUs
• Driven by numerical concerns
Nice standards for rounding, overflow, underflow
Hard to make go fast
» Numerical analysts predominated over hardware types in
defining standard
• you want the wrong answer – we can do that real fast!!
» Standards are inherently a “design by committee”
• including a substantial “our company wants to do it our way” factor
» If you ever get a chance to be on a standards committee
• call in sick in advance

School of Computing 11 CS5830

• Encoding
Floating Point Precisions
s exp frac

MSB is sign bit

exp field encodes E
frac field encodes M
• IEEE 754 Sizes
Single precision: 8 exp bits, 23 frac bits
» 32 bits total
Double precision: 11 exp bits, 52 frac bits
» 64 bits total
Extended precision: 15 exp bits, 63 frac bits
» Only found in Intel-compatible machines
• legacy from the 8087 FP coprocessor
» Stored in 80 bits
• 1 bit wasted in a world where bytes are the currency of the realm

School of Computing 12 CS5830

Page 6
IEEE 754 (similar to DEC, very much Intel)
8 bits 23 bits

S E exponent F fraction/mantissa

For “Normal #’s”: Value = (-1)S x 1.F x 2E-127

• Plus some conventions
definitions for + and – infinity
» symmetric number system so two flavors of 0 as well
denormals
» lots of values very close to zero
» note hidden bit is not used: V = (-1)S x 0.F x 21-127
» this 1-bias model provides uniform number spacing near 0
NaN’s (2 semi-non-standard flavors SNaN & QNaN)
» specific patterns but they aren’t numbers
» a way of keeping things running under errors (e.g. sqrt (-1))

School of Computing 13 CS5830

More Precision
• 64 bit form (double)
11 bits 52 bits

S E exponent F fraction/mantissa

“Normalized”
Value = (-1)S x 1.F x 2E-1023

» NOTE: 3 bits more of E but 29 more bits of F

• also definitions for 128 bit and extended temporaries

temps live inside the FPU so not programmer visible
» note: this is particularly important for x86 architectures
» allows internal representation to work for both single and
doubles
• any guess why Intel wanted this in the standard?

School of Computing 14 CS5830

Page 7
From a Number Line Perspective
represented as: +/- 1.0 x {20, 21, 22, 23} – e.g. the E field

-8 -4 -2 -1 0 1 2 4 8

How many numbers are there in each gap??

2n for an n-bit F field

Distance between representable numbers??

same intra-gap
different inter-gap

School of Computing 15 CS5830

Floating Point Operations

• Multiply/Divide
similar to what you learned in school
» add/sub the exponents
» multiply/divide the mantissas
» normalize and round the result
tricks exist of course
• Add/Sub
find largest
make smallest exponent = largest and shift mantissa
add the mantissa’s (smaller one may have gone to 0)
• Problems?
consider all the types: numbers, NaN’s, 0’s, infinities, and denormals
what sort of exceptions exist

School of Computing 16 CS5830

Page 8
IEEE Standard Differences
• 4 rounding modes
round to nearest is the default
» others selectable
• via library, system, or embedded assembly instruction however
» rounding a halfway result always picks the even number side
» why?
• Defines special values
NaN – not a number
» 2 subtypes
• quiet Î qNaN
• signalling Î sNaN
» +∞ and –∞
Denormals
» for values < |1.0 x 2Emin| Î gradual underflow
• Sophisticated facilities for handling exceptions
sophisticated Î designer headache

School of Computing 17 CS5830

Benefits of Special Values

• Denormals and gradual underflow
once a value is < |1.0 x 2Emin| it could snap to 0
however certain usual properties would be lost from the system
» e.g. x==y ÍÎ x-y == 0
with denormals the mantissa gradually reduces in magnitude until all
significance is lost and the next smaller representable number is
really 0
• NaN’s
allows a non-value to be generated rather than an exception
» allows computation to proceed
» usual scheme is to post-check results for NaN’s
• keeps control in the programmer’s rather than OS’s domain
» rule
• any operation with a NaN operand produces a NaN
• hence code doesn’t need to check for NaN input values and special case then
• sqrt(-1) ::= NaN

School of Computing 18 CS5830

Page 9
Benefits (cont’d)
• Infinities
converts overflow into a representable value
» once again the idea is to avoid exceptions
» also takes advantage of the close but not exact nature of floating
point calculation styles
e.g. 1/0 Î +∞
• A more interesting example
identity: arccos(x) = 2*arctan(sqrt(1-x)/(1+x))
» arctan x assymptotically approaches π/2 as x approaches ∞
» natural to define arctan(∞) = π/2
• in which case arccos(-1) = 2arctan(∞) = π

School of Computing 19 CS5830

Special Value Rules

Operations Result
n / ±∞ ?
±∞ x ±∞
nonzero / 0
+∞ + +∞
±0 / ±0
+∞ - +∞
±∞ / ±∞
±∞ X ±0
NaN any-op anything NaN (similar for [any op NaN])

School of Computing 20 CS5830

Page 10
Special Value Rules
Operations Result
n / ±∞ ±0
±∞ x ±∞ ?
nonzero / 0
+∞ + +∞
±0 / ±0
+∞ - +∞
±∞ / ±∞
±∞ X ±0
NaN any-op anything NaN (similar for [any op NaN])

School of Computing 21 CS5830

Special Value Rules

Operations Result
n / ±∞ ±0
±∞ x ±∞ ±∞
nonzero / 0 ?
+∞ + +∞
±0 / ±0
+∞ - +∞
±∞ / ±∞
±∞ X ±0
NaN any-op anything NaN (similar for [any op NaN])

School of Computing 22 CS5830

Page 11
Special Value Rules
Operations Result
n / ±∞ ±0
±∞ x ±∞ ±∞
nonzero / 0 ±∞
+∞ + +∞ ?
±0 / ±0
+∞ - +∞
±∞ / ±∞
±∞ X ±0
NaN any-op anything NaN (similar for [any op NaN])

School of Computing 23 CS5830

Special Value Rules

Operations Result
n / ±∞ ±0
±∞ x ±∞ ±∞
nonzero / 0 ±∞
+∞ + +∞ +∞ (similar for -∞)
±0 / ±0 ?
+∞ - +∞
±∞ / ±∞
±∞ X ±0
NaN any-op anything NaN (similar for [any op NaN])

School of Computing 24 CS5830

Page 12
Special Value Rules
Operations Result
n / ±∞ ±0
±∞ x ±∞ ±∞
nonzero / 0 ±∞
+∞ + +∞ +∞ (similar for -∞)
±0 / ±0 NaN
+∞ - +∞ ?
±∞ / ±∞
±∞ X ±0
NaN any-op anything NaN (similar for [any op NaN])

School of Computing 25 CS5830

Special Value Rules

Operations Result
n / ±∞ ±0
±∞ x ±∞ ±∞
nonzero / 0 ±∞
+∞ + +∞ +∞ (similar for -∞)
±0 / ±0 NaN
+∞ - +∞ NaN (similar for -∞)
±∞ / ±∞ ?
±∞ X ±0
NaN any-op anything NaN (similar for [any op NaN])

School of Computing 26 CS5830

Page 13
Special Value Rules
Operations Result
n / ±∞ ±0
±∞ x ±∞ ±∞
nonzero / 0 ±∞
+∞ + +∞ +∞ (similar for -∞)
±0 / ±0 NaN
+∞ - +∞ NaN (similar for -∞)
±∞ / ±∞ NaN
±∞ X ±0 ?
NaN any-op anything NaN (similar for [any op NaN])

School of Computing 27 CS5830

Special Value Rules

Operations Result
n / ±∞ ±0
±∞ x ±∞ ±∞
nonzero / 0 ±∞
+∞ + +∞ +∞ (similar for -∞)
±0 / ±0 NaN
+∞ - +∞ NaN (similar for -∞)
±∞ / ±∞ NaN
±∞ X ±0 NaN
NaN any-op anything NaN (similar for [any op NaN])

School of Computing 28 CS5830

Page 14
• Encoding
Floating Point Precisions
s exp frac
MSB is sign bit
exp field encodes E
frac field encodes M
• IEEE 754 Sizes
Single precision: 8 exp bits, 23 frac bits
» 32 bits total
Double precision: 11 exp bits, 52 frac bits
» 64 bits total
Extended precision: 15 exp bits, 63 frac bits
» Only found in Intel-compatible machines
• legacy from the 8087 FP coprocessor
» Stored in 80 bits
• 1 bit wasted in a world where bytes are the currency of the realm

School of Computing 29 CS5830

s exp frac

IEEE 754 Format Definitions

S Emin Emax Fmin Fmax IEEE Meaning
1 all 1’s all 1’s 000...001 111...111 Not a Number

1 all 1’s all 1’s all 0’s all 0’s Negative Infinity

1 000...001 111...110 anything anything Negative Numbers

1 all 0’s all 0’s 000...001 111...111 Negative Denormals

1 all 0’s all 0’s all 0’s all 0’s Negative Zero

0 all 0’s all 0’s all 0’s all 0’s Positive Zero

0 all 0’s all 0’s 000...001 111...111 Positive Denormals

0 000...001 111...110 anything anything Positive Numbers
0 all 1’s all 1’s all 0’s all 0’s Positive Infinity

0 all 1’s all 1’s 000...001 111...111 Not a Number

School of Computing 30 CS5830

Page 15
“Normalized” Numeric Values
• Condition
exp ≠ 000…0 and exp ≠ 111…1
• Exponent coded as biased value
E = Exp – Bias
» Exp : unsigned value denoted by exp
» Bias : Bias value
• Single precision: 127 (Exp: 1…254, E: -126…127)
• Double precision: 1023 (Exp: 1…2046, E: -1022…1023)
• in general: Bias = 2e-1 - 1, where e is number of exponent bits

• Significand coded with implied leading 1

S = 1.xxx…x2
» xxx…x: bits of frac
» Minimum when 000…0 (M = 1.0)
» Maximum when 111…1 (M = 2.0 – ε)
» Get extra leading bit for “free”
School of Computing 31 CS5830

Normalized Encoding Example

• Value
Float F = 15213.0;
1521310 = 111011011011012 = 1.11011011011012 X 213
• Significand
S = 1.11011011011012
M = frac = 110110110110100000000002
• Exponent
E = 13
Bias = 127
Exp = 140 = 100011002

Floating Point Representation:

Hex: 4 6 6 D B 4 0 0
Binary: 0100 0110 0110 1101 1011 0100 0000 0000
140: 100 0110 0
15213: 1110 1101 1011 01

School of Computing 32 CS5830

Page 16
Denormal Values
• Condition
exp = 000…0
• Value
Exponent value E = 1–Bias
» E= 0 – Bias provides a poor denormal to normal transition
» since denormals don’t have an implied leading 1.xxxx….
Significand value M = 0.xxx…x2
» xxx…x: bits of frac
• Cases
exp = 000…0, frac = 000…0
» Represents value 0 (denormal role #1)
» Note that have distinct values +0 and –0
exp = 000…0, frac ≠ 000…0
» 2n numbers very close to 0.0
» Lose precision as get smaller
» “Gradual underflow” (denormal role #2)

School of Computing 33 CS5830

Special Values
• Condition
exp = 111…1
• Cases
exp = 111…1, frac = 000…0
» Represents value ∞ (infinity)
» Operation that overflows
» Both positive and negative
» E.g., 1.0/0.0 = −1.0/−0.0 = +∞, 1.0/−0.0 = −∞
exp = 111…1, frac ≠ 000…0
» Not-a-Number (NaN)
» Represents case when no numeric value can be determined
• E.g., sqrt(–1), ∞ − ∞

School of Computing 34 CS5830

Page 17
NaN Issues
• Esoterica which we’ll subsequently ignore
• qNaN’s
F = .1u…u (where u can be 1 or 0)
propagate freely through calculations
» all operations which generate NaN’s are supposed to generate
qNaN’s
» EXCEPT sNaN in can generate sNaN out
• 754 spec leaves this “can” issue vague

• sNaN’s
F = .0u…u (where at least one u must be a 1)
» hence representation options
» can encode different exceptions based on encoding
typical use is to mark uninitialized variables
» trap prior to use is common model

School of Computing 35 CS5830

Summary of Floating Point

Encodings

−∞ -Normalized -Denorm +Denorm +Normalized +∞

NaN NaN
−0 +0

School of Computing 36 CS5830

Page 18
Tiny Floating Point Example
• 8-bit Floating Point Representation
the sign bit is in the most significant bit.
the next four bits are the exponent, with a bias of 7.
the last three bits are the frac
• Same General Form as IEEE Format
normalized, denormalized
representation of 0, NaN, infinity
7 6 3 2 0
s exp frac

Remember in this representation

» VN = -1S x 1.F x 2E-7
» VDN = -1S x 0.F x 2-6

School of Computing 37 CS5830

Values Related to the Exponent

Exp exp E 2E

0 0000 -6 1/64 (denorms)

1 0001 -6 1/64 (same as denorms = smooth)
2 0010 -5 1/32
3 0011 -4 1/16
4 0100 -3 1/8
5 0101 -2 1/4
6 0110 -1 1/2
7 0111 0 1
8 1000 +1 2
9 1001 +2 4
10 1010 +3 8
11 1011 +4 16
12 1100 +5 32
13 1101 +6 64
14 1110 +7 128
15 1111 n/a (inf or NaN)

School of Computing 38 CS5830

Page 19
s exp frac E Value
Dynamic Range
0 0000 000 -6 0
0 0000 001 -6 1/8*1/64 = 1/512 closest to zero
Denormalized 0 0000 010 -6 2/8*1/64 = 2/512
numbers …
0 0000 110 -6 6/8*1/64 = 6/512
0 0000 111 -6 7/8*1/64 = 7/512 largest denorm
0 0001 000 -6 8/8*1/64 = 8/512 smallest norm
0 0001 001 -6 9/8*1/64 = 9/512
…
0 0110 110 -1 14/8*1/2 = 14/16
0 0110 111 -1 15/8*1/2 = 15/16 closest to 1 below
Normalized 0 0111 000 0 8/8*1 = 1
numbers 0 0111 001 0 9/8*1 = 9/8 closest to 1 above
0 0111 010 0 10/8*1 = 10/8
…
0 1110 110 7 14/8*128 = 224
0 1110 111 7 15/8*128 = 240 largest norm
0 1111 000 n/a inf

School of Computing 39 CS5830

Distribution of Values
• 6-bit IEEE-like format
e = 3 exponent bits
f = 2 fraction bits
Bias is 3

• Notice how the distribution gets denser toward zero.

-15 -10 -5 0 5 10 15
Denormalized Normalized Infinity

School of Computing 40 CS5830

Page 20
Distribution of Values
(close-up view)
• 6-bit IEEE-like format
e = 3 exponent bits
f = 2 fraction bits
Bias is 3

-1 -0.5 0 0.5 1
Denormalized Normalized Infinity

School of Computing 41 CS5830

Interesting Numbers
• Description exp frac Numeric Value
• Zero 00…00 00…00 0.0
• Smallest Pos. Denorm. 00…00 00…01 2– {23,52} X 2– {126,1022}
Single ≈ 1.4 X 10–45
Double ≈ 4.9 X 10–324
• Largest Denormalized 00…00 11…11 (1.0 – ε) X 2– {126,1022}
Single ≈ 1.18 X 10–38
Double ≈ 2.2 X 10–308
• Smallest Pos. Normalized 00…01 00…00 1.0 X 2– {126,1022}
Just larger than largest denormalized
• One 01…11 00…00 1.0
• Largest Normalized 11…10 11…11 (2.0 – ε) X 2{127,1023}
Single ≈ 3.4 X 1038
Double ≈ 1.8 X 10308

School of Computing 42 CS5830

Page 21
Special Encoding Properties
• FP Zero Same as Integer Zero
All bits = 0
» note anomaly with negative zero
» fix is to ignore the sign bit on a zero compare
• Can (Almost) Use Unsigned Integer Comparison
Must first compare sign bits
Must consider -0 = 0
NaNs problematic
» Will be greater than any other values
» What should comparison yield?
Otherwise OK
» Denorm vs. normalized
» Normalized vs. infinity
» due to monotonicity property of floats

School of Computing 43 CS5830

Rounding
• IEEE 754 provides 4 modes
application/user choice
» mode bits indicate choice in some register
• from C - library support is needed
» control of numeric stability is the goal here
modes
» unbiased: round towards nearest
• if in middle then round towards the even representation
» truncation: round towards 0 (+0 for pos, -0 for neg)
» round up: towards +inifinity
» round down: towards – infinity
underflow
» rounding from a non-zero value to 0 Î underflow
» denormals minimize this case
overflow
» rounding from non-infinite to infinite value

School of Computing 44 CS5830

Page 22
Floating Point Operations
• Conceptual View
First compute exact result
Make it fit into desired precision
» Possibly overflow if exponent too large
» Possibly round to fit into frac
• Rounding Modes (illustrate with $ rounding)
• $1.40 $1.60 $1.50 $2.50 –$1.50
Zero $1 $1 $1 $2 –$1
Round down (-∞) $1 $1 $1 $2 –$2
Round up (+∞) $2 $2 $2 $3 –$1
Nearest Even (default) $1 $2 $2 $2 –$2

Note:
1. Round down: rounded result is close to but no greater than true result.
2. Round up: rounded result is close to but no less than true result.
School of Computing 45 CS5830

Closer Look at Round-To-Even

• Default Rounding Mode
Hard to get any other kind without dropping into assembly
All others are statistically biased
» Sum of set of positive numbers will consistently be over- or under-
estimated
• Applying to Other Decimal Places / Bit Positions
When exactly halfway between two possible values
» Round so that least significant digit is even
E.g., round to nearest hundredth
1.2349999 1.23 (Less than half way)
1.2350001 1.24 (Greater than half way)
1.2350000 1.24 (Half way—round up)
1.2450000 1.24 (Half way—round down)

School of Computing 46 CS5830

Page 23
Rounding Binary Numbers
• Binary Fractional Numbers
“Even” when least significant bit is 0
Half way when bits to right of rounding position = 100…2
• Examples
Round to nearest 1/4 (2 bits right of binary point)
Value Binary Rounded Action Rounded Value
2 3/32 10.000112 10.002 (<1/2—down) 2
2 3/16 10.001102 10.012 (>1/2—up) 2 1/4
2 7/8 10.111002 11.002 (1/2—up) 3
2 5/8 10.101002 10.102 (1/2—down) 2 1/2

School of Computing 47 CS5830

Guard Bits and Rounding

• Clearly extra fraction bits are required to support
rounding modes
e.g. given a 23 bit representation of the mantissa in single precision
you round to 23 bits for the answer but you’ll need extra bits in order
to know what to do
• Question: How many do you really need?
why?

School of Computing 48 CS5830

Page 24
Rounding in the Worst Case
• Basic algorithm for add
subtract exponents to see which one is bigger d=Ex - Ey
usually swapped if necessary so biggest one is in a fixed register
alignment step
» shift smallest significand d positions to the right
» copy largest exponent into exponent field of the smallest
add or subtract signifcands
» add if signs equal – subtract if they aren’t
normalize result
» details next slide
round according to the specified mode
» more details soon
» note this might generate an overflow Î shift right and increment the
exponent
exceptions
» exponent overflow Î ∞
» exponent underflow Î denormal
» inexact Î rounding was done
» special value 0 may also result Î need to avoid wrap around

School of Computing 49 CS5830

Normalization Cases
• Result already normalized
no action needed
• On an add
you may end up with 2 leading bits before the “.”
hence significand shift right one & increment exponent
• On a subtract
the significand may have n leading zero’s
hence shift significand left by n and decrement exponent by n
note: common circuit is a L0D ::= leading 0 detector

School of Computing 50 CS5830

Page 25
Alignment and Normalization Issues
• During alignment
smaller exponent arg gets significand right shifted
» Î need for extra precision in the FPU
• the question is again how much extra do you need?
• Intel maintains 80 bits inside their FPU’s – an 8087 legacy

• During normalization
a left shift of the significand may occur
• During the rounding step
extra internal precision bits (guard bits) get dropped

• Time to consider how many guard bits we need

to do rounding properly
to compensate for what happens during alignment and normalization

School of Computing 51 CS5830

For Effective Addition

• Result
either normalized
or generates 1 additional integer bit
» hence right shift of 1
» Î need for f+1 bits
» extra bit is called the rounding bit
• Alignment throws a bunch of bits to the right
need to know whether they were all 0 or not
Î hence 1 more bit called the sticky bit
» sticky bit value is the OR of the discarded bits

School of Computing 52 CS5830

Page 26
For Effective Subtraction
• There are 2 subcases
if the difference in the two exponents is larger than 1
» alignment produces a mantissa with more than 1 leading 0
» hence result is either normalized or has one leading 0
• in this case a left shift will be required in normalization
• Î an extra bit is needed for the fraction plus you still need the rounding bit
• this extra bit is called the guard bit
» also during subtraction a borrow may happen at position f+2
• this borrow is determined by the sticky bit
the difference of the two exponents is 0 or 1
» in this case the result may have many more than 1 leading 0
» but at most one nonzero bit was shifted during normalization
• hence only one additional bit is needed for the subtraction result
• but the borrow to the extra bit may still happen

School of Computing 53 CS5830

Answer to the Guard Bit Puzzle

• You need at least 3 extra bits
Guard, Round, Sticky
• Hence minimal internal representation is

s exp frac G R S

School of Computing 54 CS5830

Page 27
Round to Nearest
• Add 1 to low order fraction bit L
if G=1 and R and S are NOT both zero
• However a halfway result gets rounded to even
e.g. if G=1 and R and S are both zero
• Hence
let rnd be the value added to L
then
» rnd = G(R+S) + G(R+S)’ = G(L+R+S)
• OR
always add 1 to position G which effectively rounds to nearest
» but doesn’t account for the halfway result case
and then
» zero the L bit if G(R+S)’=1

School of Computing 55 CS5830

The Other Rounding Modes

• Round towards Zero
this is simple truncation
» so ignore G, R, and S bits
• Rounding towards ± ∞
in both cases add 1 to L if G, R, and S are not all 0
if sgn is the sign of the result then
» rnd = sgn’(G+R+T) for the + ∞ case
» rnd = sgn(G+R+T) for the - ∞ case

• Finished
you know how to round

School of Computing 56 CS5830

Page 28
Floating Point Operations
• Multiply/Divide
similar to what you learned in school
» add/sub the exponents
» multiply/divide the mantissas
» normalize and round the result
tricks exist of course
• Add/Sub
find largest
make smallest exponent = largest and shift mantissa
add the mantissa’s (smaller one may have gone to 0)
• Problems?
consider all the types: numbers, NaN’s, 0’s, infinities, and denormals
what sort of exceptions exist

School of Computing 57 CS5830

FP Multiplication
• Operands
(–1)s1 S1 2E1 * (–1)s2 S2 2E2
• Exact Result
(–1)s S 2E
Sign s: s1 XOR s2
Significand S: S1 * S2
Exponent E: E1 + E2
• Fixing: Post operation normalization
If M ≥ 2, shift S right, increment E
If E out of range, overflow
Round S to fit frac precision
• Implementation
Biggest chore is multiplying significands
» same as unsigned integer multiply which you know all about

School of Computing 58 CS5830

Page 29
FP Addition
• Operands
(–1)s1 M1 2E1 E1–E2
(–1)s2 M2 2E2 (–1)s1 S1
Assume E1 > E2 Î sort
(–1)s2 S2
• Exact Result +
(–1)s M 2E
Sign s, significand S: (–1)s S
» Result of signed align & add
Exponent E: E1
• Fixing
If S ≥ 2, shift S right, increment E
if S < 1, shift S left k positions, decrement E by k
Overflow if E out of range
Round S to fit frac precision

School of Computing 59 CS5830

Mathematical Properties of FP Add

• Compare to those of Abelian Group
(i.e. commutative group)
Closed under addition? YES
» But may generate infinity or NaN
Commutative? YES
Associative? NO
» Overflow and inexactness of rounding
0 is additive identity? YES
Every element has additive inverse ALMOST
» Except for infinities & NaNs
• Monotonicity
a ≥ b ⇒ a+c ≥ b+c? ALMOST
» Except for infinities & NaNs

School of Computing 60 CS5830

Page 30
Math Properties of FP Mult
• Compare to Commutative Ring
Closed under multiplication? YES
» But may generate infinity or NaN
Multiplication Commutative? YES
Multiplication is Associative? NO
» Possibility of overflow, inexactness of rounding
1 is multiplicative identity? YES
Multiplication distributes over addition? NO
» Possibility of overflow, inexactness of rounding
• Monotonicity
a ≥ b & c ≥ 0 ⇒ a *c ≥ b *c? ALMOST
» Except for infinities & NaNs

School of Computing 61 CS5830

The 5 Exceptions
• Overflow
generated when an infinite result is generated
• Underflow
generated when a 0 is generated after rounding a non-zero result
• Divide by 0
• Inexact
set when overflow or rounded value
DOH! the common case so why is an exception??
» some nitwit wanted it so it’s there
» there is a way to mask it as an exception of course
• Invalid
result of bizarre stuff which generates a NaN result
» sqrt(-1)
» 0/0
» inf - inf

School of Computing 62 CS5830

Page 31
Floating Point in C
• C Guarantees Two Levels
float single precision
double double precision
• Conversions
Casting between int, float, and double changes numeric values
Double or float to int
» Truncates fractional part
» Like rounding toward zero
» Not defined when out of range
• Generally saturates to TMin or TMax
int to double
» Exact conversion, as long as int has ≤ 53 bit word size
int to float
» Will round according to rounding mode

School of Computing 63 CS5830

int x = …; Floating Point Puzzles

float f = …; Assume neither
d nor f is NAN
double d = …;
• x == (int)(float) x ?
• x == (int)(double) x
• f == (float)(double) f
• d == (float) d
• f == -(-f);
• 2/3 == 2/3.0
• d < 0.0 ⇒ ((d*2) < 0.0)
• d > f ⇒ -f < -d
• d * d >= 0.0
• (d+f)-d == f

School of Computing 64 CS5830

Page 32
int x = …; Floating Point Puzzles
float f = …; Assume neither
d nor f is NAN
double d = …;
• x == (int)(float) x NO – 24 bit significand
• x == (int)(double) x ?
• f == (float)(double) f
• d == (float) d
• f == -(-f);
• 2/3 == 2/3.0
• d < 0.0 ⇒ ((d*2) < 0.0)
• d > f ⇒ -f < -d
• d * d >= 0.0
• (d+f)-d == f

School of Computing 65 CS5830

int x = …; Floating Point Puzzles

float f = …; Assume neither
d nor f is NAN
double d = …;
• x == (int)(float) x NO – 24 bit significand
• x == (int)(double) x YES – 53 bit significand
• f == (float)(double) f ?
• d == (float) d
• f == -(-f);
• 2/3 == 2/3.0
• d < 0.0 ⇒ ((d*2) < 0.0)
• d > f ⇒ -f < -d
• d * d >= 0.0
• (d+f)-d == f

School of Computing 66 CS5830

Page 33
int x = …; Floating Point Puzzles
float f = …; Assume neither
d nor f is NAN
double d = …;
• x == (int)(float) x NO – 24 bit significand
• x == (int)(double) x YES – 53 bit significand
• f == (float)(double) f YES – increases precision
• d == (float) d ?
• f == -(-f);
• 2/3 == 2/3.0
• d < 0.0 ⇒ ((d*2) < 0.0)
• d > f ⇒ -f < -d
• d * d >= 0.0
• (d+f)-d == f

School of Computing 67 CS5830

int x = …; Floating Point Puzzles

float f = …; Assume neither
d nor f is NAN
double d = …;
• x == (int)(float) x NO – 24 bit significand
• x == (int)(double) x YES – 53 bit significand
• f == (float)(double) f YES – increases precision
• d == (float) d NO – loses precision
• f == -(-f); ?
• 2/3 == 2/3.0
• d < 0.0 ⇒ ((d*2) < 0.0)
• d > f ⇒ -f < -d
• d * d >= 0.0
• (d+f)-d == f

School of Computing 68 CS5830

Page 34
int x = …; Floating Point Puzzles
float f = …; Assume neither
d nor f is NAN
double d = …;
• x == (int)(float) x NO – 24 bit significand
• x == (int)(double) x YES – 53 bit significand
• f == (float)(double) f YES – increases precision
• d == (float) d NO – loses precision
• f == -(-f); YES – sign bit inverts twice
• 2/3 == 2/3.0 ?
• d < 0.0 ⇒ ((d*2) < 0.0)
• d > f ⇒ -f < -d
• d * d >= 0.0
• (d+f)-d == f

School of Computing 69 CS5830

int x = …; Floating Point Puzzles

float f = …; Assume neither
d nor f is NAN
double d = …;
• x == (int)(float) x NO – 24 bit significand
• x == (int)(double) x YES – 53 bit significand
• f == (float)(double) f YES – increases precision
• d == (float) d NO – loses precision
• f == -(-f); YES – sign bit inverts twice
• 2/3 == 2/3.0 NO – 2/3 = 0
• d < 0.0 ⇒ ((d*2) < 0.0) ?
• d > f ⇒ -f < -d
• d * d >= 0.0
• (d+f)-d == f

School of Computing 70 CS5830

Page 35
int x = …; Floating Point Puzzles
float f = …; Assume neither
d nor f is NAN
double d = …;
• x == (int)(float) x NO – 24 bit significand
• x == (int)(double) x YES – 53 bit significand
• f == (float)(double) f YES – increases precision
• d == (float) d NO – loses precision
• f == -(-f); YES – sign bit inverts twice
• 2/3 == 2/3.0 NO – 2/3 = 0
• d < 0.0 ⇒ ((d*2) < 0.0) YES - monotonicity
• d > f ⇒ -f < -d ?
• d * d >= 0.0
• (d+f)-d == f

School of Computing 71 CS5830

int x = …; Floating Point Puzzles

float f = …; Assume neither
d nor f is NAN
double d = …;
• x == (int)(float) x NO – 24 bit significand
• x == (int)(double) x YES – 53 bit significand
• f == (float)(double) f YES – increases precision
• d == (float) d NO – loses precision
• f == -(-f); YES – sign bit inverts twice
• 2/3 == 2/3.0 NO – 2/3 = 0
• d < 0.0 ⇒ ((d*2) < 0.0) YES - monotonicity
• d > f ⇒ -f < -d YES - monotonicity
• d * d >= 0.0 ?
• (d+f)-d == f

School of Computing 72 CS5830

Page 36
int x = …; Floating Point Puzzles
float f = …; Assume neither
d nor f is NAN
double d = …;
• x == (int)(float) x NO – 24 bit significand
• x == (int)(double) x YES – 53 bit significand
• f == (float)(double) f YES – increases precision
• d == (float) d NO – loses precision
• f == -(-f); YES – sign bit inverts twice
• 2/3 == 2/3.0 NO – 2/3 = 0
• d < 0.0 ⇒ ((d*2) < 0.0) YES - monotonicity
• d > f ⇒ -f < -d YES - monotonicity
• d * d >= 0.0 YES - monotonicity
• (d+f)-d == f ?

School of Computing 73 CS5830

int x = …; Floating Point Puzzles

float f = …; Assume neither
d nor f is NAN
double d = …;
• x == (int)(float) x NO – 24 bit significand
• x == (int)(double) x YES – 53 bit significand
• f == (float)(double) f YES – increases precision
• d == (float) d NO – loses precision
• f == -(-f); YES – sign bit inverts twice
• 2/3 == 2/3.0 NO – 2/3 = 0
• d < 0.0 ⇒ ((d*2) < 0.0) YES - monotonicity
• d > f ⇒ -f < -d YES - monotonicity
• d * d >= 0.0 YES - monotonicity
• (d+f)-d == f NO – not associative

School of Computing 74 CS5830

Page 37
Gulf War: Patriot misses Scud
• By 687 meters
Why?
» clock ticks at 1/10 second – can’t be represented exactly in
binary
» clock was running for 100 hours
• hence clock was off by .3433 sec.
» Scud missile travels at 2000 m/sec
• 687 meter error = .3433 second
Result
» SCUD hits Army barracks
» 28 soldiers die
• Accuracy counts
floating point has many sources of inaccuracy - BEWARE
• Real problem
software updated but not fully deployed

School of Computing 75 CS5830

Ariane 5
Exploded 37 seconds after
liftoff
Cargo worth $500 million
• Why
Computed horizontal
velocity as floating point
number
Converted to 16-bit integer
Worked OK for Ariane 4
Overflowed for Ariane 5
» Used same software

School of Computing 76 CS5830

Page 38
Ariane 5
• From Wikipedia
Ariane 5's first test flight on 4 June 1996 failed, with the
rocket self-destructing 37 seconds after launch
because of a malfunction in the control software,
which was arguably one of the most expensive
computer bugs in history. A data conversion from 64-
bit floating point to 16-bit signed integer value had
caused a processor trap (operand error). The floating
point number had a value too large to be represented
by a 16-bit signed integer. Efficiency considerations
had led to the disabling of the software handler (in
Ada code) for this trap, although other conversions of
comparable variables in the code remained protected.

School of Computing 77 CS5830

Remaining Problems?
• Of course
• NaN’s
2 types defined Q and S
standard doesn’t say exactly how they are represented
standard is also vague about SNaN’s results
SNaN’s cause exceptions QNaN’s don’t
» hence program behavior is different and there’s a porting
problem
• Exceptions
standard says what they are
but forgets to say WHEN you will see them
» Weitek chips Î only see the exception when you start the next
op Î GRR!!

School of Computing 78 CS5830

Page 39
More Problems
• IA32 specific
FP registers are 80 bits
» similar to IEEE but with e=15 bits, f=63-bits, + sign
» 79 bits but stored in 80 bit field
• 10 bytes
• BUT modern x86 use 12 bytes to improve memory performance
• e.g. 4 byte alignment model
the problem
» FP reg to memory Î conversion to IEEE 64 or 32 bit formats
• loss of precision and a potential rounding step
C problem
» no control over where values are – register or memory
» hence hidden conversion
• -ffloat-store gcc flag is supposed to not register floats
• after all the x86 isn’t a RISC machine
• unfortunately there are exceptions so this doesn’t work
» stuck with specifying long doubles if you need to avoid the
hidden conversions

School of Computing 79 CS5830

From the Programmer’s Perspective

• Computer arithmetic is a huge field
lots of places to learn more
» your text is specialized for implementers
» but it’s still the gold standard even for programmers
• also contains lots of references for further study
from a machine designer perspective
» it takes a lifetime to get really good at it
» lots of specialized tricks have been employed
from the programmer perspective
» you need to know the representations
» and the basics of the operations
» they are visible to you
» understanding Î efficiency
• Floating point is always
slow and big when compared to integer circuits

School of Computing 80 CS5830

Page 40
Summary
• IEEE Floating Point Has Clear Mathematical
Properties
Represents numbers of form M X 2E
Can reason about operations independent of implementation
» As if computed with perfect precision and then rounded
Not the same as real arithmetic
» Violates associativity/distributivity
» Makes life difficult for compilers & serious numerical
applications programmers

• Next time we move to FP circuits

arcane but interesting in nerd-land

School of Computing 81 CS5830

Page 41

IEEE 754 Floating Point Guide
No ratings yet
IEEE 754 Floating Point Guide
14 pages
Floating Point & Fixed Point Representation - BCA II
No ratings yet
Floating Point & Fixed Point Representation - BCA II
24 pages
Lec 08
No ratings yet
Lec 08
36 pages
Floating Point Numbers: CS101 Introduction To Computing
No ratings yet
Floating Point Numbers: CS101 Introduction To Computing
41 pages
LEC03 Data II
No ratings yet
LEC03 Data II
45 pages
5 Data - Floating - Point v1
No ratings yet
5 Data - Floating - Point v1
25 pages
05 Floating Point
No ratings yet
05 Floating Point
24 pages
L2-Variables and Floating Point Number System
No ratings yet
L2-Variables and Floating Point Number System
38 pages
Week8 Slides
No ratings yet
Week8 Slides
43 pages
Lecture 2
No ratings yet
Lecture 2
27 pages
8.3 Floating Point Numbers
No ratings yet
8.3 Floating Point Numbers
19 pages
Computer Architecture: Nguyễn Trí Thành
No ratings yet
Computer Architecture: Nguyễn Trí Thành
55 pages
The IEEE Standard For Floating Point Arithmetic
No ratings yet
The IEEE Standard For Floating Point Arithmetic
9 pages
Floating Point
No ratings yet
Floating Point
16 pages
Computer Architecture Basics
No ratings yet
Computer Architecture Basics
64 pages
Computer Architecture: Data Types
No ratings yet
Computer Architecture: Data Types
25 pages
Floating Point
No ratings yet
Floating Point
33 pages
Floa NG Point: 15 - 213: Introduc On To Computer Systems 4 Lecture, Sep 5, 2013
No ratings yet
Floa NG Point: 15 - 213: Introduc On To Computer Systems 4 Lecture, Sep 5, 2013
40 pages
CH03 Data II
No ratings yet
CH03 Data II
31 pages
Class03 cs230s22
No ratings yet
Class03 cs230s22
33 pages
ENSC254 - Floating Point Computation
No ratings yet
ENSC254 - Floating Point Computation
29 pages
Computer Arithmetic Representations
No ratings yet
Computer Arithmetic Representations
24 pages
04 Float
No ratings yet
04 Float
40 pages
chapter02b float 中文
No ratings yet
chapter02b float 中文
48 pages
Floating Point: 15-213: Introduction To Computer Systems 4 Lecture, Sep. 10, 2015
No ratings yet
Floating Point: 15-213: Introduction To Computer Systems 4 Lecture, Sep. 10, 2015
40 pages
Floating Point
No ratings yet
Floating Point
26 pages
IEEE 754 Floating Point Formats
No ratings yet
IEEE 754 Floating Point Formats
12 pages
Module 2 - PART D Floating
No ratings yet
Module 2 - PART D Floating
30 pages
Ieee Tex
No ratings yet
Ieee Tex
4 pages
Computer Systems: Binary Floating Point
No ratings yet
Computer Systems: Binary Floating Point
44 pages
Lab 7
No ratings yet
Lab 7
9 pages
Floating Point Numbers: CS031 September 12, 2011
No ratings yet
Floating Point Numbers: CS031 September 12, 2011
22 pages
IEEE Standard 754 Floating Point Numbers
No ratings yet
IEEE Standard 754 Floating Point Numbers
7 pages
Summary of Integer Arithmetic and ALU: - Addition
No ratings yet
Summary of Integer Arithmetic and ALU: - Addition
22 pages
IEEE 754 Floating Point Guide
No ratings yet
IEEE 754 Floating Point Guide
26 pages
Demystifying Floating Point - John Farrier - CppCon 2015
No ratings yet
Demystifying Floating Point - John Farrier - CppCon 2015
61 pages
Lecture 4
No ratings yet
Lecture 4
154 pages
Lecture 10 (Temp)
No ratings yet
Lecture 10 (Temp)
50 pages
Soc2040 SP Week 5 Lecture1 Slides On Data Representation Part4 Spring 2024
No ratings yet
Soc2040 SP Week 5 Lecture1 Slides On Data Representation Part4 Spring 2024
46 pages
Floating Point Arithmetic Guide
No ratings yet
Floating Point Arithmetic Guide
42 pages
Floating Point: CS230 System Programming 4
No ratings yet
Floating Point: CS230 System Programming 4
39 pages
Computer Arithmetic Representations
No ratings yet
Computer Arithmetic Representations
24 pages
Lec 4
No ratings yet
Lec 4
15 pages
COA Module6 FloatingPoint
No ratings yet
COA Module6 FloatingPoint
17 pages
Floating-Point Numbers and Operations Representation
No ratings yet
Floating-Point Numbers and Operations Representation
8 pages
IEEE 754 Floating Point Guide
No ratings yet
IEEE 754 Floating Point Guide
2 pages
This Unit: Arithmetic and ALU Design Floating Point Arithmetic
No ratings yet
This Unit: Arithmetic and ALU Design Floating Point Arithmetic
8 pages
VHDL Floating-Point Design Lab
100% (1)
VHDL Floating-Point Design Lab
10 pages
Lecture11 Slides 1
No ratings yet
Lecture11 Slides 1
52 pages
IEEE 754 Floating Point Notes
No ratings yet
IEEE 754 Floating Point Notes
4 pages
Floating Point Sept 6, 2006 15-213: "The Course That Gives CMU Its Zip!"
No ratings yet
Floating Point Sept 6, 2006 15-213: "The Course That Gives CMU Its Zip!"
34 pages
IT3030E CA Chap3 Arithmetics
No ratings yet
IT3030E CA Chap3 Arithmetics
39 pages
Floating Point Arithmetic
100% (1)
Floating Point Arithmetic
30 pages
Unit 2
No ratings yet
Unit 2
85 pages
IEEE 754 Floating Point Guide
No ratings yet
IEEE 754 Floating Point Guide
38 pages
Ece552 10 Floating Point
No ratings yet
Ece552 10 Floating Point
15 pages
"The Course That Gives CMU Its Zip!": Topics
No ratings yet
"The Course That Gives CMU Its Zip!": Topics
30 pages
"The Course That Gives CMU Its Zip!": Topics
No ratings yet
"The Course That Gives CMU Its Zip!": Topics
30 pages
Digital Circuit Design Basics
No ratings yet
Digital Circuit Design Basics
28 pages
Transmission Lines and E.M. Waves Lec 01
No ratings yet
Transmission Lines and E.M. Waves Lec 01
42 pages
Introduction To Cmos Vlsi Design
No ratings yet
Introduction To Cmos Vlsi Design
29 pages
FPGA-Based MLP with On-Chip Learning
No ratings yet
FPGA-Based MLP with On-Chip Learning
6 pages
Noobj Ect I Oncer T I Fi Cat E-Wor Ki Ngi NT Er Nshi Psi Nsci Enceandengi Neer I NG (Wi Se)
No ratings yet
Noobj Ect I Oncer T I Fi Cat E-Wor Ki Ngi NT Er Nshi Psi Nsci Enceandengi Neer I NG (Wi Se)
1 page
DC Installation Guide For Dummies PDF
No ratings yet
DC Installation Guide For Dummies PDF
8 pages
CS - Chapter 6 (Part 2) PDF
No ratings yet
CS - Chapter 6 (Part 2) PDF
59 pages
The ISCAS '85 Benchmark Circuits and Netlist Format: David Bryan, MCNC 919-248-1432 9-30-88
No ratings yet
The ISCAS '85 Benchmark Circuits and Netlist Format: David Bryan, MCNC 919-248-1432 9-30-88
4 pages
2018 Guidelines On The Use of Therapeutic Apheresis in Clinical Practice
No ratings yet
2018 Guidelines On The Use of Therapeutic Apheresis in Clinical Practice
37 pages
Neutrosophic Single Acceptance Sampling Plan With Quality Parameters
No ratings yet
Neutrosophic Single Acceptance Sampling Plan With Quality Parameters
9 pages
Origins of Easter Traditions
No ratings yet
Origins of Easter Traditions
10 pages
How Does Zadie Smith Represent Religion in White Teeth
100% (1)
How Does Zadie Smith Represent Religion in White Teeth
6 pages
Lesson 8
No ratings yet
Lesson 8
16 pages
Takhtajan - 1980 - Outline of The Classification of Flowering Plants (Magnoliophyta)
0% (1)
Takhtajan - 1980 - Outline of The Classification of Flowering Plants (Magnoliophyta)
135 pages
Mikio Sankey Esoteric Acupuncture 4 Sea of Fire Cosmic Fire
No ratings yet
Mikio Sankey Esoteric Acupuncture 4 Sea of Fire Cosmic Fire
350 pages
Rejection Sensitivity-Race Questionnaire
No ratings yet
Rejection Sensitivity-Race Questionnaire
2 pages
Surah Al Usri Yusra - Google Search
No ratings yet
Surah Al Usri Yusra - Google Search
1 page
CNU Graduate Program Evaluation Report
100% (1)
CNU Graduate Program Evaluation Report
13 pages
Walker - Presidents and Civil Liberties From Wilson To Obama A Story of Poor Custodians (2012)
100% (1)
Walker - Presidents and Civil Liberties From Wilson To Obama A Story of Poor Custodians (2012)
570 pages
Guidelines On The Irr of Sec. 21 Ra 9165
No ratings yet
Guidelines On The Irr of Sec. 21 Ra 9165
7 pages
Information Technology History in Malaysia
100% (1)
Information Technology History in Malaysia
7 pages
Relativistic Spacetime, Matter and Gravitation Exercise 1: True or False?
No ratings yet
Relativistic Spacetime, Matter and Gravitation Exercise 1: True or False?
5 pages
Gyenze Commentary
No ratings yet
Gyenze Commentary
29 pages
FINAL SG - PR2 11 - 12 - UNIT 5 - LESSON 2 - Sampling Procedure For Quantitative Research
No ratings yet
FINAL SG - PR2 11 - 12 - UNIT 5 - LESSON 2 - Sampling Procedure For Quantitative Research
15 pages
History and Evolution of the Guitar
No ratings yet
History and Evolution of the Guitar
4 pages
21st Century Literature 12
100% (3)
21st Century Literature 12
6 pages
Google L5
No ratings yet
Google L5
11 pages
Medical Parasitology Course Guide
No ratings yet
Medical Parasitology Course Guide
18 pages
Viroguard Sanitizer SDS-Watermart
No ratings yet
Viroguard Sanitizer SDS-Watermart
7 pages
Professional Profile Talent and Knowledge Set: Equity Research, Portfolio Management, and Business Development
No ratings yet
Professional Profile Talent and Knowledge Set: Equity Research, Portfolio Management, and Business Development
3 pages
Data Analytics for Marketing Course
No ratings yet
Data Analytics for Marketing Course
14 pages
Aliquot 2008
No ratings yet
Aliquot 2008
3 pages
Ethical Issues in Employment Law
No ratings yet
Ethical Issues in Employment Law
5 pages
Morning Devotions for Christians
No ratings yet
Morning Devotions for Christians
398 pages
Understanding Financial Prosperity - Chapter 1
No ratings yet
Understanding Financial Prosperity - Chapter 1
9 pages
Motion and Momentum
No ratings yet
Motion and Momentum
6 pages
Arcs in Omniscient Reader
No ratings yet
Arcs in Omniscient Reader
9 pages
English V Semester Recap
No ratings yet
English V Semester Recap
11 pages

Floating Point Arithmetic: Numbers

Uploaded by

Floating Point Arithmetic: Numbers

Uploaded by

Floating Point Arithmetic

School of Computing 1 CS5830

School of Computing 2 CS5830

School of Computing 3 CS5830

Fractional Binary Numbers

School of Computing 4 CS5830

School of Computing 5 CS5830

School of Computing 6 CS5830

Value = (-1)S x 0.F x 16E-64

• DEC PDP and VAX

Value = (-1)S x 0.1F x 2E-128

School of Computing 7 CS5830

Early Floating Point Problems

School of Computing 8 CS5830

 MSB is sign bit

School of Computing 9 CS5830

Problems with Early FP

School of Computing 10 CS5830

School of Computing 11 CS5830

 MSB is sign bit

School of Computing 12 CS5830

For “Normal #’s”: Value = (-1)S x 1.F x 2E-127

School of Computing 13 CS5830

» NOTE: 3 bits more of E but 29 more bits of F

• also definitions for 128 bit and extended temporaries

School of Computing 14 CS5830

How many numbers are there in each gap??

2n for an n-bit F field

Distance between representable numbers??

School of Computing 15 CS5830

Floating Point Operations

School of Computing 16 CS5830

School of Computing 17 CS5830

Benefits of Special Values

School of Computing 18 CS5830

School of Computing 19 CS5830

Special Value Rules

School of Computing 20 CS5830

School of Computing 21 CS5830

Special Value Rules

School of Computing 22 CS5830

School of Computing 23 CS5830

Special Value Rules

School of Computing 24 CS5830

School of Computing 25 CS5830

Special Value Rules

School of Computing 26 CS5830

School of Computing 27 CS5830

Special Value Rules

School of Computing 28 CS5830

School of Computing 29 CS5830

IEEE 754 Format Definitions

1 000...001 111...110 anything anything Negative Numbers

1 all 0’s all 0’s 000...001 111...111 Negative Denormals

0 all 0’s all 0’s 000...001 111...111 Positive Denormals

0 all 1’s all 1’s 000...001 111...111 Not a Number

School of Computing 30 CS5830

• Significand coded with implied leading 1

Normalized Encoding Example

Floating Point Representation:

School of Computing 32 CS5830

School of Computing 33 CS5830

School of Computing 34 CS5830

School of Computing 35 CS5830

Summary of Floating Point

−∞ -Normalized -Denorm +Denorm +Normalized +∞

School of Computing 36 CS5830

 Remember in this representation

School of Computing 37 CS5830

Values Related to the Exponent

0 0000 -6 1/64 (denorms)

School of Computing 38 CS5830

School of Computing 39 CS5830

• Notice how the distribution gets denser toward zero.

School of Computing 40 CS5830

School of Computing 41 CS5830

MSB is sign bit

MSB is sign bit

Remember in this representation