Floating Point Arithmetic

Numbers represented in a form:

X_FP = (-1)^s × b^exp-bias × m

XFP - final numeric value
b - base (positive integer)
exp - exponent (positive integer)
bias - a constant
m - mantissa (always positive for IEEE754)
s - signum bit (positive/negative)

Many forms of FP was/is used

Computer/system	Width (b)	Base	Exponent (b)	Mantissa (b)
IEEE 754 half	16	2	5	10+1

IEEE 754 single	32	2	8	23+1
IEEE 754 double	64	2	11	52+1

IEEE 754 double extended	80	2	15	64
IEEE 754 quadruple	128	2	15	112+1
IEEE 754 octuple	256	2	19	236+1

IBM 7xx series	36	2	8	27
IBM 360 single	32	16	7	24
IBM 360 double	64	16	7	56
HP 3000 single	32	2	9	22
HP 3000 double	64	2	9	54
CDC 6000, 6600	60	2	11	48+1
Cray-1</a>	64	2	15	48

Strela</a>	43	2	7	35

Apple II	40	2	8	31+1
ZX Spectrum	40	2	8	31+1
Atari (FP rutiny)	48	10	7	40
Turbo Pascal real	48	2	8	39

IEEE 754

William “Velvel” Morton Kahan, University of California
Designed to avoid many common errors in production code
- and this is problem as many developers just don’t care
1977, first draft
1980, the Intel 8087 chip
Later Intel i80287, Intel i80387, Intel i80487, Motorola M68881, Motorola M68882
Now supported by most CPUs in the world

Basic characteristic

Binary exponent (i.e. some decimal values can’t be represented properly)
Focus to developers (less errors) than to performance
Distinguish between +0 and -0 (very nice!)
Supports +∞: yes
Supports -∞: yes
Supports NaN: yes
Rules for working with NaNs and infinity
Normalized values
Denormalized values (less precision)
Not to be used in banks etc.

IEEE 754 formats

Oficial name	Basic (must be supported)	a.k.a.	Sign	Exponent	Mantissa	Sum (bits)	Decimal numbers
binary16	×	half precision	1b	5b	10b	16b	cca 3,3
binary32	✓	single precision/float	1b	8b	23b	32b	cca 7,2
binary64	✓	double precision	1b	11b	52b	64b	cca 15,9
binary128	✓	quadruple precision	1b	15b	112b	128b	cca 34,0
binary256	×	octuple precision	1b	19b	236b	256b	cca 71,3

Problems with IEE 754

Can represents “real numbers”
FP (double for example) is “precise”
All algebraic rules are supported
Conversion from int to float is w/o losing data
Conversion from long to double is w/o losing data
Ariane 5 Explosion
- a Very Costly Coding Error: https://www.youtube.com/watch?v=5tJPXYA0Nec

An example

step = 0.1
x = 0.0
while x!= 1.0:
    x += step
    print(x)

package main

import "fmt"

func main() {
        step := 0.1
        x := 0.0
        for x != 1.0 {
                x += step
                fmt.Printf("%f\n", x)
        }
}

Beyond IEEE 754

Usually smaller formats
- for schools
- for GPUs (16bit FP)
- for AI/NN (16bit FP)

Microfloat

8 bits values
1 bit for signum
4 bits for exponent
3 bits for mantissa
BIAS=7
Precision: 2-3 decimal digits
Smallest non-zero value: 1/512
Used in schools
Possible to represent NaN, +∞, -∞

Mini-FP

5 bits values
1 bit for sign
3 bits for exponent
2 bits for mantissa
Only 64 values
Possible to draw all of them on numeric axis

Bfloat16 - Brain Floating Point

For NN
Now supported by many libraries and some Intel chips
Basic characteristic
- 16 bits values
- 1 bit for sign
- 5 bits for exponent
- 10 bits for mantissa
- BIAS=127
- Precision: 3-4 decimal digits
- Maximum value: 3.38953139 × 10^38
- Minimum value: -3.38953139 × 10^38
- Minimum positive value (non zero): 9.2 × 10^−41
- Minimum normalized value: 1.18 × 10^-38
- Supports +∞: yes
- Supports -∞: yes
- Supports NaN: yes

FP problems?

Equality                 Is it true?
x == (int)(float) x      yes/no
x == (int)(double) x     yes/no
f == (float)(double) f   yes/no
d == (float) d           yes/no
f == -(-f);              yes/no
2/3 == 2/3.0             yes/no
d < 0.0 ⇒ ((d*2) < 0.0)  yes/no
d > f ⇒ -f > -d          yes/no
d * d >= 0.0             yes/no
(d+f)-d == f             yes/no
(x+y)-y == x             yes/no
(x-y)+y == x             yes/no