Floating Point Arithmetic

Presentations

Floating Point Arithmetic

XFP = (-1)s × bexp-bias × m

Many forms of FP was/is used

Computer/systemWidth (b)BaseExponent (b)Mantissa (b)
IEEE 754 half162510+1
     
IEEE 754 single322823+1
IEEE 754 double6421152+1
     
IEEE 754 double extended8021564
IEEE 754 quadruple128215112+1
IEEE 754 octuple256219236+1
     
IBM 7xx series362827
IBM 360 single3216724
IBM 360 double6416756
HP 3000 single322922
HP 3000 double642954
CDC 6000, 66006021148+1
Cray-1</a>6421548
     
Strela</a>432735
     
Apple II402831+1
ZX Spectrum402831+1
Atari (FP rutiny)4810740
Turbo Pascal real482839

IEEE 754

Basic characteristic

IEEE 754 formats

Oficial nameBasic (must be supported)a.k.a.SignExponentMantissaSum (bits)Decimal numbers
binary16×half precision1b 5b10b16bcca 3,3
binary32single precision/float1b 8b23b32bcca 7,2
binary64double precision1b11b52b64bcca 15,9
binary128quadruple precision1b15b112b128bcca 34,0
binary256×octuple precision1b19b236b256bcca 71,3

Problems with IEE 754

An example

step = 0.1
x = 0.0
while x!= 1.0:
    x += step
    print(x)
package main

import "fmt"

func main() {
        step := 0.1
        x := 0.0
        for x != 1.0 {
                x += step
                fmt.Printf("%f\n", x)
        }
}

Beyond IEEE 754

Microfloat

Mini-FP

Bfloat16 - Brain Floating Point

FP problems?

Equality                 Is it true?
x == (int)(float) x      yes/no
x == (int)(double) x     yes/no
f == (float)(double) f   yes/no
d == (float) d           yes/no
f == -(-f);              yes/no
2/3 == 2/3.0             yes/no
d < 0.0 ⇒ ((d*2) < 0.0)  yes/no
d > f ⇒ -f > -d          yes/no
d * d >= 0.0             yes/no
(d+f)-d == f             yes/no
(x+y)-y == x             yes/no
(x-y)+y == x             yes/no