Floating Point Number#
Floats represents a
\[ (-1)^s \cdot 1.m \cdot 2^{e-127} \]
Precision | Width | Exp. | Bias |
---|---|---|---|
Half | 16 bit | 5 bit | 15 |
Single | 32 bit | 8 bit | 127 |
Double | 64 bit | 11bit | 1023 |
Example -7 = 1.11 * 2\^2
Special Numbers#
Floats can also represent signed zeros (\(\pm 0\)), infinity \(\pm \infty\), and Not-A-Number (NaN).
Num | Sign | Exp. | Mant. |
---|---|---|---|
Normal | [0,1] | [-127,126] | \([0,2^{23}]\) |
\(\pm 0\) | [0,1] | -128 | 0 |
\(\pm \infty\) | [0,1] | 127 | 0 |
Subnormal | [0,1] | -128 | != 0 |
QNaN | [0,1] | 127 | !=0 & MSB=1 |
SNaN | [0,1] | 127 | !=0 & MSB=0 |
Exceptions#
5 exceptions are supported:
- Invalid operation: the result of the operation is a NaN
- Division by zero
- Overflow: the result of the operation is ±∞ or ±MAX depending on the rounding mode
- Underflow: the result of the operation is a denormalized number
- Inexact result: caused by rounding
Subnormal Numbers#
If the result of a calculation is smaller than the smallest normal number there are two option:
- hard underflow: directly assign zero
- gradual underflow: subnormal number
To prevent unwanted behavior by jumping directly to zero, subnormal numbers fill the gap between zero and the smallest normal number.
Not-A-Number (NaN)#
References#
- IEEE: IEEE-754-2019 Standard, 2019
- David Goldberg: What Every Computer Scientist Should Know About Floating-Point Arithmetic, 1991