Floating Point Number#
Floats represents a
\[ (-1)^s \cdot 1.m \cdot 2^{e-127} \]
| Precision | Width | Exp. | Bias |
|---|---|---|---|
| Half | 16 bit | 5 bit | 15 |
| Single | 32 bit | 8 bit | 127 |
| Double | 64 bit | 11bit | 1023 |
Example -7 = 1.11 * 2\^2
Special Numbers#
Floats can also represent signed zeros (\(\pm 0\)), infinity \(\pm \infty\), and Not-A-Number (NaN).
| Num | Sign | Exp. | Mant. |
|---|---|---|---|
| Normal | [0,1] | [-127,126] | \([0,2^{23}]\) |
| \(\pm 0\) | [0,1] | -128 | 0 |
| \(\pm \infty\) | [0,1] | 127 | 0 |
| Subnormal | [0,1] | -128 | != 0 |
| QNaN | [0,1] | 127 | !=0 & MSB=1 |
| SNaN | [0,1] | 127 | !=0 & MSB=0 |
Exceptions#
5 exceptions are supported:
- Invalid operation: the result of the operation is a NaN
- Division by zero
- Overflow: the result of the operation is ±∞ or ±MAX depending on the rounding mode
- Underflow: the result of the operation is a denormalized number
- Inexact result: caused by rounding
Subnormal Numbers#
If the result of a calculation is smaller than the smallest normal number there are two option:
- hard underflow: directly assign zero
- gradual underflow: subnormal number
To prevent unwanted behavior by jumping directly to zero, subnormal numbers fill the gap between zero and the smallest normal number.
Not-A-Number (NaN)#
References#
- IEEE: IEEE-754-2019 Standard, 2019
- David Goldberg: What Every Computer Scientist Should Know About Floating-Point Arithmetic, 1991