Floating Point Number#

Floats represents a

\[ (-1)^s \cdot 1.m \cdot 2^{e-127} \]

Precision	Width	Exp.	Bias
Half	16 bit	5 bit	15
Single	32 bit	8 bit	127
Double	64 bit	11bit	1023

Example -7 = 1.11 * 2\^2

Special Numbers#

Floats can also represent signed zeros (\(\pm 0\)), infinity \(\pm \infty\), and Not-A-Number (NaN).

Num	Sign	Exp.	Mant.
Normal	[0,1]	[-127,126]	\([0,2^{23}]\)
\(\pm 0\)	[0,1]	-128	0
\(\pm \infty\)	[0,1]	127	0
Subnormal	[0,1]	-128	!= 0
QNaN	[0,1]	127	!=0 & MSB=1
SNaN	[0,1]	127	!=0 & MSB=0

5 exceptions are supported:

Invalid operation: the result of the operation is a NaN
Division by zero
Overflow: the result of the operation is ±∞ or ±MAX depending on the rounding mode
Underflow: the result of the operation is a denormalized number
Inexact result: caused by rounding

If the result of a calculation is smaller than the smallest normal number there are two option:

To prevent unwanted behavior by jumping directly to zero, subnormal numbers fill the gap between zero and the smallest normal number.