Representation of Floating Point in Bits
Representation of Floating Point in Bits
Understand Precision
a) 1.12547e-06 has higher precision.
Precision = number of significant figures.
1.12547e-06→ 6 significant figures (1, 1, 2, 5, 4, 7)1.12000e-06→ 3 significant figures (1, 1, 2) — the trailing zeros after 2 are not significant; they're just placeholders
Higher precision means the value is measured/known to a finer level of detail. 1.12547 tells you more about the actual value than 1.12000 does.
Note: Precision ≠ Accuracy. A precise number isn't necessarily close to the true value — it just has more digits of specificity.
Number Representation Overview
A number can be stored in two broad forms:
- Fixed Point — decimal is at a fixed position
- Integers: 1., 2., 3. → decimal after LSB
- Fractions: 0.12, 0.22, 0.34 → decimal after MSB
- Floating Point — decimal can be at any place (1.52, 0.23, 15.4...)
All forms of numbers have 2 variants: Signed and Unsigned
- Signed — first bit stores sign info:
0= positive,1= negative - Unsigned — cannot represent negative numbers
Representation Using Different Formats (4-bit example)
Consider we have 4 bits. For now, only signed & unsigned int — not floating point.
| Bits (A B C D) | Unsigned | Signed | 1's Complement | 2's Complement |
|---|---|---|---|---|
| 0 0 0 0 | 0 | 0 | 0 | 0 |
| 0 0 0 1 | 1 | 1 | 1 | 1 |
| 0 0 1 0 | 2 | 2 | 2 | 2 |
| 0 0 1 1 | 3 | 3 | 3 | 3 |
| 0 1 0 0 | 4 | 4 | 4 | 4 |
| 0 1 0 1 | 5 | 5 | 5 | 5 |
| 0 1 1 0 | 6 | 6 | 6 | 6 |
| 0 1 1 1 | 7 | 7 | 7 | 7 |
| 1 0 0 0 | 8 | -0 | -7 | -8 |
| 1 0 0 1 | 9 | -1 | -6 | -7 |
| 1 0 1 0 | 10 | -2 | -5 | -6 |
| 1 0 1 1 | 11 | -3 | -4 | -5 |
| 1 1 0 0 | 12 | -4 | -3 | -4 |
| 1 1 0 1 | 13 | -5 | -2 | -3 |
| 1 1 1 0 | 14 | -6 | -1 | -2 |
| 1 1 1 1 | 15 | -7 | -0 | -1 |
Signed vs Unsigned — How They Work
Unsigned No negatives. Just convert binary to int directly.
Signed
- First bit represents sign: MSB
0= positive, MSB1= negative - Only 3 remaining bits represent the value
- Convert bits (excluding sign bit), then apply ± based on MSB
Range for n bits (Signed): $2^{n-1} - 1 \text{ different values (with 2 zeros)}$
For n = 4: $2^{4-1} - 1 = 7 \quad \Rightarrow \quad \text{range: } -7 \text{ to } +7$
1's Complement
- To negate: flip all bits
0001= +1 →1110= -1
Range: $-(2^{n-1} - 1) to (2^{n-1} - 1)$
2's Complement
- To negate: flip all bits, then add 1, ignore carry
0001= +1 →1110 + 1=1111= -1
Range: $-2^{n-1} to +(2^{n-1} - 1)$
Converting 1.2 (Decimal) to Binary
Integer part: 1 → 1 in binary
Fractional part (multiply by 2, record integer):
| Step | Result | Bit |
|---|---|---|
| 0.2 × 2 | 0.4 | 0 |
| 0.4 × 2 | 0.8 | 0 |
| 0.8 × 2 | 1.6 | 1 |
| ... | ... | ... |
$0.2 \rightarrow (0.001...)_2$ $\therefore 1.2 \rightarrow (1.001...)_2$
Problems With Naive Binary Float Representation
- Can't be efficiently stored in hardware
- Not standard across electronics
- Requires too many bits for large floats
Solution: Use a standardized Floating Point representation.
Same binary, but add one more step — convert to standard (normalized) form.
Bit layout (e.g. 10-bit):
| S | exponent | mantissa |
Normalization
The number (1.001)₂ can be written in many ways. We normalize to pick one standard form.
① Explicit Normalization
Place decimal before MSB 1
| Original | Normalized Form |
|---|---|
| (1.001)₂ | 0.1001 × 2¹ |
| (0.001)₂ | 0.1 × 2⁻² |
② Implicit Normalization ✅ (commonly used)
Place decimal after MSB 1
| Original | Normalized Form |
|---|---|
| (1.001)₂ | 1.001 × 2⁰ |
| (0.001)₂ | 1 × 2⁻³ |
Storing the Exponent — Biased Representation
After normalization, sign and mantissa are stored directly.
For the exponent, we don't store it raw — we use biased (excess) representation.
Bias formula: $b = 2^{n-1} - 1$
For a 4-bit exponent field: $b = 2^{4-1} - 1 = 7$ (or 8 as used in the example)
Example: exponent = -2 $-2 + 8 = 6 \quad \Rightarrow \quad \text{store } 6 \text{ in binary}$
Full bit layout example:
| 0 | 0 1 1 0 | 0 0 1 0 0 |
S exponent mantissa
Decoding formula: $(-1)^S \times 2^{(\text{exp} - b)} \times 0.\text{Mantissa}$
Note: There are chances of Underflowing and Overflowing while Conversion from one FP to another FP when range don’t match
Floating Point Formats — FP32, FP16, BF16, Float8
Format Comparison
| Format | Total Bits | Sign | Exponent | Mantissa | Bias |
|---|---|---|---|---|---|
| FP32 | 32 | 1 | 8 | 23 | 127 |
| FP16 | 16 | 1 | 5 | 10 | 15 |
| BF16 | 16 | 1 | 8 | 7 | 127 |
| Float8 (E4M3) | 8 | 1 | 4 | 3 | 7 |
| Float8 (E5M2) | 8 | 1 | 5 | 2 | 15 |
Decoding formula (all formats): $(-1)^S \times 2^{(\text{stored_exp} - \text{bias})} \times 1.\text{Mantissa}$
$1.\text{Mantissa}$ not $0.\text{Mantissa}$ Because it Normalizes in Inplicit format.
FP32 — Single Precision (IEEE 754)
Bit layout:
| S (1) | Exponent (8) | Mantissa (23) |
Bias: $b = 2^{8-1} - 1 = 127$
Max exponent stored: 1111 1110 = 254
(255 = 1111 1111 is reserved for Inf / NaN)
Actual max exponent: $254 - 127 = 127$
Max mantissa (23 bits all 1s): $1.\underbrace{111...1}_{23} = 2 - 2^{-23}$
Max value: $(2 - 2^{-23}) \times 2^{127} \approx 3.4 \times 10^{38}$
Min positive normal: $2^{(1 - 127)} = 2^{-126} \approx 1.18 \times 10^{-38}$
Range: $\approx \pm 3.4 \times 10^{38}$
FP16 — Half Precision (IEEE 754)
Bit layout:
| S (1) | Exponent (5) | Mantissa (10) |
Bias: $b = 2^{5-1} - 1 = 15$
Max exponent stored: 1 1110 = 30
(31 = 1 1111 is reserved for Inf / NaN)
Actual max exponent: $30 - 15 = 15$
Max mantissa (10 bits all 1s): $1.\underbrace{111...1}_{10} = 2 - 2^{-10}$
Max value: $(2 - 2^{-10}) \times 2^{15} = 1.9990234 \times 32768 \approx 65504$
Min positive normal: $2^{(1 - 15)} = 2^{-14} \approx 6.1 \times 10^{-5}$
Range: $\approx \pm 65504$
⚠️ This is why FP16 overflows during training — values like loss or gradients can easily exceed 65504.
BF16 — Brain Float 16 (Google)
Bit layout:
| S (1) | Exponent (8) | Mantissa (7) |
BF16 = FP32 with the bottom 16 mantissa bits chopped off. Same exponent range as FP32, just less precision.
Bias: $b = 2^{8-1} - 1 = 127$
Max exponent stored: 1111 1110 = 254
Actual max exponent: $254 - 127 = 127$
Max mantissa (7 bits all 1s): $1.\underbrace{111...1}_{7} = 2 - 2^{-7}$
Max value: $(2 - 2^{-7}) \times 2^{127} \approx 3.39 \times 10^{38}$
Min positive normal: $2^{-126} \approx 1.18 \times 10^{-38}$
Range: $\approx \pm 3.39 \times 10^{38}$
✅ BF16 has same range as FP32 — that's why it replaced FP16 for training (no overflow). You lose precision, not range.
Float8 — Two Variants
Variant 1: E4M3 (used for Forward Pass)
Bit layout:
| S (1) | Exponent (4) | Mantissa (3) |
Bias: $b = 2^{4-1} - 1 = 7$
Max exponent stored: 1110 = 14
(1111 = 15 reserved for NaN)
Actual max exponent: $14 - 7 = 7$
Max mantissa (3 bits all 1s): $1.111 = 1 + \frac{7}{8} = \frac{15}{8}$
Max value: $\frac{15}{8} \times 2^{7} = \frac{15}{8} \times 128 = 240$
Range: $\approx \pm 240$
Variant 2: E5M2 (used for Gradient Storage)
Bit layout:
| S (1) | Exponent (5) | Mantissa (2) |
Bias: $b = 2^{5-1} - 1 = 15$
Max exponent stored: 1 1110 = 30
Actual max exponent: $30 - 15 = 15$
Max mantissa (2 bits all 1s): $1.11 = 1 + \frac{3}{4} = \frac{7}{4}$
Max value: $\frac{7}{4} \times 2^{15} = 1.75 \times 32768 = 57344$
Range: $\approx \pm 57344$
FP32 Range Calculation
The range of 32-bit floating-point numbers is determined by the IEEE 754 structure: 1 bit for sign, 8 bits for exponent, and 23 bits for fraction (mantissa).
1. The Formula
The value of a normalized number is calculated as:
$\text{Value} = (-1)^{\text{sign}} \times (1 + \text{fraction}) \times 2^{(\text{exponent} - 127)}$
2. Why the Range starts at $10^{-38}$
Although 8 bits can represent 0to 255, the standard reserves the boundaries:
- Exponent 255: Reserved for Infinity and NaN (Not a Number).
- Exponent 0: Reserved for Zero and Subnormal numbers.
This leaves the usable "normalized" exponent range from 1 to 254. After applying the 127 bias:
- Min Normalized Exponent: $1 - 127 = -126$
- Max Normalized Exponent: $254 - 127 = 127$
3. The Resulting Limits
- Smallest Positive: $1.0 \times 2^{-126} \approx \mathbf{1.18 \times 10^{-38}}$
- Largest Positive: $\approx 2.0 \times 2^{127} \approx \mathbf{3.40 \times 10^{38}}$
The "Reservation" Impact: Without reserving the 0 exponent, the minimum would be $2^{-127}$ ($\approx 5.88 \times 10^{-39}$), but we would lose the ability to represent a bit-pattern for True Zero.
Would you like to see the bit-level layout of how $0.0$ is actually stored?
Full Range Summary
| Format | Max Value | Min Positive Normal | Precision (mantissa bits) |
|---|---|---|---|
| FP32 | ~3.4 × 10³⁸ | ~1.18 × 10⁻³⁸ | 23 bits (~7 decimal digits) |
| FP16 | 65504 | ~6.1 × 10⁻⁵ | 10 bits (~3 decimal digits) |
| BF16 | ~3.39 × 10³⁸ | ~1.18 × 10⁻³⁸ | 7 bits (~2 decimal digits) |
| Float8 E4M3 | 240 | ~0.016 | 3 bits |
| Float8 E5M2 | 57344 | ~1.5 × 10⁻⁵ | 2 bits |
Why This Matters for ML
| Use Case | Recommended Format | Reason |
|---|---|---|
| Full precision training | FP32 | No overflow, high precision |
| Mixed precision training | BF16 + FP32 | BF16 same range as FP32, faster |
| Inference | FP16 or BF16 | Fast, small memory |
| Quantized inference | Float8 E4M3 | Tiny memory, acceptable accuracy |
| Gradient storage | Float8 E5M2 | Wider range needed for gradients |