The bit layout — how a float is actually stored
Every floating-point number is stored as three fields: sign, exponent (scale), and mantissa (precision). The formula is always: (-1)^sign × 2^(exp−bias) × 1.mantissa
FP32
32 bits · bias 127
Range ±3.4×10³⁸ | Precision ~7 decimal digits
FP16
16 bits · bias 15
Range ±65,504 | Precision ~3 decimal digits | Standard inference format
BF16
16 bits · bias 127 — keeps FP32's exponent, sacrifices mantissa
Range ±3.4×10³⁸ (same as FP32) | Precision ~2–3 decimal digits | Preferred for training
FP8 E4M3 / E5M2
8 bits · two variants
E4M3 — weights/activations
H100/H200 only | E4M3 = more precision (weights) | E5M2 = more range (gradients)
INT4
4 bits · not floating point — integers only
−8 to 7 (signed) | 16 discrete values
Weights mapped via float ≈ scale × (int − zero_point) | GPTQ / AWQ reduce quality loss
Format comparison — memory, speed, quality
| Format |
Bits |
Exp / Mant |
Max value |
Precision |
Mem vs FP32 |
Best for |
| FP32 |
32 |
8 / 23 |
3.4×10³⁸ |
~7 digits |
1× |
Training baseline |
| FP16 |
16 |
5 / 10 |
65,504 |
~3 digits |
2× |
Standard inference |
| BF16 |
16 |
8 / 7 |
3.4×10³⁸ |
~2 digits |
2× |
Training / LLM inference |
| FP8 |
8 |
4/3 or 5/2 |
448 |
~1 digit |
4× |
H100 inference/training |
| INT4 |
4 |
— / — |
scale-dep. |
16 values |
8× |
Edge / Mac / Ollama |
Training vs inference — what matters and why
Training — needs range (exponent)
Backpropagation produces gradients that can span many orders of magnitude simultaneously across layers. A gradient of 70,000 in FP16's 5-bit exponent overflows to ∞ and poisons the entire update.
FP16 overflow threshold: 2^(31−15) = 2^16 = 65,504
BF16 overflow threshold: 2^(254−127) = 2^127 = 3.4×10³⁸
BF16 matches FP32 range exactly — that's the entire reason it exists.
Inference — needs precision (mantissa)
Weights are frozen. Activation values accumulate through dozens of layers. Small rounding errors compound. The subtle difference between weight 0.10312 and 0.10318 can affect which token gets selected.
FP32 mantissa: 23 bits → ~7 decimal digits
FP16 mantissa: 10 bits → ~3 decimal digits
INT4: 0 mantissa → 16 discrete values
At INT4, two different FP16 weights that should be distinct may collapse to the same integer bucket.
Four reasons precision matters at inference
Weight fidelity
Weights trained in FP32/BF16 are loaded into a lower-precision format. Mantissa truncation is a lossy compression — information is permanently discarded. Techniques like AWQ protect the ~1% of "salient" weights that carry disproportionate importance.
Activation accumulation
Each transformer layer multiplies weights by activations and adds them up. With 96 layers (GPT-4 class), rounding error accumulates. FP16 with 10 mantissa bits handles this acceptably. INT4 with no mantissa can drift noticeably on complex reasoning.
KV cache
During generation, attention keys and values are cached in memory. This cache can be quantized independently — INT8 KV cache is common — because the softmax normalization provides some error correction before weights are applied.
Embedding lookup
Token embeddings are the first layer — the model's vocabulary representation. Quantizing embeddings to INT4 is particularly risky because there's no earlier layer to absorb errors. Most pipelines keep embeddings at FP16 or higher even when the rest of the model is INT4.
Why this connects to EdgeIQ's RAG flywheel
EdgeIQ Hockey
Vector RAG enhances inference quality without changing model precision
The model runs at INT4 or FP16 on your Mac Mini M4 via Ollama. Your RAG pipeline compensates for the precision loss by injecting high-fidelity, domain-specific context at inference time — so the model doesn't need to have memorized hockey nuance in its weights.
01
Query arrives — "Why did the power play fail in period 2?"
02
Embed query — OpenAI embeddings convert to a float vector (FP32)
03
HNSW search — HarperDB 4.6 vector index returns top-k relevant shift/event chunks
04
Context injection — retrieved chunks become high-precision grounding in the prompt
05
LLM generates — model (INT4/FP16) reasons over injected context, not memorized weights
→
Precision gap is filled by retrieval, not by bigger weights. A 7B model + good RAG often beats a 70B model without it on domain tasks.