Inference at the Edge — FTS MCP Server

The bit layout — how a float is actually stored

Every floating-point number is stored as three fields: sign, exponent (scale), and mantissa (precision). The formula is always: (-1)^sign × 2^(exp−bias) × 1.mantissa

FP32 32 bits · bias 127

1 + 8 + 23

Range ±3.4×10³⁸ | Precision ~7 decimal digits

FP16 16 bits · bias 15

1 + 5 + 10

Range ±65,504 | Precision ~3 decimal digits | Standard inference format

BF16 16 bits · bias 127 — keeps FP32's exponent, sacrifices mantissa

1 + 8 + 7

Range ±3.4×10³⁸ (same as FP32) | Precision ~2–3 decimal digits | Preferred for training

FP8 E4M3 / E5M2 8 bits · two variants

E4M3 — weights/activations

1+4+3

E5M2 — gradients

1+5+2

H100/H200 only | E4M3 = more precision (weights) | E5M2 = more range (gradients)

INT4 4 bits · not floating point — integers only

−8 to 7 (signed) | 16 discrete values

Weights mapped via float ≈ scale × (int − zero_point) | GPTQ / AWQ reduce quality loss

Sign

Exponent (range)

Mantissa (precision)

Integer

Format comparison — memory, speed, quality

Format	Bits	Exp / Mant	Max value	Precision	Mem vs FP32	Best for
FP32	32	8 / 23	3.4×10³⁸	~7 digits	1×	Training baseline
FP16	16	5 / 10	65,504	~3 digits	2×	Standard inference
BF16	16	8 / 7	3.4×10³⁸	~2 digits	2×	Training / LLM inference
FP8	8	4/3 or 5/2	448	~1 digit	4×	H100 inference/training
INT4	4	— / —	scale-dep.	16 values	8×	Edge / Mac / Ollama

Training vs inference — what matters and why

Training — needs range (exponent)

Backpropagation produces gradients that can span many orders of magnitude simultaneously across layers. A gradient of 70,000 in FP16's 5-bit exponent overflows to ∞ and poisons the entire update.

FP16 overflow threshold: 2^(31−15) = 2^16 = 65,504
BF16 overflow threshold: 2^(254−127) = 2^127 = 3.4×10³⁸ BF16 matches FP32 range exactly — that's the entire reason it exists.

Inference — needs precision (mantissa)

Weights are frozen. Activation values accumulate through dozens of layers. Small rounding errors compound. The subtle difference between weight 0.10312 and 0.10318 can affect which token gets selected.

FP32 mantissa: 23 bits → ~7 decimal digits
FP16 mantissa: 10 bits → ~3 decimal digits
INT4: 0 mantissa → 16 discrete values At INT4, two different FP16 weights that should be distinct may collapse to the same integer bucket.

Four reasons precision matters at inference

Weight fidelity

Weights trained in FP32/BF16 are loaded into a lower-precision format. Mantissa truncation is a lossy compression — information is permanently discarded. Techniques like AWQ protect the ~1% of "salient" weights that carry disproportionate importance.

Activation accumulation

Each transformer layer multiplies weights by activations and adds them up. With 96 layers (GPT-4 class), rounding error accumulates. FP16 with 10 mantissa bits handles this acceptably. INT4 with no mantissa can drift noticeably on complex reasoning.

KV cache

During generation, attention keys and values are cached in memory. This cache can be quantized independently — INT8 KV cache is common — because the softmax normalization provides some error correction before weights are applied.

Embedding lookup

Token embeddings are the first layer — the model's vocabulary representation. Quantizing embeddings to INT4 is particularly risky because there's no earlier layer to absorb errors. Most pipelines keep embeddings at FP16 or higher even when the rest of the model is INT4.

Why this connects to EdgeIQ's RAG flywheel

EdgeIQ Hockey

Vector RAG enhances inference quality without changing model precision

The model runs at INT4 or FP16 on your Mac Mini M4 via Ollama. Your RAG pipeline compensates for the precision loss by injecting high-fidelity, domain-specific context at inference time — so the model doesn't need to have memorized hockey nuance in its weights.

01 Query arrives — "Why did the power play fail in period 2?"

02 Embed query — OpenAI embeddings convert to a float vector (FP32)

03 HNSW search — HarperDB 4.6 vector index returns top-k relevant shift/event chunks

04 Context injection — retrieved chunks become high-precision grounding in the prompt

05 LLM generates — model (INT4/FP16) reasons over injected context, not memorized weights

→ Precision gap is filled by retrieval, not by bigger weights. A 7B model + good RAG often beats a 70B model without it on domain tasks.

Numeric Precision at Inference