Inference vs Training — FTS MCP Server

What is inference — using the model

Inference: Forward Pass Only

Inference is the process of running input data through a trained model to produce a prediction. The model's weights are frozen — they never change. The only computation is a single forward pass through the network.

output = model(input)
// No gradients computed
// No weights updated
// Optimized for throughput and latency Inference is analogous to reading from a compiled program — the "compilation" (training) already happened.

Key characteristics: deterministic computation path, memory footprint proportional to model size + KV cache, can be heavily optimized via quantization (INT4/INT8), batching, and speculative decoding. The model is essentially a function — input goes in, output comes out.

What is training — building the model

Training: Forward + Backward Pass

Training is the iterative process of adjusting a model's weights so its predictions match desired outputs. Every training step requires a forward pass (compute prediction), a loss calculation (measure error), and a backward pass (compute gradients via backpropagation).

loss = criterion(model(input), target)
loss.backward() // compute gradients for every weight
optimizer.step() // update weights: w -= lr * grad
optimizer.zero_grad() // reset for next iteration Training requires storing all intermediate activations for the backward pass — this is why training uses 3-4x more memory than inference.

Gradient descent is the optimization algorithm: compute the direction that reduces loss, take a step in that direction, repeat millions of times. Learning rate controls step size — too large and you overshoot, too small and training takes forever.

The training loop — an iterative process

Training Loop

Training
Data

→

Forward
Pass

→

Compute
Loss

→

Backward
Pass

→

Update
Weights

→

Repeat

Iterate millions of times over the dataset. Each pass = one gradient step. Full dataset pass = one epoch.

Inference Pipeline

User
Input

→

Tokenize

→

Forward
Pass

→

Sample
Token

→

Decode
Output

For autoregressive generation, the sample-and-decode step repeats for each output token. KV cache avoids recomputing previous tokens.

Side-by-side comparison — compute, memory, hardware

Dimension	Training	Inference
Compute	Forward + backward pass. ~3x FLOPs vs inference per sample.	Forward pass only. Optimized for throughput.
Memory	Model weights + gradients + optimizer states + activations. ~16 bytes/param (AdamW FP32).	Model weights + KV cache. ~0.5-2 bytes/param (INT4-FP16).
Precision	BF16/FP32 required. Needs exponent range for gradient stability.	INT4/INT8/FP16 viable. Mantissa precision traded for speed.
Hardware	Multi-GPU clusters (H100/A100). NVIDIA DGX, cloud HPC.	Single GPU, CPU, Mac Mini, edge devices, mobile.
Batch size	Large batches (thousands) for gradient stability. Gradient accumulation common.	Often batch=1 for real-time. Continuous batching for serving.
Latency	Hours to months. Latency per step is irrelevant — total time matters.	Milliseconds to seconds. Time-to-first-token is critical.
Cost	$1M-$100M+ for frontier models. One-time (amortized) cost.	$0.001-$0.10 per query. Ongoing, scales with usage.
Weights	Mutable. Updated every step via gradient descent.	Frozen. Loaded once, read-only.

Why this distinction matters for edge deployment

The entire reason inference can run on a Mac Mini, a phone, or an Akamai edge server is that trained models are compiled knowledge. The expensive, GPU-intensive work of learning already happened. Inference is just reading from that knowledge.

Training is write-heavy

Every parameter needs its gradient stored, plus two optimizer states (AdamW). A 70B parameter model needs ~1.1 TB just for training state. This requires 8-16 H100 GPUs with NVLink interconnect.

Inference is read-heavy

The same 70B model quantized to INT4 needs ~35 GB — fits in a single GPU or an M4 Max with 128 GB unified memory. No gradients, no optimizer states, just weights and a KV cache.

Key inference performance metrics

tok/s

Tokens per second

Generation throughput. Llama 3.1 8B on M4 Max: ~60 tok/s at INT4. On H100: ~2,000 tok/s at FP16.

TTFT

Time to first token

Latency before generation begins. Includes prompt processing (prefill). Target: <500ms for interactive use.

QPS

Queries per second

Serving throughput with continuous batching. vLLM and TGI optimize this via PagedAttention and dynamic batching.

The precision bridge — from training to deployment

Quantization Makes Edge Inference Possible

Models are trained in BF16/FP32 for numerical stability, then quantized to INT4/INT8 for deployment. This is a post-training compression step that trades precision for speed and memory savings.

Train
BF16 / FP32

→

Quantize
GPTQ / AWQ

→

Deploy
INT4 / INT8

→

Serve
Ollama / vLLM

Memory at FP32: 70B params x 4 bytes = 280 GB
Memory at INT4: 70B params x 0.5 bytes = 35 GB
8x memory reduction. The quality loss (measured by perplexity increase) is typically 0.1-0.5% with modern quantization methods.