FTS MCP Server

Field Technology Services — AI-Orchestrated Akamai Operations

Reference · AI Fundamentals

Inference vs Training

Forward Pass · Backpropagation
Gradient Descent · Edge Deployment
What is inference — using the model
Inference: Forward Pass Only

Inference is the process of running input data through a trained model to produce a prediction. The model's weights are frozen — they never change. The only computation is a single forward pass through the network.

output = model(input)
// No gradients computed
// No weights updated
// Optimized for throughput and latency Inference is analogous to reading from a compiled program — the "compilation" (training) already happened.

Key characteristics: deterministic computation path, memory footprint proportional to model size + KV cache, can be heavily optimized via quantization (INT4/INT8), batching, and speculative decoding. The model is essentially a function — input goes in, output comes out.

What is training — building the model
Training: Forward + Backward Pass

Training is the iterative process of adjusting a model's weights so its predictions match desired outputs. Every training step requires a forward pass (compute prediction), a loss calculation (measure error), and a backward pass (compute gradients via backpropagation).

loss = criterion(model(input), target)
loss.backward() // compute gradients for every weight
optimizer.step() // update weights: w -= lr * grad
optimizer.zero_grad() // reset for next iteration Training requires storing all intermediate activations for the backward pass — this is why training uses 3-4x more memory than inference.

Gradient descent is the optimization algorithm: compute the direction that reduces loss, take a step in that direction, repeat millions of times. Learning rate controls step size — too large and you overshoot, too small and training takes forever.


The training loop — an iterative process
Training Loop
Training
Data
Forward
Pass
Compute
Loss
Backward
Pass
Update
Weights
Repeat
Iterate millions of times over the dataset. Each pass = one gradient step. Full dataset pass = one epoch.
Inference Pipeline
User
Input
Tokenize
Forward
Pass
Sample
Token
Decode
Output
For autoregressive generation, the sample-and-decode step repeats for each output token. KV cache avoids recomputing previous tokens.

Side-by-side comparison — compute, memory, hardware
Dimension Training Inference
Compute Forward + backward pass. ~3x FLOPs vs inference per sample. Forward pass only. Optimized for throughput.
Memory Model weights + gradients + optimizer states + activations. ~16 bytes/param (AdamW FP32). Model weights + KV cache. ~0.5-2 bytes/param (INT4-FP16).
Precision BF16/FP32 required. Needs exponent range for gradient stability. INT4/INT8/FP16 viable. Mantissa precision traded for speed.
Hardware Multi-GPU clusters (H100/A100). NVIDIA DGX, cloud HPC. Single GPU, CPU, Mac Mini, edge devices, mobile.
Batch size Large batches (thousands) for gradient stability. Gradient accumulation common. Often batch=1 for real-time. Continuous batching for serving.
Latency Hours to months. Latency per step is irrelevant — total time matters. Milliseconds to seconds. Time-to-first-token is critical.
Cost $1M-$100M+ for frontier models. One-time (amortized) cost. $0.001-$0.10 per query. Ongoing, scales with usage.
Weights Mutable. Updated every step via gradient descent. Frozen. Loaded once, read-only.

Why this distinction matters for edge deployment

The entire reason inference can run on a Mac Mini, a phone, or an Akamai edge server is that trained models are compiled knowledge. The expensive, GPU-intensive work of learning already happened. Inference is just reading from that knowledge.

Training is write-heavy
Every parameter needs its gradient stored, plus two optimizer states (AdamW). A 70B parameter model needs ~1.1 TB just for training state. This requires 8-16 H100 GPUs with NVLink interconnect.
Inference is read-heavy
The same 70B model quantized to INT4 needs ~35 GB — fits in a single GPU or an M4 Max with 128 GB unified memory. No gradients, no optimizer states, just weights and a KV cache.

Key inference performance metrics
tok/s
Tokens per second

Generation throughput. Llama 3.1 8B on M4 Max: ~60 tok/s at INT4. On H100: ~2,000 tok/s at FP16.

TTFT
Time to first token

Latency before generation begins. Includes prompt processing (prefill). Target: <500ms for interactive use.

QPS
Queries per second

Serving throughput with continuous batching. vLLM and TGI optimize this via PagedAttention and dynamic batching.


The precision bridge — from training to deployment
Quantization Makes Edge Inference Possible

Models are trained in BF16/FP32 for numerical stability, then quantized to INT4/INT8 for deployment. This is a post-training compression step that trades precision for speed and memory savings.

Train
BF16 / FP32
Quantize
GPTQ / AWQ
Deploy
INT4 / INT8
Serve
Ollama / vLLM
Memory at FP32: 70B params x 4 bytes = 280 GB
Memory at INT4: 70B params x 0.5 bytes = 35 GB
8x memory reduction. The quality loss (measured by perplexity increase) is typically 0.1-0.5% with modern quantization methods.