Field Technology Services — AI-Orchestrated Akamai Operations
Inference is the process of running input data through a trained model to produce a prediction. The model's weights are frozen — they never change. The only computation is a single forward pass through the network.
Key characteristics: deterministic computation path, memory footprint proportional to model size + KV cache, can be heavily optimized via quantization (INT4/INT8), batching, and speculative decoding. The model is essentially a function — input goes in, output comes out.
Training is the iterative process of adjusting a model's weights so its predictions match desired outputs. Every training step requires a forward pass (compute prediction), a loss calculation (measure error), and a backward pass (compute gradients via backpropagation).
Gradient descent is the optimization algorithm: compute the direction that reduces loss, take a step in that direction, repeat millions of times. Learning rate controls step size — too large and you overshoot, too small and training takes forever.
| Dimension | Training | Inference |
|---|---|---|
| Compute | Forward + backward pass. ~3x FLOPs vs inference per sample. | Forward pass only. Optimized for throughput. |
| Memory | Model weights + gradients + optimizer states + activations. ~16 bytes/param (AdamW FP32). | Model weights + KV cache. ~0.5-2 bytes/param (INT4-FP16). |
| Precision | BF16/FP32 required. Needs exponent range for gradient stability. | INT4/INT8/FP16 viable. Mantissa precision traded for speed. |
| Hardware | Multi-GPU clusters (H100/A100). NVIDIA DGX, cloud HPC. | Single GPU, CPU, Mac Mini, edge devices, mobile. |
| Batch size | Large batches (thousands) for gradient stability. Gradient accumulation common. | Often batch=1 for real-time. Continuous batching for serving. |
| Latency | Hours to months. Latency per step is irrelevant — total time matters. | Milliseconds to seconds. Time-to-first-token is critical. |
| Cost | $1M-$100M+ for frontier models. One-time (amortized) cost. | $0.001-$0.10 per query. Ongoing, scales with usage. |
| Weights | Mutable. Updated every step via gradient descent. | Frozen. Loaded once, read-only. |
The entire reason inference can run on a Mac Mini, a phone, or an Akamai edge server is that trained models are compiled knowledge. The expensive, GPU-intensive work of learning already happened. Inference is just reading from that knowledge.
Generation throughput. Llama 3.1 8B on M4 Max: ~60 tok/s at INT4. On H100: ~2,000 tok/s at FP16.
Latency before generation begins. Includes prompt processing (prefill). Target: <500ms for interactive use.
Serving throughput with continuous batching. vLLM and TGI optimize this via PagedAttention and dynamic batching.
Models are trained in BF16/FP32 for numerical stability, then quantized to INT4/INT8 for deployment. This is a post-training compression step that trades precision for speed and memory savings.