Why run inference at the edge
Latency
Round-trip to a cloud GPU adds 50-200ms. Edge inference eliminates the network hop entirely. For real-time applications — autocomplete, content moderation, image processing — this is the difference between fluid and sluggish.
Data Privacy
Sensitive data never leaves the local network. Medical records, financial data, PII — edge inference processes everything locally. No data in transit, no third-party API logs, no compliance gray areas.
Cost
Cloud GPU inference costs $0.01-$0.10 per query at scale. A Mac Mini M4 running Ollama costs $0.00 per query after the hardware purchase. At high volume, edge inference pays for itself in weeks.
Offline
Edge inference works without internet. Field deployments, aircraft, remote sites — anywhere connectivity is unreliable. The model runs entirely on local hardware with zero external dependencies.
The edge compute spectrum — from cloud to device
Cloud GPU
H100 / A100
Cloud Edge
Akamai L4 GPU
CDN Edge
EdgeWorkers
On-Device
Mac Mini / Phone
Maximum capability
Minimum latency
Each tier trades model capability for proximity. A 405B model requires cloud GPUs. A 7B quantized model runs on a Mac Mini. The question is always: what is the smallest model that meets your quality bar?
Model size vs capability — the tradeoff
Parameter Count Determines Hardware Requirements
| Model |
Params |
FP16 Size |
INT4 Size |
Minimum Hardware |
Use Case |
| Phi-3 Mini |
3.8B |
7.6 GB |
2.3 GB |
Phone, Raspberry Pi 5 |
Simple Q&A, classification |
| Llama 3.1 8B |
8B |
16 GB |
4.7 GB |
Mac Mini M4, 16 GB laptop |
General assistant, RAG, code |
| Llama 3.1 70B |
70B |
140 GB |
40 GB |
M4 Max 128 GB or A100 80 GB |
Complex reasoning, analysis |
| Llama 3.1 405B |
405B |
810 GB |
230 GB |
8x H100 cluster |
Frontier-class tasks |
Quantization for edge — INT4/INT8 enables small hardware
How Quantization Works
Quantization maps high-precision weights (FP16/FP32) to lower-precision integers (INT4/INT8). Each group of weights gets a scale factor and zero-point that allows approximate reconstruction.
float_weight = scale * (int_weight - zero_point)
Groups of 32-128 weights share a single scale factor. Smaller groups = better accuracy, more overhead.
Quantization Methods
Modern methods minimize quality loss by analyzing weight importance during quantization.
GPTQ — Post-training, uses calibration data
AWQ — Protects salient weights (~1%)
GGUF — llama.cpp format, CPU-optimized
ExLlamaV2 — Mixed precision, GPU-only
AWQ typically outperforms GPTQ at the same bit width because it identifies and preserves the most important weight channels.
Akamai Connected Cloud — GPU inference at the edge
Distributed Inference Architecture
Akamai Connected Cloud provides GPU compute instances in 25+ global markets. Combined with Akamai CDN and EdgeWorkers, this enables a three-tier architecture where each layer handles what it does best.
CDN
Cache model artifacts
Serve static assets
→
EdgeWorkers
Pre/post processing
Auth, routing, filter
→
GPU Instance
L4 / A100 / H100
Model inference
→
Response
Stream tokens
Via CDN edge
CDN caches GGUF model files at edge PoPs (one-time download per region)
EdgeWorkers handle auth, rate limiting, prompt sanitization at <1ms
GPU instance runs vLLM/TGI for actual model inference
This architecture means the model weights are already cached near the GPU when a new instance spins up — no cross-region download required.
Hardware comparison — TFLOPS, memory, power, cost
| Hardware |
Type |
Memory |
FP16 TFLOPS |
Relative Speed |
Power |
Best For |
| H100 SXM |
Data center GPU |
80 GB HBM3 |
989 |
1.0x |
700W |
Frontier models, training |
| A100 80GB |
Data center GPU |
80 GB HBM2e |
312 |
0.31x |
400W |
Production inference |
| L4 |
Edge GPU |
24 GB GDDR6 |
121 |
0.12x |
72W |
Edge inference, Akamai CC |
| M4 Max |
Apple Silicon |
128 GB unified |
~54 |
0.05x |
40W |
Local dev, Ollama, 70B INT4 |
| RTX 4090 |
Consumer GPU |
24 GB GDDR6X |
165 |
0.17x |
450W |
Enthusiast, small models |
Real-world edge inference architectures
Hybrid: Edge + Cloud Fallback
Run a small model (8B INT4) locally for common queries. Route complex queries to a cloud 70B model via Akamai CDN. EdgeWorkers classify query complexity in <1ms and route accordingly. 80% of queries never leave the edge.
CDN-Cached Model Weights
Store GGUF model files on Akamai CDN as cacheable objects. When a new GPU instance spins up, it pulls the model from the nearest PoP (~50ms) instead of an origin server (~500ms). Model updates propagate via cache invalidation.
Edge Pre/Post Processing
EdgeWorkers tokenize input, sanitize prompts, enforce rate limits, and apply content filtering before the request reaches the GPU. Response streaming flows back through the CDN with token-level SSE events. The GPU only handles matrix math.
Fully Local: Ollama on Mac Mini
For development and demos: Ollama serves models via a local REST API. Llama 3.1 8B at INT4 runs at ~60 tok/s on M4 Max. No network, no cost, no data leaves the machine. HarperDB connects directly via localhost.