FTS MCP Server

Field Technology Services — AI-Orchestrated Akamai Operations

Reference · Edge Computing

Inference at the Edge

Edge Compute · Quantization
Akamai Connected Cloud
Why run inference at the edge
Latency

Round-trip to a cloud GPU adds 50-200ms. Edge inference eliminates the network hop entirely. For real-time applications — autocomplete, content moderation, image processing — this is the difference between fluid and sluggish.

Data Privacy

Sensitive data never leaves the local network. Medical records, financial data, PII — edge inference processes everything locally. No data in transit, no third-party API logs, no compliance gray areas.

Cost

Cloud GPU inference costs $0.01-$0.10 per query at scale. A Mac Mini M4 running Ollama costs $0.00 per query after the hardware purchase. At high volume, edge inference pays for itself in weeks.

Offline

Edge inference works without internet. Field deployments, aircraft, remote sites — anywhere connectivity is unreliable. The model runs entirely on local hardware with zero external dependencies.


The edge compute spectrum — from cloud to device
Cloud GPU H100 / A100
Cloud Edge Akamai L4 GPU
CDN Edge EdgeWorkers
On-Device Mac Mini / Phone
Maximum capability Minimum latency

Each tier trades model capability for proximity. A 405B model requires cloud GPUs. A 7B quantized model runs on a Mac Mini. The question is always: what is the smallest model that meets your quality bar?


Model size vs capability — the tradeoff
Parameter Count Determines Hardware Requirements
Model Params FP16 Size INT4 Size Minimum Hardware Use Case
Phi-3 Mini 3.8B 7.6 GB 2.3 GB Phone, Raspberry Pi 5 Simple Q&A, classification
Llama 3.1 8B 8B 16 GB 4.7 GB Mac Mini M4, 16 GB laptop General assistant, RAG, code
Llama 3.1 70B 70B 140 GB 40 GB M4 Max 128 GB or A100 80 GB Complex reasoning, analysis
Llama 3.1 405B 405B 810 GB 230 GB 8x H100 cluster Frontier-class tasks

Quantization for edge — INT4/INT8 enables small hardware
How Quantization Works

Quantization maps high-precision weights (FP16/FP32) to lower-precision integers (INT4/INT8). Each group of weights gets a scale factor and zero-point that allows approximate reconstruction.

float_weight = scale * (int_weight - zero_point)
Groups of 32-128 weights share a single scale factor. Smaller groups = better accuracy, more overhead.
Quantization Methods

Modern methods minimize quality loss by analyzing weight importance during quantization.

GPTQ — Post-training, uses calibration data
AWQ — Protects salient weights (~1%)
GGUF — llama.cpp format, CPU-optimized
ExLlamaV2 — Mixed precision, GPU-only AWQ typically outperforms GPTQ at the same bit width because it identifies and preserves the most important weight channels.

Akamai Connected Cloud — GPU inference at the edge
Distributed Inference Architecture

Akamai Connected Cloud provides GPU compute instances in 25+ global markets. Combined with Akamai CDN and EdgeWorkers, this enables a three-tier architecture where each layer handles what it does best.

CDN
Cache model artifacts
Serve static assets
EdgeWorkers
Pre/post processing
Auth, routing, filter
GPU Instance
L4 / A100 / H100
Model inference
Response
Stream tokens
Via CDN edge
CDN caches GGUF model files at edge PoPs (one-time download per region)
EdgeWorkers handle auth, rate limiting, prompt sanitization at <1ms
GPU instance runs vLLM/TGI for actual model inference This architecture means the model weights are already cached near the GPU when a new instance spins up — no cross-region download required.

Hardware comparison — TFLOPS, memory, power, cost
Hardware Type Memory FP16 TFLOPS Relative Speed Power Best For
H100 SXM Data center GPU 80 GB HBM3 989
1.0x
700W Frontier models, training
A100 80GB Data center GPU 80 GB HBM2e 312
0.31x
400W Production inference
L4 Edge GPU 24 GB GDDR6 121
0.12x
72W Edge inference, Akamai CC
M4 Max Apple Silicon 128 GB unified ~54
0.05x
40W Local dev, Ollama, 70B INT4
RTX 4090 Consumer GPU 24 GB GDDR6X 165
0.17x
450W Enthusiast, small models

Real-world edge inference architectures
Hybrid: Edge + Cloud Fallback
Run a small model (8B INT4) locally for common queries. Route complex queries to a cloud 70B model via Akamai CDN. EdgeWorkers classify query complexity in <1ms and route accordingly. 80% of queries never leave the edge.
CDN-Cached Model Weights
Store GGUF model files on Akamai CDN as cacheable objects. When a new GPU instance spins up, it pulls the model from the nearest PoP (~50ms) instead of an origin server (~500ms). Model updates propagate via cache invalidation.
Edge Pre/Post Processing
EdgeWorkers tokenize input, sanitize prompts, enforce rate limits, and apply content filtering before the request reaches the GPU. Response streaming flows back through the CDN with token-level SSE events. The GPU only handles matrix math.
Fully Local: Ollama on Mac Mini
For development and demos: Ollama serves models via a local REST API. Llama 3.1 8B at INT4 runs at ~60 tok/s on M4 Max. No network, no cost, no data leaves the machine. HarperDB connects directly via localhost.