Inference at the Edge — FTS MCP Server

Why run inference at the edge

Latency

Round-trip to a cloud GPU adds 50-200ms. Edge inference eliminates the network hop entirely. For real-time applications — autocomplete, content moderation, image processing — this is the difference between fluid and sluggish.

Data Privacy

Sensitive data never leaves the local network. Medical records, financial data, PII — edge inference processes everything locally. No data in transit, no third-party API logs, no compliance gray areas.

Cost

Cloud GPU inference costs $0.01-$0.10 per query at scale. A Mac Mini M4 running Ollama costs $0.00 per query after the hardware purchase. At high volume, edge inference pays for itself in weeks.

Offline

Edge inference works without internet. Field deployments, aircraft, remote sites — anywhere connectivity is unreliable. The model runs entirely on local hardware with zero external dependencies.

The edge compute spectrum — from cloud to device

Cloud GPU H100 / A100

Cloud Edge Akamai L4 GPU

CDN Edge EdgeWorkers

On-Device Mac Mini / Phone

Maximum capability Minimum latency

Each tier trades model capability for proximity. A 405B model requires cloud GPUs. A 7B quantized model runs on a Mac Mini. The question is always: what is the smallest model that meets your quality bar?

Model size vs capability — the tradeoff

Parameter Count Determines Hardware Requirements

Model	Params	FP16 Size	INT4 Size	Minimum Hardware	Use Case
Phi-3 Mini	3.8B	7.6 GB	2.3 GB	Phone, Raspberry Pi 5	Simple Q&A, classification
Llama 3.1 8B	8B	16 GB	4.7 GB	Mac Mini M4, 16 GB laptop	General assistant, RAG, code
Llama 3.1 70B	70B	140 GB	40 GB	M4 Max 128 GB or A100 80 GB	Complex reasoning, analysis
Llama 3.1 405B	405B	810 GB	230 GB	8x H100 cluster	Frontier-class tasks

Quantization for edge — INT4/INT8 enables small hardware

How Quantization Works

Quantization maps high-precision weights (FP16/FP32) to lower-precision integers (INT4/INT8). Each group of weights gets a scale factor and zero-point that allows approximate reconstruction.

float_weight = scale * (int_weight - zero_point)
Groups of 32-128 weights share a single scale factor. Smaller groups = better accuracy, more overhead.

Quantization Methods

Modern methods minimize quality loss by analyzing weight importance during quantization.

GPTQ — Post-training, uses calibration data
AWQ — Protects salient weights (~1%)
GGUF — llama.cpp format, CPU-optimized
ExLlamaV2 — Mixed precision, GPU-only AWQ typically outperforms GPTQ at the same bit width because it identifies and preserves the most important weight channels.

Akamai Connected Cloud — GPU inference at the edge

Distributed Inference Architecture

Akamai Connected Cloud provides GPU compute instances in 25+ global markets. Combined with Akamai CDN and EdgeWorkers, this enables a three-tier architecture where each layer handles what it does best.

CDN
Cache model artifacts
Serve static assets

→

EdgeWorkers
Pre/post processing
Auth, routing, filter

→

GPU Instance
L4 / A100 / H100
Model inference

→

Response
Stream tokens
Via CDN edge

CDN caches GGUF model files at edge PoPs (one-time download per region)
EdgeWorkers handle auth, rate limiting, prompt sanitization at <1ms
GPU instance runs vLLM/TGI for actual model inference This architecture means the model weights are already cached near the GPU when a new instance spins up — no cross-region download required.

Hardware comparison — TFLOPS, memory, power, cost

Hardware	Type	Memory	FP16 TFLOPS	Relative Speed	Power	Best For
H100 SXM	Data center GPU	80 GB HBM3	989	1.0x	700W	Frontier models, training
A100 80GB	Data center GPU	80 GB HBM2e	312	0.31x	400W	Production inference
L4	Edge GPU	24 GB GDDR6	121	0.12x	72W	Edge inference, Akamai CC
M4 Max	Apple Silicon	128 GB unified	~54	0.05x	40W	Local dev, Ollama, 70B INT4
RTX 4090	Consumer GPU	24 GB GDDR6X	165	0.17x	450W	Enthusiast, small models

Real-world edge inference architectures

Hybrid: Edge + Cloud Fallback

Run a small model (8B INT4) locally for common queries. Route complex queries to a cloud 70B model via Akamai CDN. EdgeWorkers classify query complexity in <1ms and route accordingly. 80% of queries never leave the edge.

CDN-Cached Model Weights

Store GGUF model files on Akamai CDN as cacheable objects. When a new GPU instance spins up, it pulls the model from the nearest PoP (~50ms) instead of an origin server (~500ms). Model updates propagate via cache invalidation.

Edge Pre/Post Processing

EdgeWorkers tokenize input, sanitize prompts, enforce rate limits, and apply content filtering before the request reaches the GPU. Response streaming flows back through the CDN with token-level SSE events. The GPU only handles matrix math.

Fully Local: Ollama on Mac Mini

For development and demos: Ollama serves models via a local REST API. Llama 3.1 8B at INT4 runs at ~60 tok/s on M4 Max. No network, no cost, no data leaves the machine. HarperDB connects directly via localhost.