FTS MCP Server

Field Technology Services — AI-Orchestrated Akamai Operations

Reference · Training Fundamentals

Fine-Tuning & LoRA

PEFT · LoRA · QLoRA
Adapter Merging · Edge Deployment
What is fine-tuning — specializing a general model

Fine-tuning takes a pre-trained model — one that already understands language — and continues training it on a smaller, domain-specific dataset. The model retains its general knowledge while learning new patterns, terminology, and task-specific behavior from your data.

Analogy: Pre-training is like a medical degree (broad knowledge). Fine-tuning is like a residency (specialized expertise). You don't re-learn biology — you build on it.

100%
Full fine-tuning updates all weights
0.1-1%
LoRA trains only adapter weights
4-bit
QLoRA quantizes base model

Full fine-tuning vs parameter-efficient fine-tuning (PEFT)
Full Fine-Tuning

Every parameter in the model is updated during training. This gives maximum flexibility but requires enormous GPU memory — you need to store the model weights, gradients, and optimizer states for every parameter.

Memory
~280 GB
Params
70B (all)
GPUs
8x A100

Risk: catastrophic forgetting — the model can lose general capabilities while learning specific ones.

PEFT / LoRA

Freeze the original model weights entirely. Add small trainable adapter modules. Only the adapters are updated during training. The original model is preserved perfectly — no catastrophic forgetting.

Memory
~16 GB
Params
~70M
GPUs
1x A100

Adapter file is typically 10-100 MB — can be swapped, merged, or stacked at runtime.


LoRA explained — low-rank adaptation of large language models

The core insight of LoRA: weight updates during fine-tuning have low intrinsic rank. Instead of updating the full weight matrix W (d x d), LoRA decomposes the update into two small matrices B (d x r) and A (r x d), where r is much smaller than d.

W
Original weights
d × d
(frozen)
+
ΔW
Weight update
d × d
(what we want)
=
W
Original weights
d × d
(frozen)
+
B
Down-project
d × r
×
A
Up-project
r × d
h = W0x + ΔWx = W0x + BAx
Where: d = 4096 (hidden dim), r = 8-64 (rank)
Full ΔW: 4096 × 4096 = 16.7M params per layer
LoRA B×A: 4096×16 + 16×4096 = 131K params per layer
At rank 16, LoRA uses 128x fewer parameters per layer while achieving 95-100% of full fine-tuning quality on most tasks.
Rank (r)

The rank controls adapter capacity. Higher rank = more expressive but more parameters. Most tasks work well with r=8 to r=64. Complex domain shifts may need r=128+. Start low and increase if quality is insufficient.

Alpha (α)

A scaling factor applied to the LoRA output: ΔW = (α/r) × BA. Higher alpha amplifies the adapter's effect. Common practice: set α = 2×r. This controls how much the adapter can deviate from base model behavior.

Target Modules

LoRA can target specific weight matrices: Q, K, V projections in attention, the output projection, or feed-forward layers. Targeting all linear layers gives best quality; targeting only Q/V saves memory with modest quality loss.


QLoRA — quantized LoRA for consumer hardware
QLoRA: Fine-tune a 65B model on a single 48GB GPU

QLoRA (Dettmers et al., 2023) combines three innovations to make fine-tuning dramatically more memory-efficient. The base model is loaded in 4-bit precision (NF4), LoRA adapters operate in FP16/BF16, and a novel "double quantization" further compresses the quantization constants.

4-bit NormalFloat (NF4)

Base model weights quantized to 4-bit using a data type optimized for normally-distributed neural network weights. Better than standard INT4 because it allocates quantization bins based on the expected weight distribution.

FP16 LoRA Adapters

Adapter matrices B and A are kept in full FP16 precision. During forward pass: dequantize NF4 weights to FP16, multiply, then add the LoRA adapter output. Gradients flow only through the FP16 adapters.

Double Quantization

Quantization requires scale factors per block of weights. Double quantization quantizes these scale factors too — from FP32 to FP8 — saving an additional ~0.37 bits per parameter, or ~3 GB on a 65B model.

Memory for 70B model:
Full fine-tuning (FP16 + Adam): ~280 GB (model + gradients + optimizer)
LoRA (FP16 base + adapters): ~140 GB (base model still in FP16)
QLoRA (NF4 base + FP16 adapters): ~36 GB (fits on 1 A100 or 2x consumer GPUs)
QLoRA achieves within 1% of full fine-tuning quality on most benchmarks, at 1/8th the memory cost.

Decision framework — when to fine-tune vs when to use RAG
Criterion Fine-Tuning RAG Recommendation
Data changes frequently Bad — must retrain each update Great — update docs instantly Use RAG
Need specific output format Great — model learns your format Limited — depends on prompting Fine-tune
Domain-specific jargon Great — learns terminology Good — if terms are in context Either / both
Need factual accuracy Risk — can hallucinate trained data Great — cites source documents Use RAG
Custom reasoning style Great — learns reasoning patterns Limited — can't change how model thinks Fine-tune
Budget is tight Moderate — $100-$10K per run Low — just embedding + storage costs Start with RAG
Maximum quality Combine both — fine-tune + RAG Both

Practical considerations — dataset, compute, risks
Dataset Size
Minimum: 100-500 high-quality examples for style transfer. Recommended: 1K-10K examples for robust task learning. Diminishing returns beyond 50K unless the domain is highly complex. Quality > quantity — 500 perfect examples beat 50K noisy ones.
Compute Requirements
QLoRA on 7B model: single GPU, 4-8 hours, ~$10 on cloud. QLoRA on 70B model: 1-2 A100s, 12-24 hours, ~$100-500. Full fine-tuning on 70B: 8+ A100s, days, $5K+. Most teams should start with QLoRA on a 7-13B model.
Catastrophic Forgetting
Full fine-tuning can cause the model to "forget" general capabilities while learning specific ones. LoRA mitigates this by keeping original weights frozen. The adapter can always be removed to recover the base model. Regularization and low learning rates also help.
Edge Deployment
Fine-tuned LoRA adapters are tiny — typically 10-100 MB. The base model stays the same. Multiple adapters can be swapped at runtime for different tasks or customers. This is ideal for edge: deploy one base model, swap adapters per request.

Connection to inference — adapters at the edge

LoRA adapters are the key to making fine-tuning practical for edge deployments. The base model (quantized to INT4/FP8) runs on edge hardware. Task-specific LoRA adapters are loaded on-demand — swapped in milliseconds, not minutes.

~50 MB
Typical adapter size
<100ms
Adapter swap time
0%
Base model quality loss
N
Adapters per base model
Edge deployment pattern:
1. Base model (INT4, ~4 GB) loaded once on edge device
2. Request arrives with task identifier (e.g., "legal-review")
3. LoRA adapter for that task loaded from CDN/edge store (~50 MB)
4. Inference runs with merged weights: Weffective = Wbase + BA
5. Adapter can be unloaded/swapped for next request
This pattern enables multi-tenant AI at the edge: one model, many specializations, minimal additional storage.