What is fine-tuning — specializing a general model
Fine-tuning takes a pre-trained model — one that already understands language — and continues training it on a smaller, domain-specific dataset. The model retains its general knowledge while learning new patterns, terminology, and task-specific behavior from your data.
Analogy: Pre-training is like a medical degree (broad knowledge). Fine-tuning is like a residency (specialized expertise). You don't re-learn biology — you build on it.
100%
Full fine-tuning updates all weights
0.1-1%
LoRA trains only adapter weights
4-bit
QLoRA quantizes base model
Full fine-tuning vs parameter-efficient fine-tuning (PEFT)
Full Fine-Tuning
Every parameter in the model is updated during training. This gives maximum flexibility but requires enormous GPU memory — you need to store the model weights, gradients, and optimizer states for every parameter.
Risk: catastrophic forgetting — the model can lose general capabilities while learning specific ones.
PEFT / LoRA
Freeze the original model weights entirely. Add small trainable adapter modules. Only the adapters are updated during training. The original model is preserved perfectly — no catastrophic forgetting.
Adapter file is typically 10-100 MB — can be swapped, merged, or stacked at runtime.
LoRA explained — low-rank adaptation of large language models
The core insight of LoRA: weight updates during fine-tuning have low intrinsic rank. Instead of updating the full weight matrix W (d x d), LoRA decomposes the update into two small matrices B (d x r) and A (r x d), where r is much smaller than d.
W
Original weights
d × d
(frozen)
+
ΔW
Weight update
d × d
(what we want)
=
W
Original weights
d × d
(frozen)
+
×
h = W0x + ΔWx = W0x + BAx
Where: d = 4096 (hidden dim), r = 8-64 (rank)
Full ΔW: 4096 × 4096 = 16.7M params per layer
LoRA B×A: 4096×16 + 16×4096 = 131K params per layer
At rank 16, LoRA uses 128x fewer parameters per layer while achieving 95-100% of full fine-tuning quality on most tasks.
Rank (r)
The rank controls adapter capacity. Higher rank = more expressive but more parameters. Most tasks work well with r=8 to r=64. Complex domain shifts may need r=128+. Start low and increase if quality is insufficient.
Alpha (α)
A scaling factor applied to the LoRA output: ΔW = (α/r) × BA. Higher alpha amplifies the adapter's effect. Common practice: set α = 2×r. This controls how much the adapter can deviate from base model behavior.
Target Modules
LoRA can target specific weight matrices: Q, K, V projections in attention, the output projection, or feed-forward layers. Targeting all linear layers gives best quality; targeting only Q/V saves memory with modest quality loss.
QLoRA — quantized LoRA for consumer hardware
QLoRA: Fine-tune a 65B model on a single 48GB GPU
QLoRA (Dettmers et al., 2023) combines three innovations to make fine-tuning dramatically more memory-efficient. The base model is loaded in 4-bit precision (NF4), LoRA adapters operate in FP16/BF16, and a novel "double quantization" further compresses the quantization constants.
4-bit NormalFloat (NF4)
Base model weights quantized to 4-bit using a data type optimized for normally-distributed neural network weights. Better than standard INT4 because it allocates quantization bins based on the expected weight distribution.
FP16 LoRA Adapters
Adapter matrices B and A are kept in full FP16 precision. During forward pass: dequantize NF4 weights to FP16, multiply, then add the LoRA adapter output. Gradients flow only through the FP16 adapters.
Double Quantization
Quantization requires scale factors per block of weights. Double quantization quantizes these scale factors too — from FP32 to FP8 — saving an additional ~0.37 bits per parameter, or ~3 GB on a 65B model.
Memory for 70B model:
Full fine-tuning (FP16 + Adam): ~280 GB (model + gradients + optimizer)
LoRA (FP16 base + adapters): ~140 GB (base model still in FP16)
QLoRA (NF4 base + FP16 adapters): ~36 GB (fits on 1 A100 or 2x consumer GPUs)
QLoRA achieves within 1% of full fine-tuning quality on most benchmarks, at 1/8th the memory cost.
Decision framework — when to fine-tune vs when to use RAG
| Criterion |
Fine-Tuning |
RAG |
Recommendation |
| Data changes frequently |
Bad — must retrain each update |
Great — update docs instantly |
Use RAG |
| Need specific output format |
Great — model learns your format |
Limited — depends on prompting |
Fine-tune |
| Domain-specific jargon |
Great — learns terminology |
Good — if terms are in context |
Either / both |
| Need factual accuracy |
Risk — can hallucinate trained data |
Great — cites source documents |
Use RAG |
| Custom reasoning style |
Great — learns reasoning patterns |
Limited — can't change how model thinks |
Fine-tune |
| Budget is tight |
Moderate — $100-$10K per run |
Low — just embedding + storage costs |
Start with RAG |
| Maximum quality |
Combine both — fine-tune + RAG |
Both |
Practical considerations — dataset, compute, risks
Dataset Size
Minimum: 100-500 high-quality examples for style transfer. Recommended: 1K-10K examples for robust task learning. Diminishing returns beyond 50K unless the domain is highly complex. Quality > quantity — 500 perfect examples beat 50K noisy ones.
Compute Requirements
QLoRA on 7B model: single GPU, 4-8 hours, ~$10 on cloud. QLoRA on 70B model: 1-2 A100s, 12-24 hours, ~$100-500. Full fine-tuning on 70B: 8+ A100s, days, $5K+. Most teams should start with QLoRA on a 7-13B model.
Catastrophic Forgetting
Full fine-tuning can cause the model to "forget" general capabilities while learning specific ones. LoRA mitigates this by keeping original weights frozen. The adapter can always be removed to recover the base model. Regularization and low learning rates also help.
Edge Deployment
Fine-tuned LoRA adapters are tiny — typically 10-100 MB. The base model stays the same. Multiple adapters can be swapped at runtime for different tasks or customers. This is ideal for edge: deploy one base model, swap adapters per request.
Connection to inference — adapters at the edge
LoRA adapters are the key to making fine-tuning practical for edge deployments. The base model (quantized to INT4/FP8) runs on edge hardware. Task-specific LoRA adapters are loaded on-demand — swapped in milliseconds, not minutes.
~50 MB
Typical adapter size
0%
Base model quality loss
N
Adapters per base model
Edge deployment pattern:
1. Base model (INT4, ~4 GB) loaded once on edge device
2. Request arrives with task identifier (e.g., "legal-review")
3. LoRA adapter for that task loaded from CDN/edge store (~50 MB)
4. Inference runs with merged weights: Weffective = Wbase + BA
5. Adapter can be unloaded/swapped for next request
This pattern enables multi-tenant AI at the edge: one model, many specializations, minimal additional storage.