Fine-Tuning & LoRA — FTS MCP Server

What is fine-tuning — specializing a general model

Fine-tuning takes a pre-trained model — one that already understands language — and continues training it on a smaller, domain-specific dataset. The model retains its general knowledge while learning new patterns, terminology, and task-specific behavior from your data.

Analogy: Pre-training is like a medical degree (broad knowledge). Fine-tuning is like a residency (specialized expertise). You don't re-learn biology — you build on it.

100%

Full fine-tuning updates all weights

0.1-1%

LoRA trains only adapter weights

4-bit

QLoRA quantizes base model

Full fine-tuning vs parameter-efficient fine-tuning (PEFT)

Full Fine-Tuning

Every parameter in the model is updated during training. This gives maximum flexibility but requires enormous GPU memory — you need to store the model weights, gradients, and optimizer states for every parameter.

Memory

~280 GB

Params

70B (all)

GPUs

8x A100

Risk: catastrophic forgetting — the model can lose general capabilities while learning specific ones.

PEFT / LoRA

Freeze the original model weights entirely. Add small trainable adapter modules. Only the adapters are updated during training. The original model is preserved perfectly — no catastrophic forgetting.

Memory

~16 GB

Params

~70M

GPUs

1x A100

Adapter file is typically 10-100 MB — can be swapped, merged, or stacked at runtime.

LoRA explained — low-rank adaptation of large language models

The core insight of LoRA: weight updates during fine-tuning have low intrinsic rank. Instead of updating the full weight matrix W (d x d), LoRA decomposes the update into two small matrices B (d x r) and A (r x d), where r is much smaller than d.

Original weights
d × d
(frozen)

ΔW

Weight update
d × d
(what we want)

Original weights
d × d
(frozen)

Down-project
d × r

Up-project
r × d

h = W₀x + ΔWx = W₀x + BAx
Where: d = 4096 (hidden dim), r = 8-64 (rank)
Full ΔW: 4096 × 4096 = 16.7M params per layer
LoRA B×A: 4096×16 + 16×4096 = 131K params per layer
At rank 16, LoRA uses 128x fewer parameters per layer while achieving 95-100% of full fine-tuning quality on most tasks.

Rank (r)

The rank controls adapter capacity. Higher rank = more expressive but more parameters. Most tasks work well with r=8 to r=64. Complex domain shifts may need r=128+. Start low and increase if quality is insufficient.

Alpha (α)

A scaling factor applied to the LoRA output: ΔW = (α/r) × BA. Higher alpha amplifies the adapter's effect. Common practice: set α = 2×r. This controls how much the adapter can deviate from base model behavior.

Target Modules

LoRA can target specific weight matrices: Q, K, V projections in attention, the output projection, or feed-forward layers. Targeting all linear layers gives best quality; targeting only Q/V saves memory with modest quality loss.

QLoRA — quantized LoRA for consumer hardware

QLoRA: Fine-tune a 65B model on a single 48GB GPU

QLoRA (Dettmers et al., 2023) combines three innovations to make fine-tuning dramatically more memory-efficient. The base model is loaded in 4-bit precision (NF4), LoRA adapters operate in FP16/BF16, and a novel "double quantization" further compresses the quantization constants.

4-bit NormalFloat (NF4)

Base model weights quantized to 4-bit using a data type optimized for normally-distributed neural network weights. Better than standard INT4 because it allocates quantization bins based on the expected weight distribution.

FP16 LoRA Adapters

Adapter matrices B and A are kept in full FP16 precision. During forward pass: dequantize NF4 weights to FP16, multiply, then add the LoRA adapter output. Gradients flow only through the FP16 adapters.

Double Quantization

Quantization requires scale factors per block of weights. Double quantization quantizes these scale factors too — from FP32 to FP8 — saving an additional ~0.37 bits per parameter, or ~3 GB on a 65B model.

Memory for 70B model:
Full fine-tuning (FP16 + Adam): ~280 GB (model + gradients + optimizer)
LoRA (FP16 base + adapters): ~140 GB (base model still in FP16)
QLoRA (NF4 base + FP16 adapters): ~36 GB (fits on 1 A100 or 2x consumer GPUs)
QLoRA achieves within 1% of full fine-tuning quality on most benchmarks, at 1/8th the memory cost.

Decision framework — when to fine-tune vs when to use RAG

Criterion	Fine-Tuning	RAG	Recommendation
Data changes frequently	Bad — must retrain each update	Great — update docs instantly	Use RAG
Need specific output format	Great — model learns your format	Limited — depends on prompting	Fine-tune
Domain-specific jargon	Great — learns terminology	Good — if terms are in context	Either / both
Need factual accuracy	Risk — can hallucinate trained data	Great — cites source documents	Use RAG
Custom reasoning style	Great — learns reasoning patterns	Limited — can't change how model thinks	Fine-tune
Budget is tight	Moderate — $100-$10K per run	Low — just embedding + storage costs	Start with RAG
Maximum quality	Combine both — fine-tune + RAG		Both

Practical considerations — dataset, compute, risks

Dataset Size

Minimum: 100-500 high-quality examples for style transfer. Recommended: 1K-10K examples for robust task learning. Diminishing returns beyond 50K unless the domain is highly complex. Quality > quantity — 500 perfect examples beat 50K noisy ones.

Compute Requirements

QLoRA on 7B model: single GPU, 4-8 hours, ~$10 on cloud. QLoRA on 70B model: 1-2 A100s, 12-24 hours, ~$100-500. Full fine-tuning on 70B: 8+ A100s, days, $5K+. Most teams should start with QLoRA on a 7-13B model.

Catastrophic Forgetting

Full fine-tuning can cause the model to "forget" general capabilities while learning specific ones. LoRA mitigates this by keeping original weights frozen. The adapter can always be removed to recover the base model. Regularization and low learning rates also help.

Edge Deployment

Fine-tuned LoRA adapters are tiny — typically 10-100 MB. The base model stays the same. Multiple adapters can be swapped at runtime for different tasks or customers. This is ideal for edge: deploy one base model, swap adapters per request.

Connection to inference — adapters at the edge

LoRA adapters are the key to making fine-tuning practical for edge deployments. The base model (quantized to INT4/FP8) runs on edge hardware. Task-specific LoRA adapters are loaded on-demand — swapped in milliseconds, not minutes.

~50 MB

Typical adapter size

<100ms

Adapter swap time

Base model quality loss

Adapters per base model

Edge deployment pattern:
1. Base model (INT4, ~4 GB) loaded once on edge device
2. Request arrives with task identifier (e.g., "legal-review")
3. LoRA adapter for that task loaded from CDN/edge store (~50 MB)
4. Inference runs with merged weights: W_effective = W_base + BA
5. Adapter can be unloaded/swapped for next request
This pattern enables multi-tenant AI at the edge: one model, many specializations, minimal additional storage.