The training pipeline — from raw data to deployed model
Training a large language model is a multi-stage process that transforms raw internet text into a system capable of generating coherent, useful responses. Each stage serves a distinct purpose and operates at a different scale.
Data Collection & Preprocessing
Web crawls (Common Crawl), books, code repositories, scientific papers, and curated datasets. Raw data is cleaned: deduplication, language filtering, quality scoring, PII removal, and toxic content filtering.
Common Crawl: ~250B pages crawled
After filtering: ~5-10% survives quality gates
Data quality matters more than quantity. Garbage in, garbage out — at trillion-token scale.
Tokenization
Text is split into subword tokens using algorithms like BPE (Byte-Pair Encoding) or SentencePiece. The model never sees raw characters — it operates on integer token IDs mapped to learned embedding vectors.
"Hello world" → [15496, 995]
Vocab size: 32K–128K tokens
~1 token ≈ 0.75 English words
Tokenization determines the model's vocabulary and directly impacts context length efficiency.
Inside a single training step — the four operations
Every training step processes a batch of sequences through the same four operations. A model like GPT-4 executes millions of these steps during pre-training, each one nudging billions of weights toward better predictions.
1. Forward Pass
Input tokens flow through every layer of the network. Each layer applies its weights via matrix multiplication, adds biases, and passes results through activation functions. The output is a probability distribution over all possible next tokens.
2. Loss Calculation
Compare the model's predicted token probabilities against the actual next tokens. Cross-entropy loss measures how surprised the model is by the correct answer. Lower loss = better predictions.
3. Backpropagation
The loss signal flows backward through every layer. The chain rule of calculus computes how much each weight contributed to the error. This produces a gradient for every one of the billions of parameters — a direction to move each weight to reduce loss.
4. Optimizer Step
The optimizer (typically AdamW) uses the gradients to update weights. It maintains momentum and adaptive learning rates per parameter. The learning rate schedule controls step size — too large causes instability, too small causes slow convergence.
Cross-entropy loss: L = -∑ yi log(pi)
Gradient: ∇W = ∂L / ∂W (computed via chain rule through all layers)
Weight update: Wnew = Wold - η · ∇W (simplified; Adam adds momentum + adaptive rates)
Every weight in the model is updated simultaneously. GPT-3 has 175B weights; each step updates all of them.
Neural network layers — weights, activations, gradients
W weights
↓ gradients flow back
W weights
↓ gradients flow back
Forward: data flows left → right Backward: gradients flow right → left
Real transformers have 32-128 layers, each with multi-head attention + feed-forward sublayers. Same principle, vastly more weights.
Loss functions — how the model measures its own mistakes
Cross-Entropy Loss
The standard loss function for language models. For each position in the sequence, the model outputs a probability distribution over all tokens in the vocabulary. Cross-entropy measures how far that distribution is from the true answer (the actual next token).
L = -log(pcorrect)
If the model assigns probability 0.9 to the correct token: L = -log(0.9) = 0.105 (low loss, good prediction). If it assigns 0.01: L = -log(0.01) = 4.6 (high loss, bad prediction).
Perplexity
Perplexity is the exponential of the average cross-entropy loss. It represents how many tokens the model is "choosing between" at each step. A perplexity of 15 means the model is, on average, as uncertain as if it were picking uniformly from 15 options.
PPL = eL = 2H
GPT-3 perplexity on test data: ~20
GPT-4 class perplexity: ~8-12
Lower perplexity = better model. Human-level perplexity on most English text is approximately 10-20.
Training data and compute — the scale of modern models
300B
GPT-3 tokens
175B params, 2020
2T
Llama 2 tokens
70B params, 2023
15T
Llama 3 tokens
405B params, 2024
$100M+
GPT-4 class compute
~25,000 A100 GPUs
Chinchilla Scaling Laws
DeepMind's Chinchilla paper (2022) showed that most models were undertrained relative to their size. The optimal ratio is roughly 20 tokens per parameter. A 70B model should ideally be trained on 1.4T tokens. Modern models exceed this significantly — Llama 3 used 15T tokens for a 405B model (~37 tokens per parameter).
Chinchilla optimal: tokens ≈ 20 × parameters
Llama 3 (405B): 15T / 405B = 37× — over-trained by Chinchilla standards
Over-training beyond Chinchilla ratios improves inference efficiency: the model is smaller than "optimal" for its data budget, making it cheaper to run.
Pre-training vs post-training — two distinct phases
Pre-training
The model learns language patterns by predicting the next token over trillions of tokens. This is unsupervised — no human labels, just raw text. The model learns grammar, facts, reasoning patterns, and code from the statistical structure of the data.
Objective: next-token prediction (autoregressive)
Data: 10-15T tokens of web, books, code
Compute: months on thousands of GPUs
Result: a "base model" that can complete text but doesn't follow instructions
Post-training
The base model is refined to follow instructions, be helpful, and avoid harmful outputs. This uses much less data (thousands to millions of examples) but requires expensive human annotation and careful reward modeling.
SFT (Supervised Fine-Tuning) — train on human-written instruction/response pairs
RLHF — train a reward model on human preferences, then optimize the LLM against it using PPO
DPO — Direct Preference Optimization skips the reward model, learns preferences directly
Constitutional AI — the model critiques and revises its own outputs using a set of principles
Why you can't just "retrain" for your use case
When customers say "we want to train a model on our data," what they usually need is one of three things — and none of them require pre-training from scratch.
Pre-training from scratch
Cost: $50-100M+ in compute. Timeline: months. Data: trillions of tokens. This is what OpenAI, Google, and Meta do. Unless you're building a foundation model company, this is not your path. The resulting model wouldn't even be as good as existing ones without the research team to match.
Fine-tuning (what most people need)
Take an existing model, train it on your domain-specific data. Cost: $100-$10K. Timeline: hours. Data: thousands of examples. The model learns your style, terminology, and task patterns while retaining its general capabilities. See the Fine-Tuning & LoRA page for details.
RAG (often the best choice)
Don't change the model at all. Instead, retrieve relevant documents at query time and inject them into the prompt. Cost: near zero for the model. The model reasons over your fresh data without any training. Updates are instant — just update the document store.
Prompt engineering (start here)
Before any of the above: write better prompts. System prompts, few-shot examples, chain-of-thought instructions. This is free, instant, and often sufficient. Many "the model doesn't work for our use case" problems are actually prompt engineering problems.