How Models Are Trained — FTS MCP Server

The training pipeline — from raw data to deployed model

Training a large language model is a multi-stage process that transforms raw internet text into a system capable of generating coherent, useful responses. Each stage serves a distinct purpose and operates at a different scale.

Stage 1

Data Collection

→

Stage 2

Preprocessing

→

Stage 3

Tokenization

→

Stage 4

Training Loop

→

Stage 5

Evaluation

→

Stage 6

Deployment

Data Collection & Preprocessing

Web crawls (Common Crawl), books, code repositories, scientific papers, and curated datasets. Raw data is cleaned: deduplication, language filtering, quality scoring, PII removal, and toxic content filtering.

Common Crawl: ~250B pages crawled
After filtering: ~5-10% survives quality gates
Data quality matters more than quantity. Garbage in, garbage out — at trillion-token scale.

Tokenization

Text is split into subword tokens using algorithms like BPE (Byte-Pair Encoding) or SentencePiece. The model never sees raw characters — it operates on integer token IDs mapped to learned embedding vectors.

"Hello world" → [15496, 995]
Vocab size: 32K–128K tokens
~1 token ≈ 0.75 English words
Tokenization determines the model's vocabulary and directly impacts context length efficiency.

Inside a single training step — the four operations

Every training step processes a batch of sequences through the same four operations. A model like GPT-4 executes millions of these steps during pre-training, each one nudging billions of weights toward better predictions.

1. Forward Pass

Input tokens flow through every layer of the network. Each layer applies its weights via matrix multiplication, adds biases, and passes results through activation functions. The output is a probability distribution over all possible next tokens.

2. Loss Calculation

Compare the model's predicted token probabilities against the actual next tokens. Cross-entropy loss measures how surprised the model is by the correct answer. Lower loss = better predictions.

3. Backpropagation

The loss signal flows backward through every layer. The chain rule of calculus computes how much each weight contributed to the error. This produces a gradient for every one of the billions of parameters — a direction to move each weight to reduce loss.

4. Optimizer Step

The optimizer (typically AdamW) uses the gradients to update weights. It maintains momentum and adaptive learning rates per parameter. The learning rate schedule controls step size — too large causes instability, too small causes slow convergence.

Cross-entropy loss: L = -∑ y_i log(p_i)
Gradient: ∇W = ∂L / ∂W (computed via chain rule through all layers)
Weight update: W_new = W_old - η · ∇W (simplified; Adam adds momentum + adaptive rates)
Every weight in the model is updated simultaneously. GPT-3 has 175B weights; each step updates all of them.

Neural network layers — weights, activations, gradients

x₁

x₂

x₃

Input
Embeddings

W weights

↓ gradients flow back

h₁

h₂

h₃

h₄

Hidden
Layer

W weights

↓ gradients flow back

y₁

y₂

Output
Logits

Forward: data flows left → right Backward: gradients flow right → left

Real transformers have 32-128 layers, each with multi-head attention + feed-forward sublayers. Same principle, vastly more weights.

Loss functions — how the model measures its own mistakes

Cross-Entropy Loss

The standard loss function for language models. For each position in the sequence, the model outputs a probability distribution over all tokens in the vocabulary. Cross-entropy measures how far that distribution is from the true answer (the actual next token).

L = -log(p_correct)
If the model assigns probability 0.9 to the correct token: L = -log(0.9) = 0.105 (low loss, good prediction). If it assigns 0.01: L = -log(0.01) = 4.6 (high loss, bad prediction).

Perplexity

Perplexity is the exponential of the average cross-entropy loss. It represents how many tokens the model is "choosing between" at each step. A perplexity of 15 means the model is, on average, as uncertain as if it were picking uniformly from 15 options.

PPL = e^L = 2^H
GPT-3 perplexity on test data: ~20
GPT-4 class perplexity: ~8-12
Lower perplexity = better model. Human-level perplexity on most English text is approximately 10-20.

Training data and compute — the scale of modern models

300B

GPT-3 tokens

175B params, 2020

Llama 2 tokens

70B params, 2023

15T

Llama 3 tokens

405B params, 2024

$100M+

GPT-4 class compute

~25,000 A100 GPUs

Chinchilla Scaling Laws

DeepMind's Chinchilla paper (2022) showed that most models were undertrained relative to their size. The optimal ratio is roughly 20 tokens per parameter. A 70B model should ideally be trained on 1.4T tokens. Modern models exceed this significantly — Llama 3 used 15T tokens for a 405B model (~37 tokens per parameter).

Chinchilla optimal: tokens ≈ 20 × parameters
Llama 3 (405B): 15T / 405B = 37× — over-trained by Chinchilla standards
Over-training beyond Chinchilla ratios improves inference efficiency: the model is smaller than "optimal" for its data budget, making it cheaper to run.

Pre-training vs post-training — two distinct phases

Pre-training

The model learns language patterns by predicting the next token over trillions of tokens. This is unsupervised — no human labels, just raw text. The model learns grammar, facts, reasoning patterns, and code from the statistical structure of the data.

Objective: next-token prediction (autoregressive)

Data: 10-15T tokens of web, books, code

Compute: months on thousands of GPUs

Result: a "base model" that can complete text but doesn't follow instructions

Post-training

The base model is refined to follow instructions, be helpful, and avoid harmful outputs. This uses much less data (thousands to millions of examples) but requires expensive human annotation and careful reward modeling.

SFT (Supervised Fine-Tuning) — train on human-written instruction/response pairs

RLHF — train a reward model on human preferences, then optimize the LLM against it using PPO

DPO — Direct Preference Optimization skips the reward model, learns preferences directly

Constitutional AI — the model critiques and revises its own outputs using a set of principles

Why you can't just "retrain" for your use case

When customers say "we want to train a model on our data," what they usually need is one of three things — and none of them require pre-training from scratch.

Pre-training from scratch

Cost: $50-100M+ in compute. Timeline: months. Data: trillions of tokens. This is what OpenAI, Google, and Meta do. Unless you're building a foundation model company, this is not your path. The resulting model wouldn't even be as good as existing ones without the research team to match.

Fine-tuning (what most people need)

Take an existing model, train it on your domain-specific data. Cost: $100-$10K. Timeline: hours. Data: thousands of examples. The model learns your style, terminology, and task patterns while retaining its general capabilities. See the Fine-Tuning & LoRA page for details.

RAG (often the best choice)

Don't change the model at all. Instead, retrieve relevant documents at query time and inject them into the prompt. Cost: near zero for the model. The model reasons over your fresh data without any training. Updates are instant — just update the document store.

Prompt engineering (start here)

Before any of the above: write better prompts. System prompts, few-shot examples, chain-of-thought instructions. This is free, instant, and often sufficient. Many "the model doesn't work for our use case" problems are actually prompt engineering problems.