What is RAG — retrieval-augmented generation
Ground LLM Outputs in Real Data
RAG is a technique that retrieves relevant documents and injects them into the prompt before the model generates a response. Instead of relying solely on what the model memorized during training, RAG gives it fresh, domain-specific context at inference time.
The result: fewer hallucinations, up-to-date answers, and domain expertise without fine-tuning. The model becomes a reasoning engine that operates over your data, not a static knowledge base.
The problem — LLMs hallucinate when they don't know
Without RAG
Ask an LLM about your company's internal API, last week's incident report, or a specific customer's configuration. It will confidently generate plausible-sounding text that is completely fabricated. The model has no access to your data — it can only pattern-match against its training corpus.
Q: "What is our SLA for the Acme Corp property?"
A: "Your SLA is 99.9% uptime with..." [fabricated]
The model has never seen your SLA documents. It's generating statistically plausible text, not factual answers.
With RAG
The same question triggers a vector search against your document store. The actual SLA document is retrieved and injected into the prompt. Now the model reasons over real data — it cites specific clauses and provides accurate figures.
Q: "What is our SLA for the Acme Corp property?"
[Retrieved: acme-sla-2024.pdf, section 3.2]
A: "Per section 3.2, the SLA guarantees 99.95%..."
The answer is grounded in the actual document. The model can cite its source.
The RAG pipeline — from documents to answers
End-to-End Flow
1. Chunk Documents
Split source documents into overlapping chunks of 256-1024 tokens. Overlap (typically 10-20%) ensures context isn't lost at chunk boundaries. Each chunk becomes a retrieval unit.
2. Embed Chunks
Run each chunk through an embedding model (e.g., OpenAI text-embedding-3-small) to produce a dense vector of 1536 floats. This vector captures the semantic meaning — not keywords, but concepts.
3. Store in Vector Database
Index vectors in a database with HNSW or IVF index for fast approximate nearest neighbor search. HarperDB 4.6+ supports native vector columns with HNSW indexing. Each record stores the vector alongside the original text.
4. Embed the Query
When a user asks a question, embed it with the same model. The query vector now lives in the same geometric space as the document vectors — semantically similar content is nearby.
5. Similarity Search (top-k)
Find the k nearest vectors to the query vector using cosine similarity. Typical k = 3-10. HNSW index makes this O(log n) instead of O(n) — searching millions of vectors in milliseconds.
6. Generate with Context
Inject the retrieved chunks into the LLM's prompt as context. The model generates its response grounded in the actual retrieved documents — not its training data. Include source references for traceability.
Vector embeddings — text as geometry
How Embeddings Work
An embedding model converts text into a fixed-length array of floating-point numbers — a vector. Texts with similar meaning produce vectors that are close together in this high-dimensional space. "The cat sat on the mat" and "A feline rested on the rug" have very similar vectors despite sharing few words.
"power play percentage" → [0.023, -0.118, 0.445, ...]
"PP conversion rate" → [0.019, -0.112, 0.451, ...]
"goaltender save pct" → [-0.331, 0.087, 0.102, ...]
The first two vectors are close (similar topic). The third is far away (different topic). This works across languages, synonyms, and paraphrases.
Similarity Measurement
Cosine similarity measures the angle between two vectors, ignoring magnitude. A similarity of 1.0 means identical direction (same meaning). 0.0 means orthogonal (unrelated). Negative values indicate opposing meanings.
cosine_sim(A, B) = (A . B) / (|A| * |B|)
sim("power play", "PP%") = 0.91
sim("power play", "goalie") = 0.34
sim("power play", "stock market") = 0.08
Embedding dimensions: 1536 (OpenAI small), 3072 (OpenAI large), 768 (all-MiniLM). More dimensions = more nuance, more compute.
HNSW index — approximate nearest neighbor at scale
Hierarchical Navigable Small World
Brute-force vector search (compare query to every stored vector) is O(n) — too slow for millions of documents. HNSW builds a multi-layer graph where each layer is progressively sparser. Search starts at the top (few nodes, long-range connections) and descends to the bottom (all nodes, short-range connections).
Layer 2
3 nodes — coarse navigation
Layer 1
6 nodes — medium range
Layer 0
12 nodes — all vectors, precise
Brute force: O(n) — compare query to every vector
HNSW: O(log n) — navigate the graph hierarchy
1M vectors: brute force = 1M comparisons. HNSW = ~200 comparisons. Recall@10 typically exceeds 95%.
RAG vs fine-tuning — when to use which
| Dimension |
RAG |
Fine-Tuning |
| Cost |
Low. Embed documents once (~$0.02 per 1M tokens). No GPU training required. |
High. Requires GPU hours for training. Re-train when data changes. |
| Data freshness |
Real-time. New documents are searchable immediately after embedding. |
Stale. Knowledge frozen at training time. Updates require re-training. |
| Transparency |
High. You can see exactly which documents were retrieved and cited. |
Low. Knowledge is baked into weights. No clear attribution. |
| Best for |
Domain Q&A, documentation search, knowledge bases, support bots. |
Style adaptation, task-specific behavior, classification patterns. |
| Scalability |
Scales to millions of documents. HNSW search remains fast. |
Limited by training data size. Catastrophic forgetting at scale. |
| Accuracy |
Depends on retrieval quality. Good chunking + embeddings = high accuracy. |
Can achieve higher accuracy on narrow tasks with quality training data. |
The precision connection — vector search bridges the gap
RAG Bridges Precision and Performance
Vector embeddings are stored and searched at FP32 precision — the highest fidelity. The LLM generating the response can run at INT4 precision — the fastest speed. RAG decouples retrieval accuracy from generation precision.
Retrieval: FP32
Embedding vectors use full 32-bit precision. Cosine similarity calculations need mantissa accuracy — the difference between a 0.91 and 0.89 match score determines which document is retrieved. This runs on CPU, not GPU, so FP32 is cheap.
Generation: INT4
The LLM receives high-quality retrieved context in its prompt. Even at INT4 precision, the model reasons well over explicit context — it's reading, not recalling from memory. A 7B INT4 model + good RAG often outperforms a 70B model without RAG on domain tasks.