Tokenization — FTS MCP Server

What is tokenization?

Before a language model can process text, it must convert raw characters into tokens — integer IDs that map to entries in a fixed vocabulary. Tokenization is the bridge between human-readable text and the numerical tensors a neural network operates on.

Every prompt you send and every response you receive is first split into tokens. The model never sees raw text — it sees sequences of token IDs, processes them through its transformer layers, and predicts the next token ID in the sequence.

Why subword tokenization?

Character-level

Vocabulary is tiny (~256 entries), but sequences become extremely long. The word "tokenization" becomes 12 separate tokens. Models struggle to learn word-level meaning from individual characters, and attention cost scales quadratically with sequence length.

"hello" → [h] [e] [l] [l] [o]
5 tokens for a 5-letter word — inefficient for long text

Word-level

Each word is one token, but vocabulary must be enormous to cover all words. Any word not in the vocabulary becomes <UNK>. Misspellings, neologisms, compound words, and morphological variants all fail. Multilingual support is impractical.

"unhappiness" → [<UNK>]
Out-of-vocabulary words are lost entirely

Subword tokenization — the sweet spot

Subword methods like Byte Pair Encoding (BPE) split text into pieces that balance vocabulary size against sequence length. Common words stay as single tokens; rare words decompose into recognizable subwords. This gives models a compact vocabulary that can represent any input without <UNK> tokens.

Byte Pair Encoding (BPE) — building the vocabulary

BPE starts with individual bytes (or characters) and iteratively merges the most frequent adjacent pair into a new token. After thousands of merges, the vocabulary contains a mix of single characters, common subwords, and frequent whole words.

Training corpus: "low lower lowest lowly"

init l o w </w> l o w e r </w> l o w e s t </w> l o w l y </w>

1 merge (l, o) → lo lo w </w> lo w e r </w> lo w e s t </w> lo w l y </w>

2 merge (lo, w) → low low </w> low e r </w> low e s t </w> low l y </w>

3 merge (e, r) → er low </w> low er </w> low e s t </w> low l y </w>

4 merge (low, er) → lower low </w> lower </w> low e s t </w> low l y </w>

Vocabulary after training: { l, o, w, e, r, s, t, y, lo, low, er, lower, ... }
Each merge adds one entry. GPT-4 runs ~100K merges to build its full vocabulary.

Visualizing tokenization — common vs. rare words

Common sentence — mostly whole-word tokens

The791

quick4996

brown14198

fox39935

jumps35308

over927

the279

lazy16053

dog5765

9 words → 9 tokens. Common English words are single tokens in most vocabularies.

Rare word — decomposed into subword pieces

anti3276

dis2251

establish63117

ment478

arian10441

ism1601

1 word → 6 tokens. The model has never seen this word as a whole, but it recognizes each subword piece and can reason about the word's meaning compositionally.

Vocabulary sizes across models

Model	Tokenizer	Vocab Size
GPT-4 / GPT-4o	cl100k_base / o200k_base	100,256 / 200,019
Claude 3.5	Anthropic BPE	~100,000
Llama 3	tiktoken-based BPE	128,256
Llama 2	SentencePiece BPE	32,000
Mistral 7B	SentencePiece BPE	32,000

Larger vocabularies produce shorter token sequences (more whole words become single tokens) but require larger embedding matrices, increasing model size. The optimal vocabulary size balances compression efficiency against memory cost.

Why tokenization matters for inference

Context Window = Token Limit

When a model advertises a "128K context window," that is 128K tokens, not words or characters. English text averages ~0.75 words per token (or ~4 characters per token). A 128K-token window holds roughly 96K words — about 350 pages of text.

API Cost Is Per-Token

Cloud API pricing is billed per input and output token. The same English sentence costs fewer tokens than its Japanese translation. Verbose prompts cost more. Efficient prompt engineering is really efficient token engineering.

Non-English Text Uses More Tokens

BPE vocabularies are trained primarily on English text. Japanese, Chinese, Korean, Arabic, and other scripts often require 2–4x more tokens per word than English. This effectively shrinks the usable context window for non-English tasks.

Code Tokenizes Differently

Source code contains symbols, indentation, and naming conventions that tokenize unpredictably. getUserById might become 3 tokens (get, User, ById), while x is always 1. GPT-4o's larger vocabulary improves code tokenization significantly.

Token count comparison — same meaning, different cost

Approximate GPT-4 token counts for equivalent content

Input	Text	Tokens	Ratio
English	"The cat sat on the mat."	7	1.0x
Spanish	"El gato se sentó en la alfombra."	10	1.4x
Japanese	"猫がマットの上に座った。"	14	2.0x
Python	def greet(name): return f"Hello {name}"	12	1.7x
Arabic	"جلست القطة على الحصيرة."	16	2.3x

Non-English languages and source code consistently use more tokens for equivalent semantic content. This has direct implications for cost, latency, and effective context window size.

Special tokens — control signals for the model

Beyond text tokens, every model uses reserved special tokens that control structure and behavior. These are never generated from text — they are injected by the tokenizer to mark boundaries, roles, and padding.

<|bos|> <|eos|> <|pad|> <|system|> <|user|> <|assistant|> <|endofturn|>

Boundary tokens

<|bos|> (beginning of sequence) and <|eos|> (end of sequence) tell the model where input starts and where generation should stop. Without EOS, the model generates indefinitely.

Role markers

<|system|>, <|user|>, and <|assistant|> structure multi-turn conversations. The model uses these to distinguish instructions from user queries from its own prior responses. These are part of the chat template format.

Connection to edge inference

Why tokenization is critical at the edge

Edge-deployed models (quantized to INT4/INT8, running on limited hardware) typically have smaller context windows — often 2K–8K tokens instead of 128K. When your effective context is small, every token counts.

Prompt Compression

Shorter prompts leave more room for generation. Tokenizer-aware prompt engineering — using words that tokenize efficiently — can reduce token count by 15–25% without changing meaning.

KV Cache Pressure

Each token in the context requires key-value cache memory during generation. On memory-constrained edge devices, fewer input tokens means more memory available for the KV cache, enabling longer responses.

Latency Per Token

Inference latency is roughly proportional to sequence length. A prompt that tokenizes into 500 tokens instead of 700 can reduce time-to-first-token by ~30% on edge hardware where every millisecond matters.

How Tokenization Works