FTS MCP Server

Field Technology Services — AI-Orchestrated Akamai Operations

Reference · Core Concepts

How Tokenization Works

BPE · Subword Encoding · Vocabulary
Context Windows · API Pricing
What is tokenization?

Before a language model can process text, it must convert raw characters into tokens — integer IDs that map to entries in a fixed vocabulary. Tokenization is the bridge between human-readable text and the numerical tensors a neural network operates on.

Every prompt you send and every response you receive is first split into tokens. The model never sees raw text — it sees sequences of token IDs, processes them through its transformer layers, and predicts the next token ID in the sequence.


Why subword tokenization?
Character-level

Vocabulary is tiny (~256 entries), but sequences become extremely long. The word "tokenization" becomes 12 separate tokens. Models struggle to learn word-level meaning from individual characters, and attention cost scales quadratically with sequence length.

"hello" → [h] [e] [l] [l] [o]
5 tokens for a 5-letter word — inefficient for long text
Word-level

Each word is one token, but vocabulary must be enormous to cover all words. Any word not in the vocabulary becomes <UNK>. Misspellings, neologisms, compound words, and morphological variants all fail. Multilingual support is impractical.

"unhappiness" → [<UNK>]
Out-of-vocabulary words are lost entirely
Subword tokenization — the sweet spot

Subword methods like Byte Pair Encoding (BPE) split text into pieces that balance vocabulary size against sequence length. Common words stay as single tokens; rare words decompose into recognizable subwords. This gives models a compact vocabulary that can represent any input without <UNK> tokens.


Byte Pair Encoding (BPE) — building the vocabulary

BPE starts with individual bytes (or characters) and iteratively merges the most frequent adjacent pair into a new token. After thousands of merges, the vocabulary contains a mix of single characters, common subwords, and frequent whole words.

Training corpus: "low lower lowest lowly"
init l o w </w> l o w e r </w> l o w e s t </w> l o w l y </w>
1 merge (l, o) → lo lo w </w> lo w e r </w> lo w e s t </w> lo w l y </w>
2 merge (lo, w) → low low </w> low e r </w> low e s t </w> low l y </w>
3 merge (e, r) → er low </w> low er </w> low e s t </w> low l y </w>
4 merge (low, er) → lower low </w> lower </w> low e s t </w> low l y </w>
Vocabulary after training: { l, o, w, e, r, s, t, y, lo, low, er, lower, ... }
Each merge adds one entry. GPT-4 runs ~100K merges to build its full vocabulary.

Visualizing tokenization — common vs. rare words
Common sentence — mostly whole-word tokens
The791
quick4996
brown14198
fox39935
jumps35308
over927
the279
lazy16053
dog5765

9 words → 9 tokens. Common English words are single tokens in most vocabularies.

Rare word — decomposed into subword pieces
anti3276
dis2251
establish63117
ment478
arian10441
ism1601

1 word → 6 tokens. The model has never seen this word as a whole, but it recognizes each subword piece and can reason about the word's meaning compositionally.


Vocabulary sizes across models
Model Tokenizer Vocab Size Tokens
GPT-4 / GPT-4o cl100k_base / o200k_base 100,256 / 200,019
Claude 3.5 Anthropic BPE ~100,000
Llama 3 tiktoken-based BPE 128,256
Llama 2 SentencePiece BPE 32,000
Mistral 7B SentencePiece BPE 32,000

Larger vocabularies produce shorter token sequences (more whole words become single tokens) but require larger embedding matrices, increasing model size. The optimal vocabulary size balances compression efficiency against memory cost.


Why tokenization matters for inference
Context Window = Token Limit
When a model advertises a "128K context window," that is 128K tokens, not words or characters. English text averages ~0.75 words per token (or ~4 characters per token). A 128K-token window holds roughly 96K words — about 350 pages of text.
API Cost Is Per-Token
Cloud API pricing is billed per input and output token. The same English sentence costs fewer tokens than its Japanese translation. Verbose prompts cost more. Efficient prompt engineering is really efficient token engineering.
Non-English Text Uses More Tokens
BPE vocabularies are trained primarily on English text. Japanese, Chinese, Korean, Arabic, and other scripts often require 2–4x more tokens per word than English. This effectively shrinks the usable context window for non-English tasks.
Code Tokenizes Differently
Source code contains symbols, indentation, and naming conventions that tokenize unpredictably. getUserById might become 3 tokens (get, User, ById), while x is always 1. GPT-4o's larger vocabulary improves code tokenization significantly.

Token count comparison — same meaning, different cost
Approximate GPT-4 token counts for equivalent content
Input Text Tokens Ratio
English "The cat sat on the mat." 7 1.0x
Spanish "El gato se sentó en la alfombra." 10 1.4x
Japanese "猫がマットの上に座った。" 14 2.0x
Python def greet(name): return f"Hello {name}" 12 1.7x
Arabic "جلست القطة على الحصيرة." 16 2.3x

Non-English languages and source code consistently use more tokens for equivalent semantic content. This has direct implications for cost, latency, and effective context window size.


Special tokens — control signals for the model

Beyond text tokens, every model uses reserved special tokens that control structure and behavior. These are never generated from text — they are injected by the tokenizer to mark boundaries, roles, and padding.

<|bos|> <|eos|> <|pad|> <|system|> <|user|> <|assistant|> <|endofturn|>
Boundary tokens

<|bos|> (beginning of sequence) and <|eos|> (end of sequence) tell the model where input starts and where generation should stop. Without EOS, the model generates indefinitely.

Role markers

<|system|>, <|user|>, and <|assistant|> structure multi-turn conversations. The model uses these to distinguish instructions from user queries from its own prior responses. These are part of the chat template format.

<|bos|> <|system|> You are a helpful assistant. <|user|> What is BPE? <|assistant|>
A typical chat-formatted prompt. The model sees special tokens as distinct IDs, not as text substrings.

Connection to edge inference
Why tokenization is critical at the edge

Edge-deployed models (quantized to INT4/INT8, running on limited hardware) typically have smaller context windows — often 2K–8K tokens instead of 128K. When your effective context is small, every token counts.

Prompt Compression
Shorter prompts leave more room for generation. Tokenizer-aware prompt engineering — using words that tokenize efficiently — can reduce token count by 15–25% without changing meaning.
KV Cache Pressure
Each token in the context requires key-value cache memory during generation. On memory-constrained edge devices, fewer input tokens means more memory available for the KV cache, enabling longer responses.
Latency Per Token
Inference latency is roughly proportional to sequence length. A prompt that tokenizes into 500 tokens instead of 700 can reduce time-to-first-token by ~30% on edge hardware where every millisecond matters.