Field Technology Services — AI-Orchestrated Akamai Operations
Before a language model can process text, it must convert raw characters into tokens — integer IDs that map to entries in a fixed vocabulary. Tokenization is the bridge between human-readable text and the numerical tensors a neural network operates on.
Every prompt you send and every response you receive is first split into tokens. The model never sees raw text — it sees sequences of token IDs, processes them through its transformer layers, and predicts the next token ID in the sequence.
Vocabulary is tiny (~256 entries), but sequences become extremely long. The word "tokenization" becomes 12 separate tokens. Models struggle to learn word-level meaning from individual characters, and attention cost scales quadratically with sequence length.
Each word is one token, but vocabulary must be enormous to cover all words. Any word not in the vocabulary becomes <UNK>. Misspellings, neologisms, compound words, and morphological variants all fail. Multilingual support is impractical.
Subword methods like Byte Pair Encoding (BPE) split text into pieces that balance vocabulary size against sequence length. Common words stay as single tokens; rare words decompose into recognizable subwords. This gives models a compact vocabulary that can represent any input without <UNK> tokens.
BPE starts with individual bytes (or characters) and iteratively merges the most frequent adjacent pair into a new token. After thousands of merges, the vocabulary contains a mix of single characters, common subwords, and frequent whole words.
9 words → 9 tokens. Common English words are single tokens in most vocabularies.
1 word → 6 tokens. The model has never seen this word as a whole, but it recognizes each subword piece and can reason about the word's meaning compositionally.
| Model | Tokenizer | Vocab Size | Tokens |
|---|---|---|---|
| GPT-4 / GPT-4o | cl100k_base / o200k_base | 100,256 / 200,019 | |
| Claude 3.5 | Anthropic BPE | ~100,000 | |
| Llama 3 | tiktoken-based BPE | 128,256 | |
| Llama 2 | SentencePiece BPE | 32,000 | |
| Mistral 7B | SentencePiece BPE | 32,000 |
Larger vocabularies produce shorter token sequences (more whole words become single tokens) but require larger embedding matrices, increasing model size. The optimal vocabulary size balances compression efficiency against memory cost.
getUserById might become 3 tokens (get, User, ById), while x is always 1. GPT-4o's larger vocabulary improves code tokenization significantly.| Input | Text | Tokens | Ratio |
|---|---|---|---|
| English | "The cat sat on the mat." | 7 | 1.0x |
| Spanish | "El gato se sentó en la alfombra." | 10 | 1.4x |
| Japanese | "猫がマットの上に座った。" | 14 | 2.0x |
| Python | def greet(name): return f"Hello {name}" | 12 | 1.7x |
| Arabic | "جلست القطة على الحصيرة." | 16 | 2.3x |
Non-English languages and source code consistently use more tokens for equivalent semantic content. This has direct implications for cost, latency, and effective context window size.
Beyond text tokens, every model uses reserved special tokens that control structure and behavior. These are never generated from text — they are injected by the tokenizer to mark boundaries, roles, and padding.
<|bos|> (beginning of sequence) and <|eos|> (end of sequence) tell the model where input starts and where generation should stop. Without EOS, the model generates indefinitely.
<|system|>, <|user|>, and <|assistant|> structure multi-turn conversations. The model uses these to distinguish instructions from user queries from its own prior responses. These are part of the chat template format.
Edge-deployed models (quantized to INT4/INT8, running on limited hardware) typically have smaller context windows — often 2K–8K tokens instead of 128K. When your effective context is small, every token counts.