Lesson 26 • Advanced
Tokenization Strategies
How BPE, WordPiece, and SentencePiece convert human text into the numbers that LLMs actually process — and why it matters for performance and cost.
✅ What You'll Learn
- • Byte Pair Encoding (BPE) — the standard for GPT models
- • WordPiece (BERT) and SentencePiece (T5, Mistral)
- • Token efficiency and cost implications
- • How tokenization affects model performance
🔤 Why Tokenization Matters
🎯 Real-World Analogy: Tokenization is like breaking a Lego set into individual bricks. Word-level tokenization uses pre-built sections (fast but inflexible). Character-level uses individual studs (flexible but slow). Subword tokenization (BPE) uses a mix — common patterns are single bricks, rare words decompose into smaller pieces. This balance of efficiency and flexibility is why every modern LLM uses subword tokenization.
Tokenization directly affects: cost (you pay per token), context length (more tokens = less text fits), and capability (poor tokenization hurts reasoning on numbers, code, and non-English text).
Try It: BPE Tokenization
Train a BPE tokenizer from scratch and watch merges happen
import re
from collections import Counter
# Byte Pair Encoding (BPE): The Most Common Tokenization
# Used by GPT-2, GPT-3, GPT-4, LLaMA, and most modern LLMs
def get_pairs(word):
"""Get all adjacent character pairs in a word"""
pairs = Counter()
for i in range(len(word) - 1):
pairs[(word[i], word[i+1])] += 1
return pairs
def bpe_train(corpus, num_merges):
"""Train BPE tokenizer from a corpus"""
# Start with character-level tokens
words = corpus.split()
...Try It: Compare Tokenizers
See how different strategies split the same text
# Comparing Tokenization Strategies
# BPE vs WordPiece vs SentencePiece
print("=== Tokenization Strategy Comparison ===")
print()
text = "unhappiness is overwhelming sometimes"
# Simulate different tokenizers
tokenizations = {
"Word-level": ["unhappiness", "is", "overwhelming", "sometimes"],
"Char-level": list(text.replace(" ", " _ ")),
"BPE (GPT)": ["un", "happiness", " is", " over", "wh", "elming", " sometimes"],
"WordPiece": ["un", "##happi", "##ness", "is", "over", "
...⚠️ Common Mistake: Assuming 1 word = 1 token. In English, 1 token ≈ 0.75 words on average. But for code, numbers, and non-Latin scripts, the ratio is much worse. "fibonacci" might be 3+ tokens. Chinese characters can be 2-3 tokens each. Always check with the actual tokenizer.
💡 Pro Tip: Use OpenAI's tiktoken library or Hugging Face's tokenizers to count tokens before sending API calls. This prevents unexpected costs and context length overflows. For LLaMA/Mistral, use sentencepiece.
📋 Quick Reference
| Method | Used By | Key Feature |
|---|---|---|
| BPE | GPT-2/3/4, LLaMA | Frequency-based merging |
| WordPiece | BERT, DistilBERT | Likelihood-based merging |
| SentencePiece | T5, Mistral, mBART | Language-agnostic, no pre-tokenization |
| Byte-level BPE | GPT-4, LLaMA 2 | Operates on UTF-8 bytes |
| Tiktoken | OpenAI models | Fast Rust-based BPE |
🎉 Lesson Complete!
You now understand how LLMs convert text to numbers. Next, learn how to fine-tune these models on your own data with LoRA!
Sign up for free to track which lessons you've completed and get learning reminders.