Courses/AI & ML/Tokenization Strategies

Lesson 26 • Advanced

Tokenization Strategies

By the end of this lesson you'll be able to split text into tokens by hand, run one Byte-Pair Encoding merge, read real token IDs from a live tokenizer, and explain why token count drives both cost and context length.

What You'll Learn in This Lesson

✓Why models split text into words, subwords, or characters
✓How Byte-Pair Encoding (BPE) learns merges, step by step
✓How WordPiece and SentencePiece differ from BPE
✓What a vocabulary is and how tokens become integer IDs
✓What special tokens ([CLS], [SEP], <eos>) are for
✓Why token count decides your cost and context limit

Before you start: It helps to have finished LLM Architecture. You should also be comfortable with basic Python lists, tuples, and dictionaries — every example here is plain Python you can run.

🧱 Real-World Analogy: Breaking Words into Lego Pieces

Imagine you have to build any sentence out of Lego. You have three choices. Keep a giant bin of pre-built words (fast, but you have no brick for a word you've never seen). Keep only single studs — one per character (you can build anything, but even "the" takes three pieces and sequences get painfully long). Or keep a smart middle bin of common chunks: whole bricks for frequent words like "the" and "ing", and smaller pieces you snap together for rare words.

That middle bin is subword tokenization, and it is what every modern LLM uses. A model first learns which chunks are worth keeping by scanning huge amounts of text, then reuses those chunks to encode any new sentence — even words it has never seen, which it simply spells out from smaller pieces.

1Words vs Subwords vs Characters

A token is the smallest chunk of text a model reads. Tokenization is the process of cutting text into those chunks. There are three families of approach, and each trades vocabulary size against sequence length.

Word-level

One token per word. Short sequences, but a huge vocabulary and no token at all for unseen words (the "out-of-vocabulary" problem).

Character-level

One token per character. Tiny vocabulary and never stuck on a new word, but sequences get very long and meaning is spread thin.

Subword (the winner)

Common words stay whole; rare words split into pieces. Moderate vocabulary, moderate length, and no out-of-vocabulary words.

The example below builds a tiny tokenizer by hand so you can see exactly what "splitting text" means, then runs the single most important step of subword learning.

Try It: Tokenize by hand + one BPE merge

Split text into tokens, then merge the most frequent pair — pure Python, no installs

Try it Yourself »

Python

# A tiny tokenizer you can fully understand — no libraries needed.
# Step 1: split on whitespace AND keep punctuation as its own token.

def simple_tokenize(text):
    tokens = []
    word = ""
    for ch in text:
        if ch.isspace():                 # space ends the current word
            if word:
                tokens.append(word)
                word = ""
        elif ch in ".,!?;:'\"()":         # punctuation is its own token
            if word:
                tokens.append(word)
  
...

2Byte-Pair Encoding (BPE), Step by Step

BPE is the tokenizer behind GPT and LLaMA. The idea is simple enough to do on paper. You just repeat one rule until your vocabulary is the size you want:

Start with every word as a list of single characters.
Count every adjacent pair of symbols across all words.
Merge the most frequent pair into one new symbol everywhere.
Record that merge, then go back to step 2.

Each merge adds one entry to the vocabulary. After thousands of merges the frequent words and fragments ("the", "ing", "tion") have become single tokens, while anything rare still decomposes into smaller learned pieces — so the model is never stuck on a new word. You already ran step 1 through 3 once in the example above; the only thing a real trainer adds is a loop.

Key insight: a trained tokenizer is just an ordered list of merges. Encoding new text means replaying those merges in order; decoding means gluing the pieces back together.

Your Turn: Count the pairs

Fill in the blanks to find the most frequent adjacent pair

Try it Yourself »

Python

# 🎯 YOUR TURN — find the most frequent adjacent pair yourself.

words = [("l", "o", "w"), ("l", "o", "w"), ("n", "e", "w")]

pair_counts = {}
for word in words:
    for i in range(len(word) - 1):
        pair = (word[i], word[i + 1])
        # 👉 add 1 to this pair's count (use .get with a default of 0)
        pair_counts[pair] = ___

# 👉 pick the pair with the HIGHEST count
best_pair = max(pair_counts, key=___)

print("Counts:", pair_counts)
print("Best pair:", best_pair)
# ✅ Expected output
...

3WordPiece and SentencePiece

BPE is one subword algorithm; two close cousins power most other models. They differ in how they pick merges and how they mark where words begin.

WordPiece (BERT)

Instead of merging the most frequent pair, it merges the pair that most increases the likelihood of the training data. Continuation pieces get a ## prefix, so "tokenization" becomes token + ##ization.

SentencePiece (T5, Mistral)

Works directly on the raw text stream with no whitespace pre-splitting, so it is truly language-agnostic (great for Chinese or Japanese, which have no spaces). It marks the start of a word with an underscore ▁ instead of ##.

Below, a real Hugging Face tokenizer shows the ## continuation prefix and the special tokens BERT wraps around your text. Read the # Expected output comments to see exactly what each call returns.

🤗 Worked Example: A Real Tokenizer + Special Tokens

This needs pip install transformers, so it is read-only here — but every line is annotated with what it returns. Notice the [CLS] and [SEP] markers that encode() adds automatically.

# Hugging Face tokenizers add SPECIAL TOKENS that frame the input.
# pip install transformers
from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("bert-base-uncased")   # WordPiece

text = "tokenization is wonderful"
out = tok.tokenize(text)
print(out)
# Expected output: ['token', '##ization', 'is', 'wonderful']
# The '##' prefix means "this attaches to the previous token" (a subword).

# encode() wraps the text in special tokens the model expects:
ids = tok.encode(text)
print(tok.convert_ids_to_tokens(ids))
# Expected output: ['[CLS]', 'token', '##ization', 'is', 'wonderful', '[SEP]']
# [CLS] marks the start (used for classification); [SEP] marks the end.

# GPT-style models use different markers, e.g. <|endoftext|> (an <eos> token)
# to signal "the sequence is finished". Special tokens are part of the vocabulary
# but never appear in normal text — they are reserved control symbols.

4Vocabulary, Token IDs, and Special Tokens

The output of training is a vocabulary: a fixed list of every token the model knows, each paired with a unique integer token ID. The model never works with text — it works with these IDs. Encoding turns text into a list of IDs; decoding turns IDs back into text.

A handful of IDs are reserved for special tokens that carry structure rather than meaning. They never appear in ordinary text:

[CLS] — start-of-input marker BERT uses to summarise the whole sequence for classification.
[SEP] — separates segments (and marks the end of a BERT input).
<eos> / <|endoftext|> — "end of sequence" for GPT-style models, telling them to stop generating.

The example below uses OpenAI's tiktoken — the actual tokenizer GPT-4 uses — so you can see real token IDs and where a leading space lives.

# The real thing: OpenAI's tokenizer (a fast byte-level BPE).
# pip install tiktoken
import tiktoken

enc = tiktoken.get_encoding("cl100k_base")   # used by GPT-4 / GPT-3.5

text = "Tokenization isn't magic."
ids = enc.encode(text)                       # text  -> token IDs (integers)
print("Token IDs:", ids)
# Expected output: Token IDs: [3404, 2065, 4267, 956, 11204, 13]

# Every ID maps back to a piece of text (a "token"):
pieces = [enc.decode([i]) for i in ids]
print("Tokens:", pieces)
# Expected output: Tokens: ['Token', 'ization', ' isn', "'t", ' magic', '.']

print("Token count:", len(ids))
# Expected output: Token count: 6

# Note the leading SPACE inside ' isn' and ' magic'.
# Byte-level BPE attaches the space to the FOLLOWING word — that is normal.

5Why Token Count Matters (Cost & Context)

Token count is not an academic detail — it is what you pay for and what limits how much the model can read at once. Two hard constraints flow directly from it:

Cost: APIs bill per token (for example, per 1,000 tokens). A wordier prompt is a more expensive call.
Context length: every model has a maximum number of tokens it can hold. Your prompt plus the model's answer must fit inside that window.

A useful rule of thumb for English: 1 token ≈ 4 characters ≈ 0.75 words. Code, long numbers, and non-English scripts use far more tokens per word, so always measure rather than guess. The exercise below turns a token count into a dollar cost and a remaining-context budget.

Your Turn: Token cost & context budget

Fill in the blanks to turn a token count into a price and remaining context

Try it Yourself »

Python

# 🎯 YOUR TURN — why token count = money and context.
# A model charges per 1,000 tokens. Work out the cost of a prompt.

token_count = 1500          # tokens in your prompt + answer
price_per_1k = 0.01         # dollars per 1,000 tokens

# 👉 cost = how many "thousands" of tokens, times the price per thousand
cost = ___

print(f"This call costs ${cost:.4f}")
# ✅ Expected output:
# This call costs $0.0150

context_limit = 8192        # the model's maximum tokens
# 👉 how many tokens are LEFT for
...

Common Errors and Misconceptions

❌ "One token equals one word"

Counting words to estimate tokens undercounts badly.

✅ Fix: 1 token ≈ 0.75 words in English; measure code and non-English text with the real tokenizer, where "fibonacci" can be 3+ tokens.

❌ "The tokenizer will choke on a word it's never seen" (OOV)

Word-level vocabularies hit out-of-vocabulary (OOV) words and fail. People assume subword tokenizers do too.

✅ Fix: BPE / WordPiece / SentencePiece have no OOV — an unknown word just decomposes into smaller subwords, falling back to single characters or bytes if needed.

❌ "Tokens cost the same in every language"

Tokenizers are trained mostly on English, so other scripts fragment heavily.

✅ Fix: expect multilingual text to use 2–3× more tokens per character; a single Chinese, Japanese, or emoji character can become several tokens, raising cost and shrinking effective context.

❌ "A trailing space doesn't change anything"

Byte-level BPE attaches the space to the following word, so "hello" and "hello " (or " hello") can tokenize differently.

✅ Fix: be deliberate about leading/trailing spaces in prompts and stop sequences — an unexpected trailing-space token can break exact string matching.

📋 Quick Reference

Method	Used By	Picks Merge By	Word Marker
BPE	GPT-2/3/4, LLaMA	Most frequent pair	Leading space
WordPiece	BERT, DistilBERT	Max likelihood	`##` continuation
SentencePiece	T5, Mistral, mBART	BPE/unigram, raw bytes	`▁` word start
tiktoken	OpenAI (GPT-3.5/4)	Byte-level BPE	Leading space

Rule of thumb (English): 1 token ≈ 4 characters ≈ 0.75 words. Common special tokens: [CLS], [SEP] (BERT) and <|endoftext|> / <eos> (GPT).

❓ Frequently Asked Questions

Q: What is a token in an LLM?

A: A token is the smallest unit a language model reads — usually a subword. It can be a whole word, a word fragment, a single character, punctuation, or a space. The model never sees letters or words directly; it only sees the integer ID assigned to each token.

Q: Is one token the same as one word?

A: No. In English, 1 token is roughly 0.75 words, or about 4 characters, on average. Common words are often a single token, but rare or long words split into several subword tokens, and code, numbers and non-English text use many more tokens per word.

Q: What is the difference between BPE, WordPiece and SentencePiece?

A: All three are subword tokenizers. BPE (GPT, LLaMA) repeatedly merges the most frequent adjacent pair. WordPiece (BERT) merges the pair that most increases the training-data likelihood and marks continuations with ##. SentencePiece (T5, Mistral) works on the raw byte stream with no whitespace pre-splitting, marking word starts with an underscore.

Q: What are special tokens like [CLS], [SEP] and <eos>?

A: Special tokens are reserved IDs in the vocabulary that carry structure rather than text. [CLS] marks the start of a BERT input and [SEP] separates segments; <eos> or <|endoftext|> tells a GPT-style model the sequence is finished. They never appear in normal user text.

Q: Why does token count matter for cost and context?

A: APIs bill per token, so more tokens means a more expensive call. Every model also has a fixed context limit measured in tokens — your prompt plus the answer must fit inside it. Counting tokens before you call (with tiktoken or a Hugging Face tokenizer) prevents surprise bills and truncated responses.

🎯 Mini-Challenge: Build a Tiny Vocabulary

Put it all together. Tokenize a sentence, assign each unique token an integer ID, then re-encode the sentence as a list of those IDs — exactly what a real tokenizer does, in miniature. The outline is below; the logic is up to you.

Mini-Challenge: Vocabulary builder

Tokenize, assign IDs, and encode a sentence — outline only

Try it Yourself »

Python

# 🎯 MINI-CHALLENGE: a word-frequency "vocabulary" builder
#
# 1. Take this text:  text = "the cat sat the cat ran the dog"
# 2. Tokenize it by splitting on spaces  (text.split())
# 3. Count how many times each token appears  (a dict, or collections.Counter)
# 4. Build a vocab: assign each UNIQUE token an integer ID (0, 1, 2, ...)
#    Hint: enumerate(sorted(set(tokens)))
# 5. Print the vocab dict and the original text rewritten as a list of token IDs
#
# ✅ Expected output (IDs depend on sort or
...

🎉

Lesson 26 complete — you can read how an LLM reads!

You tokenized text by hand, ran a Byte-Pair Encoding merge, compared BPE with WordPiece and SentencePiece, saw real token IDs and special tokens, and turned token counts into cost and context budgets. This is the layer every prompt passes through before the model ever sees it.

🚀 Up next: Fine-Tuning LLMs — adapt a pre-trained model to your own data with techniques like LoRA.

Tokenization Strategies

What You'll Learn in This Lesson

🧱 Real-World Analogy: Breaking Words into Lego Pieces

1Words vs Subwords vs Characters

Try It: Tokenize by hand + one BPE merge

2Byte-Pair Encoding (BPE), Step by Step

Your Turn: Count the pairs

3WordPiece and SentencePiece

🤗 Worked Example: A Real Tokenizer + Special Tokens

4Vocabulary, Token IDs, and Special Tokens

5Why Token Count Matters (Cost & Context)

Your Turn: Token cost & context budget

Common Errors and Misconceptions

📋 Quick Reference

❓ Frequently Asked Questions

🎯 Mini-Challenge: Build a Tiny Vocabulary

Mini-Challenge: Vocabulary builder

Lesson 26 complete — you can read how an LLM reads!

Cookie & Privacy Settings