Lesson 12 • Intermediate
Transformers & Large Language Models
Master the architecture behind ChatGPT, BERT, and every modern AI breakthrough — self-attention, positional encoding, and autoregressive generation.
✅ What You'll Learn
- • Self-attention: how words "look at" other words
- • Positional encoding: giving order to sequences
- • Multi-head attention: multiple perspectives at once
- • How LLMs generate text token-by-token
🤖 The Transformer Revolution
🎯 Real-World Analogy: Imagine reading a mystery novel. When you reach "the detective found the weapon at the bank", your brain instantly checks: was this story about a river or a robbery? You look back at the entire context to understand the word. That's exactly what self-attention does — every word looks at every other word to understand meaning.
Before Transformers (2017), models read text one word at a time (RNNs). Transformers read the entire sequence at once, making them massively faster and better at capturing long-range context. This single paper — "Attention Is All You Need" — launched GPT, BERT, and the entire modern AI era.
🏗️ Transformer Architecture
- • Input Embedding → Convert tokens to vectors
- • Positional Encoding → Add position information
- • Multi-Head Attention → Attend to context from multiple perspectives
- • Feed-Forward Network → Process each position independently
- • Layer Norm + Residuals → Stabilise deep training
Try It: Self-Attention
See how words attend to each other — the core of Transformers
import numpy as np
# Self-Attention: The core mechanism behind Transformers
# "Which other words should I pay attention to?"
def softmax(x):
exp_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
return exp_x / np.sum(exp_x, axis=-1, keepdims=True)
def self_attention(Q, K, V):
"""Scaled dot-product attention"""
d_k = Q.shape[-1]
# Step 1: Compute similarity scores
scores = Q @ K.T / np.sqrt(d_k)
# Step 2: Convert to probabilities
weights = softmax(scores)
#
...Try It: Positional Encoding
Give Transformers a sense of word order using sinusoidal patterns
import numpy as np
# Positional Encoding: Giving Transformers a sense of word ORDER
# Without this, "dog bites man" = "man bites dog" (bad!)
def positional_encoding(seq_len, d_model):
"""Sinusoidal positional encoding (from 'Attention Is All You Need')"""
pos = np.arange(seq_len)[:, np.newaxis] # (seq_len, 1)
dim = np.arange(d_model)[np.newaxis, :] # (1, d_model)
# Alternate between sin and cos
angles = pos / (10000 ** (2 * (dim // 2) / d_model))
encoding = np.ze
...Try It: Multi-Head Attention
Multiple attention heads capture different relationship patterns
import numpy as np
# Multi-Head Attention: Looking at text from MULTIPLE perspectives
# Like reading a sentence for grammar, meaning, and emotion simultaneously
def softmax(x):
exp_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
return exp_x / np.sum(exp_x, axis=-1, keepdims=True)
def single_head_attention(Q, K, V):
d_k = Q.shape[-1]
scores = Q @ K.T / np.sqrt(d_k)
weights = softmax(scores)
return weights @ V, weights
def multi_head_attention(X, num_heads, d_model):
...⚠️ Common Misconception: GPT and BERT use the same Transformer architecture, but differently. GPT uses the decoder (predict next word). BERT uses the encoder (understand meaning). GPT generates text; BERT classifies it.
Try It: LLM Text Generation
Simulate how ChatGPT generates text token-by-token with temperature control
import numpy as np
# How Large Language Models (LLMs) Generate Text
# Simplified autoregressive text generation
# Simulate a tiny vocabulary and next-token prediction
vocab = ["the", "cat", "sat", "on", "mat", "dog", "ran", "fast", "<end>"]
vocab_size = len(vocab)
def fake_logits(context):
"""Simulate model predicting next token probabilities"""
np.random.seed(hash(tuple(context)) % 2**31)
logits = np.random.randn(vocab_size)
# Bias toward sensible continuations
if context
...📋 Quick Reference
| Concept | Purpose | Key Idea |
|---|---|---|
| Self-Attention | Capture context | Every token attends to every other |
| Positional Encoding | Encode word order | Sin/cos patterns per position |
| Multi-Head | Multiple perspectives | Parallel attention heads |
| Temperature | Control randomness | Low = safe, High = creative |
| GPT | Generate text | Decoder-only, autoregressive |
| BERT | Understand text | Encoder-only, bidirectional |
💡 Pro Tip: You don't need to train your own Transformer! Use pre-trained models from Hugging Face. With pipeline("text-generation", model="gpt2") you get a working LLM in 2 lines of code. Fine-tuning beats training from scratch 99% of the time.
🎉 Lesson Complete!
You now understand the architecture powering modern AI! Next, explore Reinforcement Learning — how AI agents learn from trial and error.
Sign up for free to track which lessons you've completed and get learning reminders.