Courses/AI & ML/Transformers & LLMs

Lesson 12 • Intermediate

Transformers & Large Language Models

By the end of this lesson you'll be able to explain — and hand-compute — the self-attention mechanism that powers ChatGPT, BERT, and every modern language model.

What You'll Learn in This Lesson

✓What self-attention is, using the Query / Key / Value intuition
✓How to compute a scaled dot-product attention score by hand
✓Why softmax turns raw scores into weights that sum to 1.0
✓What multi-head attention adds, and why several heads help
✓Why positional encoding gives a sequence its word order
✓How a transformer block stacks attention, a feed-forward net, residuals, and layernorm — and how encoders differ from decoders

Before you start: It helps to have finished Lesson 11: Natural Language Processing so you're comfortable with tokens and word embeddings (turning words into number vectors).

🗣️ Real-World Analogy: Focusing Attention in a Conversation

Picture yourself at a noisy dinner table. Someone says "can you pass it?" To understand "it", you don't weigh every word equally — you instinctively focus on the salt someone mentioned a moment ago and mostly ignore the unrelated chatter.

Self-attention does exactly this. For each word, the model asks "which other words should I focus on to understand this one?" It hands out a budget of attention — a lot to the few words that matter, a little to the rest — and the budget always adds up to 100%. That is the whole idea you'll build piece by piece below.

1Self-Attention — Query, Key, and Value

Every token (word piece) is turned into three vectors. A Query is what the token is looking for. A Key is what each token advertises about itself. A Value is the actual information a token carries.

To work out how much word A should attend to word B, you compare A's Query with B's Key. The comparison is a dot product — multiply the matching numbers and add them up. A bigger dot product means a better match, so more attention.

🔎 Think of an online search:

Your search box text is the Query
Each page's title/tags are its Key
The page content you actually read is the Value

2Scaled Dot-Product Attention

The raw dot product can grow large when vectors are long, which makes the next step (softmax) unstable. The fix is to divide by the square root of the vector length, written d_k. That single division is why it's called scaled dot-product attention.

Here it is with no libraries — just two small vectors and one score:

Worked Example: One Attention Score

Compute a scaled dot-product score between a query and a key

Try it Yourself »

Python

# Scaled dot-product attention — by hand, no libraries
# We compare one query vector against one key vector.

import math

def dot(a, b):
    # Sum of element-wise products = the raw "match" score
    return sum(x * y for x, y in zip(a, b))

# A query (what a word is looking for) and a key (what a word offers)
query = [1.0, 0.0, 1.0]   # e.g. the word "it"
key   = [1.0, 1.0, 1.0]   # e.g. the word "cat"

raw_score = dot(query, key)        # 1*1 + 0*1 + 1*1 = 2.0
d_k = len(query)                 
...

3Softmax — Turning Scores Into Attention Weights

A real token produces one score for every other token. Softmax converts that list of scores into weights between 0 and 1 that add up to exactly 1.0 — your 100% attention budget. The highest score gets the biggest slice.

This worked example softmaxes three scores by hand:

Worked Example: Softmax Over 3 Scores

Turn three raw scores into attention weights that sum to 1.0

Try it Yourself »

Python

# Softmax turns raw scores into attention weights that sum to 1.0
import math

def softmax(scores):
    m = max(scores)                       # subtract the max for numerical safety
    exps = [math.exp(s - m) for s in scores]
    total = sum(exps)
    return [e / total for e in exps]      # each weight is between 0 and 1

# Three attention scores: "it" attends to cat / mat / sky
scores = [2.0, 1.0, 0.1]
weights = softmax(scores)

for label, w in zip(["cat", "mat", "sky"], weights):
    print(la
...

Key insight: attention output is just a weighted average of Values. Once you have the softmax weights, you multiply each token's Value by its weight and add them up — that sum is the new, context-aware vector for the token.

🎯 Your Turn: Compute One Score

Fill in the blank so the program prints the dot-product score between the query and key.

Your Turn: Attention Score

Fill in the blank to compute a single attention score

Try it Yourself »

Python

# 🎯 YOUR TURN — fill in the blanks marked with ___
# Compute a single attention SCORE between a query and a key.

def dot(a, b):
    return sum(x * y for x, y in zip(a, b))

query = [2.0, 0.0, 1.0]
key   = [1.0, 3.0, 1.0]

# 1) Use dot() to get the raw match score
raw = ___          # 👉 call dot(query, key)

# 2) Print the score
print("Score:", raw)

# ✅ Expected output:
# Score: 3.0

4The Same Thing in PyTorch

In real models you don't loop in Python — you use matrix operations so the whole sequence is processed at once. The maths is identical: scores, softmax, weighted sum of Values. The @ symbol is matrix multiplication.

import torch
import torch.nn.functional as F

# Q, K, V for a 3-token sequence, each token a 4-dim vector
# (in a real model these come from learned linear layers)
torch.manual_seed(0)
Q = torch.randn(3, 4)
K = torch.randn(3, 4)
V = torch.randn(3, 4)

d_k = Q.size(-1)                         # 4
scores = Q @ K.transpose(-2, -1) / d_k ** 0.5   # (3, 3) score matrix
weights = F.softmax(scores, dim=-1)      # each row sums to 1.0
output = weights @ V                     # context-aware vectors, shape (3, 4)

print("scores shape:", tuple(scores.shape))
print("weights row 0 sums to:", round(weights[0].sum().item(), 4))
print("output shape:", tuple(output.shape))

# Expected output:
# scores shape: (3, 3)
# weights row 0 sums to: 1.0
# output shape: (3, 4)

Note the shapes: a 3-token input produces a 3×3 score matrix (every token scored against every token), and the output keeps the original shape so blocks can stack.

5Multi-Head Attention — Several Perspectives at Once

One attention calculation can only learn one kind of relationship. Multi-head attention runs several attention computations in parallel — each "head" gets its own learned projection of the data, so one head might track grammar while another tracks meaning. Their outputs are concatenated and mixed back together.

📖 Analogy:

It's like proofreading a sentence three times — once for spelling, once for grammar, once for tone. Same sentence, different lenses, richer understanding.

6Positional Encoding — Giving Order to a Sequence

Attention treats the input as an unordered set. On its own it can't tell "dog bites man" from "man bites dog". Positional encoding fixes this by adding a unique pattern to each position before attention runs.

The original paper used sine and cosine waves of different frequencies, so each position gets a distinct fingerprint and nearby positions stay similar. Newer models often use learned or rotary positions, but the goal is the same: tell the model where each token sits.

7The Transformer Block — Putting It Together

A transformer block bolts four pieces together, and real models stack dozens of these blocks:

Multi-head self-attention — mix in context from other tokens.
Feed-forward network (FFN) — a small two-layer net applied to each position independently, to process what attention gathered.
Residual connections — add the block's input back to its output, so information and gradients flow cleanly through deep stacks.
Layer normalisation (LayerNorm) — rescale each vector to keep training stable.

Here's a complete block in PyTorch. Notice the input and output shapes are identical — that's what lets you stack blocks:

import torch
import torch.nn as nn

# One transformer block = attention + feed-forward,
# each wrapped in a residual connection and LayerNorm.
class TransformerBlock(nn.Module):
    def __init__(self, d_model=8, n_heads=2):
        super().__init__()
        self.attn = nn.MultiheadAttention(d_model, n_heads, batch_first=True)
        self.norm1 = nn.LayerNorm(d_model)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, 4 * d_model),  # expand
            nn.ReLU(),
            nn.Linear(4 * d_model, d_model),  # project back
        )
        self.norm2 = nn.LayerNorm(d_model)

    def forward(self, x):
        a, _ = self.attn(x, x, x)        # self-attention: Q=K=V=x
        x = self.norm1(x + a)            # residual + layernorm
        f = self.ffn(x)
        x = self.norm2(x + f)            # residual + layernorm
        return x

block = TransformerBlock()
tokens = torch.randn(1, 5, 8)            # (batch=1, seq=5, d_model=8)
out = block(tokens)
print("input shape:", tuple(tokens.shape))
print("output shape:", tuple(out.shape))  # unchanged — blocks stack cleanly

# Expected output:
# input shape: (1, 5, 8)
# output shape: (1, 5, 8)

🎯 Your Turn: Finish the Softmax

Fill in the two blanks so the weights sum to exactly 1.0 when rounded to one decimal place.

Your Turn: Softmax Weights

Complete the softmax so the weights sum to 1.0

Try it Yourself »

Python

# 🎯 YOUR TURN — fill in the blanks marked with ___
# Turn three scores into attention WEIGHTS that sum to 1.0
import math

def softmax(scores):
    m = max(scores)
    exps = [math.exp(s - m) for s in scores]
    total = ___        # 👉 sum the exps list with sum(exps)
    return [e / total for e in exps]

scores = [3.0, 1.0, 0.0]
weights = softmax(scores)

# 👉 print the rounded total of all weights (should be 1.0)
print("Total:", round(sum(weights), ___))   # 👉 round to 1 decimal place

# ✅ 
...

🔀 Encoder vs Decoder (BERT vs GPT)

GPT and BERT use the same building blocks but wire attention differently:

Encoder (BERT)

Reads the whole sequence at once and looks both left and right (bidirectional). Great for understanding: classification, search, filling blanks.

Decoder (GPT)

Generates one token at a time and may only look at earlier tokens (masked/causal attention). Great for generating text.

Translation models often use both: an encoder to read the source and a decoder to write the translation.

9Real Models in 2 Lines — Hugging Face

You almost never train a transformer from scratch. Pre-trained models from Hugging Face give you a working LLM instantly — a decoder generates, an encoder understands:

from transformers import pipeline

# A decoder-only model (GPT-2) GENERATES text...
gen = pipeline("text-generation", model="gpt2")
print(gen("The transformer architecture", max_new_tokens=8)[0]["generated_text"])

# ...while an encoder-only model (BERT) UNDERSTANDS text by filling blanks
fill = pipeline("fill-mask", model="bert-base-uncased")
print(fill("Paris is the [MASK] of France.")[0]["token_str"])

# Expected output (wording varies — models are probabilistic):
# The transformer architecture is a powerful tool for ...
# capital

Fine-tuning a pre-trained model beats training from scratch in almost every real project.

!Common Errors (And How to Fix Them)

❌ RuntimeError: mat1 and mat2 shapes cannot be multiplied

Your Query and Key have mismatched dimensions, so the score matrix can't be formed.

✅ Fix: make sure Q and K share the last dimension, and transpose K:

scores = Q @ K.transpose(-2, -1) / d_k ** 0.5

❌ Weights don't sum to 1.0

You applied softmax along the wrong axis, so it normalised across the wrong tokens.

✅ Fix: softmax over the last dimension (each row of scores):

weights = F.softmax(scores, dim=-1)   # not dim=0

❌ nan values after softmax

Scores grew huge and exp() overflowed.

✅ Fix: subtract the max before exp, and scale by √d_k:

exps = [math.exp(s - max(scores)) for s in scores]

❌ AssertionError: embed_dim must be divisible by num_heads

Multi-head attention splits d_model evenly across heads.

✅ Fix: pick heads that divide the model size (e.g. d_model=8, n_heads=2 or 4).

📋 Quick Reference

Concept	Purpose	Key Idea
Query / Key / Value	Compare and retrieve	Query matches Keys to pull in Values
Dot-product score	Measure match	Multiply matching numbers and add
Scaling (÷√d_k)	Keep softmax stable	Stops large scores from overflowing
Softmax	Make weights	Scores → weights summing to 1.0
Multi-head	Multiple lenses	Parallel heads, then concatenate
Positional encoding	Add word order	Unique pattern per position
Transformer block	One layer	Attention + FFN + residual + LayerNorm
Encoder vs Decoder	Understand vs generate	BERT bidirectional, GPT causal

❓ Frequently Asked Questions

Q: What is self-attention in a transformer?

A: Self-attention lets every token in a sequence look at every other token and decide how much each one matters. Each token produces a Query, a Key, and a Value; the dot-product of a Query with each Key gives scores, softmax turns them into weights, and the weighted sum of Values becomes the new, context-aware representation of that token.

Q: What do Query, Key, and Value mean?

A: Think of a search: the Query is what a token is looking for, the Key is what each token advertises about itself, and the Value is the information a token carries. Matching a Query against all Keys decides how strongly to pull in each token's Value.

Q: Why divide the attention scores by the square root of d_k?

A: As vectors get longer, raw dot products grow large, which pushes softmax into tiny gradients and unstable training. Dividing by the square root of the key dimension (d_k) keeps the scores in a sensible range so softmax stays well-behaved. That is the 'scaled' in scaled dot-product attention.

Q: What is the point of multi-head attention?

A: A single attention head can only learn one kind of relationship. Multi-head attention runs several attention computations in parallel on different learned projections, so one head might track grammar while another tracks meaning. Their outputs are concatenated and combined, giving the model multiple perspectives at once.

Q: Why do transformers need positional encoding?

A: Attention treats the input as a set, so on its own it has no idea of word order — 'dog bites man' would look the same as 'man bites dog'. Positional encoding adds a unique pattern to each position (the original paper used sine and cosine waves) so the model can tell where each token sits in the sequence.

Q: What is the difference between an encoder and a decoder?

A: An encoder reads the whole sequence at once and looks both left and right (bidirectional) to build understanding — that is how BERT works. A decoder generates one token at a time and may only look at earlier tokens (masked/causal attention) — that is how GPT works. Many translation models use both: an encoder to read and a decoder to write.

🎯 Mini Challenge: One-Query Attention Output

Put the whole pipeline together: softmax three scores into weights, then take the weighted sum of three Value vectors to produce one context vector. The starter has only a comment outline — you write the logic.

Mini Challenge: Attention Output

Combine softmax weights with value vectors into one context vector

Try it Yourself »

Python

# 🎯 MINI-CHALLENGE: one-query attention output
# Goal: weight three VALUE vectors by attention weights, then add them up.
#
# 1. scores = [2.0, 1.0, 0.0]  (already given below)
# 2. Write softmax(scores) -> weights (3 numbers that sum to 1.0)
# 3. values = [[1, 0], [0, 1], [1, 1]]  (three 2-dim value vectors)
# 4. output = weighted sum: weights[0]*values[0] + weights[1]*values[1] + ...
# 5. print(output)  — a single 2-number context vector
#
# ✅ Expected (approx): [0.66, 0.34]

import math
scor
...

🎉 Lesson Complete!

You can now explain self-attention with the Query/Key/Value intuition, hand-compute a scaled dot-product score, softmax scores into attention weights, and describe multi-head attention, positional encoding, the transformer block, and how encoders (BERT) differ from decoders (GPT). That's the core of every modern language model.

🚀 Up next: Reinforcement Learning — how AI agents learn from trial and error.

Transformers & Large Language Models

What You'll Learn in This Lesson

🗣️ Real-World Analogy: Focusing Attention in a Conversation

1Self-Attention — Query, Key, and Value

2Scaled Dot-Product Attention

Worked Example: One Attention Score

3Softmax — Turning Scores Into Attention Weights

Worked Example: Softmax Over 3 Scores

🎯 Your Turn: Compute One Score

Your Turn: Attention Score

4The Same Thing in PyTorch

5Multi-Head Attention — Several Perspectives at Once

6Positional Encoding — Giving Order to a Sequence

7The Transformer Block — Putting It Together

🎯 Your Turn: Finish the Softmax

Your Turn: Softmax Weights

🔀 Encoder vs Decoder (BERT vs GPT)

9Real Models in 2 Lines — Hugging Face

!Common Errors (And How to Fix Them)

📋 Quick Reference

❓ Frequently Asked Questions

🎯 Mini Challenge: One-Query Attention Output

Mini Challenge: Attention Output

🎉 Lesson Complete!

Cookie & Privacy Settings