Skip to main content

    Lesson 12 • Intermediate

    Transformers & Large Language Models

    By the end of this lesson you'll be able to explain — and hand-compute — the self-attention mechanism that powers ChatGPT, BERT, and every modern language model.

    What You'll Learn in This Lesson

    • What self-attention is, using the Query / Key / Value intuition
    • How to compute a scaled dot-product attention score by hand
    • Why softmax turns raw scores into weights that sum to 1.0
    • What multi-head attention adds, and why several heads help
    • Why positional encoding gives a sequence its word order
    • How a transformer block stacks attention, a feed-forward net, residuals, and layernorm — and how encoders differ from decoders

    🗣️ Real-World Analogy: Focusing Attention in a Conversation

    Picture yourself at a noisy dinner table. Someone says "can you pass it?" To understand "it", you don't weigh every word equally — you instinctively focus on the salt someone mentioned a moment ago and mostly ignore the unrelated chatter.

    Self-attention does exactly this. For each word, the model asks "which other words should I focus on to understand this one?" It hands out a budget of attention — a lot to the few words that matter, a little to the rest — and the budget always adds up to 100%. That is the whole idea you'll build piece by piece below.

    1Self-Attention — Query, Key, and Value

    Every token (word piece) is turned into three vectors. A Query is what the token is looking for. A Key is what each token advertises about itself. A Value is the actual information a token carries.

    To work out how much word A should attend to word B, you compare A's Query with B's Key. The comparison is a dot product — multiply the matching numbers and add them up. A bigger dot product means a better match, so more attention.

    🔎 Think of an online search:

    • Your search box text is the Query
    • Each page's title/tags are its Key
    • The page content you actually read is the Value

    2Scaled Dot-Product Attention

    The raw dot product can grow large when vectors are long, which makes the next step (softmax) unstable. The fix is to divide by the square root of the vector length, written d_k. That single division is why it's called scaled dot-product attention.

    Here it is with no libraries — just two small vectors and one score:

    Worked Example: One Attention Score

    Compute a scaled dot-product score between a query and a key

    Try it Yourself »
    Python
    # Scaled dot-product attention — by hand, no libraries
    # We compare one query vector against one key vector.
    
    import math
    
    def dot(a, b):
        # Sum of element-wise products = the raw "match" score
        return sum(x * y for x, y in zip(a, b))
    
    # A query (what a word is looking for) and a key (what a word offers)
    query = [1.0, 0.0, 1.0]   # e.g. the word "it"
    key   = [1.0, 1.0, 1.0]   # e.g. the word "cat"
    
    raw_score = dot(query, key)        # 1*1 + 0*1 + 1*1 = 2.0
    d_k = len(query)                 
    ...

    3Softmax — Turning Scores Into Attention Weights

    A real token produces one score for every other token. Softmax converts that list of scores into weights between 0 and 1 that add up to exactly 1.0 — your 100% attention budget. The highest score gets the biggest slice.

    This worked example softmaxes three scores by hand:

    Worked Example: Softmax Over 3 Scores

    Turn three raw scores into attention weights that sum to 1.0

    Try it Yourself »
    Python
    # Softmax turns raw scores into attention weights that sum to 1.0
    import math
    
    def softmax(scores):
        m = max(scores)                       # subtract the max for numerical safety
        exps = [math.exp(s - m) for s in scores]
        total = sum(exps)
        return [e / total for e in exps]      # each weight is between 0 and 1
    
    # Three attention scores: "it" attends to cat / mat / sky
    scores = [2.0, 1.0, 0.1]
    weights = softmax(scores)
    
    for label, w in zip(["cat", "mat", "sky"], weights):
        print(la
    ...

    🎯 Your Turn: Compute One Score

    Fill in the blank so the program prints the dot-product score between the query and key.

    Your Turn: Attention Score

    Fill in the blank to compute a single attention score

    Try it Yourself »
    Python
    # 🎯 YOUR TURN — fill in the blanks marked with ___
    # Compute a single attention SCORE between a query and a key.
    
    def dot(a, b):
        return sum(x * y for x, y in zip(a, b))
    
    query = [2.0, 0.0, 1.0]
    key   = [1.0, 3.0, 1.0]
    
    # 1) Use dot() to get the raw match score
    raw = ___          # 👉 call dot(query, key)
    
    # 2) Print the score
    print("Score:", raw)
    
    # ✅ Expected output:
    # Score: 3.0

    4The Same Thing in PyTorch

    In real models you don't loop in Python — you use matrix operations so the whole sequence is processed at once. The maths is identical: scores, softmax, weighted sum of Values. The @ symbol is matrix multiplication.

    import torch
    import torch.nn.functional as F
    
    # Q, K, V for a 3-token sequence, each token a 4-dim vector
    # (in a real model these come from learned linear layers)
    torch.manual_seed(0)
    Q = torch.randn(3, 4)
    K = torch.randn(3, 4)
    V = torch.randn(3, 4)
    
    d_k = Q.size(-1)                         # 4
    scores = Q @ K.transpose(-2, -1) / d_k ** 0.5   # (3, 3) score matrix
    weights = F.softmax(scores, dim=-1)      # each row sums to 1.0
    output = weights @ V                     # context-aware vectors, shape (3, 4)
    
    print("scores shape:", tuple(scores.shape))
    print("weights row 0 sums to:", round(weights[0].sum().item(), 4))
    print("output shape:", tuple(output.shape))
    
    # Expected output:
    # scores shape: (3, 3)
    # weights row 0 sums to: 1.0
    # output shape: (3, 4)

    Note the shapes: a 3-token input produces a 3×3 score matrix (every token scored against every token), and the output keeps the original shape so blocks can stack.

    5Multi-Head Attention — Several Perspectives at Once

    One attention calculation can only learn one kind of relationship. Multi-head attention runs several attention computations in parallel — each "head" gets its own learned projection of the data, so one head might track grammar while another tracks meaning. Their outputs are concatenated and mixed back together.

    📖 Analogy:

    It's like proofreading a sentence three times — once for spelling, once for grammar, once for tone. Same sentence, different lenses, richer understanding.

    6Positional Encoding — Giving Order to a Sequence

    Attention treats the input as an unordered set. On its own it can't tell "dog bites man" from "man bites dog". Positional encoding fixes this by adding a unique pattern to each position before attention runs.

    The original paper used sine and cosine waves of different frequencies, so each position gets a distinct fingerprint and nearby positions stay similar. Newer models often use learned or rotary positions, but the goal is the same: tell the model where each token sits.

    7The Transformer Block — Putting It Together

    A transformer block bolts four pieces together, and real models stack dozens of these blocks:

    • Multi-head self-attention — mix in context from other tokens.
    • Feed-forward network (FFN) — a small two-layer net applied to each position independently, to process what attention gathered.
    • Residual connections — add the block's input back to its output, so information and gradients flow cleanly through deep stacks.
    • Layer normalisation (LayerNorm) — rescale each vector to keep training stable.

    Here's a complete block in PyTorch. Notice the input and output shapes are identical — that's what lets you stack blocks:

    import torch
    import torch.nn as nn
    
    # One transformer block = attention + feed-forward,
    # each wrapped in a residual connection and LayerNorm.
    class TransformerBlock(nn.Module):
        def __init__(self, d_model=8, n_heads=2):
            super().__init__()
            self.attn = nn.MultiheadAttention(d_model, n_heads, batch_first=True)
            self.norm1 = nn.LayerNorm(d_model)
            self.ffn = nn.Sequential(
                nn.Linear(d_model, 4 * d_model),  # expand
                nn.ReLU(),
                nn.Linear(4 * d_model, d_model),  # project back
            )
            self.norm2 = nn.LayerNorm(d_model)
    
        def forward(self, x):
            a, _ = self.attn(x, x, x)        # self-attention: Q=K=V=x
            x = self.norm1(x + a)            # residual + layernorm
            f = self.ffn(x)
            x = self.norm2(x + f)            # residual + layernorm
            return x
    
    block = TransformerBlock()
    tokens = torch.randn(1, 5, 8)            # (batch=1, seq=5, d_model=8)
    out = block(tokens)
    print("input shape:", tuple(tokens.shape))
    print("output shape:", tuple(out.shape))  # unchanged — blocks stack cleanly
    
    # Expected output:
    # input shape: (1, 5, 8)
    # output shape: (1, 5, 8)

    🎯 Your Turn: Finish the Softmax

    Fill in the two blanks so the weights sum to exactly 1.0 when rounded to one decimal place.

    Your Turn: Softmax Weights

    Complete the softmax so the weights sum to 1.0

    Try it Yourself »
    Python
    # 🎯 YOUR TURN — fill in the blanks marked with ___
    # Turn three scores into attention WEIGHTS that sum to 1.0
    import math
    
    def softmax(scores):
        m = max(scores)
        exps = [math.exp(s - m) for s in scores]
        total = ___        # 👉 sum the exps list with sum(exps)
        return [e / total for e in exps]
    
    scores = [3.0, 1.0, 0.0]
    weights = softmax(scores)
    
    # 👉 print the rounded total of all weights (should be 1.0)
    print("Total:", round(sum(weights), ___))   # 👉 round to 1 decimal place
    
    # ✅ 
    ...

    🔀 Encoder vs Decoder (BERT vs GPT)

    GPT and BERT use the same building blocks but wire attention differently:

    Encoder (BERT)

    Reads the whole sequence at once and looks both left and right (bidirectional). Great for understanding: classification, search, filling blanks.

    Decoder (GPT)

    Generates one token at a time and may only look at earlier tokens (masked/causal attention). Great for generating text.

    Translation models often use both: an encoder to read the source and a decoder to write the translation.

    9Real Models in 2 Lines — Hugging Face

    You almost never train a transformer from scratch. Pre-trained models from Hugging Face give you a working LLM instantly — a decoder generates, an encoder understands:

    from transformers import pipeline
    
    # A decoder-only model (GPT-2) GENERATES text...
    gen = pipeline("text-generation", model="gpt2")
    print(gen("The transformer architecture", max_new_tokens=8)[0]["generated_text"])
    
    # ...while an encoder-only model (BERT) UNDERSTANDS text by filling blanks
    fill = pipeline("fill-mask", model="bert-base-uncased")
    print(fill("Paris is the [MASK] of France.")[0]["token_str"])
    
    # Expected output (wording varies — models are probabilistic):
    # The transformer architecture is a powerful tool for ...
    # capital

    Fine-tuning a pre-trained model beats training from scratch in almost every real project.

    !Common Errors (And How to Fix Them)

    ❌ RuntimeError: mat1 and mat2 shapes cannot be multiplied

    Your Query and Key have mismatched dimensions, so the score matrix can't be formed.

    ✅ Fix: make sure Q and K share the last dimension, and transpose K:

    scores = Q @ K.transpose(-2, -1) / d_k ** 0.5

    ❌ Weights don't sum to 1.0

    You applied softmax along the wrong axis, so it normalised across the wrong tokens.

    ✅ Fix: softmax over the last dimension (each row of scores):

    weights = F.softmax(scores, dim=-1)   # not dim=0

    ❌ nan values after softmax

    Scores grew huge and exp() overflowed.

    ✅ Fix: subtract the max before exp, and scale by √d_k:

    exps = [math.exp(s - max(scores)) for s in scores]

    ❌ AssertionError: embed_dim must be divisible by num_heads

    Multi-head attention splits d_model evenly across heads.

    ✅ Fix: pick heads that divide the model size (e.g. d_model=8, n_heads=2 or 4).

    📋 Quick Reference

    ConceptPurposeKey Idea
    Query / Key / ValueCompare and retrieveQuery matches Keys to pull in Values
    Dot-product scoreMeasure matchMultiply matching numbers and add
    Scaling (÷√d_k)Keep softmax stableStops large scores from overflowing
    SoftmaxMake weightsScores → weights summing to 1.0
    Multi-headMultiple lensesParallel heads, then concatenate
    Positional encodingAdd word orderUnique pattern per position
    Transformer blockOne layerAttention + FFN + residual + LayerNorm
    Encoder vs DecoderUnderstand vs generateBERT bidirectional, GPT causal

    ❓ Frequently Asked Questions

    Q: What is self-attention in a transformer?

    A: Self-attention lets every token in a sequence look at every other token and decide how much each one matters. Each token produces a Query, a Key, and a Value; the dot-product of a Query with each Key gives scores, softmax turns them into weights, and the weighted sum of Values becomes the new, context-aware representation of that token.

    Q: What do Query, Key, and Value mean?

    A: Think of a search: the Query is what a token is looking for, the Key is what each token advertises about itself, and the Value is the information a token carries. Matching a Query against all Keys decides how strongly to pull in each token's Value.

    Q: Why divide the attention scores by the square root of d_k?

    A: As vectors get longer, raw dot products grow large, which pushes softmax into tiny gradients and unstable training. Dividing by the square root of the key dimension (d_k) keeps the scores in a sensible range so softmax stays well-behaved. That is the 'scaled' in scaled dot-product attention.

    Q: What is the point of multi-head attention?

    A: A single attention head can only learn one kind of relationship. Multi-head attention runs several attention computations in parallel on different learned projections, so one head might track grammar while another tracks meaning. Their outputs are concatenated and combined, giving the model multiple perspectives at once.

    Q: Why do transformers need positional encoding?

    A: Attention treats the input as a set, so on its own it has no idea of word order — 'dog bites man' would look the same as 'man bites dog'. Positional encoding adds a unique pattern to each position (the original paper used sine and cosine waves) so the model can tell where each token sits in the sequence.

    Q: What is the difference between an encoder and a decoder?

    A: An encoder reads the whole sequence at once and looks both left and right (bidirectional) to build understanding — that is how BERT works. A decoder generates one token at a time and may only look at earlier tokens (masked/causal attention) — that is how GPT works. Many translation models use both: an encoder to read and a decoder to write.

    🎯 Mini Challenge: One-Query Attention Output

    Put the whole pipeline together: softmax three scores into weights, then take the weighted sum of three Value vectors to produce one context vector. The starter has only a comment outline — you write the logic.

    Mini Challenge: Attention Output

    Combine softmax weights with value vectors into one context vector

    Try it Yourself »
    Python
    # 🎯 MINI-CHALLENGE: one-query attention output
    # Goal: weight three VALUE vectors by attention weights, then add them up.
    #
    # 1. scores = [2.0, 1.0, 0.0]  (already given below)
    # 2. Write softmax(scores) -> weights (3 numbers that sum to 1.0)
    # 3. values = [[1, 0], [0, 1], [1, 1]]  (three 2-dim value vectors)
    # 4. output = weighted sum: weights[0]*values[0] + weights[1]*values[1] + ...
    # 5. print(output)  — a single 2-number context vector
    #
    # ✅ Expected (approx): [0.66, 0.34]
    
    import math
    scor
    ...

    🎉 Lesson Complete!

    You can now explain self-attention with the Query/Key/Value intuition, hand-compute a scaled dot-product score, softmax scores into attention weights, and describe multi-head attention, positional encoding, the transformer block, and how encoders (BERT) differ from decoders (GPT). That's the core of every modern language model.

    🚀 Up next: Reinforcement Learning — how AI agents learn from trial and error.

    Sign up for free to track which lessons you've completed and get learning reminders.

    Previous

    Cookie & Privacy Settings

    We use cookies to improve your experience, analyze traffic, and show personalized ads. You can manage your preferences below.

    By clicking "Accept All", you consent to our use of cookies for analytics and personalized advertising. You can customize your preferences or reject non-essential cookies.

    Privacy PolicyTerms of Service