Lesson 12 • Intermediate

    Transformers & Large Language Models

    Master the architecture behind ChatGPT, BERT, and every modern AI breakthrough — self-attention, positional encoding, and autoregressive generation.

    ✅ What You'll Learn

    • • Self-attention: how words "look at" other words
    • • Positional encoding: giving order to sequences
    • • Multi-head attention: multiple perspectives at once
    • • How LLMs generate text token-by-token

    🤖 The Transformer Revolution

    🎯 Real-World Analogy: Imagine reading a mystery novel. When you reach "the detective found the weapon at the bank", your brain instantly checks: was this story about a river or a robbery? You look back at the entire context to understand the word. That's exactly what self-attention does — every word looks at every other word to understand meaning.

    Before Transformers (2017), models read text one word at a time (RNNs). Transformers read the entire sequence at once, making them massively faster and better at capturing long-range context. This single paper — "Attention Is All You Need" — launched GPT, BERT, and the entire modern AI era.

    🏗️ Transformer Architecture

    • Input Embedding → Convert tokens to vectors
    • Positional Encoding → Add position information
    • Multi-Head Attention → Attend to context from multiple perspectives
    • Feed-Forward Network → Process each position independently
    • Layer Norm + Residuals → Stabilise deep training

    Try It: Self-Attention

    See how words attend to each other — the core of Transformers

    Try it Yourself »
    Python
    import numpy as np
    
    # Self-Attention: The core mechanism behind Transformers
    # "Which other words should I pay attention to?"
    
    def softmax(x):
        exp_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
        return exp_x / np.sum(exp_x, axis=-1, keepdims=True)
    
    def self_attention(Q, K, V):
        """Scaled dot-product attention"""
        d_k = Q.shape[-1]
        # Step 1: Compute similarity scores
        scores = Q @ K.T / np.sqrt(d_k)
        # Step 2: Convert to probabilities
        weights = softmax(scores)
        # 
    ...

    Try It: Positional Encoding

    Give Transformers a sense of word order using sinusoidal patterns

    Try it Yourself »
    Python
    import numpy as np
    
    # Positional Encoding: Giving Transformers a sense of word ORDER
    # Without this, "dog bites man" = "man bites dog" (bad!)
    
    def positional_encoding(seq_len, d_model):
        """Sinusoidal positional encoding (from 'Attention Is All You Need')"""
        pos = np.arange(seq_len)[:, np.newaxis]  # (seq_len, 1)
        dim = np.arange(d_model)[np.newaxis, :]  # (1, d_model)
        
        # Alternate between sin and cos
        angles = pos / (10000 ** (2 * (dim // 2) / d_model))
        encoding = np.ze
    ...

    Try It: Multi-Head Attention

    Multiple attention heads capture different relationship patterns

    Try it Yourself »
    Python
    import numpy as np
    
    # Multi-Head Attention: Looking at text from MULTIPLE perspectives
    # Like reading a sentence for grammar, meaning, and emotion simultaneously
    
    def softmax(x):
        exp_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
        return exp_x / np.sum(exp_x, axis=-1, keepdims=True)
    
    def single_head_attention(Q, K, V):
        d_k = Q.shape[-1]
        scores = Q @ K.T / np.sqrt(d_k)
        weights = softmax(scores)
        return weights @ V, weights
    
    def multi_head_attention(X, num_heads, d_model):
    ...

    ⚠️ Common Misconception: GPT and BERT use the same Transformer architecture, but differently. GPT uses the decoder (predict next word). BERT uses the encoder (understand meaning). GPT generates text; BERT classifies it.

    Try It: LLM Text Generation

    Simulate how ChatGPT generates text token-by-token with temperature control

    Try it Yourself »
    Python
    import numpy as np
    
    # How Large Language Models (LLMs) Generate Text
    # Simplified autoregressive text generation
    
    # Simulate a tiny vocabulary and next-token prediction
    vocab = ["the", "cat", "sat", "on", "mat", "dog", "ran", "fast", "<end>"]
    vocab_size = len(vocab)
    
    def fake_logits(context):
        """Simulate model predicting next token probabilities"""
        np.random.seed(hash(tuple(context)) % 2**31)
        logits = np.random.randn(vocab_size)
        # Bias toward sensible continuations
        if context
    ...

    📋 Quick Reference

    ConceptPurposeKey Idea
    Self-AttentionCapture contextEvery token attends to every other
    Positional EncodingEncode word orderSin/cos patterns per position
    Multi-HeadMultiple perspectivesParallel attention heads
    TemperatureControl randomnessLow = safe, High = creative
    GPTGenerate textDecoder-only, autoregressive
    BERTUnderstand textEncoder-only, bidirectional

    💡 Pro Tip: You don't need to train your own Transformer! Use pre-trained models from Hugging Face. With pipeline("text-generation", model="gpt2") you get a working LLM in 2 lines of code. Fine-tuning beats training from scratch 99% of the time.

    🎉 Lesson Complete!

    You now understand the architecture powering modern AI! Next, explore Reinforcement Learning — how AI agents learn from trial and error.

    Sign up for free to track which lessons you've completed and get learning reminders.

    Previous

    Cookie & Privacy Settings

    We use cookies to improve your experience, analyze traffic, and show personalized ads. You can manage your preferences below.

    By clicking "Accept All", you consent to our use of cookies for analytics and personalized advertising. You can customize your preferences or reject non-essential cookies.

    Privacy PolicyTerms of Service