Skip to main content

    Lesson 25 • Advanced

    How Large Language Models Work

    By the end of this lesson you'll be able to explain, in plain English and with runnable code, how an LLM turns your prompt into text — token by token.

    What You'll Learn in This Lesson

    • You'll be able to explain decoder-only transformers and next-token prediction
    • You'll be able to define tokens and the context window in plain English
    • You'll be able to describe parameters (weights) and what they store
    • You'll be able to tell pretraining apart from fine-tuning
    • You'll be able to control randomness with temperature and softmax
    • You'll be able to explain emergent abilities, scaling laws, and hallucination

    🌍 Real-World Analogy: Autocomplete on Steroids

    You already use a tiny language model every day: phone keyboard autocomplete. Type "I'll be there in five" and it suggests "minutes". An LLM is that exact idea — autocomplete on steroids.

    Imagine someone who has read almost everything ever written and remembers the patterns. When you type "The cat sat on the", they don't understand cats — they just know that "mat" is the most likely next word. Scale that predictor up to billions of internal settings and trillions of words of practice, and you get something that can write essays, code, and answers. That's all an LLM is doing: predicting the next token, over and over, very well.

    1Next-Token Prediction — the one job an LLM has

    An LLM has exactly one job: given some text, output a probability for every possible next token. A token is a chunk of text (often a word piece). The model assigns, say, 62% to "mat" and 21% to "floor", picks one, appends it, and runs again. Generating a paragraph is just this loop repeated hundreds of times — that's called autoregressive generation.

    The simplest way to pick is greedy decoding: always take the highest-probability token. Run the worked example below — it takes a probability dictionary and grabs the most likely next word.

    Worked Example: Greedy Next-Token Prediction

    The model outputs probabilities; greedy decoding grabs the most likely token

    Try it Yourself »
    Python
    # How an LLM picks the next word: NEXT-TOKEN PREDICTION
    # An LLM reads your text and outputs a probability for every possible
    # next token. "Greedy" decoding just grabs the most likely one.
    
    # Pretend the model just read: "The cat sat on the"
    # These are the probabilities it assigned to possible next tokens.
    next_token_probs = {
        "mat":   0.62,   # most likely
        "floor": 0.21,
        "roof":  0.10,
        "couch": 0.05,
        "moon":  0.02,
    }
    
    # Greedy pick = the token with the highest probability.
    ...

    2Decoder-Only Transformers & Causal Attention

    GPT, LLaMA, and Mistral all use a decoder-only transformer. The key mechanism is self-attention: at every position, the model looks back at earlier tokens and decides which ones matter for predicting what comes next. "Causal" (or masked) attention means a token can only see tokens before it — never the future. That's what makes left-to-right generation possible.

    Reading "The cat sat on" with causal attention:

    • The → sees only The
    • cat → sees The, cat
    • sat → sees The, cat, sat
    • on → sees everything before it, then predicts the next token

    Stack dozens of these attention-plus-feed-forward layers, and the model can capture grammar, facts, and style. You don't need the matrix maths to use one — but knowing that each token only attends to the past explains why an LLM writes one token at a time.

    3Tokens & the Context Window

    Models don't read characters or whole words — they read tokens. A rough rule for English is one token per four characters (about ¾ of a word). The context window is the maximum number of tokens — your prompt plus the reply — the model can hold at once. Go over it and the oldest text falls out of view, so the model effectively "forgets" the start of a long conversation.

    Worked Example: Tokens & Context Window

    Estimate token counts and see how the context window limits what the model can read

    Try it Yourself »
    Python
    # TOKENS & CONTEXT WINDOW
    # LLMs don't read characters or whole words — they read TOKENS (word pieces).
    # A rough rule of thumb in English: ~1 token per 4 characters (~0.75 words).
    
    text = "Large language models predict the next token."
    
    # Real tokenizers (like tiktoken) split smarter, but we'll estimate.
    approx_tokens = max(1, len(text) // 4)
    words = len(text.split())
    
    print("Text:", text)
    print("Characters:", len(text))
    print("Words:", words)
    print("Estimated tokens:", approx_tokens)
    print()
    
    
    ...

    4Parameters & Weights — where the "knowledge" lives

    A model's parameters (also called weights) are just numbers — billions of them — that the model adjusts during training. They're the dials that turn an input of tokens into an output of probabilities. "GPT-3 has 175 billion parameters" means 175 billion of these numbers.

    There's no database of facts inside; the "knowledge" is encoded across all those weights. More parameters generally means more capacity to capture patterns — but, as you'll see in the scaling section, size alone isn't everything.

    5Temperature & Sampling — controlling randomness

    Greedy decoding always picks the top token, which can feel robotic and repetitive. Instead, models usually sample from the probabilities. The raw scores the model produces are called logits; softmax turns them into probabilities that sum to 1.

    Temperature is the knob. You divide the logits by the temperature before softmax. A low temperature (e.g. 0.5) sharpens the distribution — the top token dominates, output is safe and consistent. A high temperature (e.g. 2.0) flattens it — output gets diverse and creative, but also more likely to go off the rails. Run the example to watch the probabilities shift.

    Worked Example: Temperature + Softmax

    Apply temperature to logits, then softmax — see how randomness changes

    Try it Yourself »
    Python
    # TEMPERATURE controls randomness. The model outputs raw scores called
    # "logits". Temperature divides the logits, then softmax turns them into
    # probabilities. Low temp = confident/repetitive. High temp = creative/risky.
    
    import math
    
    # Raw scores (logits) for 4 candidate next tokens.
    tokens = ["mat", "floor", "roof", "moon"]
    logits = [3.0, 1.5, 0.8, -1.0]
    
    def softmax(scores):
        m = max(scores)                       # subtract max for numerical safety
        exps = [math.exp(s - m) for s in sco
    ...

    🎯 Your Turn: Greedy Next Token

    Fill in the blanks to make the model pick its next token

    Try it Yourself »
    Python
    # 🎯 YOUR TURN — make the model pick its next token (greedy decoding)
    # Fill in every ___ . Run it and compare against the expected output.
    
    next_token_probs = {
        "pizza":  0.55,
        "salad":  0.30,
        "rocks":  0.15,
    }
    
    # 1) Greedy decoding = the token with the HIGHEST probability.
    #    👉 use max(...) with key=next_token_probs.get
    best_token = ___          # 👉 replace ___
    
    # 2) 👉 print the winning token
    print("Next token:", ___)
    
    # 3) 👉 build the sentence "I want to eat <best_token>"
    pr
    ...

    🎯 Your Turn: Temperature & Softmax

    Complete softmax and the temperature step yourself

    Try it Yourself »
    Python
    # 🎯 YOUR TURN — turn raw logits into probabilities, with temperature
    # Fill in every ___ . No numpy — just the math module.
    
    import math
    
    logits = [2.0, 1.0, 0.1]      # raw scores for 3 candidate tokens
    
    def softmax(scores):
        m = max(scores)
        exps = [math.exp(s - m) for s in scores]
        total = sum(exps)
        # 👉 each probability = its exp divided by the total
        return [e / ___ for e in exps]      # 👉 replace ___
    
    def apply_temperature(logits, temperature):
        # 👉 divide every logit
    ...

    6Pretraining vs Fine-Tuning

    Pretraining is the expensive part: the model learns general language by predicting the next token across trillions of tokens of internet text. This takes months on huge GPU clusters and can cost millions of dollars. The result is a model that knows language but isn't specialised.

    Fine-tuning takes that pretrained model and cheaply adapts it to a specific task or style using a much smaller dataset — sometimes just thousands of examples. Techniques like LoRA make this affordable on a single GPU. The rule of thumb: almost everyone fine-tunes; almost no one pretrains from scratch.

    The worked examples below use the real Hugging Face transformers library. They need transformers and torch installed locally, so treat them as a preview of the real API — the expected output is shown in comments.

    Worked Example: Real LLM with transformers (HF pipeline)

    A text-generation pipeline using a pretrained decoder-only model (gpt2)

    Try it Yourself »
    Python
    # REAL LLMs in practice: Hugging Face 'transformers'
    # (This needs the transformers + torch libraries installed locally;
    #  it is shown here to illustrate the real API and its output.)
    
    from transformers import pipeline
    
    # A text-generation pipeline downloads a pretrained decoder-only model.
    generator = pipeline("text-generation", model="gpt2")
    
    result = generator(
        "The future of artificial intelligence is",
        max_new_tokens=20,
        temperature=0.7,     # same temperature idea you coded ab
    ...

    Worked Example: Loading a Pretrained Model

    Inspect a pretrained model's vocabulary and parameter count before fine-tuning

    Try it Yourself »
    Python
    # PRETRAINING vs FINE-TUNING
    # Pretraining: learn language from trillions of tokens (months, $millions).
    # Fine-tuning: cheaply adapt that pretrained model to YOUR task.
    
    from transformers import AutoModelForCausalLM, AutoTokenizer
    
    # 1) Load a model that was already PRETRAINED on the open internet.
    tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
    model = AutoModelForCausalLM.from_pretrained("distilgpt2")
    
    print("Loaded a pretrained model:", model.config.model_type)
    print("Vocabulary size
    ...

    7Emergent Abilities & Scaling Laws

    Scaling laws are the surprising finding that a model's prediction error falls in a smooth, predictable way as you add more parameters, more data, and more compute. The Chinchilla result added an important twist: for a fixed compute budget, you often want a smaller model trained on more data — roughly 20 tokens of training data per parameter.

    Emergent abilities are skills that barely exist in small models but suddenly appear once a model crosses a size threshold — things like following instructions, doing multi-step arithmetic, or in-context learning. Nobody programmed these in; they emerge from scale. This is why bigger models can feel qualitatively, not just quantitatively, smarter.

    ⚠️ Limitations: Hallucination & What LLMs Can't Do

    Because an LLM optimises for plausible next tokens — not for truth — it will sometimes produce confident, fluent statements that are simply wrong. This is called hallucination. There's no fact-checker inside; if a made-up citation or API method "sounds right", the model may emit it.

    • It has no live knowledge after its training cutoff unless given tools or retrieval.
    • It can't reliably count, do exact maths, or remember beyond the context window.
    • It's sensitive to prompt wording — small changes can flip the answer.

    The practical fix: verify important facts, ground the model with retrieval (RAG) or tools, and lower the temperature when you need consistency.

    !Common Errors (And How to Fix Them)

    ❌ Output cut off / "maximum context length exceeded"

    Your prompt plus reply went past the context window, so the model truncated or errored.

    ✅ Fix:

    # Trim old messages or summarise them before sending.
    # Leave room for the reply: reply_budget = context_window - prompt_tokens
    # If reply_budget is small, shorten the prompt.

    ❌ The model confidently states something false (hallucination)

    It generated a plausible-sounding but wrong fact, citation, or function name.

    ✅ Fix:

    # Don't trust facts blindly. Ground the model:
    #  - lower the temperature (e.g. 0.2) for factual tasks
    #  - give it the source text (retrieval / RAG)
    #  - ask it to say "I don't know" when unsure

    ❌ Output is random / repetitive — wrong temperature

    Too-high temperature = nonsense; temperature of 0 = identical, repetitive text.

    ✅ Fix:

    # Use ~0.2 for facts/code, ~0.7-1.0 for creative writing.
    generator(prompt, temperature=0.7, do_sample=True)  # balanced
    generator(prompt, temperature=0.0)                  # deterministic

    ❌ Same task, different wording, different answer (prompt sensitivity)

    A tiny rephrase changed the result because the model reacts to exact tokens.

    ✅ Fix:

    # Be explicit and consistent. Spell out the format you want:
    # "Answer in one sentence." / "Return valid JSON only."
    # Provide 1-2 examples (few-shot) to anchor the style.

    📋 Quick Reference

    TermWhat It MeansIn One Line
    TokenA chunk of text (word piece)~4 chars of English
    Next-token predictionOutput a probability per tokenPick, append, repeat
    Decoder-only transformerArchitecture behind GPT/LLaMACausal (past-only) attention
    Context windowMax tokens held at oncePrompt + reply combined
    Parameters / weightsThe trained numbersWhere capacity lives
    LogitsRaw pre-softmax scoressoftmax → probabilities
    TemperatureRandomness knobLow = safe, high = creative
    PretrainingLearn language from scratchMonths, $millions
    Fine-tuningAdapt a pretrained modelCheap, task-specific
    HallucinationConfident but false outputPlausible ≠ true

    ❓ Frequently Asked Questions

    Q: How does a large language model actually work?

    A: An LLM is a decoder-only transformer trained to predict the next token. It turns your text into tokens, runs them through layers of self-attention and feed-forward weights, and outputs a probability for every possible next token. It picks one, appends it, and repeats — generating text one token at a time.

    Q: What is a token and what is the context window?

    A: A token is a chunk of text — often a word piece — and roughly one token equals about four characters of English. The context window is the maximum number of tokens (prompt plus reply) the model can consider at once. Text beyond that limit is dropped, so the model effectively forgets the earliest content.

    Q: What does temperature do when generating text?

    A: Temperature scales the model's raw scores (logits) before softmax. A low temperature (near 0) makes the model confident and repetitive — it almost always picks the top token. A high temperature flattens the probabilities, making output more random and creative but more error-prone.

    Q: What is the difference between pretraining and fine-tuning?

    A: Pretraining teaches a model general language from trillions of tokens — it costs months of compute and millions of dollars. Fine-tuning cheaply adapts that already-pretrained model to your specific task or style using a small dataset. Almost everyone fine-tunes; almost no one pretrains from scratch.

    Q: Why do LLMs hallucinate or make things up?

    A: An LLM optimises for plausible next tokens, not for truth. It has no built-in fact database, so when the most likely continuation sounds right but is wrong, it produces a confident, fluent falsehood — a hallucination. Verify important facts and prefer retrieval-augmented or grounded approaches for accuracy.

    Q: What are emergent abilities and scaling laws?

    A: Scaling laws describe how loss falls predictably as you add parameters, data, and compute. Emergent abilities are skills (like multi-step reasoning or following instructions) that barely appear in small models but show up once a model crosses a size threshold — they emerge with scale rather than being explicitly programmed.

    🎯 Mini Challenge: Build a Tiny Next-Token Generator

    Time to fade the scaffolding. Starting from the word "the", greedily generate three more tokens and print the finished sentence. The starter below gives you only a comment outline — write the logic yourself.

    Mini Challenge

    Greedily generate a 4-word sentence from a transition table

    Try it Yourself »
    Python
    # 🎯 MINI-CHALLENGE: build a tiny next-token generator
    # Brief: starting from "the", generate 3 more tokens by GREEDILY picking
    # the most likely next token at each step, then print the final sentence.
    #
    # 1. Make a dict 'transitions' mapping a word -> {nextWord: probability}
    #    e.g. "the": {"cat": 0.7, "dog": 0.3}, "cat": {"sat": 0.9, "ran": 0.1}, ...
    # 2. Start with current = "the" and an output list ["the"]
    # 3. Loop 3 times: pick the highest-probability next word (max + key=.get),
    #    app
    ...
    🎉

    Lesson complete — you can now explain how an LLM works!

    You learned that an LLM is a decoder-only transformer doing next-token prediction over tokens within a context window, that its knowledge lives in billions of parameters, and that temperature, softmax, pretraining vs fine-tuning, scaling laws, and hallucination shape what it produces.

    🚀 Up next: Tokenization Strategies — see exactly how text becomes the tokens an LLM reads.

    Sign up for free to track which lessons you've completed and get learning reminders.

    Previous

    Cookie & Privacy Settings

    We use cookies to improve your experience, analyze traffic, and show personalized ads. You can manage your preferences below.

    By clicking "Accept All", you consent to our use of cookies for analytics and personalized advertising. You can customize your preferences or reject non-essential cookies.

    Privacy PolicyTerms of Service