Lesson 25 • Advanced

How Large Language Models Work

By the end of this lesson you'll be able to explain, in plain English and with runnable code, how an LLM turns your prompt into text — token by token.

What You'll Learn in This Lesson

✓You'll be able to explain decoder-only transformers and next-token prediction
✓You'll be able to define tokens and the context window in plain English
✓You'll be able to describe parameters (weights) and what they store
✓You'll be able to tell pretraining apart from fine-tuning
✓You'll be able to control randomness with temperature and softmax
✓You'll be able to explain emergent abilities, scaling laws, and hallucination

Before you start: A little Python helps, since the exercises run real code. If neural networks are new to you, skim Neural Networks first — but it isn't required to follow the ideas here.

🌍 Real-World Analogy: Autocomplete on Steroids

You already use a tiny language model every day: phone keyboard autocomplete. Type "I'll be there in five" and it suggests "minutes". An LLM is that exact idea — autocomplete on steroids.

Imagine someone who has read almost everything ever written and remembers the patterns. When you type "The cat sat on the", they don't understand cats — they just know that "mat" is the most likely next word. Scale that predictor up to billions of internal settings and trillions of words of practice, and you get something that can write essays, code, and answers. That's all an LLM is doing: predicting the next token, over and over, very well.

1Next-Token Prediction — the one job an LLM has

An LLM has exactly one job: given some text, output a probability for every possible next token. A token is a chunk of text (often a word piece). The model assigns, say, 62% to "mat" and 21% to "floor", picks one, appends it, and runs again. Generating a paragraph is just this loop repeated hundreds of times — that's called autoregressive generation.

The simplest way to pick is greedy decoding: always take the highest-probability token. Run the worked example below — it takes a probability dictionary and grabs the most likely next word.

Worked Example: Greedy Next-Token Prediction

The model outputs probabilities; greedy decoding grabs the most likely token

Try it Yourself »

Python

# How an LLM picks the next word: NEXT-TOKEN PREDICTION
# An LLM reads your text and outputs a probability for every possible
# next token. "Greedy" decoding just grabs the most likely one.

# Pretend the model just read: "The cat sat on the"
# These are the probabilities it assigned to possible next tokens.
next_token_probs = {
    "mat":   0.62,   # most likely
    "floor": 0.21,
    "roof":  0.10,
    "couch": 0.05,
    "moon":  0.02,
}

# Greedy pick = the token with the highest probability.
...

2Decoder-Only Transformers & Causal Attention

GPT, LLaMA, and Mistral all use a decoder-only transformer. The key mechanism is self-attention: at every position, the model looks back at earlier tokens and decides which ones matter for predicting what comes next. "Causal" (or masked) attention means a token can only see tokens before it — never the future. That's what makes left-to-right generation possible.

Reading "The cat sat on" with causal attention:

The → sees only The
cat → sees The, cat
sat → sees The, cat, sat
on → sees everything before it, then predicts the next token

Stack dozens of these attention-plus-feed-forward layers, and the model can capture grammar, facts, and style. You don't need the matrix maths to use one — but knowing that each token only attends to the past explains why an LLM writes one token at a time.

3Tokens & the Context Window

Models don't read characters or whole words — they read tokens. A rough rule for English is one token per four characters (about ¾ of a word). The context window is the maximum number of tokens — your prompt plus the reply — the model can hold at once. Go over it and the oldest text falls out of view, so the model effectively "forgets" the start of a long conversation.

Worked Example: Tokens & Context Window

Estimate token counts and see how the context window limits what the model can read

Try it Yourself »

Python

# TOKENS & CONTEXT WINDOW
# LLMs don't read characters or whole words — they read TOKENS (word pieces).
# A rough rule of thumb in English: ~1 token per 4 characters (~0.75 words).

text = "Large language models predict the next token."

# Real tokenizers (like tiktoken) split smarter, but we'll estimate.
approx_tokens = max(1, len(text) // 4)
words = len(text.split())

print("Text:", text)
print("Characters:", len(text))
print("Words:", words)
print("Estimated tokens:", approx_tokens)
print()


...

4Parameters & Weights — where the "knowledge" lives

A model's parameters (also called weights) are just numbers — billions of them — that the model adjusts during training. They're the dials that turn an input of tokens into an output of probabilities. "GPT-3 has 175 billion parameters" means 175 billion of these numbers.

There's no database of facts inside; the "knowledge" is encoded across all those weights. More parameters generally means more capacity to capture patterns — but, as you'll see in the scaling section, size alone isn't everything.

5Temperature & Sampling — controlling randomness

Greedy decoding always picks the top token, which can feel robotic and repetitive. Instead, models usually sample from the probabilities. The raw scores the model produces are called logits; softmax turns them into probabilities that sum to 1.

Temperature is the knob. You divide the logits by the temperature before softmax. A low temperature (e.g. 0.5) sharpens the distribution — the top token dominates, output is safe and consistent. A high temperature (e.g. 2.0) flattens it — output gets diverse and creative, but also more likely to go off the rails. Run the example to watch the probabilities shift.

Worked Example: Temperature + Softmax

Apply temperature to logits, then softmax — see how randomness changes

Try it Yourself »

Python

# TEMPERATURE controls randomness. The model outputs raw scores called
# "logits". Temperature divides the logits, then softmax turns them into
# probabilities. Low temp = confident/repetitive. High temp = creative/risky.

import math

# Raw scores (logits) for 4 candidate next tokens.
tokens = ["mat", "floor", "roof", "moon"]
logits = [3.0, 1.5, 0.8, -1.0]

def softmax(scores):
    m = max(scores)                       # subtract max for numerical safety
    exps = [math.exp(s - m) for s in sco
...

🎯 Your Turn: Greedy Next Token

Fill in the blanks to make the model pick its next token

Try it Yourself »

Python

# 🎯 YOUR TURN — make the model pick its next token (greedy decoding)
# Fill in every ___ . Run it and compare against the expected output.

next_token_probs = {
    "pizza":  0.55,
    "salad":  0.30,
    "rocks":  0.15,
}

# 1) Greedy decoding = the token with the HIGHEST probability.
#    👉 use max(...) with key=next_token_probs.get
best_token = ___          # 👉 replace ___

# 2) 👉 print the winning token
print("Next token:", ___)

# 3) 👉 build the sentence "I want to eat <best_token>"
pr
...

🎯 Your Turn: Temperature & Softmax

Complete softmax and the temperature step yourself

Try it Yourself »

Python

# 🎯 YOUR TURN — turn raw logits into probabilities, with temperature
# Fill in every ___ . No numpy — just the math module.

import math

logits = [2.0, 1.0, 0.1]      # raw scores for 3 candidate tokens

def softmax(scores):
    m = max(scores)
    exps = [math.exp(s - m) for s in scores]
    total = sum(exps)
    # 👉 each probability = its exp divided by the total
    return [e / ___ for e in exps]      # 👉 replace ___

def apply_temperature(logits, temperature):
    # 👉 divide every logit
...

6Pretraining vs Fine-Tuning

Pretraining is the expensive part: the model learns general language by predicting the next token across trillions of tokens of internet text. This takes months on huge GPU clusters and can cost millions of dollars. The result is a model that knows language but isn't specialised.

Fine-tuning takes that pretrained model and cheaply adapts it to a specific task or style using a much smaller dataset — sometimes just thousands of examples. Techniques like LoRA make this affordable on a single GPU. The rule of thumb: almost everyone fine-tunes; almost no one pretrains from scratch.

The worked examples below use the real Hugging Face transformers library. They need transformers and torch installed locally, so treat them as a preview of the real API — the expected output is shown in comments.

Worked Example: Real LLM with transformers (HF pipeline)

A text-generation pipeline using a pretrained decoder-only model (gpt2)

Try it Yourself »

Python

# REAL LLMs in practice: Hugging Face 'transformers'
# (This needs the transformers + torch libraries installed locally;
#  it is shown here to illustrate the real API and its output.)

from transformers import pipeline

# A text-generation pipeline downloads a pretrained decoder-only model.
generator = pipeline("text-generation", model="gpt2")

result = generator(
    "The future of artificial intelligence is",
    max_new_tokens=20,
    temperature=0.7,     # same temperature idea you coded ab
...

Worked Example: Loading a Pretrained Model

Inspect a pretrained model's vocabulary and parameter count before fine-tuning

Try it Yourself »

Python

# PRETRAINING vs FINE-TUNING
# Pretraining: learn language from trillions of tokens (months, $millions).
# Fine-tuning: cheaply adapt that pretrained model to YOUR task.

from transformers import AutoModelForCausalLM, AutoTokenizer

# 1) Load a model that was already PRETRAINED on the open internet.
tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
model = AutoModelForCausalLM.from_pretrained("distilgpt2")

print("Loaded a pretrained model:", model.config.model_type)
print("Vocabulary size
...

7Emergent Abilities & Scaling Laws

Scaling laws are the surprising finding that a model's prediction error falls in a smooth, predictable way as you add more parameters, more data, and more compute. The Chinchilla result added an important twist: for a fixed compute budget, you often want a smaller model trained on more data — roughly 20 tokens of training data per parameter.

Emergent abilities are skills that barely exist in small models but suddenly appear once a model crosses a size threshold — things like following instructions, doing multi-step arithmetic, or in-context learning. Nobody programmed these in; they emerge from scale. This is why bigger models can feel qualitatively, not just quantitatively, smarter.

Key insight: bigger isn't automatically better. Mistral-7B beats larger models on many tasks thanks to better data and smarter attention. Data quality and architecture matter alongside raw size.

⚠️ Limitations: Hallucination & What LLMs Can't Do

Because an LLM optimises for plausible next tokens — not for truth — it will sometimes produce confident, fluent statements that are simply wrong. This is called hallucination. There's no fact-checker inside; if a made-up citation or API method "sounds right", the model may emit it.

It has no live knowledge after its training cutoff unless given tools or retrieval.
It can't reliably count, do exact maths, or remember beyond the context window.
It's sensitive to prompt wording — small changes can flip the answer.

The practical fix: verify important facts, ground the model with retrieval (RAG) or tools, and lower the temperature when you need consistency.

!Common Errors (And How to Fix Them)

❌ Output cut off / "maximum context length exceeded"

Your prompt plus reply went past the context window, so the model truncated or errored.

✅ Fix:

# Trim old messages or summarise them before sending.
# Leave room for the reply: reply_budget = context_window - prompt_tokens
# If reply_budget is small, shorten the prompt.

❌ The model confidently states something false (hallucination)

It generated a plausible-sounding but wrong fact, citation, or function name.

✅ Fix:

# Don't trust facts blindly. Ground the model:
#  - lower the temperature (e.g. 0.2) for factual tasks
#  - give it the source text (retrieval / RAG)
#  - ask it to say "I don't know" when unsure

❌ Output is random / repetitive — wrong temperature

Too-high temperature = nonsense; temperature of 0 = identical, repetitive text.

✅ Fix:

# Use ~0.2 for facts/code, ~0.7-1.0 for creative writing.
generator(prompt, temperature=0.7, do_sample=True)  # balanced
generator(prompt, temperature=0.0)                  # deterministic

❌ Same task, different wording, different answer (prompt sensitivity)

A tiny rephrase changed the result because the model reacts to exact tokens.

✅ Fix:

# Be explicit and consistent. Spell out the format you want:
# "Answer in one sentence." / "Return valid JSON only."
# Provide 1-2 examples (few-shot) to anchor the style.

📋 Quick Reference

Term	What It Means	In One Line
Token	A chunk of text (word piece)	~4 chars of English
Next-token prediction	Output a probability per token	Pick, append, repeat
Decoder-only transformer	Architecture behind GPT/LLaMA	Causal (past-only) attention
Context window	Max tokens held at once	Prompt + reply combined
Parameters / weights	The trained numbers	Where capacity lives
Logits	Raw pre-softmax scores	softmax → probabilities
Temperature	Randomness knob	Low = safe, high = creative
Pretraining	Learn language from scratch	Months, $millions
Fine-tuning	Adapt a pretrained model	Cheap, task-specific
Hallucination	Confident but false output	Plausible ≠ true

❓ Frequently Asked Questions

Q: How does a large language model actually work?

A: An LLM is a decoder-only transformer trained to predict the next token. It turns your text into tokens, runs them through layers of self-attention and feed-forward weights, and outputs a probability for every possible next token. It picks one, appends it, and repeats — generating text one token at a time.

Q: What is a token and what is the context window?

A: A token is a chunk of text — often a word piece — and roughly one token equals about four characters of English. The context window is the maximum number of tokens (prompt plus reply) the model can consider at once. Text beyond that limit is dropped, so the model effectively forgets the earliest content.

Q: What does temperature do when generating text?

A: Temperature scales the model's raw scores (logits) before softmax. A low temperature (near 0) makes the model confident and repetitive — it almost always picks the top token. A high temperature flattens the probabilities, making output more random and creative but more error-prone.

Q: What is the difference between pretraining and fine-tuning?

A: Pretraining teaches a model general language from trillions of tokens — it costs months of compute and millions of dollars. Fine-tuning cheaply adapts that already-pretrained model to your specific task or style using a small dataset. Almost everyone fine-tunes; almost no one pretrains from scratch.

Q: Why do LLMs hallucinate or make things up?

A: An LLM optimises for plausible next tokens, not for truth. It has no built-in fact database, so when the most likely continuation sounds right but is wrong, it produces a confident, fluent falsehood — a hallucination. Verify important facts and prefer retrieval-augmented or grounded approaches for accuracy.

Q: What are emergent abilities and scaling laws?

A: Scaling laws describe how loss falls predictably as you add parameters, data, and compute. Emergent abilities are skills (like multi-step reasoning or following instructions) that barely appear in small models but show up once a model crosses a size threshold — they emerge with scale rather than being explicitly programmed.

🎯 Mini Challenge: Build a Tiny Next-Token Generator

Time to fade the scaffolding. Starting from the word "the", greedily generate three more tokens and print the finished sentence. The starter below gives you only a comment outline — write the logic yourself.

Mini Challenge

Greedily generate a 4-word sentence from a transition table

Try it Yourself »

Python

# 🎯 MINI-CHALLENGE: build a tiny next-token generator
# Brief: starting from "the", generate 3 more tokens by GREEDILY picking
# the most likely next token at each step, then print the final sentence.
#
# 1. Make a dict 'transitions' mapping a word -> {nextWord: probability}
#    e.g. "the": {"cat": 0.7, "dog": 0.3}, "cat": {"sat": 0.9, "ran": 0.1}, ...
# 2. Start with current = "the" and an output list ["the"]
# 3. Loop 3 times: pick the highest-probability next word (max + key=.get),
#    app
...

🎉

Lesson complete — you can now explain how an LLM works!

You learned that an LLM is a decoder-only transformer doing next-token prediction over tokens within a context window, that its knowledge lives in billions of parameters, and that temperature, softmax, pretraining vs fine-tuning, scaling laws, and hallucination shape what it produces.

🚀 Up next: Tokenization Strategies — see exactly how text becomes the tokens an LLM reads.

How Large Language Models Work

What You'll Learn in This Lesson

🌍 Real-World Analogy: Autocomplete on Steroids

1Next-Token Prediction — the one job an LLM has

Worked Example: Greedy Next-Token Prediction

2Decoder-Only Transformers & Causal Attention

3Tokens & the Context Window

Worked Example: Tokens & Context Window

4Parameters & Weights — where the "knowledge" lives

5Temperature & Sampling — controlling randomness

Worked Example: Temperature + Softmax

🎯 Your Turn: Greedy Next Token

🎯 Your Turn: Temperature & Softmax

6Pretraining vs Fine-Tuning

Worked Example: Real LLM with transformers (HF pipeline)

Worked Example: Loading a Pretrained Model

7Emergent Abilities & Scaling Laws

⚠️ Limitations: Hallucination & What LLMs Can't Do

!Common Errors (And How to Fix Them)

📋 Quick Reference

❓ Frequently Asked Questions

🎯 Mini Challenge: Build a Tiny Next-Token Generator

Mini Challenge

Lesson complete — you can now explain how an LLM works!

Cookie & Privacy Settings