Lesson 25 • Advanced
Large Language Models Architecture
Understand how GPT, LLaMA, and Mistral work inside — decoder-only transformers, causal masking, tokenisation, and scaling laws.
✅ What You'll Learn
- • Decoder-only transformers and causal self-attention
- • How autoregressive generation works (token by token)
- • Scaling laws: parameters, data, and compute tradeoffs
- • Key differences between GPT, LLaMA, and Mistral
🧠 Inside an LLM
🎯 Real-World Analogy: An LLM is like an incredibly well-read autocomplete system. Imagine someone who has read every book, article, and conversation ever written. When you type "The cat sat on the", they predict "mat" — not by understanding cats, but by recognising patterns from trillions of words. The decoder-only transformer is the architecture that makes this pattern recognition efficient at massive scale.
Modern LLMs (GPT-4, LLaMA, Mistral) use decoder-only transformers. Unlike the original encoder-decoder Transformer (2017), these have only the decoder half with causal masking — each token can only see tokens that came before it.
Try It: Decoder-Only Transformer
See how causal masking ensures tokens only attend to the past
import numpy as np
# Decoder-Only Transformer: The Architecture Behind GPT
# Each token can only attend to PREVIOUS tokens (causal masking)
np.random.seed(42)
def softmax(x):
e = np.exp(x - np.max(x, axis=-1, keepdims=True))
return e / e.sum(axis=-1, keepdims=True)
def causal_self_attention(X, Wq, Wk, Wv, d_k):
"""Masked self-attention: tokens can't see the future"""
Q = X @ Wq
K = X @ Wk
V = X @ Wv
scores = Q @ K.T / np.sqrt(d_k)
# Causal mask: -inf
...Try It: LLM Scaling Laws
Explore how model size, data, and compute affect performance
import numpy as np
# LLM Scaling Laws: How Size Affects Performance
# More parameters + more data + more compute = better models
np.random.seed(42)
# Chinchilla scaling law: L(N, D) ≈ A/N^α + B/D^β + E
# N = parameters, D = data tokens
def compute_loss(params_b, data_tokens_b):
"""Approximate loss using Chinchilla scaling"""
A, alpha = 406.4, 0.34
B, beta = 410.7, 0.28
E = 1.69 # irreducible loss
N = params_b * 1e9
D = data_tokens_b * 1e9
return A / (N ** alpha) +
...⚠️ Common Mistake: Thinking bigger models are always better. Mistral-7B outperforms LLaMA-13B on many benchmarks by using better data, sliding window attention, and grouped-query attention. Architecture improvements and data quality matter as much as scale.
💡 Pro Tip: For most applications, don't train an LLM from scratch — it costs millions. Instead: (1) Use an API (GPT-4, Claude) for prototyping, (2) Fine-tune an open model (LLaMA, Mistral) with LoRA for production, (3) Only pretrain if you have domain-specific data that doesn't exist in public datasets.
📋 Quick Reference
| Model | Params | Key Innovation | Open? |
|---|---|---|---|
| GPT-3 | 175B | In-context learning | No |
| LLaMA 2 | 7-70B | Open weights, GQA | Yes |
| Mistral 7B | 7B | Sliding window attn | Yes |
| GPT-4 | ~1.8T | MoE, RLHF | No |
| Gemini | Unknown | Multimodal native | Partial |
🎉 Lesson Complete!
You now understand the architecture powering modern AI assistants. Next, learn how these models convert text into tokens!
Sign up for free to track which lessons you've completed and get learning reminders.