Lesson 24 • Advanced

Diffusion Models Explained

By the end of this lesson you'll be able to explain — and run code for — how DALL-E and Stable Diffusion turn random noise into images by learning to denoise, step by step.

What You'll Learn in This Lesson

✓You'll be able to explain the forward (noising) process step by step
✓You'll be able to describe the reverse (denoising) process and sampling
✓You'll be able to say what the model predicts during training (the noise)
✓You'll be able to explain classifier-free guidance and text conditioning
✓You'll be able to describe latent diffusion and why Stable Diffusion is fast
✓You'll be able to name what diffusion can generate (images, audio, more)

Before you start: A little Python helps, since the exercises run real code. If neural networks are new to you, skim Neural Networks first — but it isn't required to follow the ideas here.

🌍 Real-World Analogy: Sculpting from Static

Picture an old TV tuned to a dead channel — a screen full of fuzzy static. Now imagine a sculptor who can look at any patch of static and carefully chip away the randomness until a clear picture appears underneath. That is exactly what a diffusion model does: it sculpts an image out of noise.

During training the model watches the opposite happen thousands of times — real photos slowly dissolving into static — so it learns precisely what "noise" looks like at every stage. To generate something new you hand it a fresh screen of static and your text description, and it removes the noise a little at a time, guided by your words, until a finished image emerges.

Diffusion models power DALL-E, Stable Diffusion, Midjourney, and Imagen. They produce higher-quality images than older GANs with far more stable training, and the same denoising recipe also generates audio, video, and 3D shapes.

1The Forward Process — adding noise on purpose

The forward process takes clean data and adds a small amount of Gaussian noise (bell-curve random values) at each of many timesteps. Repeat it enough and the data becomes indistinguishable from pure noise. Crucially this process is fixed, not learned — it follows a preset noise schedule (called beta) that decides how much noise to add at each step.

Why deliberately destroy data? Because it creates perfect training pairs: at every step you know exactly how much noise you added, so you can teach a network to undo it. Run the worked example below to watch a tiny 5-value signal dissolve into noise.

Worked Example: The Forward (Noising) Process

Watch a clean signal dissolve into noise as Gaussian noise is added step by step

Try it Yourself »

Python

# FORWARD PROCESS — turning a clean signal into noise, one step at a time.
# A diffusion model first DESTROYS data so it can later learn to REBUILD it.
# We use a tiny 1-D "signal" (think of it as 5 pixels) instead of an image.

import random
random.seed(0)                       # reproducible noise for the demo

signal = [0.9, 0.7, 0.5, 0.3, 0.1]   # the clean data we start from
steps = 4                            # how many times we add noise
noise_amount = 0.15                  # std-dev of 
...

2The Reverse Process — denoising one step

The reverse process is the part the model actually learns. Given a noisy sample, the network predicts the noise that was added, and you subtract that prediction to get a cleaner sample. The network is usually a U-Net — an encoder-decoder shape with skip connections that's great at processing images — but for the idea you only need one move: predict the noise, then subtract it.

The example below fakes a decent noise prediction so you can see a single denoise step pull a noisy signal back toward the clean one.

Worked Example: One Denoise Step

Predict the noise, subtract it, and watch the noisy signal move toward clean

Try it Yourself »

Python

# REVERSE PROCESS — one denoising step.
# The model's whole job is to PREDICT the noise that was added.
# Once we have that prediction, we SUBTRACT it to recover cleaner data.

import random
random.seed(1)

clean = [0.9, 0.7, 0.5, 0.3, 0.1]                 # the data we want back
true_noise = [random.gauss(0, 0.2) for _ in clean] # the noise reality added
noisy = [c + n for c, n in zip(clean, true_noise)] # what the model receives

# A REAL model is a neural net (a U-Net). Here we fake a decent 
...

3Sampling — generating from pure noise

Sampling is how you create something new. You start from a screen of pure random noise and apply the denoise step over and over, walking backwards through the timesteps until a clean sample appears. More steps generally means higher quality but slower generation — early models used 1,000 steps, while DDIM and modern samplers get great results in 20-50.

The toy loop below starts from random noise and converges toward a target shape. A real model never sees the target; it learned that shape from millions of training images.

Worked Example: The Sampling Loop

Start from pure noise and denoise repeatedly until clean data emerges

Try it Yourself »

Python

# SAMPLING — generate data from scratch by denoising over many steps.
# Start from PURE NOISE, then repeatedly predict-and-subtract toward clean data.

import random
random.seed(7)

target = [0.9, 0.7, 0.5, 0.3, 0.1]      # the "shape" we are trying to recover
x = [random.gauss(0, 1) for _ in target] # step 0 = pure random noise

print("start (pure noise):", [round(v, 2) for v in x])

total_steps = 8
for step in range(total_steps, 0, -1):
    # Toy "model": guess how far each value is from the t
...

4Training — learning to predict the noise

Training is surprisingly simple. Take a clean sample, pick a random timestep, add the matching amount of Gaussian noise, and ask the network to predict that noise. The loss is the mean squared error between the predicted noise and the true noise. Because you generated the noise yourself, you always have a perfect answer to grade against — no human labels needed. This is self-supervised learning.

Run the example to compare an untrained model (predicts nothing) against a trained one (predicts the noise accurately) and see the loss drop.

Worked Example: The Training Objective (MSE on noise)

Compare an untrained vs trained noise prediction using mean squared error

Try it Yourself »

Python

# TRAINING — the loss a diffusion model actually minimises.
# Goal: given a noisy sample, the model should PREDICT the added noise.
# Loss = mean squared error between predicted noise and true noise.

import random
random.seed(3)

clean = [0.5, 0.2, 0.8, 0.4]
true_noise = [random.gauss(0, 1) for _ in clean]
noisy = [c + n for c, n in zip(clean, true_noise)]

def mse(predicted, actual):
    diffs = [(p - a) ** 2 for p, a in zip(predicted, actual)]
    return sum(diffs) / len(diffs)

bad_guess  = 
...

🎯 Your Turn: Add the Noise

Fill in the blanks to run the forward process with random.gauss

Try it Yourself »

Python

# 🎯 YOUR TURN — run the FORWARD process: add Gaussian noise to a signal.
# Fill in every ___ . Run it and compare against the expected output.

import random
random.seed(0)

signal = [1.0, 0.6, 0.2]      # a tiny clean signal
steps = 3

x = list(signal)
for step in range(steps):
    # 1) Add Gaussian noise (mean 0, std 0.1) to EACH value.
    #    👉 use random.gauss(0, 0.1)
    x = [v + ___ for v in x]            # 👉 replace ___

    # 2) 👉 print the step number and the rounded values
    pr
...

🎯 Your Turn: One Denoise Step

Subtract the predicted noise yourself to recover cleaner data

Try it Yourself »

Python

# 🎯 YOUR TURN — run ONE reverse step: subtract the predicted noise.
# Fill in every ___ . The model's prediction is given to you.

noisy           = [1.1, 0.4, 0.9]   # what the model receives
predicted_noise = [0.2, 0.1, 0.3]   # what the model thinks the noise is

# 1) Denoise = noisy value MINUS predicted noise, value by value.
#    👉 subtract p from x for each pair
denoised = [x - ___ for x, p in zip(noisy, predicted_noise)]   # 👉 replace ___

# 2) 👉 print the denoised list
print("denois
...

5Conditioning & Classifier-Free Guidance

So far the model denoises toward any realistic image. To get the image you want, you condition the denoising on a text prompt. The prompt is encoded (often by a model like CLIP) and fed into the U-Net through cross-attention, so every denoise step is nudged toward "a cosy cabin in a snowy forest" rather than just anything.

Classifier-free guidance (CFG) makes the prompt count more. At each step the model predicts noise twice — once with the prompt and once with no prompt — then exaggerates the difference between them. The guidance scale controls how much: around 7-8 is the usual sweet spot.

How the guidance scale behaves:

Too low (≈1) → the model mostly ignores your prompt
Sweet spot (≈7-8) → follows the prompt and stays realistic
Too high (≈20) → oversaturated, distorted, "fried" images

6Latent Diffusion — how Stable Diffusion stays fast

Denoising every pixel of a 512×512 image, hundreds of times, is slow. Latent diffusion fixes this by doing the whole process in a small latent space. An autoencoder first compresses the image down to a tiny representation (e.g. 64×64), diffusion runs there, and then a decoder expands the cleaned latent back to a full image.

This is the trick behind Stable Diffusion: it denoises something roughly 48× smaller, so it runs on a consumer GPU instead of a data centre. The text conditioning and guidance you just learned happen inside that latent space.

The worked example below shows the real Hugging Face diffusers API. It needs diffusers, torch, and a GPU, so treat it as a preview of the real code — the expected output is shown in a comment.

Worked Example: Stable Diffusion with diffusers

Generate an image from a text prompt using the real latent-diffusion API

Try it Yourself »

Python

# REAL diffusion in practice: Stable Diffusion via Hugging Face 'diffusers'.
# (Needs the diffusers + torch libraries and a GPU; shown to illustrate the API.)

from diffusers import StableDiffusionPipeline
import torch

# Load a pretrained LATENT diffusion model. "Latent" means it denoises in a
# small compressed space, not full-resolution pixels — that is why it's fast.
pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
)
pipe = 
...

🎨 Beyond Images: What Diffusion Can Generate

The predict-the-noise recipe works on anything you can add noise to. Swap the data and the network shape, keep the forward/reverse idea, and you get a generator for a new domain:

Images — DALL-E, Stable Diffusion, Midjourney, Imagen.
Audio & music — AudioLDM and MusicGen denoise sound representations.
Video — denoise across frames for short clips (Sora-style systems build on this).
3D & molecules — generate shapes and even candidate drug molecules.

!Common Errors (And How to Fix Them)

❌ Output is blurry or noisy — too few sampling steps

You set num_inference_steps very low, so denoising didn't finish.

✅ Fix:

# Give the sampler enough steps to fully denoise.
pipe(prompt, num_inference_steps=30)   # 20-50 is a good range
# DDIM/DPM samplers look great at ~25-30; raw DDPM may need more.

❌ Image ignores the prompt OR looks "fried" — wrong guidance scale

Guidance too low ignores your words; too high oversaturates and distorts.

✅ Fix:

# Stay near the sweet spot.
pipe(prompt, guidance_scale=7.5)   # 7-8 = balanced
# guidance_scale=1   -> prompt mostly ignored
# guidance_scale=20  -> oversaturated, distorted output

❌ Garbage samples — train/sample noise-schedule mismatch

You sampled with a different number of steps or schedule than you trained on.

✅ Fix:

# Sampling MUST match the noise schedule used in training.
# Use the same scheduler the model was trained/released with:
from diffusers import DDIMScheduler
pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)

❌ "CUDA out of memory"

Pixel-space or large batches won't fit; latent diffusion plus half precision helps.

✅ Fix:

# Use fp16 and generate one image at a time.
pipe = pipe.to("cuda")
pipe.enable_attention_slicing()   # trades a little speed for less memory

📋 Quick Reference

Term	What It Means	In One Line
Forward process	Adds noise to data over many steps	Fixed, not learned
Reverse process	Removes noise step by step	What the network learns
Noise schedule (beta)	How much noise per timestep	Linear or cosine
Training target	Predict the added noise	MSE loss, self-supervised
U-Net	Encoder-decoder network	Predicts the noise
Sampling steps	Denoise iterations to generate	20-50 (DDIM) vs 1000
Conditioning	Guide output with a prompt	Cross-attention on text
Classifier-free guidance	Strengthen the prompt	Scale ≈ 7-8
Latent diffusion	Diffuse in compressed space	Why SD is fast

Pro tip: Don't confuse the noise schedule with the learning rate. The noise schedule (beta) is fixed before training and decides how much noise the forward process adds at each step. The learning rate is a separate optimiser setting that controls how fast the network's weights update. They're unrelated knobs.

❓ Frequently Asked Questions

Q: What is a diffusion model in simple terms?

A: A diffusion model is a generative AI that learns to create data by reversing a noising process. During training it watches clean data slowly turn into random noise, and learns to predict the noise at each step. To generate something new it starts from pure noise and removes the predicted noise step by step until a clean image, audio clip, or other sample emerges.

Q: What is the difference between the forward and reverse process?

A: The forward (noising) process is fixed, not learned — it just adds a little Gaussian noise to the data over many steps until nothing is left but noise. The reverse (denoising) process is what the neural network learns: starting from noise, it predicts and subtracts the noise step by step to rebuild realistic data. Forward destroys; reverse creates.

Q: What does the model actually predict during training?

A: It predicts the noise. You take a clean sample, add a known amount of Gaussian noise at a random timestep, and ask the network to output that noise. The training loss is the mean squared error between the predicted noise and the true noise. Because the noise is known, you have a perfect target — no labels required.

Q: What is classifier-free guidance and the guidance scale?

A: Classifier-free guidance steers generation toward your text prompt without a separate classifier. The model runs twice — once with the prompt and once unconditionally — and pushes the result in the direction the prompt adds. The guidance scale controls how hard it pushes: around 7-8 is the sweet spot, too low ignores the prompt, and too high causes oversaturated, distorted images.

Q: What is latent diffusion and why is Stable Diffusion fast?

A: Latent diffusion runs the whole noising and denoising process inside a small compressed 'latent' space instead of on full-resolution pixels. An autoencoder shrinks the image first, diffusion happens on that tiny representation, then a decoder expands it back to a full image. This is why Stable Diffusion runs on a consumer GPU while pixel-space diffusion is far slower.

Q: What can diffusion models generate besides images?

A: The same predict-the-noise recipe works on any data you can add noise to. Beyond images (DALL-E, Stable Diffusion, Midjourney) it powers audio and music generation (AudioLDM, MusicGen), video, 3D shapes, and even molecule design. You change the data and the network architecture, but the forward/reverse diffusion idea stays the same.

🎯 Mini Challenge: Noise It, Then Denoise It

Time to fade the scaffolding. Add noise to a clean signal over a few steps, then take one denoise step using a noise prediction you compute yourself, and show that you land back near the original. The starter gives you only a comment outline — write the logic yourself.

Mini Challenge

Noise a signal, then denoise it back toward the original

Try it Yourself »

Python

# 🎯 MINI-CHALLENGE: noise then denoise a signal
# Brief: add noise to a clean signal over a few steps, then take ONE denoise
# step using a "predicted noise" you compute, and print how close you got.
#
# 1. import random; random.seed(0)
# 2. clean = [0.8, 0.5, 0.2]  and an empty list noise_added = []
# 3. Loop 3 times: pick n = random.gauss(0, 0.1), append n to noise_added,
#    and add n to each value of a working copy 'noisy'
# 4. predicted = the SUM of noise added to each position (a perfect
...

🎉

Lesson complete — you can now explain how diffusion models work!

You learned that the forward process adds Gaussian noise to data, that the network is trained to predict that noise with an MSE loss, and that sampling starts from pure noise and denoises step by step. You also saw how text conditioning plus classifier-free guidance steer the output, and how latent diffusion makes Stable Diffusion fast enough for a consumer GPU.

🚀 Up next: LLM Architecture — see how the same "predict the next piece" idea powers large language models.

Diffusion Models Explained

What You'll Learn in This Lesson

🌍 Real-World Analogy: Sculpting from Static

1The Forward Process — adding noise on purpose

Worked Example: The Forward (Noising) Process

2The Reverse Process — denoising one step

Worked Example: One Denoise Step

3Sampling — generating from pure noise

Worked Example: The Sampling Loop

4Training — learning to predict the noise

Worked Example: The Training Objective (MSE on noise)

🎯 Your Turn: Add the Noise

🎯 Your Turn: One Denoise Step

5Conditioning & Classifier-Free Guidance

6Latent Diffusion — how Stable Diffusion stays fast

Worked Example: Stable Diffusion with diffusers

🎨 Beyond Images: What Diffusion Can Generate

!Common Errors (And How to Fix Them)

📋 Quick Reference

❓ Frequently Asked Questions

🎯 Mini Challenge: Noise It, Then Denoise It

Mini Challenge

Lesson complete — you can now explain how diffusion models work!

Cookie & Privacy Settings