Lesson 24 • Advanced
Diffusion Models Explained
By the end of this lesson you'll be able to explain — and run code for — how DALL-E and Stable Diffusion turn random noise into images by learning to denoise, step by step.
What You'll Learn in This Lesson
- ✓You'll be able to explain the forward (noising) process step by step
- ✓You'll be able to describe the reverse (denoising) process and sampling
- ✓You'll be able to say what the model predicts during training (the noise)
- ✓You'll be able to explain classifier-free guidance and text conditioning
- ✓You'll be able to describe latent diffusion and why Stable Diffusion is fast
- ✓You'll be able to name what diffusion can generate (images, audio, more)
🌍 Real-World Analogy: Sculpting from Static
Picture an old TV tuned to a dead channel — a screen full of fuzzy static. Now imagine a sculptor who can look at any patch of static and carefully chip away the randomness until a clear picture appears underneath. That is exactly what a diffusion model does: it sculpts an image out of noise.
During training the model watches the opposite happen thousands of times — real photos slowly dissolving into static — so it learns precisely what "noise" looks like at every stage. To generate something new you hand it a fresh screen of static and your text description, and it removes the noise a little at a time, guided by your words, until a finished image emerges.
Diffusion models power DALL-E, Stable Diffusion, Midjourney, and Imagen. They produce higher-quality images than older GANs with far more stable training, and the same denoising recipe also generates audio, video, and 3D shapes.
1The Forward Process — adding noise on purpose
The forward process takes clean data and adds a small amount of Gaussian noise (bell-curve random values) at each of many timesteps. Repeat it enough and the data becomes indistinguishable from pure noise. Crucially this process is fixed, not learned — it follows a preset noise schedule (called beta) that decides how much noise to add at each step.
Why deliberately destroy data? Because it creates perfect training pairs: at every step you know exactly how much noise you added, so you can teach a network to undo it. Run the worked example below to watch a tiny 5-value signal dissolve into noise.
Worked Example: The Forward (Noising) Process
Watch a clean signal dissolve into noise as Gaussian noise is added step by step
# FORWARD PROCESS — turning a clean signal into noise, one step at a time.
# A diffusion model first DESTROYS data so it can later learn to REBUILD it.
# We use a tiny 1-D "signal" (think of it as 5 pixels) instead of an image.
import random
random.seed(0) # reproducible noise for the demo
signal = [0.9, 0.7, 0.5, 0.3, 0.1] # the clean data we start from
steps = 4 # how many times we add noise
noise_amount = 0.15 # std-dev of
...2The Reverse Process — denoising one step
The reverse process is the part the model actually learns. Given a noisy sample, the network predicts the noise that was added, and you subtract that prediction to get a cleaner sample. The network is usually a U-Net — an encoder-decoder shape with skip connections that's great at processing images — but for the idea you only need one move: predict the noise, then subtract it.
The example below fakes a decent noise prediction so you can see a single denoise step pull a noisy signal back toward the clean one.
Worked Example: One Denoise Step
Predict the noise, subtract it, and watch the noisy signal move toward clean
# REVERSE PROCESS — one denoising step.
# The model's whole job is to PREDICT the noise that was added.
# Once we have that prediction, we SUBTRACT it to recover cleaner data.
import random
random.seed(1)
clean = [0.9, 0.7, 0.5, 0.3, 0.1] # the data we want back
true_noise = [random.gauss(0, 0.2) for _ in clean] # the noise reality added
noisy = [c + n for c, n in zip(clean, true_noise)] # what the model receives
# A REAL model is a neural net (a U-Net). Here we fake a decent
...3Sampling — generating from pure noise
Sampling is how you create something new. You start from a screen of pure random noise and apply the denoise step over and over, walking backwards through the timesteps until a clean sample appears. More steps generally means higher quality but slower generation — early models used 1,000 steps, while DDIM and modern samplers get great results in 20-50.
The toy loop below starts from random noise and converges toward a target shape. A real model never sees the target; it learned that shape from millions of training images.
Worked Example: The Sampling Loop
Start from pure noise and denoise repeatedly until clean data emerges
# SAMPLING — generate data from scratch by denoising over many steps.
# Start from PURE NOISE, then repeatedly predict-and-subtract toward clean data.
import random
random.seed(7)
target = [0.9, 0.7, 0.5, 0.3, 0.1] # the "shape" we are trying to recover
x = [random.gauss(0, 1) for _ in target] # step 0 = pure random noise
print("start (pure noise):", [round(v, 2) for v in x])
total_steps = 8
for step in range(total_steps, 0, -1):
# Toy "model": guess how far each value is from the t
...4Training — learning to predict the noise
Training is surprisingly simple. Take a clean sample, pick a random timestep, add the matching amount of Gaussian noise, and ask the network to predict that noise. The loss is the mean squared error between the predicted noise and the true noise. Because you generated the noise yourself, you always have a perfect answer to grade against — no human labels needed. This is self-supervised learning.
Run the example to compare an untrained model (predicts nothing) against a trained one (predicts the noise accurately) and see the loss drop.
Worked Example: The Training Objective (MSE on noise)
Compare an untrained vs trained noise prediction using mean squared error
# TRAINING — the loss a diffusion model actually minimises.
# Goal: given a noisy sample, the model should PREDICT the added noise.
# Loss = mean squared error between predicted noise and true noise.
import random
random.seed(3)
clean = [0.5, 0.2, 0.8, 0.4]
true_noise = [random.gauss(0, 1) for _ in clean]
noisy = [c + n for c, n in zip(clean, true_noise)]
def mse(predicted, actual):
diffs = [(p - a) ** 2 for p, a in zip(predicted, actual)]
return sum(diffs) / len(diffs)
bad_guess =
...🎯 Your Turn: Add the Noise
Fill in the blanks to run the forward process with random.gauss
# 🎯 YOUR TURN — run the FORWARD process: add Gaussian noise to a signal.
# Fill in every ___ . Run it and compare against the expected output.
import random
random.seed(0)
signal = [1.0, 0.6, 0.2] # a tiny clean signal
steps = 3
x = list(signal)
for step in range(steps):
# 1) Add Gaussian noise (mean 0, std 0.1) to EACH value.
# 👉 use random.gauss(0, 0.1)
x = [v + ___ for v in x] # 👉 replace ___
# 2) 👉 print the step number and the rounded values
pr
...🎯 Your Turn: One Denoise Step
Subtract the predicted noise yourself to recover cleaner data
# 🎯 YOUR TURN — run ONE reverse step: subtract the predicted noise.
# Fill in every ___ . The model's prediction is given to you.
noisy = [1.1, 0.4, 0.9] # what the model receives
predicted_noise = [0.2, 0.1, 0.3] # what the model thinks the noise is
# 1) Denoise = noisy value MINUS predicted noise, value by value.
# 👉 subtract p from x for each pair
denoised = [x - ___ for x, p in zip(noisy, predicted_noise)] # 👉 replace ___
# 2) 👉 print the denoised list
print("denois
...5Conditioning & Classifier-Free Guidance
So far the model denoises toward any realistic image. To get the image you want, you condition the denoising on a text prompt. The prompt is encoded (often by a model like CLIP) and fed into the U-Net through cross-attention, so every denoise step is nudged toward "a cosy cabin in a snowy forest" rather than just anything.
Classifier-free guidance (CFG) makes the prompt count more. At each step the model predicts noise twice — once with the prompt and once with no prompt — then exaggerates the difference between them. The guidance scale controls how much: around 7-8 is the usual sweet spot.
How the guidance scale behaves:
- Too low (≈1) → the model mostly ignores your prompt
- Sweet spot (≈7-8) → follows the prompt and stays realistic
- Too high (≈20) → oversaturated, distorted, "fried" images
6Latent Diffusion — how Stable Diffusion stays fast
Denoising every pixel of a 512×512 image, hundreds of times, is slow. Latent diffusion fixes this by doing the whole process in a small latent space. An autoencoder first compresses the image down to a tiny representation (e.g. 64×64), diffusion runs there, and then a decoder expands the cleaned latent back to a full image.
This is the trick behind Stable Diffusion: it denoises something roughly 48× smaller, so it runs on a consumer GPU instead of a data centre. The text conditioning and guidance you just learned happen inside that latent space.
The worked example below shows the real Hugging Face diffusers API. It needs diffusers, torch, and a GPU, so treat it as a preview of the real code — the expected output is shown in a comment.
Worked Example: Stable Diffusion with diffusers
Generate an image from a text prompt using the real latent-diffusion API
# REAL diffusion in practice: Stable Diffusion via Hugging Face 'diffusers'.
# (Needs the diffusers + torch libraries and a GPU; shown to illustrate the API.)
from diffusers import StableDiffusionPipeline
import torch
# Load a pretrained LATENT diffusion model. "Latent" means it denoises in a
# small compressed space, not full-resolution pixels — that is why it's fast.
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
)
pipe =
...🎨 Beyond Images: What Diffusion Can Generate
The predict-the-noise recipe works on anything you can add noise to. Swap the data and the network shape, keep the forward/reverse idea, and you get a generator for a new domain:
- Images — DALL-E, Stable Diffusion, Midjourney, Imagen.
- Audio & music — AudioLDM and MusicGen denoise sound representations.
- Video — denoise across frames for short clips (Sora-style systems build on this).
- 3D & molecules — generate shapes and even candidate drug molecules.
!Common Errors (And How to Fix Them)
❌ Output is blurry or noisy — too few sampling steps
You set num_inference_steps very low, so denoising didn't finish.
✅ Fix:
# Give the sampler enough steps to fully denoise. pipe(prompt, num_inference_steps=30) # 20-50 is a good range # DDIM/DPM samplers look great at ~25-30; raw DDPM may need more.
❌ Image ignores the prompt OR looks "fried" — wrong guidance scale
Guidance too low ignores your words; too high oversaturates and distorts.
✅ Fix:
# Stay near the sweet spot. pipe(prompt, guidance_scale=7.5) # 7-8 = balanced # guidance_scale=1 -> prompt mostly ignored # guidance_scale=20 -> oversaturated, distorted output
❌ Garbage samples — train/sample noise-schedule mismatch
You sampled with a different number of steps or schedule than you trained on.
✅ Fix:
# Sampling MUST match the noise schedule used in training. # Use the same scheduler the model was trained/released with: from diffusers import DDIMScheduler pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
❌ "CUDA out of memory"
Pixel-space or large batches won't fit; latent diffusion plus half precision helps.
✅ Fix:
# Use fp16 and generate one image at a time.
pipe = pipe.to("cuda")
pipe.enable_attention_slicing() # trades a little speed for less memory📋 Quick Reference
| Term | What It Means | In One Line |
|---|---|---|
| Forward process | Adds noise to data over many steps | Fixed, not learned |
| Reverse process | Removes noise step by step | What the network learns |
| Noise schedule (beta) | How much noise per timestep | Linear or cosine |
| Training target | Predict the added noise | MSE loss, self-supervised |
| U-Net | Encoder-decoder network | Predicts the noise |
| Sampling steps | Denoise iterations to generate | 20-50 (DDIM) vs 1000 |
| Conditioning | Guide output with a prompt | Cross-attention on text |
| Classifier-free guidance | Strengthen the prompt | Scale ≈ 7-8 |
| Latent diffusion | Diffuse in compressed space | Why SD is fast |
❓ Frequently Asked Questions
Q: What is a diffusion model in simple terms?
A: A diffusion model is a generative AI that learns to create data by reversing a noising process. During training it watches clean data slowly turn into random noise, and learns to predict the noise at each step. To generate something new it starts from pure noise and removes the predicted noise step by step until a clean image, audio clip, or other sample emerges.
Q: What is the difference between the forward and reverse process?
A: The forward (noising) process is fixed, not learned — it just adds a little Gaussian noise to the data over many steps until nothing is left but noise. The reverse (denoising) process is what the neural network learns: starting from noise, it predicts and subtracts the noise step by step to rebuild realistic data. Forward destroys; reverse creates.
Q: What does the model actually predict during training?
A: It predicts the noise. You take a clean sample, add a known amount of Gaussian noise at a random timestep, and ask the network to output that noise. The training loss is the mean squared error between the predicted noise and the true noise. Because the noise is known, you have a perfect target — no labels required.
Q: What is classifier-free guidance and the guidance scale?
A: Classifier-free guidance steers generation toward your text prompt without a separate classifier. The model runs twice — once with the prompt and once unconditionally — and pushes the result in the direction the prompt adds. The guidance scale controls how hard it pushes: around 7-8 is the sweet spot, too low ignores the prompt, and too high causes oversaturated, distorted images.
Q: What is latent diffusion and why is Stable Diffusion fast?
A: Latent diffusion runs the whole noising and denoising process inside a small compressed 'latent' space instead of on full-resolution pixels. An autoencoder shrinks the image first, diffusion happens on that tiny representation, then a decoder expands it back to a full image. This is why Stable Diffusion runs on a consumer GPU while pixel-space diffusion is far slower.
Q: What can diffusion models generate besides images?
A: The same predict-the-noise recipe works on any data you can add noise to. Beyond images (DALL-E, Stable Diffusion, Midjourney) it powers audio and music generation (AudioLDM, MusicGen), video, 3D shapes, and even molecule design. You change the data and the network architecture, but the forward/reverse diffusion idea stays the same.
🎯 Mini Challenge: Noise It, Then Denoise It
Time to fade the scaffolding. Add noise to a clean signal over a few steps, then take one denoise step using a noise prediction you compute yourself, and show that you land back near the original. The starter gives you only a comment outline — write the logic yourself.
Mini Challenge
Noise a signal, then denoise it back toward the original
# 🎯 MINI-CHALLENGE: noise then denoise a signal
# Brief: add noise to a clean signal over a few steps, then take ONE denoise
# step using a "predicted noise" you compute, and print how close you got.
#
# 1. import random; random.seed(0)
# 2. clean = [0.8, 0.5, 0.2] and an empty list noise_added = []
# 3. Loop 3 times: pick n = random.gauss(0, 0.1), append n to noise_added,
# and add n to each value of a working copy 'noisy'
# 4. predicted = the SUM of noise added to each position (a perfect
...Lesson complete — you can now explain how diffusion models work!
You learned that the forward process adds Gaussian noise to data, that the network is trained to predict that noise with an MSE loss, and that sampling starts from pure noise and denoises step by step. You also saw how text conditioning plus classifier-free guidance steer the output, and how latent diffusion makes Stable Diffusion fast enough for a consumer GPU.
🚀 Up next: LLM Architecture — see how the same "predict the next piece" idea powers large language models.
Sign up for free to track which lessons you've completed and get learning reminders.