Lesson 29 • Advanced

Q-Learning & Deep Q-Networks

By the end of this lesson you'll be able to build a Q-table, apply the Bellman update by hand, and write an epsilon-greedy agent that learns the best path through a grid world — all in plain Python.

What You'll Learn in This Lesson

✓What a Q-table is and what each cell Q(s,a) means
✓How to apply the Bellman update step by step
✓What the learning rate α and discount γ actually control
✓How epsilon-greedy balances exploring and exploiting
✓Why you decay epsilon as the agent gets smarter
✓When a Q-table breaks down and you reach for a DQN

Before you start: Make sure you've completed RL Basics so you're comfortable with states, actions, rewards, and the idea of an agent learning from trial and error.

🌍 Real-World Analogy: Learning a Maze by Trial and Reward

Imagine you're dropped into a hedge maze with a chocolate bar hidden at the exit. You have no map. The first time, you wander randomly — left here, right there — until you stumble onto the exit and get the chocolate.

Each time you reach a junction (a state) you remember which turn (an action) eventually paid off. Over many runs you build a mental note for every junction: "from here, going right got me to the chocolate fastest." That mental note is exactly a Q-table — a score for every junction-and-turn pair.

Two instincts make you learn well. You discount faraway rewards (chocolate ten turns away is worth less than chocolate one turn away), and you sometimes explore a turn you've never tried, just in case there's a shortcut. Q-Learning is just this maze instinct written as a formula.

1The Q-table — One Score Per (State, Action)

The heart of Q-Learning is the Q-table. Q(s,a) stores the agent's best guess of the total future reward it will earn if it takes action a in state s and then keeps acting optimally afterwards. "Q" stands for quality — the quality of that move.

You don't need a fancy data structure. A plain Python dictionary keyed by a (state, action) tuple works perfectly for small worlds. Every entry starts at 0.0 because the agent begins knowing nothing.

# A Q-table is just a dictionary
Q = {}
states  = ["A", "B"]
actions = ["left", "right"]

for s in states:
    for a in actions:
        Q[(s, a)] = 0.0     # start every estimate at zero

print(Q[("A", "left")])     # 0.0  — no knowledge yet

# To act, read the row for the current state and pick the best action:
best = max(actions, key=lambda a: Q[("A", a)])
print(best)                 # 'left' (ties broken by first seen)

Key idea: the Q-table is the agent's brain. Training just means filling in good numbers. Once the numbers are right, "act optimally" is as simple as "pick the highest number in your row."

2The Bellman Update — How Learning Actually Happens

Every time the agent takes an action and sees what happens, it nudges one cell of the Q-table using this rule:

Q(s,a) += α · [ r + γ · maxQ(s',a') − Q(s,a) ]

Read it piece by piece — in plain English it just says "move my old guess a little closer to a better guess":

r — the reward you just received for this move.
maxQ(s',a') — the best Q-value available from the next state s'. This is the "what comes after" estimate.
r + γ · maxQ(s',a') — the TD target: a fresher, better estimate of what Q(s,a) should be.
[ target − Q(s,a) ] — the TD error: how wrong your old guess was.
α · — take only a fraction of that correction, so one surprise doesn't overwrite everything.

It's called off-policy temporal-difference learning, but you don't need the jargon to use it. "Temporal difference" just means you learn from the gap between two estimates taken one step apart.

3Learning Rate α and Discount Factor γ

α — Learning rate (0 to 1)

How big a step you take toward the new estimate.

α = 0: never learns (ignores new info).
α small (0.1): slow, smooth, stable.
α = 1: throws away the old value entirely each step — unstable.

γ — Discount factor (0 to 1)

How much you value future reward versus reward right now.

γ near 0: short-sighted — grabs instant rewards.
γ = 0.9–0.99: plans ahead (the usual choice).
γ = 1: values distant reward fully — can fail to converge.

# Same situation, two different gammas:
reward = 0
max_next = 100          # a big payoff is reachable from the next state

short_sighted = reward + 0.1 * max_next   # 10.0  — barely cares about the future
far_sighted   = reward + 0.95 * max_next  # 95.0  — strongly pulled toward the payoff

print(short_sighted)    # 10.0
print(far_sighted)      # 95.0

4Epsilon-Greedy — Explore vs Exploit

There's a tension at the core of learning. If the agent always picks the action it currently thinks is best (pure greedy), it can lock onto a mediocre path and never discover a better one. But if it acts randomly forever, it never cashes in what it learned.

Epsilon-greedy is the simple fix: with probability ε (epsilon) act randomly (explore), otherwise pick the best known action (exploit).

import random

epsilon = 0.1   # explore 10% of the time

def choose(state):
    if random.random() < epsilon:
        return random.choice(actions)              # EXPLORE: try something new
    return max(actions, key=lambda a: Q[(state, a)])  # EXPLOIT: pick the best known

# Early on, you want a HIGH epsilon (lots of exploration).
# As Q gets accurate, DECAY epsilon so the agent exploits what it learned:
epsilon = max(0.05, epsilon * 0.99)   # shrink each episode, but never below 0.05

Decay epsilon over time. Start high (say 1.0) to explore, then multiply it down toward a small floor (say 0.05). A constant high epsilon means the agent keeps making random moves even after it has learned the optimal path.

5Worked Example — One Update, Start to Finish

Here's the whole loop in miniature: a Q-table as a dictionary, a single Bellman update applied by hand, and an epsilon-greedy pick. Read every comment, run it, and watch Q(A,right) move from 0 to 5.

Worked Example: One Bellman Update

A complete, commented walk-through — Q-table, one update, epsilon-greedy

Try it Yourself »

Python

# Q-LEARNING BY HAND — one Bellman update, no libraries
# The Q-table is just a dictionary: Q[(state, action)] = expected future reward.

import random
random.seed(0)

# A tiny world. The agent is in state "A" and can go "left" or "right".
# Start every estimate at 0.0 — the agent knows nothing yet.
Q = {
    ("A", "left"):  0.0,
    ("A", "right"): 0.0,
    ("B", "left"):  0.0,
    ("B", "right"): 0.0,
}

alpha = 0.5    # learning rate: how big a step we take toward the new estimate
gamma = 0.9
...

🎯 Your Turn #1 — Finish the Bellman Update

Fill in the two blanks so the update runs. The next state B already has learned values, so this time max_next won't be zero.

Your Turn #1: Apply the Update

Fill in max_next and the old Q-value subtraction

Try it Yourself »

Python

# 🎯 YOUR TURN — finish the Bellman update
# Formula:  Q(s,a) += alpha * (r + gamma * maxQ(s',a') - Q(s,a))

Q = {
    ("A", "left"):  0.0,
    ("A", "right"): 0.0,
    ("B", "left"):  2.0,   # state B already has some learned value
    ("B", "right"): 6.0,
}

alpha = 0.5
gamma = 0.9

state, action = "A", "right"
reward = 1
next_state = "B"

# 1) Find the best Q-value available from next_state (B):
max_next = ___          # 👉 max over Q[(next_state, a)] for a in ["left", "right"]

# 2) Apply th
...

🎯 Your Turn #2 — Build Epsilon-Greedy

Complete the explore and exploit branches. Run it 1000 times and confirm the agent mostly picks the higher-value action.

Your Turn #2: Epsilon-Greedy Selection

Implement the explore (random) and exploit (argmax) branches

Try it Yourself »

Python

# 🎯 YOUR TURN — implement epsilon-greedy action selection
import random
random.seed(7)

Q = {
    ("S", "up"):    1.0,
    ("S", "down"):  9.0,   # 'down' is clearly the best learned action so far
}
actions = ["up", "down"]
epsilon = 0.2              # explore 20% of the time

def choose(state):
    if random.random() < epsilon:
        # EXPLORE: return a random action from 'actions'
        return ___          # 👉 use random.choice(actions)
    # EXPLOIT: return the action with the highest Q
...

🌍 When Tables Run Out: Deep Q-Networks

Everything above scales to any small world. But a Q-table needs one cell per state-action pair, and that breaks down fast: an Atari screen has more possible pixel states than atoms in the observable universe. You can't store a table that big, and you'd never visit each cell enough to learn it.

A Deep Q-Network (DQN) swaps the table for a neural network. Instead of looking up Q(s,a), the network predicts it from the raw state, so similar states share what they've learned — the GPS that generalises instead of the notebook that memorises. DeepMind's 2013 DQN learned to play Atari games straight from pixels using exactly this idea, plus two stabilisers:

Experience replay: store past (s, a, r, s') transitions and train on random batches, breaking correlation between consecutive steps.
Target network: compute the TD target with a slowly-updated copy of the network, so you're not chasing a moving target.

The learning rule is still the Bellman update you mastered here — only the storage changed from a dict to a net.

6Common Errors (And How to Fix Them)

These are the mistakes that quietly stop an agent from learning. None of them throw a Python error — the code runs, it just never gets good.

❌ Learning rate α too high

With alpha = 1.0 each update overwrites the old value with one noisy sample, so the Q-values jump around and never settle.

✅ Fix: use a small, steady α (0.1 is a safe default); lower it if values oscillate.

❌ Learning rate α too low

alpha = 0.001 barely moves the estimate each step, so training crawls and may not converge in the episodes you run.

✅ Fix: raise α (try 0.1) or run far more episodes. There's a trade-off — too high is unstable, too low is glacial.

❌ Discount γ too high or too low

gamma = 1.0 can stop the Q-values from converging at all; gamma = 0.1 makes the agent ignore the goal and chase tiny immediate rewards.

✅ Fix: stay in the usual 0.9–0.99 range so the agent plans ahead without diverging.

❌ No exploration (epsilon = 0)

A pure-greedy agent commits to the first path it stumbles on and never discovers the better one — it gets stuck in a local optimum.

✅ Fix: keep epsilon > 0 so the agent keeps trying new actions while it learns.

❌ Not decaying epsilon

Leaving epsilon = 0.5 for the whole run means half the agent's moves stay random even after it has learned the optimal path, capping its performance.

✅ Fix: start epsilon high and multiply it down toward a small floor (e.g. epsilon = max(0.05, epsilon * 0.99)).

❌ State space too huge for a table

Trying to build a Q-table for chess or raw pixels means billions of cells — you run out of memory and never visit each state enough to learn it.

✅ Fix: when states can't be enumerated, switch from a table to a function approximator (a neural network) — that's a DQN.

📋 Quick Reference

Symbol / Term	Meaning	Typical value
`Q(s,a)`	Estimated total future reward for action a in state s	a number in the table
`α` (alpha)	Learning rate — step size toward the new estimate	0.1
`γ` (gamma)	Discount factor — weight on future reward	0.9 – 0.99
`ε` (epsilon)	Exploration rate — chance of a random action	1.0 → 0.05 (decayed)
Bellman update	`Q(s,a) += α·[r + γ·maxQ(s',a') − Q(s,a)]`	applied every step
DQN	Neural network replaces the Q-table for huge state spaces	when states are uncountable

❓ Frequently Asked Questions

Q: What is a Q-table?

A: A Q-table is a lookup table with one row per state and one column per action. Each cell Q(s,a) stores the agent's current estimate of the total future reward it expects to collect if it takes action a in state s and then keeps acting optimally. The agent picks actions by reading off the row for its current state and choosing the highest value.

Q: What does the Bellman update actually do?

A: Q(s,a) += alpha * (r + gamma * maxQ(s',a') - Q(s,a)) nudges the old estimate toward a better one. The target r + gamma * maxQ(s',a') is the reward you just got plus the discounted value of the best action from the next state. The bracketed part is the TD error (how wrong you were), and alpha controls how big a step you take to fix it.

Q: What is the difference between the learning rate and the discount factor?

A: The learning rate alpha (0 to 1) controls how fast you overwrite old estimates with new experience. The discount factor gamma (0 to 1) controls how much you care about future reward versus immediate reward: gamma near 0 makes the agent greedy for instant rewards, gamma near 1 makes it plan far ahead.

Q: Why do I need epsilon-greedy exploration?

A: If the agent always picks the action it currently thinks is best (pure greedy), it can get stuck repeating a mediocre path and never discover a better one. Epsilon-greedy picks a random action with probability epsilon and the greedy action the rest of the time, so the agent keeps trying new things while still exploiting what it has learned.

Q: Why should I decay epsilon over time?

A: Early in training the agent knows nothing, so you want lots of exploration (high epsilon). Late in training its Q-table is mostly correct, so random moves just waste steps. Decaying epsilon (for example multiplying it by 0.99 each episode down to a small floor like 0.05) shifts the agent from exploring to exploiting as it learns.

Q: When does a plain Q-table stop working?

A: A Q-table needs one cell for every state-action pair. That is fine for a 4x4 grid (16 states) but impossible for chess or raw game pixels, where states number in the billions. When the state space is too large to enumerate, you replace the table with a function approximator such as a neural network, which gives you a Deep Q-Network (DQN).

🎯 Mini-Challenge: Train an Agent on a Line World

Now put it all together with the support removed. The starter below gives you only a comment outline — no filled-in logic. Build the training loop yourself: a Q-table dict, epsilon-greedy moves, and the Bellman update running for 500 episodes until every state learns to head toward the goal.

Mini-Challenge: Line World

Write the full Q-learning loop from the outline

Try it Yourself »

Python

# 🎯 MINI-CHALLENGE: train an agent on a 1-D line world
#
# The world is 5 squares: 0 1 2 3 4. The agent starts at 0.
# Square 4 is the GOAL (reward +10). Every other step gives reward -1.
# Actions are "left" and "right". Falling off either end keeps you put.
#
# Build it step by step:
# 1. Make a Q-table dict: Q[(s, a)] = 0.0 for s in 0..4 and a in ["left","right"].
# 2. alpha = 0.1, gamma = 0.9, epsilon = 0.2.
# 3. For 500 episodes:
#      - start state = 0
#      - loop up to 20 steps:
#    
...

🎉

Lesson complete — you can teach an agent to learn!

You built a Q-table from a dictionary, applied the Bellman update by hand, tuned α and γ, and wrote an epsilon-greedy agent that explores then exploits. You also saw exactly when a table runs out and a Deep Q-Network takes over.

🚀 Up next: Policy Gradient Methods — instead of learning action values, learn the policy directly. These are the algorithms behind PPO and ChatGPT's RLHF training.

Q-Learning & Deep Q-Networks

What You'll Learn in This Lesson

🌍 Real-World Analogy: Learning a Maze by Trial and Reward

1The Q-table — One Score Per (State, Action)

2The Bellman Update — How Learning Actually Happens

3Learning Rate α and Discount Factor γ

4Epsilon-Greedy — Explore vs Exploit

5Worked Example — One Update, Start to Finish

Worked Example: One Bellman Update

🎯 Your Turn #1 — Finish the Bellman Update

Your Turn #1: Apply the Update

🎯 Your Turn #2 — Build Epsilon-Greedy

Your Turn #2: Epsilon-Greedy Selection

🌍 When Tables Run Out: Deep Q-Networks

6Common Errors (And How to Fix Them)

📋 Quick Reference

❓ Frequently Asked Questions

🎯 Mini-Challenge: Train an Agent on a Line World

Mini-Challenge: Line World

Lesson complete — you can teach an agent to learn!

Cookie & Privacy Settings