Lesson 28 • Advanced

Reinforcement Learning Basics

Master the core mechanics every RL algorithm is built on — the discounted return, value functions V(s) and Q(s,a), the Bellman equation, and the words used to describe how agents learn.

What You'll Learn in This Lesson

✓Compute the discounted return G_t from a list of rewards
✓Explain what the discount factor gamma does and why it exists
✓Tell apart the value function V(s) and the action-value Q(s,a)
✓Read the Bellman equation as plain English, not algebra
✓Distinguish episodic tasks from continuing tasks
✓Describe on-policy vs off-policy learning at a high level

Before you start: This lesson assumes you know the RL vocabulary — agent, environment, state, action, reward — from the Reinforcement Learning overview. That lesson sets the scene; this one zooms into the maths that powers it.

💰 Real-World Analogy: Money Today vs Money Later

Imagine someone offers you £100 today or £100 in a year. You take it today — money now is worth more than money later, because you could spend or invest it. RL agents feel exactly the same about reward.

The discount factor gamma (a number between 0 and 1) is the "interest rate" the agent applies to the future. A reward two steps away is multiplied by gamma², three steps away by gamma³, and so on. With gamma close to 1 the agent is patient and plans far ahead; with gamma close to 0 it is impulsive and grabs whatever reward is nearest. The return is just the total of all those discounted rewards — and maximising it is the agent's entire job.

1The Return G_t — What RL Actually Maximises

A reward is the single number the environment hands you on one step. But a greedy agent that only chased the next reward would walk straight into traps. What you really care about is the return — the sum of all future rewards, with later ones discounted.

G_t = r_t + γ·r__t+1 + γ²·r__t+2 + γ³·r__t+3 + …

Read it left to right: take this step's reward in full, then add next step's reward shrunk by gamma, then the one after shrunk by gamma², and keep going. Because gamma is below 1, far-future rewards barely register — which also keeps the sum finite even if it never ends. The worked example below computes this for a real reward list and then does the one update step that turns an observed return into a learned value.

Worked Example: Return and One Value Update

Compute the discounted return G_t, then nudge a value estimate toward it

Try it Yourself »

Python

# The two numbers at the heart of RL: the return and the value
# Plain Python — no libraries needed.

# A reward is the single number the environment hands back each step.
# The RETURN G_t is the TOTAL future reward from a point in time,
# but with future rewards shrunk by a discount factor gamma (0..1).
#
#   G_t = r_t + gamma*r_{t+1} + gamma^2*r_{t+2} + ...
#
# gamma < 1 means "a reward now is worth more than the same reward later".

rewards = [1, 0, 2, 5]   # rewards collected over 4 time ste
...

2Value Functions: V(s) and Q(s,a)

A single return is what happened in one run. A value function is the return you expect on average. There are two of them, and the only difference is how much they pin down:

V(s) — state value

"How good is it to be in state s?" The expected return if you start in s and act well from there.

Q(s,a) — action value

"How good is it to do action a in state s?" The expected return if you take a now, then act well.

Q is the more useful of the two because it scores every action — so the agent can simply pick the action with the highest Q. And the two connect neatly: the value of a state is the value of its best action, written V(s) = max over actions Q(s, a). The example computes both for a small two-action state.

Worked Example: V(s) vs Q(s,a)

Score two actions with Q, then derive V(s) as the best of them

Try it Yourself »

Python

# V(s) vs Q(s,a): two flavours of "how good is this?"
#
#   V(s)    = expected return if you START in state s and act well after.
#   Q(s,a)  = expected return if you START in state s, take action a NOW,
#             then act well after.
#
# Q is more detailed: it scores each ACTION, so you can pick the best one.
# Relationship: V(s) = the Q of the best action = max over a of Q(s, a).

gamma = 0.9

def discounted_return(rewards):
    """Total reward in one episode, discounting later steps."""
 
...

3The Bellman Equation — The Big Idea

The return G_t is an infinite sum, which sounds impossible to compute. The Bellman equation is the trick that makes it easy. It notices that the future, after one step, is just another value — so the whole sum folds into a tiny recursive rule:

V(s) = r + γ·V(s′)

In plain English: the value of where you are equals the reward you get now, plus the discounted value of wherever you land next (s′). You don't need to look at the entire future — you only need this step's reward and an estimate of the next state. That's the insight that turns RL from "sum over infinity" into a loop you can actually run.

Why it matters: Every value-based algorithm — value iteration, Q-learning, Deep Q-Networks — is just the Bellman equation applied over and over until the estimates stop changing.

4Episodes vs Continuing Tasks

RL problems come in two shapes, and they change how you think about the return:

🏁 Episodic tasks

Have a natural end — a chess game finishes, a robot reaches the goal, the player dies. The run is an episode; afterwards everything resets. The return is a finite sum over the steps of that one episode.

♾️ Continuing tasks

Never stop — a thermostat, a stock-trading bot, a server tuning itself run forever. There's no reset and no final step.

This is the second reason discounting exists: for a continuing task the reward sum would run to infinity, but multiplying each future reward by gamma below 1 makes the total settle on a finite number. A terminal state (the end of an episode) is treated as having value 0 — there's no future left to discount.

5On-Policy vs Off-Policy (High Level)

A policy is the agent's strategy — its rule for choosing an action in each state. Algorithms differ in whether the policy they learn about is the same one they act with:

On-policy — you learn about the very policy you're following, exploration and all. Honest but cautious; it can't easily reuse old data. (SARSA is the classic example.)
Off-policy — you act with one policy (often a curious, exploratory one) but learn the value of a different policy (usually the best greedy one). More flexible, and it lets you reuse stored experience — at the cost of being harder to keep stable. (Q-learning is off-policy.)

One-line summary: on-policy learns about "what I'm actually doing"; off-policy learns about "what I should be doing" while doing something more exploratory.

🎯 Your Turn 1: Compute a Discounted Return

Fill in the two blanks so the program adds up the rewards with a discount factor of 0.5. Use the expected output to check yourself.

Your Turn: Discounted Return

Fill in gamma and the per-step discount

Try it Yourself »

Python

# 🎯 YOUR TURN: compute a discounted return from a reward list
# Fill in each ___ then run it.

rewards = [2, 3, 0, 4]      # rewards over 4 steps (given)
gamma   = ___               # 👉 use 0.5 as the discount factor

G = 0.0
for step, r in enumerate(rewards):
    # 👉 add this step's reward, discounted by gamma**step
    G += ___ * r            # 👉 replace ___ with  gamma ** step

print(f"Discounted return G = {G:.3f}")
# ✅ Expected output: Discounted return G = 4.000
#    (2 + 0.5*3 + 0.25*
...

🎯 Your Turn 2: One Value-Update Step

This is the learning rule at the heart of value-based RL. Fill in the blank to move the value estimate part of the way toward the observed return.

Your Turn: Value Update

Apply V(s) <- V(s) + alpha * (target - V(s))

Try it Yourself »

Python

# 🎯 YOUR TURN: do ONE value-update step toward an observed return
# This is the core learning rule of value-based RL.
#
#   V(s) <- V(s) + alpha * (target - V(s))

V_s    = 10.0      # current value estimate for this state (given)
alpha  = 0.2       # learning rate (given)
target = 20.0      # the return we actually observed (given)

# 👉 nudge V_s 20% of the way from its old value toward the target
V_s = V_s + ___ * (target - V_s)      # 👉 replace ___ with  alpha

print(f"Updated V(s) = {V_s:
...

Common Errors (And How to Fix Them)

❌ Off-by-one on the discount power

Discounting the first reward as well — every term ends up too small.

G += (gamma ** (step + 1)) * r   # ❌ first reward shouldn't be discounted

✅ Fix: step starts at 0, so the first reward uses gamma**0 = 1.

G += (gamma ** step) * r          # ✅ step = 0, 1, 2, ...

❌ ZeroDivisionError / runaway return

Setting gamma = 1 on a continuing (never-ending) task makes the return grow without bound.

gamma = 1.0   # ❌ on a task with no terminal state, G never converges

✅ Fix: use gamma below 1 (0.9 or 0.99) for continuing tasks.

gamma = 0.99  # ✅ future rewards fade, return stays finite

❌ Wrong sign in the value update

Writing (V_s - target) pushes the estimate away from the return — it diverges.

V_s = V_s + alpha * (V_s - target)   # ❌ moves the wrong way

✅ Fix: the error is target minus current estimate.

V_s = V_s + alpha * (target - V_s)   # ✅ moves toward the target

📋 Quick Reference

Concept	Symbol / Formula	Meaning
Reward	r	Single number returned each step
Return	G_t = Σ γ^k·r	Total discounted future reward
Discount	0 ≤ γ ≤ 1	How much the future matters
State value	V(s)	Expected return from state s
Action value	Q(s, a)	Expected return of action a in s
Bellman	V(s) = r + γ·V(s′)	Value as reward + next value
Update rule	V ← V + α·(target − V)	Nudge estimate toward observed return
Episode	start … terminal	One run that ends and resets

❓ Frequently Asked Questions

Q: What is the return G_t in reinforcement learning?

A: The return is the total reward an agent collects from a moment onward, with each future reward shrunk by the discount factor gamma: G_t = r_t + gamma*r_{t+1} + gamma^2*r_{t+2} + .... It is the quantity RL actually tries to maximise — not the single next reward.

Q: What is the difference between V(s) and Q(s,a)?

A: V(s) is the expected return when you start in state s and then act well. Q(s,a) is the expected return when you start in state s, take action a now, and act well afterwards. Q scores each action, so you can pick the best one, and V(s) equals the Q value of that best action.

Q: What does the Bellman equation actually say?

A: It says the value of a state equals the immediate reward plus the discounted value of where you land next: V(s) = r + gamma*V(s'). It turns one giant sum over the whole future into a short recursive rule, which is what makes RL computable.

Q: What is the difference between an episode and a continuing task?

A: An episodic task has a natural end — a game finishes, a robot reaches the goal — and then resets. A continuing task never terminates, like a server that runs forever. Discounting (gamma < 1) keeps the return finite even for continuing tasks.

Q: What does on-policy versus off-policy mean?

A: On-policy methods learn about the same policy they use to choose actions. Off-policy methods can learn about one policy (for example the best greedy one) while exploring with a different policy. Off-policy is more flexible and lets you reuse old experience, but is trickier to make stable.

🎯 Mini-Challenge: Average Return Over Episodes

Support is faded here — only the outline is given. Write the function and the averaging logic yourself, then check against the expected output in the comments.

Mini-Challenge: Average Return

Write discounted_return() and average it over 3 episodes

Try it Yourself »

Python

# 🎯 MINI-CHALLENGE: average return over several episodes
# 1. You have 3 episodes, each a list of rewards (given below).
# 2. Write a function discounted_return(rewards) using gamma = 0.9
#    that returns r0 + 0.9*r1 + 0.9^2*r2 + ...
# 3. Compute the return for each episode, then print the AVERAGE return.
#
# ✅ Expected (gamma=0.9): Average return = 6.660

gamma = 0.9
episodes = [
    [1, 2, 3],      # return = 1 + 1.8 + 2.43 = 5.23
    [0, 0, 10],     # return = 0 + 0   + 8.1  = 8.10
    [5, 
...

🎉

Lesson complete — you now know the machinery of RL!

You can compute a discounted return, tell V(s) from Q(s,a), read the Bellman equation as plain English, separate episodic from continuing tasks, and explain on-policy vs off-policy. These are the exact pieces every RL algorithm assembles.

🚀 Up next: Q-Learning & DQN — turn the Bellman update into an algorithm that learns to play Atari from raw pixels.

Reinforcement Learning Basics

What You'll Learn in This Lesson

💰 Real-World Analogy: Money Today vs Money Later

1The Return G_t — What RL Actually Maximises

Worked Example: Return and One Value Update

2Value Functions: V(s) and Q(s,a)

Worked Example: V(s) vs Q(s,a)

3The Bellman Equation — The Big Idea

4Episodes vs Continuing Tasks

5On-Policy vs Off-Policy (High Level)

🎯 Your Turn 1: Compute a Discounted Return

Your Turn: Discounted Return

🎯 Your Turn 2: One Value-Update Step

Your Turn: Value Update

Common Errors (And How to Fix Them)

📋 Quick Reference

❓ Frequently Asked Questions

🎯 Mini-Challenge: Average Return Over Episodes

Mini-Challenge: Average Return

Lesson complete — you now know the machinery of RL!

Cookie & Privacy Settings