Skip to main content

    Lesson 28 • Advanced

    Reinforcement Learning Basics

    Master the core mechanics every RL algorithm is built on — the discounted return, value functions V(s) and Q(s,a), the Bellman equation, and the words used to describe how agents learn.

    What You'll Learn in This Lesson

    • Compute the discounted return G_t from a list of rewards
    • Explain what the discount factor gamma does and why it exists
    • Tell apart the value function V(s) and the action-value Q(s,a)
    • Read the Bellman equation as plain English, not algebra
    • Distinguish episodic tasks from continuing tasks
    • Describe on-policy vs off-policy learning at a high level

    💰 Real-World Analogy: Money Today vs Money Later

    Imagine someone offers you £100 today or £100 in a year. You take it today — money now is worth more than money later, because you could spend or invest it. RL agents feel exactly the same about reward.

    The discount factor gamma (a number between 0 and 1) is the "interest rate" the agent applies to the future. A reward two steps away is multiplied by gamma², three steps away by gamma³, and so on. With gamma close to 1 the agent is patient and plans far ahead; with gamma close to 0 it is impulsive and grabs whatever reward is nearest. The return is just the total of all those discounted rewards — and maximising it is the agent's entire job.

    1The Return G_t — What RL Actually Maximises

    A reward is the single number the environment hands you on one step. But a greedy agent that only chased the next reward would walk straight into traps. What you really care about is the return — the sum of all future rewards, with later ones discounted.

    G_t = r_t + γ·r_t+1 + γ²·r_t+2 + γ³·r_t+3 + …

    Read it left to right: take this step's reward in full, then add next step's reward shrunk by gamma, then the one after shrunk by gamma², and keep going. Because gamma is below 1, far-future rewards barely register — which also keeps the sum finite even if it never ends. The worked example below computes this for a real reward list and then does the one update step that turns an observed return into a learned value.

    Worked Example: Return and One Value Update

    Compute the discounted return G_t, then nudge a value estimate toward it

    Try it Yourself »
    Python
    # The two numbers at the heart of RL: the return and the value
    # Plain Python — no libraries needed.
    
    # A reward is the single number the environment hands back each step.
    # The RETURN G_t is the TOTAL future reward from a point in time,
    # but with future rewards shrunk by a discount factor gamma (0..1).
    #
    #   G_t = r_t + gamma*r_{t+1} + gamma^2*r_{t+2} + ...
    #
    # gamma < 1 means "a reward now is worth more than the same reward later".
    
    rewards = [1, 0, 2, 5]   # rewards collected over 4 time ste
    ...

    2Value Functions: V(s) and Q(s,a)

    A single return is what happened in one run. A value function is the return you expect on average. There are two of them, and the only difference is how much they pin down:

    V(s) — state value

    "How good is it to be in state s?" The expected return if you start in s and act well from there.

    Q(s,a) — action value

    "How good is it to do action a in state s?" The expected return if you take a now, then act well.

    Q is the more useful of the two because it scores every action — so the agent can simply pick the action with the highest Q. And the two connect neatly: the value of a state is the value of its best action, written V(s) = max over actions Q(s, a). The example computes both for a small two-action state.

    Worked Example: V(s) vs Q(s,a)

    Score two actions with Q, then derive V(s) as the best of them

    Try it Yourself »
    Python
    # V(s) vs Q(s,a): two flavours of "how good is this?"
    #
    #   V(s)    = expected return if you START in state s and act well after.
    #   Q(s,a)  = expected return if you START in state s, take action a NOW,
    #             then act well after.
    #
    # Q is more detailed: it scores each ACTION, so you can pick the best one.
    # Relationship: V(s) = the Q of the best action = max over a of Q(s, a).
    
    gamma = 0.9
    
    def discounted_return(rewards):
        """Total reward in one episode, discounting later steps."""
     
    ...

    3The Bellman Equation — The Big Idea

    The return G_t is an infinite sum, which sounds impossible to compute. The Bellman equation is the trick that makes it easy. It notices that the future, after one step, is just another value — so the whole sum folds into a tiny recursive rule:

    V(s) = r + γ·V(s′)

    In plain English: the value of where you are equals the reward you get now, plus the discounted value of wherever you land next (s′). You don't need to look at the entire future — you only need this step's reward and an estimate of the next state. That's the insight that turns RL from "sum over infinity" into a loop you can actually run.

    4Episodes vs Continuing Tasks

    RL problems come in two shapes, and they change how you think about the return:

    🏁 Episodic tasks

    Have a natural end — a chess game finishes, a robot reaches the goal, the player dies. The run is an episode; afterwards everything resets. The return is a finite sum over the steps of that one episode.

    ♾️ Continuing tasks

    Never stop — a thermostat, a stock-trading bot, a server tuning itself run forever. There's no reset and no final step.

    This is the second reason discounting exists: for a continuing task the reward sum would run to infinity, but multiplying each future reward by gamma below 1 makes the total settle on a finite number. A terminal state (the end of an episode) is treated as having value 0 — there's no future left to discount.

    5On-Policy vs Off-Policy (High Level)

    A policy is the agent's strategy — its rule for choosing an action in each state. Algorithms differ in whether the policy they learn about is the same one they act with:

    • On-policy — you learn about the very policy you're following, exploration and all. Honest but cautious; it can't easily reuse old data. (SARSA is the classic example.)
    • Off-policy — you act with one policy (often a curious, exploratory one) but learn the value of a different policy (usually the best greedy one). More flexible, and it lets you reuse stored experience — at the cost of being harder to keep stable. (Q-learning is off-policy.)

    🎯 Your Turn 1: Compute a Discounted Return

    Fill in the two blanks so the program adds up the rewards with a discount factor of 0.5. Use the expected output to check yourself.

    Your Turn: Discounted Return

    Fill in gamma and the per-step discount

    Try it Yourself »
    Python
    # 🎯 YOUR TURN: compute a discounted return from a reward list
    # Fill in each ___ then run it.
    
    rewards = [2, 3, 0, 4]      # rewards over 4 steps (given)
    gamma   = ___               # 👉 use 0.5 as the discount factor
    
    G = 0.0
    for step, r in enumerate(rewards):
        # 👉 add this step's reward, discounted by gamma**step
        G += ___ * r            # 👉 replace ___ with  gamma ** step
    
    print(f"Discounted return G = {G:.3f}")
    # ✅ Expected output: Discounted return G = 4.000
    #    (2 + 0.5*3 + 0.25*
    ...

    🎯 Your Turn 2: One Value-Update Step

    This is the learning rule at the heart of value-based RL. Fill in the blank to move the value estimate part of the way toward the observed return.

    Your Turn: Value Update

    Apply V(s) <- V(s) + alpha * (target - V(s))

    Try it Yourself »
    Python
    # 🎯 YOUR TURN: do ONE value-update step toward an observed return
    # This is the core learning rule of value-based RL.
    #
    #   V(s) <- V(s) + alpha * (target - V(s))
    
    V_s    = 10.0      # current value estimate for this state (given)
    alpha  = 0.2       # learning rate (given)
    target = 20.0      # the return we actually observed (given)
    
    # 👉 nudge V_s 20% of the way from its old value toward the target
    V_s = V_s + ___ * (target - V_s)      # 👉 replace ___ with  alpha
    
    print(f"Updated V(s) = {V_s:
    ...

    Common Errors (And How to Fix Them)

    ❌ Off-by-one on the discount power

    Discounting the first reward as well — every term ends up too small.

    G += (gamma ** (step + 1)) * r   # ❌ first reward shouldn't be discounted

    ✅ Fix: step starts at 0, so the first reward uses gamma**0 = 1.

    G += (gamma ** step) * r          # ✅ step = 0, 1, 2, ...

    ❌ ZeroDivisionError / runaway return

    Setting gamma = 1 on a continuing (never-ending) task makes the return grow without bound.

    gamma = 1.0   # ❌ on a task with no terminal state, G never converges

    ✅ Fix: use gamma below 1 (0.9 or 0.99) for continuing tasks.

    gamma = 0.99  # ✅ future rewards fade, return stays finite

    ❌ Wrong sign in the value update

    Writing (V_s - target) pushes the estimate away from the return — it diverges.

    V_s = V_s + alpha * (V_s - target)   # ❌ moves the wrong way

    ✅ Fix: the error is target minus current estimate.

    V_s = V_s + alpha * (target - V_s)   # ✅ moves toward the target

    📋 Quick Reference

    ConceptSymbol / FormulaMeaning
    RewardrSingle number returned each step
    ReturnG_t = Σ γ^k·rTotal discounted future reward
    Discount0 ≤ γ ≤ 1How much the future matters
    State valueV(s)Expected return from state s
    Action valueQ(s, a)Expected return of action a in s
    BellmanV(s) = r + γ·V(s′)Value as reward + next value
    Update ruleV ← V + α·(target − V)Nudge estimate toward observed return
    Episodestart … terminalOne run that ends and resets

    ❓ Frequently Asked Questions

    Q: What is the return G_t in reinforcement learning?

    A: The return is the total reward an agent collects from a moment onward, with each future reward shrunk by the discount factor gamma: G_t = r_t + gamma*r_{t+1} + gamma^2*r_{t+2} + .... It is the quantity RL actually tries to maximise — not the single next reward.

    Q: What is the difference between V(s) and Q(s,a)?

    A: V(s) is the expected return when you start in state s and then act well. Q(s,a) is the expected return when you start in state s, take action a now, and act well afterwards. Q scores each action, so you can pick the best one, and V(s) equals the Q value of that best action.

    Q: What does the Bellman equation actually say?

    A: It says the value of a state equals the immediate reward plus the discounted value of where you land next: V(s) = r + gamma*V(s'). It turns one giant sum over the whole future into a short recursive rule, which is what makes RL computable.

    Q: What is the difference between an episode and a continuing task?

    A: An episodic task has a natural end — a game finishes, a robot reaches the goal — and then resets. A continuing task never terminates, like a server that runs forever. Discounting (gamma < 1) keeps the return finite even for continuing tasks.

    Q: What does on-policy versus off-policy mean?

    A: On-policy methods learn about the same policy they use to choose actions. Off-policy methods can learn about one policy (for example the best greedy one) while exploring with a different policy. Off-policy is more flexible and lets you reuse old experience, but is trickier to make stable.

    🎯 Mini-Challenge: Average Return Over Episodes

    Support is faded here — only the outline is given. Write the function and the averaging logic yourself, then check against the expected output in the comments.

    Mini-Challenge: Average Return

    Write discounted_return() and average it over 3 episodes

    Try it Yourself »
    Python
    # 🎯 MINI-CHALLENGE: average return over several episodes
    # 1. You have 3 episodes, each a list of rewards (given below).
    # 2. Write a function discounted_return(rewards) using gamma = 0.9
    #    that returns r0 + 0.9*r1 + 0.9^2*r2 + ...
    # 3. Compute the return for each episode, then print the AVERAGE return.
    #
    # ✅ Expected (gamma=0.9): Average return = 6.660
    
    gamma = 0.9
    episodes = [
        [1, 2, 3],      # return = 1 + 1.8 + 2.43 = 5.23
        [0, 0, 10],     # return = 0 + 0   + 8.1  = 8.10
        [5, 
    ...
    🎉

    Lesson complete — you now know the machinery of RL!

    You can compute a discounted return, tell V(s) from Q(s,a), read the Bellman equation as plain English, separate episodic from continuing tasks, and explain on-policy vs off-policy. These are the exact pieces every RL algorithm assembles.

    🚀 Up next: Q-Learning & DQN — turn the Bellman update into an algorithm that learns to play Atari from raw pixels.

    Sign up for free to track which lessons you've completed and get learning reminders.

    Previous

    Cookie & Privacy Settings

    We use cookies to improve your experience, analyze traffic, and show personalized ads. You can manage your preferences below.

    By clicking "Accept All", you consent to our use of cookies for analytics and personalized advertising. You can customize your preferences or reject non-essential cookies.

    Privacy PolicyTerms of Service

    Install LearnCodingFast

    Learn faster with the app on your home screen.