Skip to main content
    Courses/AI & ML/Reinforcement Learning

    Lesson 13 • Intermediate

    Reinforcement Learning

    Teach an agent to make good decisions through trial and error. By the end you'll understand the RL loop, write an epsilon-greedy agent in plain Python, and know where RL powers real systems — from game AI to robots.

    What You'll Learn in This Lesson

    • What an agent, environment, state, action, and reward are
    • How the RL loop turns experience into better decisions
    • The difference between a policy and a value estimate
    • Why exploration vs exploitation is the core RL dilemma
    • How a Markov Decision Process frames any RL problem
    • Where RL is used — games, robotics, and beyond

    🐶 Real-World Analogy: Training a Dog with Treats

    You can't hand a puppy a textbook. Instead you say "sit", wait, and the moment it sits you give a treat. The treat is a reward. The puppy doesn't know why the treat came at first — but after many tries it links the situation ("I heard 'sit'") and the action ("I sat") to the reward.

    That is reinforcement learning exactly. The puppy is the agent. Your living room is the environment. "I just heard the word sit" is the state. Sitting, lying down, or running off are the actions. The treat (or no treat) is the reward. Over time the puppy forms a policy: "when I hear 'sit', sitting pays off."

    Notice there were no labelled examples — nobody showed the puppy ten thousand photos of correct sitting. It learned purely from interaction and feedback. That's what separates RL from supervised learning.

    1The Five Pieces: Agent, Environment, State, Action, Reward

    Every RL problem is built from the same five pieces. Learn these names once and the rest of the field reads easily.

    • Agent — the learner and decision-maker (the puppy, the robot, the game bot).
    • Environment — the world the agent acts in. It receives actions and returns the next state and a reward.
    • State — a snapshot of the current situation (where you are on the board, the robot's joint angles).
    • Action — one of the choices available right now (move left, pull lever 2, accelerate).
    • Reward — a single number of feedback. Positive means "good", negative means "bad". The agent's whole goal is to maximise the total reward over time.

    Jargon check: a policy is the agent's strategy — a rule mapping each state to an action. The agent's job is to improve its policy until it reliably collects high reward.

    2The RL Loop (Worked Example)

    RL runs in a loop: the agent observes a state, picks an action, the environment returns a new state and a reward, and round it goes. One full run from start to finish is called an episode.

    Read this fully-commented example. It builds a tiny 5-cell corridor by hand and walks an agent to the goal so you can see every part of the loop with your own eyes. Run it and match the output.

    Worked Example: The RL Loop by Hand

    A 5-cell corridor showing state, action, reward, and episode end

    Try it Yourself »
    Python
    # The reinforcement learning loop, spelled out step by step.
    # No frameworks — just plain Python so you can SEE every part.
    
    # The "environment": a tiny 1D corridor of 5 cells, 0..4.
    # The agent starts at cell 0. The treasure is at cell 4.
    GOAL = 4
    
    def reward_for(state):
        # +10 when you reach the treasure, -1 for every other step
        # (the -1 nudges the agent toward the SHORTEST path).
        return 10 if state == GOAL else -1
    
    def step(state, action):
        # An action is "right" (+1) or "left"
    ...

    The -1 per step plus +10 at the goal is called reward shaping — designing the numbers so the agent prefers short paths. Get the shaping wrong and the agent learns the wrong thing.

    3Policy vs Value — Two Things the Agent Learns

    There are two quantities an RL agent can learn, and beginners mix them up constantly:

    Policy — "what should I do?"

    A rule that maps a state to an action. Ask it "I'm in state X, what now?" and it answers with an action. The puppy's "hear sit → sit" is a policy.

    Value — "how good is this?"

    A number estimating the total future reward you expect from a state (or a state-action pair). It doesn't tell you what to do directly — it scores options.

    Many algorithms learn values first, then act greedily on them: "estimate how good each action is, then pick the highest." That is exactly what the bandit below does — it keeps a value estimate per arm and exploits the best one.

    4Exploration vs Exploitation (Worked Example)

    Here is the central dilemma of RL. Exploitation means choosing the action you currently think is best. Exploration means trying a different action to find out if something better exists. Lean too far either way and you lose.

    The classic toy problem is the multi-armed bandit: several slot machine arms, each paying out with a hidden probability. The epsilon-greedy rule solves it simply — with probability epsilon (say 0.1) explore a random arm, otherwise exploit the arm with the highest current estimate.

    Worked Example: Epsilon-Greedy Bandit

    Discover the best of 3 slot machines with plain Python

    Try it Yourself »
    Python
    import random
    
    # Multi-armed bandit: 3 slot machines ("arms").
    # Each arm pays 1 with a hidden probability. The agent must DISCOVER
    # the best arm using only the rewards it sees.
    random.seed(7)                       # deterministic so the output is fixed
    
    true_probs = [0.2, 0.5, 0.8]         # hidden! the agent never reads this
    n_arms = len(true_probs)
    
    def pull(arm):
        # Returns 1 (win) or 0 (lose) for the chosen arm.
        return 1 if random.random() < true_probs[arm] else 0
    
    # What the agent 
    ...

    Because most pulls exploit the best-known arm, arm 2 (the true best) gets pulled thousands of times — but the occasional 10% exploration is what let the agent find arm 2 in the first place. That balance is the whole game.

    5The Markov Decision Process (MDP)

    The Markov Decision Process is the formal frame underneath every RL problem. Don't let the name scare you — it's just the five pieces written precisely:

    • States (S) — all the situations the agent can be in.
    • Actions (A) — what the agent can do in each state.
    • Transitions (P) — the probability that action a in state s leads to state s′.
    • Rewards (R) — the number you get for each transition.
    • Discount (γ, gamma) — how much you value future rewards versus immediate ones (0 to 1).

    The "Markov" property is the key simplifying assumption: the next state depends only on the current state and action — not on the entire history of how you got there. That makes the maths tractable. Your corridor example was a small MDP, and so was the bandit (a one-state MDP).

    6Where Reinforcement Learning Is Used

    🎮 Games

    AlphaGo beat the world Go champion; agents master Atari, chess, and StarCraft purely from reward signals.

    🤖 Robotics

    Robots learn to walk, grasp, and balance by trial and error in simulation, then transfer the policy to hardware.

    💬 Language models

    RLHF fine-tunes chatbots — humans rank responses and RL nudges the model toward answers people prefer.

    📈 Operations

    Recommendation, ad bidding, traffic-light timing, and data-centre cooling are tuned with RL to maximise a long-term metric.

    🌍 In the Real World: Gymnasium

    You built tiny environments by hand to learn the mechanics. In real projects you reach for Gymnasium (the maintained successor to OpenAI Gym), which provides ready-made environments behind a standard reset() / step() interface — the very same loop you wrote.

    This block needs pip install gymnasium, so run it on your own machine rather than in the sandbox above. Read it to see how the hand-built loop maps onto a real library.

    # In a real project you don't hand-build the environment — you use
    # Gymnasium (the maintained successor to OpenAI Gym). It gives you
    # ready-made worlds with a standard reset()/step() interface.
    #
    #   pip install gymnasium
    import gymnasium as gym
    
    env = gym.make("FrozenLake-v1", is_slippery=False)
    
    state, info = env.reset(seed=0)      # start a new episode
    done = False
    total = 0
    
    while not done:
        action = env.action_space.sample()        # random policy (placeholder)
        state, reward, terminated, truncated, info = env.step(action)
        total += reward
        done = terminated or truncated
    
    print("episode finished, total reward:", total)
    
    # Expected output (random actions rarely reach the goal):
    # episode finished, total reward: 0.0
    #
    # A trained agent would learn a policy that reaches the goal for reward 1.0.
    # Gymnasium gives you the SAME state/action/reward loop you built by hand
    # above — it just supplies the environment for you.

    🎯 Your Turn #1: Exploit the Best Arm

    The explore branch is written for you. Fill in the exploit line so the agent picks the arm with the highest estimate. Match the expected output in the comment.

    Your Turn: Pick the Highest-Estimated Action

    Fill in the greedy exploit step

    Try it Yourself »
    Python
    # 🎯 YOUR TURN — fill in the blanks marked with ___
    import random
    random.seed(1)
    
    # Hidden true probabilities for 3 arms (the agent can't read these).
    true_probs = [0.3, 0.9, 0.6]
    
    def pull(arm):
        return 1 if random.random() < true_probs[arm] else 0
    
    estimates = [0.0, 0.0, 0.0]
    counts    = [0, 0, 0]
    epsilon   = 0.1
    
    for t in range(1500):
        if random.random() < epsilon:
            arm = random.randrange(3)              # explore
        else:
            # 👉 EXPLOIT: choose the arm with the HIGHEST es
    ...

    🎯 Your Turn #2: Make the Epsilon-Greedy Choice

    Now write both halves of the decision: the explore return (a random arm) and the exploit return (the best arm). The hints are right there in the comments.

    Your Turn: Explore or Exploit

    Complete the epsilon-greedy action selection

    Try it Yourself »
    Python
    # 🎯 YOUR TURN — fill in the blanks marked with ___
    import random
    random.seed(2)
    
    estimates = [0.1, 0.7, 0.4]     # pretend the agent already learned these
    epsilon   = 0.2                 # explore 20% of the time
    
    def choose_action():
        if random.random() < epsilon:
            # 👉 EXPLORE: return a RANDOM arm number from 0, 1, 2.
            return ___              # 👉 replace ___  (hint: random.randrange(3))
        else:
            # 👉 EXPLOIT: return the arm with the highest estimate.
            return 
    ...

    🏆 Mini-Challenge: A 4-Arm Bandit from Scratch

    Support is fading now. You get only a comment outline — write the epsilon-greedy agent yourself. Use the two worked examples above as your reference.

    Mini-Challenge: Epsilon-Greedy, 4 Arms

    Build the whole agent from a comment outline

    Try it Yourself »
    Python
    import random
    random.seed(0)
    
    # 🎯 MINI-CHALLENGE: epsilon-greedy on a 4-arm bandit
    # 1. true_probs = [0.1, 0.25, 0.6, 0.45]  (hidden payouts; arm 2 is best)
    # 2. Write pull(arm): return 1 if random.random() < true_probs[arm] else 0
    # 3. Keep estimates = [0, 0, 0, 0] and counts = [0, 0, 0, 0]
    # 4. Loop 3000 times with epsilon = 0.1:
    #       - explore (random arm) with probability epsilon, else exploit best
    #       - update counts and estimates with the incremental mean
    # 5. Print the agent's bes
    ...

    Common Mistakes (And How to Fix Them)

    ❌ No exploration (epsilon = 0)

    A pure-greedy agent locks onto whichever arm happened to win first and never discovers the truly best one.

    ✅ Fix: keep a small exploration rate (e.g. epsilon = 0.1), or decay it from high to low over time so you explore early and exploit later.

    ❌ Bad reward shaping

    If you reward the agent for the wrong thing, it optimises the wrong thing. A vacuum bot rewarded for "dust collected" learns to dump dust and re-vacuum it.

    ✅ Fix: reward the outcome you actually want, not a proxy. Add small penalties (like -1 per step) to discourage stalling.

    ❌ Sparse rewards

    If reward only arrives at a distant goal and is zero everywhere else, the agent wanders randomly and almost never stumbles onto it, so it never learns.

    ✅ Fix: add intermediate rewards (shaping), shrink the environment while learning, or use exploration bonuses to encourage reaching new states.

    ❌ Ignoring non-stationarity

    If the environment changes over time (payouts drift, an opponent adapts), an estimate built as a plain average of all past rewards reacts too slowly.

    ✅ Fix: use a fixed step size — estimate += alpha * (reward - estimate) — so recent experience counts more than ancient experience.

    📋 Quick Reference

    TermWhat It MeansIn the Examples
    AgentThe learner that chooses actionsThe bandit player
    EnvironmentReturns next state + rewardThe corridor / slot machines
    StateThe current situationstate = the cell number
    ActionA choice available nowMove right / pull an arm
    RewardNumeric feedback to maximise+10 goal, -1 step, win/lose
    PolicyState → action strategy"pick the best-estimated arm"
    ValueExpected future rewardestimates[arm]
    Epsilon-greedyExplore ε of the time, else exploitrandom.random() < epsilon
    MDPStates, actions, transitions, rewards, γThe whole formal frame

    Pro tip: ChatGPT and other chat models are fine-tuned with Reinforcement Learning from Human Feedback (RLHF). Humans rank responses, and RL nudges the model toward the answers people prefer — the exact same reward-maximising loop you learned here, just with a language model as the agent.

    ❓ Frequently Asked Questions

    Q: What is reinforcement learning in simple terms?

    A: It is learning by trial and error. An agent takes actions in an environment, receives rewards or penalties, and gradually learns a policy — a strategy for choosing actions — that maximises its total reward over time.

    Q: How is reinforcement learning different from supervised learning?

    A: Supervised learning trains on labelled examples that tell it the correct answer. Reinforcement learning has no labels — the agent only gets a reward signal after acting, and must figure out for itself which actions lead to good outcomes.

    Q: What is the exploration vs exploitation trade-off?

    A: Exploitation means picking the action you currently believe is best; exploration means trying other actions to discover something better. If you never explore you can get stuck on a mediocre choice; if you always explore you waste reward. Epsilon-greedy balances the two.

    Q: What is a Markov Decision Process (MDP)?

    A: An MDP is the mathematical frame for RL. It is defined by states, actions, a reward for each transition, and transition probabilities. The 'Markov' part means the next state depends only on the current state and action — not the full history.

    Q: Where is reinforcement learning actually used?

    A: Game-playing AI (AlphaGo, Atari, chess), robotics and locomotion control, recommendation and ad systems, traffic and energy optimisation, and fine-tuning large language models with human feedback (RLHF).

    🎉 Lesson Complete!

    You now know the language of reinforcement learning: an agent acts in an environment, observing states, choosing actions, and collecting rewards in a loop. You can explain policy vs value, balance exploration vs exploitation with epsilon-greedy, frame a problem as a Markov Decision Process, and name where RL is used.

    🚀 Up next: Model Deployment — take a trained model out of your notebook and put it into production.

    Sign up for free to track which lessons you've completed and get learning reminders.

    Previous

    Cookie & Privacy Settings

    We use cookies to improve your experience, analyze traffic, and show personalized ads. You can manage your preferences below.

    By clicking "Accept All", you consent to our use of cookies for analytics and personalized advertising. You can customize your preferences or reject non-essential cookies.

    Privacy PolicyTerms of Service