Lesson 28 • Advanced

    Reinforcement Learning Basics

    Train agents to make decisions through trial and error — Markov Decision Processes, value iteration, and the exploration-exploitation tradeoff.

    ✅ What You'll Learn

    • • Markov Decision Processes (MDPs): states, actions, rewards
    • • Value iteration and the Bellman equation
    • • Exploration vs exploitation: ε-greedy and UCB
    • • The multi-armed bandit problem

    🎮 Learning by Doing

    🎯 Real-World Analogy: Reinforcement learning is like training a puppy. You don't show it examples of "good behaviour" (supervised learning). Instead, you let it try things and give treats (+reward) for sitting and say "no" (-reward) for chewing shoes. Over time, the puppy learns which actions lead to treats. The Bellman equation formalises this: "the value of a state = immediate reward + discounted value of the best next state."

    RL is behind AlphaGo (beating world champions at Go), ChatGPT (RLHF for alignment), autonomous driving, robotics, and game-playing AI. Unlike supervised learning, the agent must discover the right actions through interaction with an environment.

    Try It: Markov Decision Process

    Solve a grid world with value iteration — find the optimal policy

    Try it Yourself »
    Python
    import numpy as np
    
    # Markov Decision Process: The Foundation of RL
    # An agent takes actions in states and receives rewards
    
    np.random.seed(42)
    
    # Simple Grid World: 4 states, 2 actions
    # States: [Start, Path, Trap, Goal]
    # Actions: [Left, Right]
    
    states = ["Start", "Path", "Trap", "Goal"]
    actions = ["Left", "Right"]
    
    # Transition probabilities: P(next_state | state, action)
    # Simplified: deterministic transitions
    transitions = {
        ("Start", "Right"): "Path",
        ("Start", "Left"):  "Start",
     
    ...

    Try It: Multi-Armed Bandit

    Balance exploration and exploitation across 5 slot machines

    Try it Yourself »
    Python
    import numpy as np
    
    # Multi-Armed Bandit: Exploration vs Exploitation
    # The fundamental dilemma of reinforcement learning
    
    np.random.seed(42)
    
    class SlotMachine:
        def __init__(self, true_mean):
            self.true_mean = true_mean
        
        def pull(self):
            return np.random.randn() + self.true_mean
    
    def epsilon_greedy(machines, n_pulls, epsilon):
        """Epsilon-greedy: explore randomly with probability epsilon"""
        n = len(machines)
        counts = np.zeros(n)
        values = np.zeros(n)
        
    ...

    ⚠️ Common Mistake: Setting the discount factor γ too low. With γ=0.1, the agent only cares about immediate rewards and makes short-sighted decisions. With γ=0.99, it plans far ahead but training is slower. Start with γ=0.9 for most problems.

    💡 Pro Tip: For learning RL, start with OpenAI Gymnasium (formerly Gym). Try CartPole first (balance a pole), then LunarLander (land a spacecraft). Use Stable Baselines3 for production-quality implementations of PPO, DQN, and A2C.

    📋 Quick Reference

    ConceptFormulaMeaning
    Value FunctionV(s)Expected total reward from state s
    Bellman Eq.V(s) = R + γ·max V(s')Recursive value decomposition
    Discount γ0 ≤ γ ≤ 1How much future rewards matter
    Policy ππ(a|s)Probability of action a in state s
    ε-greedyP(explore) = εRandom exploration rate

    🎉 Lesson Complete!

    You now understand the fundamentals of RL. Next, learn Q-Learning and Deep Q-Networks — the algorithms that taught AI to play Atari!

    Sign up for free to track which lessons you've completed and get learning reminders.

    Previous

    Cookie & Privacy Settings

    We use cookies to improve your experience, analyze traffic, and show personalized ads. You can manage your preferences below.

    By clicking "Accept All", you consent to our use of cookies for analytics and personalized advertising. You can customize your preferences or reject non-essential cookies.

    Privacy PolicyTerms of Service