Courses/AI & ML/Reinforcement Learning

    Lesson 13 • Intermediate

    Reinforcement Learning

    Teach AI agents to make decisions through trial and error — the technology behind game-playing AI, robotics, and autonomous systems.

    ✅ What You'll Learn

    • • The exploration vs exploitation dilemma
    • • Q-Learning with epsilon-greedy strategy
    • • Multi-armed bandit problem
    • • Policy gradient methods (REINFORCE)

    🎮 What Is Reinforcement Learning?

    🎯 Real-World Analogy: Think of training a puppy. You don't show it 10,000 labelled examples of "sit" (supervised learning). Instead, the puppy tries random actions, and you give a treat (reward) when it sits. Over time, the puppy learns: "sitting when I hear 'sit' = treat!" That's reinforcement learning.

    Unlike supervised learning (learn from labels) or unsupervised learning (find patterns), RL learns from interaction. An agent takes actions in an environment, receives rewards, and learns a policy — a strategy for choosing actions that maximise long-term reward.

    🧩 Core Components

    • Agent — The learner/decision-maker
    • Environment — The world the agent acts in
    • State — Current situation
    • Action — What the agent can do
    • Reward — Feedback signal (positive or negative)
    • Policy — Strategy mapping states → actions

    Try It: Multi-Armed Bandit

    Explore the exploration vs exploitation trade-off with slot machines

    Try it Yourself »
    Python
    import numpy as np
    
    # Multi-Armed Bandit: The Exploration vs Exploitation Dilemma
    # Imagine 4 slot machines — which one pays the most?
    
    np.random.seed(42)
    
    # True reward probabilities (agent doesn't know these!)
    true_probs = [0.2, 0.5, 0.75, 0.3]
    n_arms = len(true_probs)
    
    def pull_arm(arm):
        """Simulate pulling a slot machine lever"""
        return 1 if np.random.random() < true_probs[arm] else 0
    
    # Strategy 1: Epsilon-Greedy
    def epsilon_greedy(n_rounds, epsilon=0.1):
        counts = np.zeros(n_arm
    ...

    Try It: Q-Learning Grid Navigation

    Train an agent to find the shortest path through a 4×4 grid

    Try it Yourself »
    Python
    import numpy as np
    
    # Q-Learning: Teaching an agent to navigate a grid world
    # The agent learns which actions lead to rewards
    
    # Grid: 4x4, Agent starts at (0,0), Goal at (3,3)
    grid_size = 4
    n_states = grid_size * grid_size
    n_actions = 4  # up, down, left, right
    actions = ["UP", "DOWN", "LEFT", "RIGHT"]
    
    # Q-table: stores expected reward for each (state, action) pair
    Q = np.zeros((n_states, n_actions))
    
    # Reward: +10 at goal, -1 per step (encourages shortest path)
    def get_reward(state):
        if s
    ...

    Try It: Policy Gradient (REINFORCE)

    Learn a policy directly — scales to complex environments

    Try it Yourself »
    Python
    import numpy as np
    
    # Policy Gradient: Learning a POLICY directly (not a value table)
    # Used in complex environments where Q-tables are too large
    
    def softmax(x):
        exp_x = np.exp(x - np.max(x))
        return exp_x / np.sum(exp_x)
    
    # Simple environment: choose action 0, 1, or 2
    # Action 2 gives best reward on average
    n_actions = 3
    true_rewards = [1.0, 2.0, 5.0]  # expected rewards
    
    # Policy parameters (will be learned)
    np.random.seed(42)
    theta = np.zeros(n_actions)
    learning_rate = 0.1
    
    print("===
    ...

    📋 Quick Reference

    AlgorithmApproachBest For
    Q-LearningLearn value of (state, action)Small, discrete environments
    SARSAOn-policy Q-learningSafer exploration
    DQNQ-learning + neural networkAtari games, large states
    Policy GradientLearn policy directlyContinuous actions
    PPOStable policy optimisationRobotics, ChatGPT RLHF

    💡 Pro Tip: ChatGPT uses Reinforcement Learning from Human Feedback (RLHF). Humans rank model responses, and RL fine-tunes the model to produce responses humans prefer. This is why ChatGPT sounds helpful — it was literally trained to maximise human approval!

    🎉 Lesson Complete!

    You can now build RL agents that learn from interaction! Next, learn how to deploy ML models to production.

    Sign up for free to track which lessons you've completed and get learning reminders.

    Previous

    Cookie & Privacy Settings

    We use cookies to improve your experience, analyze traffic, and show personalized ads. You can manage your preferences below.

    By clicking "Accept All", you consent to our use of cookies for analytics and personalized advertising. You can customize your preferences or reject non-essential cookies.

    Privacy PolicyTerms of Service