Lesson 13 • Intermediate
Reinforcement Learning
Teach AI agents to make decisions through trial and error — the technology behind game-playing AI, robotics, and autonomous systems.
✅ What You'll Learn
- • The exploration vs exploitation dilemma
- • Q-Learning with epsilon-greedy strategy
- • Multi-armed bandit problem
- • Policy gradient methods (REINFORCE)
🎮 What Is Reinforcement Learning?
🎯 Real-World Analogy: Think of training a puppy. You don't show it 10,000 labelled examples of "sit" (supervised learning). Instead, the puppy tries random actions, and you give a treat (reward) when it sits. Over time, the puppy learns: "sitting when I hear 'sit' = treat!" That's reinforcement learning.
Unlike supervised learning (learn from labels) or unsupervised learning (find patterns), RL learns from interaction. An agent takes actions in an environment, receives rewards, and learns a policy — a strategy for choosing actions that maximise long-term reward.
🧩 Core Components
- • Agent — The learner/decision-maker
- • Environment — The world the agent acts in
- • State — Current situation
- • Action — What the agent can do
- • Reward — Feedback signal (positive or negative)
- • Policy — Strategy mapping states → actions
Try It: Multi-Armed Bandit
Explore the exploration vs exploitation trade-off with slot machines
import numpy as np
# Multi-Armed Bandit: The Exploration vs Exploitation Dilemma
# Imagine 4 slot machines — which one pays the most?
np.random.seed(42)
# True reward probabilities (agent doesn't know these!)
true_probs = [0.2, 0.5, 0.75, 0.3]
n_arms = len(true_probs)
def pull_arm(arm):
"""Simulate pulling a slot machine lever"""
return 1 if np.random.random() < true_probs[arm] else 0
# Strategy 1: Epsilon-Greedy
def epsilon_greedy(n_rounds, epsilon=0.1):
counts = np.zeros(n_arm
...Try It: Policy Gradient (REINFORCE)
Learn a policy directly — scales to complex environments
import numpy as np
# Policy Gradient: Learning a POLICY directly (not a value table)
# Used in complex environments where Q-tables are too large
def softmax(x):
exp_x = np.exp(x - np.max(x))
return exp_x / np.sum(exp_x)
# Simple environment: choose action 0, 1, or 2
# Action 2 gives best reward on average
n_actions = 3
true_rewards = [1.0, 2.0, 5.0] # expected rewards
# Policy parameters (will be learned)
np.random.seed(42)
theta = np.zeros(n_actions)
learning_rate = 0.1
print("===
...📋 Quick Reference
| Algorithm | Approach | Best For |
|---|---|---|
| Q-Learning | Learn value of (state, action) | Small, discrete environments |
| SARSA | On-policy Q-learning | Safer exploration |
| DQN | Q-learning + neural network | Atari games, large states |
| Policy Gradient | Learn policy directly | Continuous actions |
| PPO | Stable policy optimisation | Robotics, ChatGPT RLHF |
💡 Pro Tip: ChatGPT uses Reinforcement Learning from Human Feedback (RLHF). Humans rank model responses, and RL fine-tunes the model to produce responses humans prefer. This is why ChatGPT sounds helpful — it was literally trained to maximise human approval!
🎉 Lesson Complete!
You can now build RL agents that learn from interaction! Next, learn how to deploy ML models to production.
Sign up for free to track which lessons you've completed and get learning reminders.