Lesson 28 • Advanced
Reinforcement Learning Basics
Train agents to make decisions through trial and error — Markov Decision Processes, value iteration, and the exploration-exploitation tradeoff.
✅ What You'll Learn
- • Markov Decision Processes (MDPs): states, actions, rewards
- • Value iteration and the Bellman equation
- • Exploration vs exploitation: ε-greedy and UCB
- • The multi-armed bandit problem
🎮 Learning by Doing
🎯 Real-World Analogy: Reinforcement learning is like training a puppy. You don't show it examples of "good behaviour" (supervised learning). Instead, you let it try things and give treats (+reward) for sitting and say "no" (-reward) for chewing shoes. Over time, the puppy learns which actions lead to treats. The Bellman equation formalises this: "the value of a state = immediate reward + discounted value of the best next state."
RL is behind AlphaGo (beating world champions at Go), ChatGPT (RLHF for alignment), autonomous driving, robotics, and game-playing AI. Unlike supervised learning, the agent must discover the right actions through interaction with an environment.
Try It: Markov Decision Process
Solve a grid world with value iteration — find the optimal policy
import numpy as np
# Markov Decision Process: The Foundation of RL
# An agent takes actions in states and receives rewards
np.random.seed(42)
# Simple Grid World: 4 states, 2 actions
# States: [Start, Path, Trap, Goal]
# Actions: [Left, Right]
states = ["Start", "Path", "Trap", "Goal"]
actions = ["Left", "Right"]
# Transition probabilities: P(next_state | state, action)
# Simplified: deterministic transitions
transitions = {
("Start", "Right"): "Path",
("Start", "Left"): "Start",
...Try It: Multi-Armed Bandit
Balance exploration and exploitation across 5 slot machines
import numpy as np
# Multi-Armed Bandit: Exploration vs Exploitation
# The fundamental dilemma of reinforcement learning
np.random.seed(42)
class SlotMachine:
def __init__(self, true_mean):
self.true_mean = true_mean
def pull(self):
return np.random.randn() + self.true_mean
def epsilon_greedy(machines, n_pulls, epsilon):
"""Epsilon-greedy: explore randomly with probability epsilon"""
n = len(machines)
counts = np.zeros(n)
values = np.zeros(n)
...⚠️ Common Mistake: Setting the discount factor γ too low. With γ=0.1, the agent only cares about immediate rewards and makes short-sighted decisions. With γ=0.99, it plans far ahead but training is slower. Start with γ=0.9 for most problems.
💡 Pro Tip: For learning RL, start with OpenAI Gymnasium (formerly Gym). Try CartPole first (balance a pole), then LunarLander (land a spacecraft). Use Stable Baselines3 for production-quality implementations of PPO, DQN, and A2C.
📋 Quick Reference
| Concept | Formula | Meaning |
|---|---|---|
| Value Function | V(s) | Expected total reward from state s |
| Bellman Eq. | V(s) = R + γ·max V(s') | Recursive value decomposition |
| Discount γ | 0 ≤ γ ≤ 1 | How much future rewards matter |
| Policy π | π(a|s) | Probability of action a in state s |
| ε-greedy | P(explore) = ε | Random exploration rate |
🎉 Lesson Complete!
You now understand the fundamentals of RL. Next, learn Q-Learning and Deep Q-Networks — the algorithms that taught AI to play Atari!
Sign up for free to track which lessons you've completed and get learning reminders.