Courses/AI & ML/Reinforcement Learning

Lesson 13 • Intermediate

Reinforcement Learning

Teach an agent to make good decisions through trial and error. By the end you'll understand the RL loop, write an epsilon-greedy agent in plain Python, and know where RL powers real systems — from game AI to robots.

What You'll Learn in This Lesson

✓What an agent, environment, state, action, and reward are
✓How the RL loop turns experience into better decisions
✓The difference between a policy and a value estimate
✓Why exploration vs exploitation is the core RL dilemma
✓How a Markov Decision Process frames any RL problem
✓Where RL is used — games, robotics, and beyond

Before you start: You should be comfortable with basic Python — functions, loops, and lists. If you've met supervised learning earlier in this course, you'll appreciate how different RL feels. Every runnable example here uses only Python's built-in random module — no frameworks to install.

🐶 Real-World Analogy: Training a Dog with Treats

You can't hand a puppy a textbook. Instead you say "sit", wait, and the moment it sits you give a treat. The treat is a reward. The puppy doesn't know why the treat came at first — but after many tries it links the situation ("I heard 'sit'") and the action ("I sat") to the reward.

That is reinforcement learning exactly. The puppy is the agent. Your living room is the environment. "I just heard the word sit" is the state. Sitting, lying down, or running off are the actions. The treat (or no treat) is the reward. Over time the puppy forms a policy: "when I hear 'sit', sitting pays off."

Notice there were no labelled examples — nobody showed the puppy ten thousand photos of correct sitting. It learned purely from interaction and feedback. That's what separates RL from supervised learning.

1The Five Pieces: Agent, Environment, State, Action, Reward

Every RL problem is built from the same five pieces. Learn these names once and the rest of the field reads easily.

• Agent — the learner and decision-maker (the puppy, the robot, the game bot).
• Environment — the world the agent acts in. It receives actions and returns the next state and a reward.
• State — a snapshot of the current situation (where you are on the board, the robot's joint angles).
• Action — one of the choices available right now (move left, pull lever 2, accelerate).
• Reward — a single number of feedback. Positive means "good", negative means "bad". The agent's whole goal is to maximise the total reward over time.

Jargon check: a policy is the agent's strategy — a rule mapping each state to an action. The agent's job is to improve its policy until it reliably collects high reward.

2The RL Loop (Worked Example)

RL runs in a loop: the agent observes a state, picks an action, the environment returns a new state and a reward, and round it goes. One full run from start to finish is called an episode.

Read this fully-commented example. It builds a tiny 5-cell corridor by hand and walks an agent to the goal so you can see every part of the loop with your own eyes. Run it and match the output.

Worked Example: The RL Loop by Hand

A 5-cell corridor showing state, action, reward, and episode end

Try it Yourself »

Python

# The reinforcement learning loop, spelled out step by step.
# No frameworks — just plain Python so you can SEE every part.

# The "environment": a tiny 1D corridor of 5 cells, 0..4.
# The agent starts at cell 0. The treasure is at cell 4.
GOAL = 4

def reward_for(state):
    # +10 when you reach the treasure, -1 for every other step
    # (the -1 nudges the agent toward the SHORTEST path).
    return 10 if state == GOAL else -1

def step(state, action):
    # An action is "right" (+1) or "left"
...

The -1 per step plus +10 at the goal is called reward shaping — designing the numbers so the agent prefers short paths. Get the shaping wrong and the agent learns the wrong thing.

3Policy vs Value — Two Things the Agent Learns

There are two quantities an RL agent can learn, and beginners mix them up constantly:

Policy — "what should I do?"

A rule that maps a state to an action. Ask it "I'm in state X, what now?" and it answers with an action. The puppy's "hear sit → sit" is a policy.

Value — "how good is this?"

A number estimating the total future reward you expect from a state (or a state-action pair). It doesn't tell you what to do directly — it scores options.

Many algorithms learn values first, then act greedily on them: "estimate how good each action is, then pick the highest." That is exactly what the bandit below does — it keeps a value estimate per arm and exploits the best one.

4Exploration vs Exploitation (Worked Example)

Here is the central dilemma of RL. Exploitation means choosing the action you currently think is best. Exploration means trying a different action to find out if something better exists. Lean too far either way and you lose.

The classic toy problem is the multi-armed bandit: several slot machine arms, each paying out with a hidden probability. The epsilon-greedy rule solves it simply — with probability epsilon (say 0.1) explore a random arm, otherwise exploit the arm with the highest current estimate.

Worked Example: Epsilon-Greedy Bandit

Discover the best of 3 slot machines with plain Python

Try it Yourself »

Python

import random

# Multi-armed bandit: 3 slot machines ("arms").
# Each arm pays 1 with a hidden probability. The agent must DISCOVER
# the best arm using only the rewards it sees.
random.seed(7)                       # deterministic so the output is fixed

true_probs = [0.2, 0.5, 0.8]         # hidden! the agent never reads this
n_arms = len(true_probs)

def pull(arm):
    # Returns 1 (win) or 0 (lose) for the chosen arm.
    return 1 if random.random() < true_probs[arm] else 0

# What the agent 
...

Because most pulls exploit the best-known arm, arm 2 (the true best) gets pulled thousands of times — but the occasional 10% exploration is what let the agent find arm 2 in the first place. That balance is the whole game.

5The Markov Decision Process (MDP)

The Markov Decision Process is the formal frame underneath every RL problem. Don't let the name scare you — it's just the five pieces written precisely:

• States (S) — all the situations the agent can be in.
• Actions (A) — what the agent can do in each state.
• Transitions (P) — the probability that action a in state s leads to state s′.
• Rewards (R) — the number you get for each transition.
• Discount (γ, gamma) — how much you value future rewards versus immediate ones (0 to 1).

The "Markov" property is the key simplifying assumption: the next state depends only on the current state and action — not on the entire history of how you got there. That makes the maths tractable. Your corridor example was a small MDP, and so was the bandit (a one-state MDP).

6Where Reinforcement Learning Is Used

🎮 Games

AlphaGo beat the world Go champion; agents master Atari, chess, and StarCraft purely from reward signals.

🤖 Robotics

Robots learn to walk, grasp, and balance by trial and error in simulation, then transfer the policy to hardware.

💬 Language models

RLHF fine-tunes chatbots — humans rank responses and RL nudges the model toward answers people prefer.

📈 Operations

Recommendation, ad bidding, traffic-light timing, and data-centre cooling are tuned with RL to maximise a long-term metric.

🌍 In the Real World: Gymnasium

You built tiny environments by hand to learn the mechanics. In real projects you reach for Gymnasium (the maintained successor to OpenAI Gym), which provides ready-made environments behind a standard reset() / step() interface — the very same loop you wrote.

This block needs pip install gymnasium, so run it on your own machine rather than in the sandbox above. Read it to see how the hand-built loop maps onto a real library.

# In a real project you don't hand-build the environment — you use
# Gymnasium (the maintained successor to OpenAI Gym). It gives you
# ready-made worlds with a standard reset()/step() interface.
#
#   pip install gymnasium
import gymnasium as gym

env = gym.make("FrozenLake-v1", is_slippery=False)

state, info = env.reset(seed=0)      # start a new episode
done = False
total = 0

while not done:
    action = env.action_space.sample()        # random policy (placeholder)
    state, reward, terminated, truncated, info = env.step(action)
    total += reward
    done = terminated or truncated

print("episode finished, total reward:", total)

# Expected output (random actions rarely reach the goal):
# episode finished, total reward: 0.0
#
# A trained agent would learn a policy that reaches the goal for reward 1.0.
# Gymnasium gives you the SAME state/action/reward loop you built by hand
# above — it just supplies the environment for you.

🎯 Your Turn #1: Exploit the Best Arm

The explore branch is written for you. Fill in the exploit line so the agent picks the arm with the highest estimate. Match the expected output in the comment.

Your Turn: Pick the Highest-Estimated Action

Fill in the greedy exploit step

Try it Yourself »

Python

# 🎯 YOUR TURN — fill in the blanks marked with ___
import random
random.seed(1)

# Hidden true probabilities for 3 arms (the agent can't read these).
true_probs = [0.3, 0.9, 0.6]

def pull(arm):
    return 1 if random.random() < true_probs[arm] else 0

estimates = [0.0, 0.0, 0.0]
counts    = [0, 0, 0]
epsilon   = 0.1

for t in range(1500):
    if random.random() < epsilon:
        arm = random.randrange(3)              # explore
    else:
        # 👉 EXPLOIT: choose the arm with the HIGHEST es
...

🎯 Your Turn #2: Make the Epsilon-Greedy Choice

Now write both halves of the decision: the explore return (a random arm) and the exploit return (the best arm). The hints are right there in the comments.

Your Turn: Explore or Exploit

Complete the epsilon-greedy action selection

Try it Yourself »

Python

# 🎯 YOUR TURN — fill in the blanks marked with ___
import random
random.seed(2)

estimates = [0.1, 0.7, 0.4]     # pretend the agent already learned these
epsilon   = 0.2                 # explore 20% of the time

def choose_action():
    if random.random() < epsilon:
        # 👉 EXPLORE: return a RANDOM arm number from 0, 1, 2.
        return ___              # 👉 replace ___  (hint: random.randrange(3))
    else:
        # 👉 EXPLOIT: return the arm with the highest estimate.
        return 
...

🏆 Mini-Challenge: A 4-Arm Bandit from Scratch

Support is fading now. You get only a comment outline — write the epsilon-greedy agent yourself. Use the two worked examples above as your reference.

Mini-Challenge: Epsilon-Greedy, 4 Arms

Build the whole agent from a comment outline

Try it Yourself »

Python

import random
random.seed(0)

# 🎯 MINI-CHALLENGE: epsilon-greedy on a 4-arm bandit
# 1. true_probs = [0.1, 0.25, 0.6, 0.45]  (hidden payouts; arm 2 is best)
# 2. Write pull(arm): return 1 if random.random() < true_probs[arm] else 0
# 3. Keep estimates = [0, 0, 0, 0] and counts = [0, 0, 0, 0]
# 4. Loop 3000 times with epsilon = 0.1:
#       - explore (random arm) with probability epsilon, else exploit best
#       - update counts and estimates with the incremental mean
# 5. Print the agent's bes
...

Common Mistakes (And How to Fix Them)

❌ No exploration (epsilon = 0)

A pure-greedy agent locks onto whichever arm happened to win first and never discovers the truly best one.

✅ Fix: keep a small exploration rate (e.g. epsilon = 0.1), or decay it from high to low over time so you explore early and exploit later.

❌ Bad reward shaping

If you reward the agent for the wrong thing, it optimises the wrong thing. A vacuum bot rewarded for "dust collected" learns to dump dust and re-vacuum it.

✅ Fix: reward the outcome you actually want, not a proxy. Add small penalties (like -1 per step) to discourage stalling.

❌ Sparse rewards

If reward only arrives at a distant goal and is zero everywhere else, the agent wanders randomly and almost never stumbles onto it, so it never learns.

✅ Fix: add intermediate rewards (shaping), shrink the environment while learning, or use exploration bonuses to encourage reaching new states.

❌ Ignoring non-stationarity

If the environment changes over time (payouts drift, an opponent adapts), an estimate built as a plain average of all past rewards reacts too slowly.

✅ Fix: use a fixed step size — estimate += alpha * (reward - estimate) — so recent experience counts more than ancient experience.

📋 Quick Reference

Term	What It Means	In the Examples
Agent	The learner that chooses actions	The bandit player
Environment	Returns next state + reward	The corridor / slot machines
State	The current situation	`state` = the cell number
Action	A choice available now	Move right / pull an arm
Reward	Numeric feedback to maximise	`+10` goal, `-1` step, win/lose
Policy	State → action strategy	"pick the best-estimated arm"
Value	Expected future reward	`estimates[arm]`
Epsilon-greedy	Explore ε of the time, else exploit	`random.random() < epsilon`
MDP	States, actions, transitions, rewards, γ	The whole formal frame

Pro tip: ChatGPT and other chat models are fine-tuned with Reinforcement Learning from Human Feedback (RLHF). Humans rank responses, and RL nudges the model toward the answers people prefer — the exact same reward-maximising loop you learned here, just with a language model as the agent.

❓ Frequently Asked Questions

Q: What is reinforcement learning in simple terms?

A: It is learning by trial and error. An agent takes actions in an environment, receives rewards or penalties, and gradually learns a policy — a strategy for choosing actions — that maximises its total reward over time.

Q: How is reinforcement learning different from supervised learning?

A: Supervised learning trains on labelled examples that tell it the correct answer. Reinforcement learning has no labels — the agent only gets a reward signal after acting, and must figure out for itself which actions lead to good outcomes.

Q: What is the exploration vs exploitation trade-off?

A: Exploitation means picking the action you currently believe is best; exploration means trying other actions to discover something better. If you never explore you can get stuck on a mediocre choice; if you always explore you waste reward. Epsilon-greedy balances the two.

Q: What is a Markov Decision Process (MDP)?

A: An MDP is the mathematical frame for RL. It is defined by states, actions, a reward for each transition, and transition probabilities. The 'Markov' part means the next state depends only on the current state and action — not the full history.

Q: Where is reinforcement learning actually used?

A: Game-playing AI (AlphaGo, Atari, chess), robotics and locomotion control, recommendation and ad systems, traffic and energy optimisation, and fine-tuning large language models with human feedback (RLHF).

🎉 Lesson Complete!

You now know the language of reinforcement learning: an agent acts in an environment, observing states, choosing actions, and collecting rewards in a loop. You can explain policy vs value, balance exploration vs exploitation with epsilon-greedy, frame a problem as a Markov Decision Process, and name where RL is used.

🚀 Up next: Model Deployment — take a trained model out of your notebook and put it into production.

Reinforcement Learning

What You'll Learn in This Lesson

🐶 Real-World Analogy: Training a Dog with Treats

1The Five Pieces: Agent, Environment, State, Action, Reward

2The RL Loop (Worked Example)

Worked Example: The RL Loop by Hand

3Policy vs Value — Two Things the Agent Learns

4Exploration vs Exploitation (Worked Example)

Worked Example: Epsilon-Greedy Bandit

5The Markov Decision Process (MDP)

6Where Reinforcement Learning Is Used

🌍 In the Real World: Gymnasium

🎯 Your Turn #1: Exploit the Best Arm

Your Turn: Pick the Highest-Estimated Action

🎯 Your Turn #2: Make the Epsilon-Greedy Choice

Your Turn: Explore or Exploit

🏆 Mini-Challenge: A 4-Arm Bandit from Scratch

Mini-Challenge: Epsilon-Greedy, 4 Arms

Common Mistakes (And How to Fix Them)

📋 Quick Reference

❓ Frequently Asked Questions

🎉 Lesson Complete!

Cookie & Privacy Settings