Courses/AI & ML/Policy Gradient Methods

Lesson 30 • Advanced

Policy Gradient Methods

By the end of this lesson you'll understand how an agent can learn what to do directly — turning action preferences into probabilities, sampling actions, and nudging the policy toward higher reward.

What You'll Learn in This Lesson

✓How policy-based RL differs from value-based RL like Q-learning
✓How softmax turns action preferences into a probability distribution
✓How to sample an action from a policy with Python's random module
✓The intuition behind the REINFORCE algorithm and its update rule
✓Why advantages and baselines slash the variance of the gradient
✓How actor-critic methods (and PPO) build on these ideas

Before you start: Make sure you've worked through Q-Learning & DQN so you know what a state, an action, and a reward are. This lesson contrasts directly with the value-based approach you learned there.

🎯 Real-World Analogy: Learning Habits vs Rating Every Option

Imagine two tennis players. The first, a value-based player, mentally scores every possible shot before each swing — "a cross-court forehand is worth 7, a drop shot is worth 4" — then picks the highest score. It's accurate but slow, and it falls apart when there are infinitely many shot angles to rate.

The second, a policy-based player, just builds instincts: "when the ball comes here, I usually go down the line." They don't rate options — they directly adjust how often they play each shot based on whether it tends to win the point. That direct adjustment of action probabilities is exactly what a policy gradient does.

1Policy-Based vs Value-Based RL

A policy (written π, "pi") is the agent's rule for choosing actions: given a state, it says how likely each action is. Written formally it's π(a|s) — "the probability of action a in state s."

Value-based methods like Q-learning never store a policy directly. They learn a value for each action, then act greedily on those values. Policy-based methods flip that around: they store the policy itself and adjust it to earn more reward, never bothering to rate actions on an absolute scale.

Value-based (Q-learning)

Learns Q(s, a) — how good each action is
Acts greedily on the values
Struggles with continuous actions
Usually a deterministic policy

Policy-based (policy gradient)

Learns π(a|s) — action probabilities directly
Samples actions from the policy
Handles continuous actions naturally
Naturally stochastic — keeps exploring

2A Policy Is Just Preferences Turned Into Probabilities

The simplest policy stores one number per action — a preference. To turn raw preferences into a valid probability distribution (all positive, summing to 1), you run them through the softmax function. Bigger preference means bigger probability, but every action keeps a non-zero chance, so the agent still explores.

Run this worked example and watch the preferences become probabilities:

Worked Example: Softmax over action preferences

Turn three action preferences into a probability distribution

Try it Yourself »

Python

import math

# A policy is just a rule for PICKING actions.
# We store one "preference" number per action, then turn those
# preferences into probabilities with the softmax function.

# Higher preference -> higher probability (but never exactly 0 or 1).
preferences = [2.0, 1.0, 0.1]   # 3 actions: Left, Stay, Right

def softmax(prefs):
    # Subtract the max for numerical stability (avoids huge exp values).
    biggest = max(prefs)
    exps = [math.exp(p - biggest) for p in prefs]   # all > 0
  
...

3Sampling an Action From the Policy

Once you have probabilities, the agent doesn't always pick the top one — it samples. Python's random.choices does weighted sampling: pass the probabilities asweights and it returns higher-probability actions more often, but not every time. That built-in randomness is how policy gradient agents explore.

This example samples 1,000 actions and shows the frequencies matching the target probabilities:

Worked Example: Sample actions with random

Draw 1,000 actions and confirm they match the policy

Try it Yourself »

Python

import math
import random

random.seed(7)   # makes the random draws repeatable for this lesson

def softmax(prefs):
    biggest = max(prefs)
    exps = [math.exp(p - biggest) for p in prefs]
    total = sum(exps)
    return [e / total for e in exps]

def sample_action(probs):
    # random.choices does weighted sampling: actions with higher
    # probability are picked more often, but not always.
    actions = list(range(len(probs)))
    return random.choices(actions, weights=probs, k=1)[0]

pre
...

4The REINFORCE Algorithm and the Policy Gradient Objective

REINFORCE is the simplest policy gradient algorithm, and the rule is intuitive: take an action, see the reward, then make that action more likely if the reward was good and less likely if it was bad. Repeat millions of times and the policy drifts toward whatever earns reward.

The objective being maximised is the expected reward, written J(θ) = E[ R ], where θ ("theta") are the policy's parameters (your preferences). The famous policy gradient theorem says you can climb this objective with the update:

∇J(θ) ≈ R · ∇log π(a|s)

In plain English: nudge each parameter in the direction that increases the log-probability of the action you took, scaled by the reward R. Positive reward pushes the action up; negative reward pushes it down. The worked code below does exactly this for one action, and you'll do an update by hand in the next exercise.

Key insight: you never need to know the "true value" of an action. You only need to know whether it did better or worse than usual, and shift its probability accordingly.

🎯 Your Turn: Finish the softmax policy

Fill in the blanks to normalise probabilities and read off the best action

Try it Yourself »

Python

import math

# 🎯 YOUR TURN — fill in the blanks marked with ___

def softmax(prefs):
    biggest = max(prefs)
    # 1) Exponentiate each (pref - biggest) so all values are positive
    exps = [math.exp(p - biggest) for p in prefs]
    # 2) Divide each exp by the total so they sum to 1
    total = ___                 # 👉 add up everything in 'exps' with sum(...)
    return [e / total for e in exps]

# An agent in a maze with 3 moves and these learned preferences:
preferences = [0.5, 3.0, 0.5]  
...

🎯 Your Turn: One REINFORCE update by hand

Nudge the chosen action's preference based on its reward

Try it Yourself »

Python

import math
import random

random.seed(1)

# 🎯 YOUR TURN — fill in the blanks marked with ___
# Do ONE policy-gradient update: nudge the chosen action's preference
# UP when the reward is positive, and DOWN when it is negative.

def softmax(prefs):
    biggest = max(prefs)
    exps = [math.exp(p - biggest) for p in prefs]
    total = sum(exps)
    return [e / total for e in exps]

prefs = [0.0, 0.0]              # 2 actions: Left, Right (start equal)
learning_rate = 0.5

probs = softmax(prefs)

...

📉 Advantages, Baselines, and Actor-Critic

Plain REINFORCE works, but it's noisy. Because it uses the raw reward of a whole episode, the learning signal swings wildly — two nearly identical episodes can give very different totals. That high variance makes training slow and unstable.

The fix is the advantage: instead of using the raw reward, subtract a baseline (a reference reward, often the average) and use the difference:

advantage = reward − baseline

Now an action is judged as better or worse than expected, not on its raw, noisy total. This slashes variance without biasing the gradient — actions that beat the baseline go up, those below it go down.

Actor-critic methods take this one step further. They run two pieces side by side: the actor is the policy that chooses actions, and the critic is a learned value function that estimates how good each state is. The critic's estimate becomes the baseline for the actor's update. This is the foundation of A2C and PPO — the algorithm used to fine-tune ChatGPT with RLHF — which is why those methods learn far more stably than plain REINFORCE.

5Common Errors (And How to Fix Them)

These are the four mistakes that derail almost every first policy gradient implementation:

❌ Training is unstable and never converges (high variance)

Plain REINFORCE uses the full episode return, which swings wildly from run to run. The gradient estimate is so noisy the policy can't settle.

✅ Fix:

# Subtract a baseline to get an advantage, and average over a batch
advantage = reward - baseline          # not the raw reward
prefs[action] += lr * advantage * (1 - probs[action])

❌ Forgetting the baseline entirely

If every reward in your task is positive (say 0 to 100), every action gets pushed up — the policy can't tell good actions from merely-less-good ones.

✅ Fix:

baseline = running_average_of_rewards   # so advantages are both + and -
advantage = reward - baseline           # below-average actions now go DOWN

❌ Learning rate too high — the policy collapses

A large learning rate lets one lucky episode shove all the probability onto a single action. The policy stops exploring and gets stuck forever.

✅ Fix:

learning_rate = 0.01    # start small; PPO commonly uses 3e-4
# If P(one action) rockets to ~1.0 in a few steps, lower it.

❌ Reward scaling is off

Rewards of +1000 produce giant gradients; rewards of 0.0001 produce no learning. Either way the update size is wrong.

✅ Fix:

# Normalise advantages to roughly mean 0, std 1 before the update
mean = sum(advs) / len(advs)
std = (sum((a - mean) ** 2 for a in advs) / len(advs)) ** 0.5
advs = [(a - mean) / (std + 1e-8) for a in advs]

🎯 Mini-Challenge: Train a Policy With a Baseline

Time to put it all together with the support faded. The starter below is just a comment outline — write the training loop yourself. Use softmax for the policy, random.choices to sample, and an advantage (reward minus a running baseline) for the update.

Mini-Challenge: REINFORCE with a baseline

Train a one-state policy to prefer the rewarding action

Try it Yourself »

Python

import math
import random

random.seed(42)

# 🎯 MINI-CHALLENGE: Train a tiny policy with REINFORCE + a baseline
#
# One state, two actions: 0 = "Left", 1 = "Right".
# "Right" pays reward +1, "Left" pays reward 0.
# Over many episodes the policy should learn to almost always pick Right.
#
# Steps:
# 1. Start with prefs = [0.0, 0.0] and learning_rate = 0.1
# 2. Keep a running 'baseline' = average reward seen so far (start at 0.0)
# 3. For 300 episodes:
#      - softmax(prefs) -> probs
#      - sa
...

📋 Quick Reference

Term	What It Means	In Code / Maths
Policy π(a\|s)	Probability of each action in a state	`softmax(prefs)`
Softmax	Turns preferences into probabilities	`exp(x) / sum(exp(x))`
Sampling	Pick an action by its probability	`random.choices(a, weights=p)`
REINFORCE	Simplest policy gradient update	`∇J ≈ R · ∇log π(a\|s)`
Advantage	Reward relative to a baseline	`reward − baseline`
Actor-Critic	Policy (actor) + value baseline (critic)	A2C, PPO

❓ Frequently Asked Questions

Q: What is the difference between policy-based and value-based reinforcement learning?

A: Value-based methods like Q-learning learn how good each action is (a value), then act greedily on those values. Policy-based methods skip the value step and directly learn the policy — the probability of each action in each state — adjusting those probabilities to earn more reward. Policy-based methods naturally output stochastic policies and handle continuous action spaces, which value-based methods struggle with.

Q: What is the REINFORCE algorithm in simple terms?

A: REINFORCE runs a full episode, then nudges the policy so actions that led to high reward become more likely and actions that led to low reward become less likely. The size of the nudge is the reward multiplied by the gradient of the log-probability of the chosen action. It is the simplest policy gradient method and the foundation everything else builds on.

Q: Why do policy gradient methods have such high variance?

A: REINFORCE uses the raw return of a whole episode as its learning signal, and returns swing wildly from one episode to the next because of randomness in both the policy and the environment. Two near-identical episodes can produce very different totals, so the gradient estimate is noisy and learning is slow and unstable.

Q: What is a baseline and why does subtracting it help?

A: A baseline is a reference reward you subtract from the actual reward to get the advantage (reward minus baseline). A common baseline is the average reward, or a learned value estimate. Subtracting it dramatically lowers variance without biasing the gradient, because actions are now judged as better-or-worse-than-expected rather than on their raw, noisy totals.

Q: What is actor-critic and how does it relate to policy gradients?

A: Actor-critic combines both worlds: the actor is a policy that chooses actions, and the critic is a value function that estimates how good states are. The critic's estimate serves as the baseline for the actor's policy gradient update, cutting variance. PPO and A2C are actor-critic methods, which is why they learn far more stably than plain REINFORCE.

🎉

Lesson complete — you can think in policies now!

You learned how policy-based RL differs from value-based RL, how softmax turns preferences into probabilities, how to sample an action, the REINFORCE update rule and its objective, and why advantages, baselines, and actor-critic methods make training stable. These ideas scale all the way up to PPO and the RLHF that aligns modern language models.

🚀 Up next: Computer Vision Pipelines — build end-to-end vision systems for real-world applications.

Policy Gradient Methods

What You'll Learn in This Lesson

🎯 Real-World Analogy: Learning Habits vs Rating Every Option

1Policy-Based vs Value-Based RL

2A Policy Is Just Preferences Turned Into Probabilities

Worked Example: Softmax over action preferences

3Sampling an Action From the Policy

Worked Example: Sample actions with random

4The REINFORCE Algorithm and the Policy Gradient Objective

🎯 Your Turn: Finish the softmax policy

🎯 Your Turn: One REINFORCE update by hand

📉 Advantages, Baselines, and Actor-Critic

5Common Errors (And How to Fix Them)

🎯 Mini-Challenge: Train a Policy With a Baseline

Mini-Challenge: REINFORCE with a baseline

📋 Quick Reference

❓ Frequently Asked Questions

Lesson complete — you can think in policies now!

Cookie & Privacy Settings