Lesson 30 • Advanced
Policy Gradient Methods
By the end of this lesson you'll understand how an agent can learn what to do directly — turning action preferences into probabilities, sampling actions, and nudging the policy toward higher reward.
What You'll Learn in This Lesson
- ✓How policy-based RL differs from value-based RL like Q-learning
- ✓How softmax turns action preferences into a probability distribution
- ✓How to sample an action from a policy with Python's random module
- ✓The intuition behind the REINFORCE algorithm and its update rule
- ✓Why advantages and baselines slash the variance of the gradient
- ✓How actor-critic methods (and PPO) build on these ideas
🎯 Real-World Analogy: Learning Habits vs Rating Every Option
Imagine two tennis players. The first, a value-based player, mentally scores every possible shot before each swing — "a cross-court forehand is worth 7, a drop shot is worth 4" — then picks the highest score. It's accurate but slow, and it falls apart when there are infinitely many shot angles to rate.
The second, a policy-based player, just builds instincts: "when the ball comes here, I usually go down the line." They don't rate options — they directly adjust how often they play each shot based on whether it tends to win the point. That direct adjustment of action probabilities is exactly what a policy gradient does.
1Policy-Based vs Value-Based RL
A policy (written π, "pi") is the agent's rule for choosing actions: given a state, it says how likely each action is. Written formally it's π(a|s) — "the probability of action a in state s."
Value-based methods like Q-learning never store a policy directly. They learn a value for each action, then act greedily on those values. Policy-based methods flip that around: they store the policy itself and adjust it to earn more reward, never bothering to rate actions on an absolute scale.
Value-based (Q-learning)
- Learns Q(s, a) — how good each action is
- Acts greedily on the values
- Struggles with continuous actions
- Usually a deterministic policy
Policy-based (policy gradient)
- Learns π(a|s) — action probabilities directly
- Samples actions from the policy
- Handles continuous actions naturally
- Naturally stochastic — keeps exploring
2A Policy Is Just Preferences Turned Into Probabilities
The simplest policy stores one number per action — a preference. To turn raw preferences into a valid probability distribution (all positive, summing to 1), you run them through the softmax function. Bigger preference means bigger probability, but every action keeps a non-zero chance, so the agent still explores.
Run this worked example and watch the preferences become probabilities:
Worked Example: Softmax over action preferences
Turn three action preferences into a probability distribution
import math
# A policy is just a rule for PICKING actions.
# We store one "preference" number per action, then turn those
# preferences into probabilities with the softmax function.
# Higher preference -> higher probability (but never exactly 0 or 1).
preferences = [2.0, 1.0, 0.1] # 3 actions: Left, Stay, Right
def softmax(prefs):
# Subtract the max for numerical stability (avoids huge exp values).
biggest = max(prefs)
exps = [math.exp(p - biggest) for p in prefs] # all > 0
...3Sampling an Action From the Policy
Once you have probabilities, the agent doesn't always pick the top one — it samples. Python's random.choices does weighted sampling: pass the probabilities asweights and it returns higher-probability actions more often, but not every time. That built-in randomness is how policy gradient agents explore.
This example samples 1,000 actions and shows the frequencies matching the target probabilities:
Worked Example: Sample actions with random
Draw 1,000 actions and confirm they match the policy
import math
import random
random.seed(7) # makes the random draws repeatable for this lesson
def softmax(prefs):
biggest = max(prefs)
exps = [math.exp(p - biggest) for p in prefs]
total = sum(exps)
return [e / total for e in exps]
def sample_action(probs):
# random.choices does weighted sampling: actions with higher
# probability are picked more often, but not always.
actions = list(range(len(probs)))
return random.choices(actions, weights=probs, k=1)[0]
pre
...4The REINFORCE Algorithm and the Policy Gradient Objective
REINFORCE is the simplest policy gradient algorithm, and the rule is intuitive: take an action, see the reward, then make that action more likely if the reward was good and less likely if it was bad. Repeat millions of times and the policy drifts toward whatever earns reward.
The objective being maximised is the expected reward, written J(θ) = E[ R ], where θ ("theta") are the policy's parameters (your preferences). The famous policy gradient theorem says you can climb this objective with the update:
In plain English: nudge each parameter in the direction that increases the log-probability of the action you took, scaled by the reward R. Positive reward pushes the action up; negative reward pushes it down. The worked code below does exactly this for one action, and you'll do an update by hand in the next exercise.
🎯 Your Turn: Finish the softmax policy
Fill in the blanks to normalise probabilities and read off the best action
import math
# 🎯 YOUR TURN — fill in the blanks marked with ___
def softmax(prefs):
biggest = max(prefs)
# 1) Exponentiate each (pref - biggest) so all values are positive
exps = [math.exp(p - biggest) for p in prefs]
# 2) Divide each exp by the total so they sum to 1
total = ___ # 👉 add up everything in 'exps' with sum(...)
return [e / total for e in exps]
# An agent in a maze with 3 moves and these learned preferences:
preferences = [0.5, 3.0, 0.5]
...🎯 Your Turn: One REINFORCE update by hand
Nudge the chosen action's preference based on its reward
import math
import random
random.seed(1)
# 🎯 YOUR TURN — fill in the blanks marked with ___
# Do ONE policy-gradient update: nudge the chosen action's preference
# UP when the reward is positive, and DOWN when it is negative.
def softmax(prefs):
biggest = max(prefs)
exps = [math.exp(p - biggest) for p in prefs]
total = sum(exps)
return [e / total for e in exps]
prefs = [0.0, 0.0] # 2 actions: Left, Right (start equal)
learning_rate = 0.5
probs = softmax(prefs)
...📉 Advantages, Baselines, and Actor-Critic
Plain REINFORCE works, but it's noisy. Because it uses the raw reward of a whole episode, the learning signal swings wildly — two nearly identical episodes can give very different totals. That high variance makes training slow and unstable.
The fix is the advantage: instead of using the raw reward, subtract a baseline (a reference reward, often the average) and use the difference:
Now an action is judged as better or worse than expected, not on its raw, noisy total. This slashes variance without biasing the gradient — actions that beat the baseline go up, those below it go down.
Actor-critic methods take this one step further. They run two pieces side by side: the actor is the policy that chooses actions, and the critic is a learned value function that estimates how good each state is. The critic's estimate becomes the baseline for the actor's update. This is the foundation of A2C and PPO — the algorithm used to fine-tune ChatGPT with RLHF — which is why those methods learn far more stably than plain REINFORCE.
5Common Errors (And How to Fix Them)
These are the four mistakes that derail almost every first policy gradient implementation:
❌ Training is unstable and never converges (high variance)
Plain REINFORCE uses the full episode return, which swings wildly from run to run. The gradient estimate is so noisy the policy can't settle.
✅ Fix:
# Subtract a baseline to get an advantage, and average over a batch advantage = reward - baseline # not the raw reward prefs[action] += lr * advantage * (1 - probs[action])
❌ Forgetting the baseline entirely
If every reward in your task is positive (say 0 to 100), every action gets pushed up — the policy can't tell good actions from merely-less-good ones.
✅ Fix:
baseline = running_average_of_rewards # so advantages are both + and - advantage = reward - baseline # below-average actions now go DOWN
❌ Learning rate too high — the policy collapses
A large learning rate lets one lucky episode shove all the probability onto a single action. The policy stops exploring and gets stuck forever.
✅ Fix:
learning_rate = 0.01 # start small; PPO commonly uses 3e-4 # If P(one action) rockets to ~1.0 in a few steps, lower it.
❌ Reward scaling is off
Rewards of +1000 produce giant gradients; rewards of 0.0001 produce no learning. Either way the update size is wrong.
✅ Fix:
# Normalise advantages to roughly mean 0, std 1 before the update mean = sum(advs) / len(advs) std = (sum((a - mean) ** 2 for a in advs) / len(advs)) ** 0.5 advs = [(a - mean) / (std + 1e-8) for a in advs]
🎯 Mini-Challenge: Train a Policy With a Baseline
Time to put it all together with the support faded. The starter below is just a comment outline — write the training loop yourself. Use softmax for the policy, random.choices to sample, and an advantage (reward minus a running baseline) for the update.
Mini-Challenge: REINFORCE with a baseline
Train a one-state policy to prefer the rewarding action
import math
import random
random.seed(42)
# 🎯 MINI-CHALLENGE: Train a tiny policy with REINFORCE + a baseline
#
# One state, two actions: 0 = "Left", 1 = "Right".
# "Right" pays reward +1, "Left" pays reward 0.
# Over many episodes the policy should learn to almost always pick Right.
#
# Steps:
# 1. Start with prefs = [0.0, 0.0] and learning_rate = 0.1
# 2. Keep a running 'baseline' = average reward seen so far (start at 0.0)
# 3. For 300 episodes:
# - softmax(prefs) -> probs
# - sa
...📋 Quick Reference
| Term | What It Means | In Code / Maths |
|---|---|---|
| Policy π(a|s) | Probability of each action in a state | softmax(prefs) |
| Softmax | Turns preferences into probabilities | exp(x) / sum(exp(x)) |
| Sampling | Pick an action by its probability | random.choices(a, weights=p) |
| REINFORCE | Simplest policy gradient update | ∇J ≈ R · ∇log π(a|s) |
| Advantage | Reward relative to a baseline | reward − baseline |
| Actor-Critic | Policy (actor) + value baseline (critic) | A2C, PPO |
❓ Frequently Asked Questions
Q: What is the difference between policy-based and value-based reinforcement learning?
A: Value-based methods like Q-learning learn how good each action is (a value), then act greedily on those values. Policy-based methods skip the value step and directly learn the policy — the probability of each action in each state — adjusting those probabilities to earn more reward. Policy-based methods naturally output stochastic policies and handle continuous action spaces, which value-based methods struggle with.
Q: What is the REINFORCE algorithm in simple terms?
A: REINFORCE runs a full episode, then nudges the policy so actions that led to high reward become more likely and actions that led to low reward become less likely. The size of the nudge is the reward multiplied by the gradient of the log-probability of the chosen action. It is the simplest policy gradient method and the foundation everything else builds on.
Q: Why do policy gradient methods have such high variance?
A: REINFORCE uses the raw return of a whole episode as its learning signal, and returns swing wildly from one episode to the next because of randomness in both the policy and the environment. Two near-identical episodes can produce very different totals, so the gradient estimate is noisy and learning is slow and unstable.
Q: What is a baseline and why does subtracting it help?
A: A baseline is a reference reward you subtract from the actual reward to get the advantage (reward minus baseline). A common baseline is the average reward, or a learned value estimate. Subtracting it dramatically lowers variance without biasing the gradient, because actions are now judged as better-or-worse-than-expected rather than on their raw, noisy totals.
Q: What is actor-critic and how does it relate to policy gradients?
A: Actor-critic combines both worlds: the actor is a policy that chooses actions, and the critic is a value function that estimates how good states are. The critic's estimate serves as the baseline for the actor's policy gradient update, cutting variance. PPO and A2C are actor-critic methods, which is why they learn far more stably than plain REINFORCE.
Lesson complete — you can think in policies now!
You learned how policy-based RL differs from value-based RL, how softmax turns preferences into probabilities, how to sample an action, the REINFORCE update rule and its objective, and why advantages, baselines, and actor-critic methods make training stable. These ideas scale all the way up to PPO and the RLHF that aligns modern language models.
🚀 Up next: Computer Vision Pipelines — build end-to-end vision systems for real-world applications.
Sign up for free to track which lessons you've completed and get learning reminders.