Lesson 30 • Advanced
Policy Gradient Methods
Master REINFORCE, PPO, and actor-critic methods — the algorithms that power ChatGPT's RLHF and modern game AI.
✅ What You'll Learn
- • REINFORCE: the simplest policy gradient algorithm
- • PPO: clipped objectives for stable policy updates
- • Generalized Advantage Estimation (GAE)
- • How RLHF uses PPO to align ChatGPT
🎯 Learning the Policy Directly
🎯 Real-World Analogy: Q-Learning is like a chess player who evaluates every possible board position — slow and memory-intensive. Policy gradients are like a player who directly learns habits: "in this type of position, I should usually castle." They learn what to do rather than how valuable each option is. This makes them better for continuous actions (robotics, self-driving) and high-dimensional problems.
Policy gradient methods directly optimise the policy π(a|s) — the probability of taking action a in state s. Unlike Q-learning (which learns values), they work with continuous action spaces and naturally handle stochastic policies.
Try It: REINFORCE
Watch a policy learn which actions are best in each state
import numpy as np
# REINFORCE: The Simplest Policy Gradient Algorithm
# Directly optimise the policy (action probabilities)
np.random.seed(42)
def softmax(x):
e = np.exp(x - np.max(x))
return e / e.sum()
class PolicyNetwork:
"""Simple policy: maps state → action probabilities"""
def __init__(self, n_states, n_actions):
self.W = np.zeros((n_states, n_actions))
def get_action_probs(self, state_idx):
return softmax(self.W[state_idx])
def select
...Try It: PPO & GAE
See how PPO clips updates and GAE estimates advantages
import numpy as np
# PPO: Proximal Policy Optimization — The Industry Standard
# Used to train ChatGPT (RLHF) and game-playing agents
np.random.seed(42)
def softmax(x):
e = np.exp(x - np.max(x))
return e / e.sum()
def compute_advantages(rewards, values, gamma=0.99, lam=0.95):
"""Generalized Advantage Estimation (GAE)"""
advantages = []
gae = 0
for t in reversed(range(len(rewards))):
next_val = values[t + 1] if t + 1 < len(values) else 0
delta = rewards
...⚠️ Common Mistake: Using REINFORCE for real problems. It has very high variance and converges slowly. Always use PPO or A2C in practice — they add a value baseline that dramatically reduces variance while keeping the policy gradient unbiased.
💡 Pro Tip: PPO's hyperparameters are remarkably robust. Start with: clip_range=0.2, learning_rate=3e-4, n_epochs=10, batch_size=64. These defaults from Stable Baselines3 work well across most environments without tuning.
📋 Quick Reference
| Method | Type | Actions | Use Case |
|---|---|---|---|
| REINFORCE | On-policy | Both | Educational, simple tasks |
| A2C/A3C | On-policy | Both | Parallel training, fast |
| PPO | On-policy | Both | Default choice, RLHF |
| TRPO | On-policy | Both | Theory-optimal, complex |
| SAC | Off-policy | Continuous | Robotics, sample efficiency |
🎉 Lesson Complete!
You've mastered policy gradient methods and PPO! Next, learn how to build end-to-end computer vision pipelines for real-world applications.
Sign up for free to track which lessons you've completed and get learning reminders.