One of the most popular algorithms for solving Reinforcement Learning (RL) problems is Proximal Policy Optimization (PPO). John Schuman, an OpenAI co-founder, created it in 2017.
At OpenAI, PPO has been used extensively to train models to mimic human behavior. Because it is a reliable and effective algorithm, it has gained popularity and outperforms previous techniques like Trust Region Policy Optimization (TRPO).
We take a close look at Proximal Policy Optimization (PPO) in this tutorial. We discuss the theory and show how to use PyTorch to implement it.
Table of Contents
ToggleUnderstanding Proximal Policy Optimization (PPO)
The parameters of traditional supervised learning algorithms are updated in the direction of the steepest gradient. This update is adjusted during subsequent training examples that are independent of one another if it turns out to be excessive.
On the other hand, the agent’s actions and returns make up the training examples in reinforcement learning. As a result, there is a correlation between the training examples. To determine the best course of action, the agent investigates its surroundings. Therefore, the policy may become stuck in a bad region with suboptimal rewards if significant changes are made to the gradient.
Large policy changes cause instability in the training process because the agent must explore the environment.
By guaranteeing that policy updates take place within a trusted region, trust-region-based approaches seek to prevent this issue. Within the policy space, this trusted region is an artificially limited area where updates are permitted.
Only a trusted area of the previous policy may be included in the updated policy. Instability is avoided by making sure policy updates are incremental.
Trust region policy updates (TRPO)
John Schulman (who also proposed Proximal Policy Optimization (PPO) in 2017) proposed the Trust Region Policy Updates (TRPO) algorithm in 2015. Kullback-Leibler (KL) divergence is used by TRPO to quantify the difference between the old and updated policies.
The difference between two probability distributions is measured using KL divergence. When it came to establishing trust regions, TRPO worked well.
The computational complexity related to KL divergence is the issue with TRPO. Taylor expansion and other numerical techniques must be used to expand the application of KL divergence to the second order.
The computational cost of this is high. PPO was suggested as a more straightforward and effective substitute for TRPO.
Without using intricate calculations involving KL divergence, PPO approximates the trust region by clipping the ratio of the policies.
Proximal policy approximation (PPO)
PPO is frequently regarded as a subclass of actor-critic techniques, which use the value function to update the policy gradients. Advantage is a parameter used by Advantage Actor-critic (A2C) methods. This calculates the discrepancy between the returns realized by putting the policy into practice and the returns predicted by the critic.
To comprehend PPO, you must be aware of its constituent parts:
- The policy is carried out by the actor. A neural network is used to implement it. It outputs the appropriate course of action given a state as the input.
- Another neural network is the critic. It receives the state as input and outputs the state’s expected value. The state-value function is thus expressed by the critic.
- Different objective functions can be used by policy-gradient-based methods. PPO specifically makes use of the advantage function.
- The primary innovation in PPO is the clipped objective function. Large policy updates in a single training iteration are avoided. It restricts the amount of policy updates that can be made in a single iteration.
Policy-based approaches use the probability ratio of the new policy to the old policy to quantify incremental policy updates.
- The objective function in PPO is the surrogate loss, which considers the previously mentioned innovations. This is how it is calculated:
As previously mentioned, calculate the real ratio and multiply it by the benefit.
1. Setting Up the Environment
To begin using Proximal Policy Optimisation, we must first install the software packages that we need and choose an appropriate environment in which to run our PPO algorithm.
Installation of Required Software Libraries
To use the Proximal Policy Optimization (PPO) algorithm, we will also need to download and install the following software packages:
- PyTorch and its dependencies (such as numpy (mathematics/statistics) and matplotlib (graph plotting)).
- We will also download the OpenAI Gym software library, which is an open-source Python library for simulating many different environments and reproducing Reinforcement Learning experiments.
The Gym API will allow us to set up the interactions between our algorithms and a Gym environment.
Install Required Libraries
pip install gymnasium torch numpy matplotlib
Import Packages
import gymnasium as gym
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from torch.distributions import Categorical
import matplotlib.pyplot as plt
Choose an Environment
Use OpenAI Gym to create two instances (one for training and another for testing) of the CartPole environment:
env = gym.make("CartPole-v1")
State & Action Spaces
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n
2. Implementing PPO in PyTorch
Defining the Policy Network
Proximal Policy Optimization (PPO) uses both an Actor and Critic Model, where the Actor is responsible for realizing the actions taken during a specific time step according to the defined Policy, while the Critic forecasts the expected value associated with these actions through the approximate evaluation of State Actions. Therefore, both networks receive identical information (i.e., the State at Time T), and hence can have a common shared network referred to as “Backbone Architecture”, with Layers of unique architecture added for further specialization by both the Actor and Critic.
- The actor-critic is defined by the actor-critic network.
Next, we can define the actor-critic class (ActorCritic) using this network. An actor creates the policy and predicts actions, while the critic creates the value function and predicts values; both actors and critics take states as inputs.
class ActorCritic(nn.Module):
def __init__(self, state_dim, action_dim):
super(ActorCritic, self).__init__()
self.actor = nn.Sequential(
nn.Linear(state_dim, 128),
nn.ReLU(),
nn.Linear(128, action_dim),
nn.Softmax(dim=-1)
)
self.critic = nn.Sequential(
nn.Linear(state_dim, 128),
nn.ReLU(),
nn.Linear(128, 1)
)
def forward(self, x):
value = self.critic(x)
probs = self.actor(x)
return probs, value
- Memory Buffer for PPO
PPO requires temporary storage of:
states
actions
rewards
log probabilities
values
dones
class PPOMemory:
def __init__(self):
self.states = []
self.actions = []
self.probs = []
self.values = []
self.rewards = []
self.dones = []
def clear(self):
self.__init__()
- PPO Agent Implementation
class PPOAgent:
def __init__(self, state_dim, action_dim, lr=3e-4, gamma=0.99, clip=0.2):
self.gamma = gamma
self.clip = clip
self.actor_critic = ActorCritic(state_dim, action_dim)
self.optimizer = optim.Adam(self.actor_critic.parameters(), lr=lr)
self.memory = PPOMemory()
- Selecting an Action
def select_action(self, state):
state = torch.tensor(state, dtype=torch.float32)
probs, value = self.actor_critic(state)
dist = Categorical(probs)
action = dist.sample()
self.memory.states.append(state)
self.memory.actions.append(action)
self.memory.probs.append(dist.log_prob(action))
self.memory.values.append(value)
return action.item()
- Computing Advantages
We use the standard advantage formula:
def compute_advantages(self, next_value):
rewards = self.memory.rewards
values = self.memory.values
dones = self.memory.dones
advantages = []
advantage = 0
for i in reversed(range(len(rewards))):
td_error = rewards[i] + self.gamma * (next_value if not dones[i] else 0) - values[i]
advantage = td_error + (self.gamma * advantage)
advantages.insert(0, advantage)
return advantages
- PPO Policy Update
Main Proximal Policy Optimization (PPO) loss:
def update(self):
states = torch.stack(self.memory.states)
actions = torch.stack(self.memory.actions)
old_probs = torch.stack(self.memory.probs)
values = torch.stack(self.memory.values).squeeze()
next_value = values[-1]
advantages = torch.tensor(self.compute_advantages(next_value), dtype=torch.float32)
for _ in range(5): # multiple epochs
probs, vals = self.actor_critic(states)
dist = Categorical(probs)
new_probs = dist.log_prob(actions)
ratio = torch.exp(new_probs - old_probs)
surr1 = ratio * advantages
surr2 = torch.clamp(ratio, 1 - self.clip, 1 + self.clip) * advantages
actor_loss = -torch.min(surr1, surr2).mean()
critic_loss = nn.MSELoss()(vals.squeeze(), values)
loss = actor_loss + 0.5 * critic_loss
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
self.memory.clear()
3. Training the Agent
Training is the most important phase in the Proximal Policy Optimization (PPO) algorithm. This is where the agent interacts with the environment, collects experience, calculates advantages, and improves its policy. The goal is to let the agent learn how to behave in different states to maximize rewards over time.
Let’s break down the training process step-by-step in a clean and understandable way.
def train(agent, env, episodes=2000):
rewards_history = []
for episode in range(episodes):
state, _ = env.reset()
total_reward = 0
done = False
while not done:
action = agent.select_action(state)
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
agent.memory.rewards.append(reward)
agent.memory.dones.append(done)
state = next_state
total_reward += reward
rewards_history.append(total_reward)
agent.update()
if episode % 50 == 0:
print(f"Episode {episode}, Reward = {total_reward}")
return rewards_history
4. Running the PPO Agent
- What “Running the Proximal Policy Optimization (PPO) Agent” means
Running the agent covers two related activities:
Training run — interact with the environment, collect trajectories, compute advantages, and update the policy/critic (this is where learning happens).
Evaluation / inference run — run the trained policy without learning (no gradient updates) to measure performance in episodes, optionally render the environment to watch the agent behave.
Both steps are essential: training improves the policy; evaluation shows whether it learned the task.
- Practical checklist before you run
Install dependencies:
gymorgymnasium,torch,numpy,matplotlib.Choose environment (CartPole / LunarLander / custom). Discrete vs continuous actions affect the actor output (softmax vs Gaussian).
Seed RNGs (reproducibility):
torch.manual_seed,np.random.seed,env.seed(when available).Decide device: CPU or CUDA. Move networks and tensors to the chosen device.
Logging: store episode rewards, loss values, and optionally tensorboard logs.
Model saving: checkpoint actor_critic weights periodically with
torch.save.
- Full code you can run locally to train and evaluate the PPO agent
This code assumes you have already implemented
ActorCritic,PPOMemory, andPPOAgent(as in the earlier message). I include the end-to-endmain()+ evaluation + save/load + render functions.
import gym
import torch
import numpy as np
import matplotlib.pyplot as plt
# ---- Evaluation function (no learning) ----
def evaluate_agent(agent, env_name="CartPole-v1", episodes=10, render=False, device='cpu'):
env = gym.make(env_name)
total_rewards = []
agent.actor_critic.to(device)
agent.actor_critic.eval()
with torch.no_grad():
for ep in range(episodes):
state = env.reset()[0]
ep_reward = 0
done = False
while not done:
state_t = torch.tensor(state, dtype=torch.float32, device=device)
probs, _ = agent.actor_critic(state_t)
# For deterministic evaluation: choose argmax
action = torch.argmax(probs).item()
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
ep_reward += reward
state = next_state
if render:
env.render()
total_rewards.append(ep_reward)
print(f"Eval Episode {ep+1}: Reward = {ep_reward}")
env.close()
agent.actor_critic.train()
return total_rewards
# ---- Save / Load utilities ----
def save_agent(agent, path="ppo_agent.pth"):
torch.save(agent.actor_critic.state_dict(), path)
def load_agent(agent, path="ppo_agent.pth", device='cpu'):
agent.actor_critic.load_state_dict(torch.load(path, map_location=device))
agent.actor_critic.to(device)
# ---- Example main block (train + save + evaluate) ----
def main():
env_name = "CartPole-v1"
env = gym.make(env_name)
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
agent = PPOAgent(state_dim, action_dim, lr=3e-4, gamma=0.99, clip=0.2)
agent.actor_critic.to(device)
# Train (this calls agent.update inside)
rewards = train(agent, env, episodes=1000) # train() from earlier code
save_agent(agent, "ppo_cartpole.pth")
# Plot training rewards
plt.plot(rewards, label="Episode Reward")
plt.xlabel("Episode")
plt.ylabel("Reward")
plt.legend()
plt.show()
# Evaluate
load_agent(agent, "ppo_cartpole.pth", device=device)
eval_rewards = evaluate_agent(agent, env_name=env_name, episodes=10, render=False, device=device)
print("Evaluation mean reward:", np.mean(eval_rewards))
if __name__ == "__main__":
main()
Notes
Replace
train(agent, env, episodes=1000)with your training function (provided earlier).Use
render=Trueinsideevaluate_agent()to watch the agent play (works for simple desktop envs).For continuous action spaces (e.g.,
Pendulum-v1) use a Gaussian actor and sample/clip actions accordingly.
Visualizing Training Results
plt.plot(rewards)
plt.xlabel("Episode")
plt.ylabel("Reward")
plt.title("PPO Training Performance")
plt.show()
5. Hyperparameter Tuning and Optimization
Although Proximal Policy Optimization (PPO) is known for being stable, robust, and relatively easy to tune, its performance still heavily depends on choosing the right hyperparameters. Each parameter controls a specific aspect of learning, stability, and exploration.
1. Learning Rate
Too high → instability
Too low → slow learning
Recommended: 3e-4
2. Clip Range (ε)
Controls how much the policy can change.
Default: 0.2
3. Discount Factor (γ)
Higher value → long-term planning
Recommended: 0.99
4. Update Epochs
More epochs = more stable but slower.
Common values: 4–10
5. Batch Size
Recommended: 2048 or 4096 steps per update
6. Entropy Bonus
Promotes exploration.
Typical: 0.01 – 0.02
Tuning Strategy
Start with default Proximal Policy Optimization (PPO) values
Optimize learning rate
Increase batch size
Adjust clip range
Tune actor/critic network sizes
Challenges and Best Practices in Proximal Policy Optimization (PPO)
Sensitive to advantage estimation
Why it’s a problem:
PPO’s policy gradients rely on the advantage
to tell the actor which actions were better-than-expected. If the advantage estimates are noisy or biased, the gradient direction becomes unreliable and learning can slow or diverge.
How this manifests:
High variance in updates, unstable training loss, or sudden performance drops.
Practical mitigations:
Use Generalized Advantage Estimation (GAE) (tunable λ) to trade off bias/variance.
Normalize advantages before using them in the loss:
advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)Improve value function fitting (better critic architecture or training) so values are closer to true returns.
Use larger batch sizes to reduce sampling noise.
Clipping can suppress beneficial updates
Why it’s a problem:
PPO’s clipped objective prevents policy updates that move the probability ratio outside
While this stabilizes training, it can also prevent legitimate, helpful updates when the advantage is large.
How this manifests:
Slow learning in tasks where occasional large updates are needed to escape a poor local policy; policy improvement plateaus.
Practical mitigations:
Carefully tune clip range (
). Try smaller values (0.1) for very sensitive tasks, larger values (0.25–0.3) if learning is too slow — but monitor stability.
Use adaptive clipping (clip that decays or is computed per-batch) if you need more flexibility.
Increase batch size or epochs so the true advantage signal is clearer and not mistakenly clipped.
Combine clipping with a small penalty-based trust-region (hybrid approaches) if needed.
Requires tuning for continuous control
Why it’s a problem:
Continuous action spaces (e.g., robotics) often require Gaussian policies, action scaling, proper exploration-exploitation balance, and more precise critic estimates. Hyperparameters that work on discrete tasks (CartPole) often fail on continuous control.
How this manifests:
Large action variance, unstable actuations, oscillatory behavior, or failure to converge.
Practical mitigations:
Use separate learning rates for actor and critic (critic often needs higher LR or more updates).
Carefully initialize action standard deviation; consider learning a state-dependent std or using an annealed schedule.
Normalize observations and rewards; apply action clipping to keep outputs in feasible bounds.
Use a stronger critic (larger network, more update steps) and larger batch sizes.
Consider off-policy or hybrid algorithms (SAC, TD3) for very hard continuous tasks.
Not as sample efficient as offline RL
Why it’s a problem:
PPO is an on-policy algorithm: it discards collected trajectories after a few epochs of update. Offline RL and off-policy algorithms reuse past experience more extensively, making them more sample efficient.
How this manifests:
Requires many environment interactions (episodes/steps) to reach good performance, which is costly in real-world or slow simulators.
Practical mitigations:
Use parallel/vectorized environments to collect more samples per wall-clock second.
Increase epochs and minibatch reuse (carefully) to squeeze more value from trajectories.
For expensive environments, consider off-policy methods (SAC, DDPG) or hybrid approaches that combine on-policy stability and off-policy efficiency.
Use careful curriculum learning or shaped rewards to reduce sample complexity.
Struggles in sparse reward environments
Why it’s a problem:
PPO optimizes via gradient signals derived from rewards. If rewards are rare, the advantage estimates are mostly zeros and the policy receives little learning signal.
How this manifests:
Very slow or no learning, random exploration without meaningful progress.
Practical mitigations:
Introduce reward shaping or intermediate rewards to provide denser feedback (be careful to avoid overriding the desired objective).
Use intrinsic motivation or exploration bonuses (curiosity, count-based, intrinsic curiosity modules).
Employ demonstration data or imitation learning (pretrain with behavior cloning) to bootstrap learning.
Use hierarchical RL or options to decompose long-horizon tasks into smaller subtasks.
Best Practices — explained in detail (with practical tips)
Normalize advantages
What: Standardize advantages to zero mean and unit variance before use.
Why: Reduces gradient variance and stabilizes learning across minibatches.
How:
adv = (adv - adv.mean()) / (adv.std() + 1e-8)
Tip: Do this each update step (after computing advantages for the batch).
Use GAE instead of simple TD advantage
What: Generalized Advantage Estimation computes advantages with a λ parameter that interpolates between high-bias/low-variance (λ≈0) and low-bias/high-variance (λ≈1).
Why: GAE often gives better bias-variance tradeoff, improving stability and final performance.
How: Typical values:
lambda = 0.95or0.97.Tip: Tune λ together with γ — lower λ if advantages are noisy; increase λ for smoother advantage estimates.
Use separate learning rates for actor & critic
What: Give the actor and critic their own optimizers and learning rates.
Why: Critic often needs faster convergence (or vice versa) — separate LRs let you balance their learning speeds.
How:
actor_opt = Adam(actor.parameters(), lr=3e-4)
critic_opt = Adam(critic.parameters(), lr=1e-3)
Tip: Monitor critic loss; if value estimates lag, increase critic updates or LR.
Keep clipping range small (0.1–0.2)
What: Use a conservative clip range to retain trust-region behavior.
Why: Small clipping yields more stable updates and prevents destructive policy jumps.
How: Start with
0.2; for fragile or continuous tasks try0.1.Tip: If learning is too slow after other fixes, slowly relax the clip (e.g., 0.25) and monitor stability.
Use reward scaling for stability
What: Scale or normalize rewards so their magnitude is numerically reasonable.
Why: Large reward magnitudes cause large gradients and unstable updates; tiny rewards lead to vanishing signals.
How:
Clip rewards to a range (e.g.,
[-10, +10]), orUse running mean/std normalization:
r_norm = (r - mean) / (std + eps)
Tip: When using reward scaling, remember to adjust value loss weighting accordingly.
Train with many epochs for complex environments
What: Increase the number of optimization epochs per collected batch.
Why: For complex tasks, you want to extract more learning signal from each sampled trajectory.
How: Try
n_epochs = 8–10or higher if stable.Tip: Watch for overfitting to the batch — if performance degrades, reduce epochs or increase batch size.
Use larger batch sizes for continuous action environments
What: Collect more timesteps per update (2048, 4096, or more).
Why: Continuous control benefits from lower variance gradient estimates and better scaling of the clipping mechanism.
How: Increase
n_stepsper environment or use many parallel environments.Tip: Use vectorized environments (
gym.vector) orStable-Baselines3’s VecEnv to collect large batches efficiently.
Conclusion
Proximal Policy Optimization (PPO) remains one of the most powerful and widely adopted reinforcement learning algorithms. It strikes the perfect balance between stability, simplicity, and performance. In this article, we implemented PPO from scratch using PyTorch, trained the agent, evaluated its performance, and explored visualization, challenges, and tuning strategies.
Whether you’re building robotics systems, trading agents, or game AI, PPO is an excellent starting point due to its trust-region inspired stability and clean architecture.
FAQs on Proximal Policy Optimization (PPO)
1. Why is PPO considered more stable than other reinforcement learning algorithms?
PPO uses a clipped objective function that limits how much the policy can change in a single update. This prevents large, destructive gradient steps and makes training more stable compared to algorithms like vanilla policy gradient or REINFORCE.
2. What is the purpose of the clipping parameter (ε) in PPO?
The clipping parameter controls the allowed deviation between the new policy and old policy.
If the update tries to change the policy too much, clipping restrains it—helping maintain a trust region and preventing policy collapse.
3. Why does PPO use Generalized Advantage Estimation (GAE)?
GAE reduces variance in advantage estimates while keeping bias relatively low.
This makes updates smoother, improves sample efficiency, and stabilizes training—especially in long-horizon environments.
4. Is PPO good for continuous action environments?
Yes, PPO is widely used for robotics and physics control tasks.
However, it requires more careful tuning of:
learning rate
entropy bonus
batch size
clipping range
Compared to simpler discrete tasks.
5. What are the most important hyperparameters to tune in PPO?
The following hyperparameters have the strongest impact on performance:
Learning rate (3e-4 recommended)
Clip range (0.1–0.2)
Batch size (2048–4096)
GAE λ and γ
Entropy coefficient
Fine-tuning these parameters often leads to significantly better policy performance.