Proximal Policy Optimization (PPO): A Powerful Actor–Critic Reinforcement Learning Algorithm

One of the most popular algorithms for solving Reinforcement Learning (RL) problems is Proximal Policy Optimization (PPO). John Schuman, an OpenAI co-founder, created it in 2017.

At OpenAI, PPO has been used extensively to train models to mimic human behavior. Because it is a reliable and effective algorithm, it has gained popularity and outperforms previous techniques like Trust Region Policy Optimization (TRPO).

We take a close look at Proximal Policy Optimization (PPO) in this tutorial. We discuss the theory and show how to use PyTorch to implement it.

Table of Contents

Understanding Proximal Policy Optimization (PPO)

The parameters of traditional supervised learning algorithms are updated in the direction of the steepest gradient. This update is adjusted during subsequent training examples that are independent of one another if it turns out to be excessive.

On the other hand, the agent’s actions and returns make up the training examples in reinforcement learning. As a result, there is a correlation between the training examples. To determine the best course of action, the agent investigates its surroundings. Therefore, the policy may become stuck in a bad region with suboptimal rewards if significant changes are made to the gradient.

Large policy changes cause instability in the training process because the agent must explore the environment.

By guaranteeing that policy updates take place within a trusted region, trust-region-based approaches seek to prevent this issue. Within the policy space, this trusted region is an artificially limited area where updates are permitted.

Only a trusted area of the previous policy may be included in the updated policy. Instability is avoided by making sure policy updates are incremental.

Trust region policy updates (TRPO)

John Schulman (who also proposed Proximal Policy Optimization (PPO) in 2017) proposed the Trust Region Policy Updates (TRPO) algorithm in 2015. Kullback-Leibler (KL) divergence is used by TRPO to quantify the difference between the old and updated policies.

The difference between two probability distributions is measured using KL divergence. When it came to establishing trust regions, TRPO worked well.

The computational complexity related to KL divergence is the issue with TRPO. Taylor expansion and other numerical techniques must be used to expand the application of KL divergence to the second order.

The computational cost of this is high. PPO was suggested as a more straightforward and effective substitute for TRPO.

Without using intricate calculations involving KL divergence, PPO approximates the trust region by clipping the ratio of the policies.

Proximal policy approximation (PPO)

PPO is frequently regarded as a subclass of actor-critic techniques, which use the value function to update the policy gradients. Advantage is a parameter used by Advantage Actor-critic (A2C) methods. This calculates the discrepancy between the returns realized by putting the policy into practice and the returns predicted by the critic.

To comprehend PPO, you must be aware of its constituent parts:

The policy is carried out by the actor. A neural network is used to implement it. It outputs the appropriate course of action given a state as the input.
Another neural network is the critic. It receives the state as input and outputs the state’s expected value. The state-value function is thus expressed by the critic.
Different objective functions can be used by policy-gradient-based methods. PPO specifically makes use of the advantage function.
The primary innovation in PPO is the clipped objective function. Large policy updates in a single training iteration are avoided. It restricts the amount of policy updates that can be made in a single iteration.

Policy-based approaches use the probability ratio of the new policy to the old policy to quantify incremental policy updates.

The objective function in PPO is the surrogate loss, which considers the previously mentioned innovations. This is how it is calculated:
As previously mentioned, calculate the real ratio and multiply it by the benefit.

1. Setting Up the Environment

To begin using Proximal Policy Optimisation, we must first install the software packages that we need and choose an appropriate environment in which to run our PPO algorithm.

Installation of Required Software Libraries

To use the Proximal Policy Optimization (PPO) algorithm, we will also need to download and install the following software packages:

PyTorch and its dependencies (such as numpy (mathematics/statistics) and matplotlib (graph plotting)).
We will also download the OpenAI Gym software library, which is an open-source Python library for simulating many different environments and reproducing Reinforcement Learning experiments.

The Gym API will allow us to set up the interactions between our algorithms and a Gym environment.

Install Required Libraries

				
					pip install gymnasium torch numpy matplotlib

Import Packages

				
					import gymnasium as gym
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from torch.distributions import Categorical
import matplotlib.pyplot as plt

Choose an Environment

Use OpenAI Gym to create two instances (one for training and another for testing) of the CartPole environment:

				
					env = gym.make("CartPole-v1")

State & Action Spaces

				
					state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n

2. Implementing PPO in PyTorch

Defining the Policy Network

Proximal Policy Optimization (PPO) uses both an Actor and Critic Model, where the Actor is responsible for realizing the actions taken during a specific time step according to the defined Policy, while the Critic forecasts the expected value associated with these actions through the approximate evaluation of State Actions. Therefore, both networks receive identical information (i.e., the State at Time T), and hence can have a common shared network referred to as “Backbone Architecture”, with Layers of unique architecture added for further specialization by both the Actor and Critic.

The actor-critic is defined by the actor-critic network.

Next, we can define the actor-critic class (ActorCritic) using this network. An actor creates the policy and predicts actions, while the critic creates the value function and predicts values; both actors and critics take states as inputs.

				
					class ActorCritic(nn.Module):
    def __init__(self, state_dim, action_dim):
        super(ActorCritic, self).__init__()

        self.actor = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.ReLU(),
            nn.Linear(128, action_dim),
            nn.Softmax(dim=-1)
        )

        self.critic = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 1)
        )

    def forward(self, x):
        value = self.critic(x)
        probs = self.actor(x)
        return probs, value

Memory Buffer for PPO

PPO requires temporary storage of:

states
actions
rewards
log probabilities
values
dones

				
					class PPOMemory:
    def __init__(self):
        self.states = []
        self.actions = []
        self.probs = []
        self.values = []
        self.rewards = []
        self.dones = []

    def clear(self):
        self.__init__()

PPO Agent Implementation

				
					class PPOAgent:
    def __init__(self, state_dim, action_dim, lr=3e-4, gamma=0.99, clip=0.2):
        self.gamma = gamma
        self.clip = clip

        self.actor_critic = ActorCritic(state_dim, action_dim)
        self.optimizer = optim.Adam(self.actor_critic.parameters(), lr=lr)

        self.memory = PPOMemory()

Selecting an Action

				
					def select_action(self, state):
    state = torch.tensor(state, dtype=torch.float32)
    probs, value = self.actor_critic(state)

    dist = Categorical(probs)
    action = dist.sample()

    self.memory.states.append(state)
    self.memory.actions.append(action)
    self.memory.probs.append(dist.log_prob(action))
    self.memory.values.append(value)

    return action.item()

Computing Advantages

We use the standard advantage formula:

				
					def compute_advantages(self, next_value):
    rewards = self.memory.rewards
    values = self.memory.values
    dones = self.memory.dones

    advantages = []
    advantage = 0

    for i in reversed(range(len(rewards))):
        td_error = rewards[i] + self.gamma * (next_value if not dones[i] else 0) - values[i]
        advantage = td_error + (self.gamma * advantage)
        advantages.insert(0, advantage)

    return advantages

PPO Policy Update

Main Proximal Policy Optimization (PPO) loss:

				
					def update(self):
    states = torch.stack(self.memory.states)
    actions = torch.stack(self.memory.actions)
    old_probs = torch.stack(self.memory.probs)
    values = torch.stack(self.memory.values).squeeze()

    next_value = values[-1]
    advantages = torch.tensor(self.compute_advantages(next_value), dtype=torch.float32)

    for _ in range(5):  # multiple epochs
        probs, vals = self.actor_critic(states)
        dist = Categorical(probs)

        new_probs = dist.log_prob(actions)
        ratio = torch.exp(new_probs - old_probs)

        surr1 = ratio * advantages
        surr2 = torch.clamp(ratio, 1 - self.clip, 1 + self.clip) * advantages

        actor_loss = -torch.min(surr1, surr2).mean()
        critic_loss = nn.MSELoss()(vals.squeeze(), values)

        loss = actor_loss + 0.5 * critic_loss

        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

    self.memory.clear()

3. Training the Agent

Training is the most important phase in the Proximal Policy Optimization (PPO) algorithm. This is where the agent interacts with the environment, collects experience, calculates advantages, and improves its policy. The goal is to let the agent learn how to behave in different states to maximize rewards over time.

Let’s break down the training process step-by-step in a clean and understandable way.

				
					def train(agent, env, episodes=2000):
    rewards_history = []

    for episode in range(episodes):
        state, _ = env.reset()
        total_reward = 0

        done = False
        while not done:
            action = agent.select_action(state)
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated

            agent.memory.rewards.append(reward)
            agent.memory.dones.append(done)

            state = next_state
            total_reward += reward

        rewards_history.append(total_reward)
        agent.update()

        if episode % 50 == 0:
            print(f"Episode {episode}, Reward = {total_reward}")

    return rewards_history

4. Running the PPO Agent

What “Running the Proximal Policy Optimization (PPO) Agent” means

Running the agent covers two related activities:

Training run — interact with the environment, collect trajectories, compute advantages, and update the policy/critic (this is where learning happens).
Evaluation / inference run — run the trained policy without learning (no gradient updates) to measure performance in episodes, optionally render the environment to watch the agent behave.

Both steps are essential: training improves the policy; evaluation shows whether it learned the task.

Practical checklist before you run

Install dependencies: gym or gymnasium, torch, numpy, matplotlib.
Choose environment (CartPole / LunarLander / custom). Discrete vs continuous actions affect the actor output (softmax vs Gaussian).
Seed RNGs (reproducibility): torch.manual_seed, np.random.seed, env.seed (when available).
Decide device: CPU or CUDA. Move networks and tensors to the chosen device.
Logging: store episode rewards, loss values, and optionally tensorboard logs.
Model saving: checkpoint actor_critic weights periodically with torch.save.

Full code you can run locally to train and evaluate the PPO agent

This code assumes you have already implemented ActorCritic, PPOMemory, and PPOAgent (as in the earlier message). I include the end-to-end main() + evaluation + save/load + render functions.

				
					import gym
import torch
import numpy as np
import matplotlib.pyplot as plt

# ---- Evaluation function (no learning) ----
def evaluate_agent(agent, env_name="CartPole-v1", episodes=10, render=False, device='cpu'):
    env = gym.make(env_name)
    total_rewards = []
    agent.actor_critic.to(device)
    agent.actor_critic.eval()
    with torch.no_grad():
        for ep in range(episodes):
            state = env.reset()[0]
            ep_reward = 0
            done = False
            while not done:
                state_t = torch.tensor(state, dtype=torch.float32, device=device)
                probs, _ = agent.actor_critic(state_t)
                # For deterministic evaluation: choose argmax
                action = torch.argmax(probs).item()
                next_state, reward, terminated, truncated, _ = env.step(action)
                done = terminated or truncated
                ep_reward += reward
                state = next_state
                if render:
                    env.render()
            total_rewards.append(ep_reward)
            print(f"Eval Episode {ep+1}: Reward = {ep_reward}")
    env.close()
    agent.actor_critic.train()
    return total_rewards

# ---- Save / Load utilities ----
def save_agent(agent, path="ppo_agent.pth"):
    torch.save(agent.actor_critic.state_dict(), path)

def load_agent(agent, path="ppo_agent.pth", device='cpu'):
    agent.actor_critic.load_state_dict(torch.load(path, map_location=device))
    agent.actor_critic.to(device)

# ---- Example main block (train + save + evaluate) ----
def main():
    env_name = "CartPole-v1"
    env = gym.make(env_name)
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    agent = PPOAgent(state_dim, action_dim, lr=3e-4, gamma=0.99, clip=0.2)
    agent.actor_critic.to(device)

    # Train (this calls agent.update inside)
    rewards = train(agent, env, episodes=1000)  # train() from earlier code
    save_agent(agent, "ppo_cartpole.pth")

    # Plot training rewards
    plt.plot(rewards, label="Episode Reward")
    plt.xlabel("Episode")
    plt.ylabel("Reward")
    plt.legend()
    plt.show()

    # Evaluate
    load_agent(agent, "ppo_cartpole.pth", device=device)
    eval_rewards = evaluate_agent(agent, env_name=env_name, episodes=10, render=False, device=device)
    print("Evaluation mean reward:", np.mean(eval_rewards))

if __name__ == "__main__":
    main()

Notes

Replace train(agent, env, episodes=1000) with your training function (provided earlier).
Use render=True inside evaluate_agent() to watch the agent play (works for simple desktop envs).
For continuous action spaces (e.g., Pendulum-v1) use a Gaussian actor and sample/clip actions accordingly.

Visualizing Training Results

				
					plt.plot(rewards)
plt.xlabel("Episode")
plt.ylabel("Reward")
plt.title("PPO Training Performance")
plt.show()

5. Hyperparameter Tuning and Optimization

Although Proximal Policy Optimization (PPO) is known for being stable, robust, and relatively easy to tune, its performance still heavily depends on choosing the right hyperparameters. Each parameter controls a specific aspect of learning, stability, and exploration.

1. Learning Rate

Too high → instability
Too low → slow learning
Recommended: 3e-4

2. Clip Range (ε)

Controls how much the policy can change.

Default: 0.2

3. Discount Factor (γ)

Higher value → long-term planning
Recommended: 0.99

4. Update Epochs

More epochs = more stable but slower.
Common values: 4–10

5. Batch Size

Recommended: 2048 or 4096 steps per update

6. Entropy Bonus

Promotes exploration.
Typical: 0.01 – 0.02

Tuning Strategy

Start with default Proximal Policy Optimization (PPO) values
Optimize learning rate
Increase batch size
Adjust clip range
Tune actor/critic network sizes

Challenges and Best Practices in Proximal Policy Optimization (PPO)

Sensitive to advantage estimation
- Why it’s a problem:
  - PPO’s policy gradients rely on the advantage
    $A_t$
    to tell the actor which actions were better-than-expected. If the advantage estimates are noisy or biased, the gradient direction becomes unreliable and learning can slow or diverge.
- How this manifests:
  - High variance in updates, unstable training loss, or sudden performance drops.
- Practical mitigations:
  - Use Generalized Advantage Estimation (GAE) (tunable λ) to trade off bias/variance.
  - Normalize advantages before using them in the loss:
    advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
  - Improve value function fitting (better critic architecture or training) so values are closer to true returns.
  - Use larger batch sizes to reduce sampling noise.
Clipping can suppress beneficial updates
- Why it’s a problem:
  - PPO’s clipped objective prevents policy updates that move the probability ratio outside
    $[1-\epsilon, 1+\epsilon]$
    While this stabilizes training, it can also prevent legitimate, helpful updates when the advantage is large.
- How this manifests:
  - Slow learning in tasks where occasional large updates are needed to escape a poor local policy; policy improvement plateaus.
- Practical mitigations:
  - Carefully tune clip range (
    $\epsilon$
    ). Try smaller values (0.1) for very sensitive tasks, larger values (0.25–0.3) if learning is too slow — but monitor stability.
  - Use adaptive clipping (clip that decays or is computed per-batch) if you need more flexibility.
  - Increase batch size or epochs so the true advantage signal is clearer and not mistakenly clipped.
  - Combine clipping with a small penalty-based trust-region (hybrid approaches) if needed.
Requires tuning for continuous control
- Why it’s a problem:
  - Continuous action spaces (e.g., robotics) often require Gaussian policies, action scaling, proper exploration-exploitation balance, and more precise critic estimates. Hyperparameters that work on discrete tasks (CartPole) often fail on continuous control.
- How this manifests:
  - Large action variance, unstable actuations, oscillatory behavior, or failure to converge.
- Practical mitigations:
  - Use separate learning rates for actor and critic (critic often needs higher LR or more updates).
  - Carefully initialize action standard deviation; consider learning a state-dependent std or using an annealed schedule.
  - Normalize observations and rewards; apply action clipping to keep outputs in feasible bounds.
  - Use a stronger critic (larger network, more update steps) and larger batch sizes.
  - Consider off-policy or hybrid algorithms (SAC, TD3) for very hard continuous tasks.
Not as sample efficient as offline RL
- Why it’s a problem:
  - PPO is an on-policy algorithm: it discards collected trajectories after a few epochs of update. Offline RL and off-policy algorithms reuse past experience more extensively, making them more sample efficient.
- How this manifests:
  - Requires many environment interactions (episodes/steps) to reach good performance, which is costly in real-world or slow simulators.
- Practical mitigations:
  - Use parallel/vectorized environments to collect more samples per wall-clock second.
  - Increase epochs and minibatch reuse (carefully) to squeeze more value from trajectories.
  - For expensive environments, consider off-policy methods (SAC, DDPG) or hybrid approaches that combine on-policy stability and off-policy efficiency.
  - Use careful curriculum learning or shaped rewards to reduce sample complexity.
Struggles in sparse reward environments
- Why it’s a problem:
  - PPO optimizes via gradient signals derived from rewards. If rewards are rare, the advantage estimates are mostly zeros and the policy receives little learning signal.
- How this manifests:
  - Very slow or no learning, random exploration without meaningful progress.
- Practical mitigations:
  - Introduce reward shaping or intermediate rewards to provide denser feedback (be careful to avoid overriding the desired objective).
  - Use intrinsic motivation or exploration bonuses (curiosity, count-based, intrinsic curiosity modules).
  - Employ demonstration data or imitation learning (pretrain with behavior cloning) to bootstrap learning.
  - Use hierarchical RL or options to decompose long-horizon tasks into smaller subtasks.

Best Practices — explained in detail (with practical tips)

Normalize advantages
- What: Standardize advantages to zero mean and unit variance before use.
- Why: Reduces gradient variance and stabilizes learning across minibatches.
- How:
  
  adv = (adv - adv.mean()) / (adv.std() + 1e-8)
- Tip: Do this each update step (after computing advantages for the batch).
Use GAE instead of simple TD advantage
- What: Generalized Advantage Estimation computes advantages with a λ parameter that interpolates between high-bias/low-variance (λ≈0) and low-bias/high-variance (λ≈1).
- Why: GAE often gives better bias-variance tradeoff, improving stability and final performance.
- How: Typical values: lambda = 0.95 or 0.97.
- Tip: Tune λ together with γ — lower λ if advantages are noisy; increase λ for smoother advantage estimates.
Use separate learning rates for actor & critic
- What: Give the actor and critic their own optimizers and learning rates.
- Why: Critic often needs faster convergence (or vice versa) — separate LRs let you balance their learning speeds.
- How:
  
  actor_opt = Adam(actor.parameters(), lr=3e-4) critic_opt = Adam(critic.parameters(), lr=1e-3)
- Tip: Monitor critic loss; if value estimates lag, increase critic updates or LR.
Keep clipping range small (0.1–0.2)
- What: Use a conservative clip range to retain trust-region behavior.
- Why: Small clipping yields more stable updates and prevents destructive policy jumps.
- How: Start with 0.2; for fragile or continuous tasks try 0.1.
- Tip: If learning is too slow after other fixes, slowly relax the clip (e.g., 0.25) and monitor stability.
Use reward scaling for stability
- What: Scale or normalize rewards so their magnitude is numerically reasonable.
- Why: Large reward magnitudes cause large gradients and unstable updates; tiny rewards lead to vanishing signals.
- How:
  - Clip rewards to a range (e.g., [-10, +10]), or
  - Use running mean/std normalization: r_norm = (r - mean) / (std + eps)
- Tip: When using reward scaling, remember to adjust value loss weighting accordingly.
Train with many epochs for complex environments
- What: Increase the number of optimization epochs per collected batch.
- Why: For complex tasks, you want to extract more learning signal from each sampled trajectory.
- How: Try n_epochs = 8–10 or higher if stable.
- Tip: Watch for overfitting to the batch — if performance degrades, reduce epochs or increase batch size.
Use larger batch sizes for continuous action environments
- What: Collect more timesteps per update (2048, 4096, or more).
- Why: Continuous control benefits from lower variance gradient estimates and better scaling of the clipping mechanism.
- How: Increase n_steps per environment or use many parallel environments.
- Tip: Use vectorized environments (gym.vector) or Stable-Baselines3’s VecEnv to collect large batches efficiently.

Conclusion

Proximal Policy Optimization (PPO) remains one of the most powerful and widely adopted reinforcement learning algorithms. It strikes the perfect balance between stability, simplicity, and performance. In this article, we implemented PPO from scratch using PyTorch, trained the agent, evaluated its performance, and explored visualization, challenges, and tuning strategies.

Whether you’re building robotics systems, trading agents, or game AI, PPO is an excellent starting point due to its trust-region inspired stability and clean architecture.

FAQs on Proximal Policy Optimization (PPO)

1. Why is PPO considered more stable than other reinforcement learning algorithms?

PPO uses a clipped objective function that limits how much the policy can change in a single update. This prevents large, destructive gradient steps and makes training more stable compared to algorithms like vanilla policy gradient or REINFORCE.

2. What is the purpose of the clipping parameter (ε) in PPO?

The clipping parameter controls the allowed deviation between the new policy and old policy.
If the update tries to change the policy too much, clipping restrains it—helping maintain a trust region and preventing policy collapse.

3. Why does PPO use Generalized Advantage Estimation (GAE)?

GAE reduces variance in advantage estimates while keeping bias relatively low.
This makes updates smoother, improves sample efficiency, and stabilizes training—especially in long-horizon environments.

4. Is PPO good for continuous action environments?

Yes, PPO is widely used for robotics and physics control tasks.
However, it requires more careful tuning of:

learning rate
entropy bonus
batch size
clipping range
Compared to simpler discrete tasks.

5. What are the most important hyperparameters to tune in PPO?

The following hyperparameters have the strongest impact on performance:

Learning rate (3e-4 recommended)
Clip range (0.1–0.2)
Batch size (2048–4096)
GAE λ and γ
Entropy coefficient

Fine-tuning these parameters often leads to significantly better policy performance.

Deep Learning Demystified: A Beginner’s Guide in Simple Words!

Cellular Neural Networks Unveiled: Your Ultimate Guide to the Future of AI!

Deep Learning and Machine Learning: A Complete Guide for Beginners

Feedforward Neural Networks Decoded: How They Work & Why They Shine!