Vanilla Actor-Critic Deep theory, Python code, examples & 2025

Introduction

If you are entering the world of Artificial Intelligence or reinforcement learning (RL), then Vanilla Actor-Critic is a concept that wins hearts! It is a hybrid RL algorithm that combines policy-based (actor) and value-based (critic) methods. Imagine, a friend (actor) makes moves in the game, and the other friend (critic) says, “ this move a hit or a flop!” This teamwork makes RL stable and powerful.

In this blog we will:

Explain the theory of Vanilla Actor-Critic with analogies and examples.
Give multiple Python code snippets (Gym, PyTorch) so that you can experiment yourself.
We will discuss real-world applications and AI trends of 2025.
We will share SEO tips so that you can take your blog (like aigreeks.com) to the top of Google.

Heavy on coding, and so much depth in theory that it starts Spinning Up on the side too! Come on, let’s get started!

Theory of Vanilla Actor-Critic (Deep Dive)

Actor: Decision Maker

Actor is a policy function, (\pi(a|s; \theta)), which chooses action (a) by looking at state (s). It is usually a neural network which:

Discrete Actions: Softmax gives the output (e.g., left or right).

Continuous Actions: Gives the mean and variance of a Gaussian distribution.

Analogy: Actor is a gamer who presses buttons by looking at the game screen (state). His neural network (parameters (\theta)) are his skills, which improve with practice.

Example: Suppose you are playing CartPole game (you have to balance the pole). The state has the angle and velocity of the pole. Actor sees this and decides: “Push left or right?” This decision is taken on the basis of probability, like 70% left, 30% right.

Critic: Evaluator

The critic is a value function, (V(s; w)), that describes how good it is to be in a state. It can also be a neural network and gives feedback to the actor.

Analogy: The critic is a coach who watches the game and says, “Your score in this situation is 8/10!” It tells the actor how effective his action was.

Example: In CartPole, the critic predicts how “safe” the state is by looking at the angle and velocity of the pole. If the pole is almost going to fall, the value will be low; if it is stable, the value will be high.

How Vanilla Actor-Critic Works

The core idea of Actor-Critic is:

Actor action is chosen: (a_t \sim \pi(a|s_t; \theta)).
Environment reward ((r_t)) and next state ((s_{t+1})) are given.
Critic TD error calculates: [ \delta_t = r_t + \gamma V(s_{t+1}; w) – V(s_t; w) ]

(r_t): Immediate reward.

(\gamma): Discount factor (0 to 1), importance given to future rewards.

4. (V(s_t; w)): Predicted value of current state.Actor improves with policy gradient: [ \nabla_{\theta} J(\theta) \approx \delta_t \nabla_{\theta} \log \pi(a_t | s_t; \theta) ]

5. Critic minimizes TD error to predict better.

Actor tries, critic scores, and both learn together to get more and more rewards!

Detailed Example: In CartPole:

State: Pole angle = 5°, velocity = 0.2.
Actor: Chooses “right push” with 60% probability.
Environment: Reward = +1 (pole balanced), next state angle 4°.
Critic: Current state’s value (V(s_t) = 10), next state’s (V(s_{t+1}) = 11). TD error: [ \delta_t = 1 + 0.99 \cdot 11 – 10 = 1.89 ]
Actor: Positive TD error teaches that “right push” was good, so will push it further.

Vanilla Policy Gradient (VPG) to Actor-Critic

VPG is a simple algorithm that only optimizes policy without using critics. Formula: [ \nabla_{\theta} J(\theta) = \mathbb{E} \left[ \sum_{t} \nabla_{\theta} \log \pi_{\theta}(a_t|s_t) A(s_t,a_t) \right] ] But variance is high in VPG because it depends on rewards of entire episode. Actor-Critic improves it by:

Reducing variance by using critic.
Faster learning from TD error.

Example: In VPG policy is updated by collecting rewards of entire episode (e.g., 200 steps) of CartPole. Actor-Critic gets updated with TD error at every step, which is faster and stable.

Limitations and Challenges

Bias in Critic: If the critic predicts the wrong value, the actor will also seek the wrong one.
Local Optima: Policy gradient methods can sometimes end up in suboptimal solutions.
Hyperparameters: Learning rates ((\alpha_{\theta}), (\alpha_w)) and (\gamma) need to be tuned.

Vanilla Actor-Critic is a solid player, but needs a little tuning and attention to become a champion!

Coding Vanilla Actor-Critic in Python (Multiple Examples)

Now the real game begins! We will provide multiple code snippets so you can understand Actor-Critic from different angles. All with Gayam and Peterch.

Setup and Dependencies

Prepare the environment first:

				
					pip install gym==0.21.0 torch numpy

We will use the CartPole-v1 (pole balancing) and LunarLander-v2 (rocket landing) environments.

Code 1: Basic of Vanilla Actor-Critic for CartPole

This is a simple implementation where the actor and critic are separate neural networks.

				
					import gym
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

# Actor Network
class Actor(nn.Module):
    def __init__(self, state_dim, action_dim):
        super(Actor, self).__init__()
        self.fc1 = nn.Linear(state_dim, 128)
        self.fc2 = nn.Linear(128, action_dim)
        self.softmax = nn.Softmax(dim=-1)

    def forward(self, state):
        x = torch.relu(self.fc1(state))
        x = self.fc2(x)
        probs = self.softmax(x)
        return probs

# Critic Network
class Critic(nn.Module):
    def __init__(self, state_dim):
        super(Critic, self).__init__()
        self.fc1 = nn.Linear(state_dim, 128)
        self.fc2 = nn.Linear(128, 1)

    def forward(self, state):
        x = torch.relu(self.fc1(state))
        value = self.fc2(x)
        return value

# Hyperparameters
GAMMA = 0.99
LR_ACTOR = 0.001
LR_CRITIC = 0.001
EPISODES = 1000

# Environment aur Models
env = gym.make('CartPole-v1')
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n

actor = Actor(state_dim, action_dim)
critic = Critic(state_dim)
actor_optimizer = optim.Adam(actor.parameters(), lr=LR_ACTOR)
critic_optimizer = optim.Adam(critic.parameters(), lr=LR_CRITIC)

# Training Loop
for episode in range(EPISODES):
    state = env.reset()
    log_probs = []
    values = []
    rewards = []
    done = False

    while not done:
        state = torch.FloatTensor(state)
        probs = actor(state)
        value = critic(state)

        # Sample action
        dist = torch.distributions.Categorical(probs)
        action = dist.sample()
        log_prob = dist.log_prob(action)

        # Take action
        next_state, reward, done, _ = env.step(action.item())

        # Store data
        log_probs.append(log_prob)
        values.append(value)
        rewards.append(reward)

        state = next_state

    # Calculate returns
    returns = []
    R = 0
    for r in rewards[::-1]:
        R = r + GAMMA * R
        returns.insert(0, R)
    returns = torch.FloatTensor(returns)

    # Normalize returns
    returns = (returns - returns.mean()) / (returns.std() + 1e-5)

    # Update Actor aur Critic
    actor_loss = 0
    critic_loss = 0
    for log_prob, value, R in zip(log_probs, values, returns):
        advantage = R - value.item()
        actor_loss += -log_prob * advantage
        critic_loss += (R - value) ** 2

    # Backpropagation
    actor_optimizer.zero_grad()
    actor_loss.backward()
    actor_optimizer.step()

    critic_optimizer.zero_grad()
    critic_loss.backward()
    critic_optimizer.step()

    if episode % 100 == 0:
        print(f"Episode {episode}, Avg Reward: {np.mean(rewards):.2f}")

env.close()

Explanation: This code trains Actor-Critic for CartPole. Actor gives probabilities (left or right), Critic predicts the value of the state, and both learn based on TD error. Normalizing the returns makes the learning stable.

Output: After 500–1000 episodes the average reward should be 200+, meaning the pole is stable!

Code 2: Advanced Actor-Critic for LunarLander

Now let’s try a more complex environment, LunarLander-v2. This is a rocket landing game where the actions can be continuous.

				
					import gym
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

# Actor for Continuous Actions
class Actor(nn.Module):
    def __init__(self, state_dim, action_dim):
        super(Actor, self).__init__()
        self.fc1 = nn.Linear(state_dim, 128)
        self.fc2 = nn.Linear(128, 64)
        self.mu = nn.Linear(64, action_dim)
        self.sigma = nn.Linear(64, action_dim)

    def forward(self, state):
        x = torch.relu(self.fc1(state))
        x = torch.relu(self.fc2(x))
        mu = torch.tanh(self.mu(x))  # Action range [-1, 1]
        sigma = torch.exp(self.sigma(x))  # Variance
        return mu, sigma

# Critic
class Critic(nn.Module):
    def __init__(self, state_dim):
        super(Critic, self).__init__()
        self.fc1 = nn.Linear(state_dim, 128)
        self.fc2 = nn.Linear(128, 1)

    def forward(self, state):
        x = torch.relu(self.fc1(state))
        value = self.fc2(x)
        return value

# Hyperparameters
GAMMA = 0.99
LR_ACTOR = 0.0001
LR_CRITIC = 0.001
EPISODES = 2000

# Environment aur Models
env = gym.make('LunarLander-v2')
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.shape[0]

actor = Actor(state_dim, action_dim)
critic = Critic(state_dim)
actor_optimizer = optim.Adam(actor.parameters(), lr=LR_ACTOR)
critic_optimizer = optim.Adam(critic.parameters(), lr=LR_CRITIC)

# Training Loop
for episode in range(EPISODES):
    state = env.reset()
    log_probs = []
    values = []
    rewards = []
    done = False

    while not done:
        state = torch.FloatTensor(state)
        mu, sigma = actor(state)
        value = critic(state)

        # Sample action from Gaussian
        dist = torch.distributions.Normal(mu, sigma)
        action = dist.sample()
        log_prob = dist.log_prob(action).sum(dim=-1)

        # Take action
        next_state, reward, done, _ = env.step(action.numpy())

        # Store data
        log_probs.append(log_prob)
        values.append(value)
        rewards.append(reward)

        state = next_state

    # Calculate returns
    returns = []
    R = 0
    for r in rewards[::-1]:
        R = r + GAMMA * R
        returns.insert(0, R)
    returns = torch.FloatTensor(returns)

    # Normalize returns
    returns = (returns - returns.mean()) / (returns.std() + 1e-5)

    # Update Actor aur Critic
    actor_loss = 0
    critic_loss = 0
    for log_prob, value, R in zip(log_probs, values, returns):
        advantage = R - value.item()
        actor_loss += -log_prob * advantage
        critic_loss += (R - value) ** 2

    # Backpropagation
    actor_optimizer.zero_grad()
    actor_loss.backward()
    actor_optimizer.step()

    critic_optimizer.zero_grad()
    critic_loss.backward()
    critic_optimizer.step()

    if episode % 100 == 0:
        print(f"Episode {episode}, Avg Reward: {np.mean(rewards):.2f}")

env.close()

Explanation: This code is for LunarLander, where the actor gives continuous actions (rocket thrust) using Gaussian distribution. The critic does the same thing—gives the value of the state. This is complex because the actions are continuous, but the logic is the same!

Output: After 1000–2000 episodes the reward should be 200+, meaning the rocket is landing safely.

Code 3: Advantage Actor-Critic (A2C)

Now an upgrade, Advantage Actor-Critic (A2C), where advantage function is used for better stability.

				
					# A2C Modification (CartPole)
# Same Actor aur Critic as above, but update logic change
for episode in range(EPISODES):
    state = env.reset()
    log_probs = []
    values = []
    rewards = []
    done = False

    while not done:
        state = torch.FloatTensor(state)
        probs = actor(state)
        value = critic(state)

        dist = torch.distributions.Categorical(probs)
        action = dist.sample()
        log_prob = dist.log_prob(action)

        next_state, reward, done, _ = env.step(action.item())

        log_probs.append(log_prob)
        values.append(value)
        rewards.append(reward)

        state = next_state

    # Calculate advantage
    returns = []
    R = 0
    for r in rewards[::-1]:
        R = r + GAMMA * R
        returns.insert(0, R)
    returns = torch.FloatTensor(returns)
    returns = (returns - returns.mean()) / (returns.std() + 1e-5)

    advantages = returns - torch.cat(values).detach()
    actor_loss = -torch.mean(torch.stack(log_probs) * advantages)
    critic_loss = torch.mean((returns - torch.cat(values)) ** 2)

    # Backpropagation
    actor_optimizer.zero_grad()
    actor_loss.backward()
    actor_optimizer.step()

    critic_optimizer.zero_grad()
    critic_loss.backward()
    critic_optimizer.step()

    if episode % 100 == 0:
        print(f"Episode {episode}, Avg Reward: {np.mean(rewards):.2f}")

Explanation: The advantage in a2k is that ((a = r – v)) is that instead of the error, the variance works out even more. It converges faster!

Real-World Examples and 2025 Trends

Practical Examples

Gaming: Actor-Critic has performed superhuman performances in Atari games (e.g., Pong). Actor selects moves, critic scores.
Robotics: Boston Dynamics’ robots learn tasks from Actor-Critic, such as opening a door or avoiding an obstacle.
SEO Optimization: Actor-Critic can optimize content strategies, e.g., adjusting keywords based on click-through rates.

Example: Imagine you are using Actor-Critic for a website. Actor decides which keyword to use (e.g., “AI tutorial” or “RL guide”). Critic evaluates how many clicks the website got. If clicks are low, actor changes keyword!

2025 RL Trends

Curiosity-Driven RL: Google’s Random Network Distillation extends Actor-Critic by adding intrinsic rewards for exploration.
Multi-Agent RL: Actor-Critic now works in teams, e.g., autonomous drones collaborate together.
Modern Algorithms: PPO and SAC evolved from VPG and are now the industry standard, but Actor-Critic is their base.

Actor-Critic is an old player, but it is still the hero for the AI of 2025.

Advanced Actor-Critic and Future

Upgrades

A2C: Advantage function is used for stability.
PPO: Safe updates from clipped objectives.
SAC: Entropy adds for exploration.

Code Snippet (PPO Idea):

				
					# PPO Clip (Simplified)
clip_epsilon = 0.2
ratio = torch.exp(log_prob - old_log_prob)
clipped = torch.clamp(ratio, 1 - clip_epsilon, 1 + clip_epsilon)
actor_loss = -torch.min(ratio * advantage, clipped * advantage).mean()

Future of Actor-Critic

Scalability: Actor-critic use in LMS and robotics will be of great use.
Social applications: Real models for dynamic content optimization.

Conclusion

Now you are the boss of Vanilla Actor-Critic! Understood the theory, tried multiple codes, and learned SEO tips. Now what?:

Run the codes in Jupyter Notebook.
Share the GitHub repo: Actor-Critic Demo.
Publish on Aigreeks.com and rank on Google!

If you have a question, comment, and promote the blog on WhatsApp, Reddit, or LinkedIn. Let’s win the game of RL and SEO!

AI in Education: Revolutionizing Learning in the 21st Century

Deep Learning Demystified: A Beginner’s Guide in Simple Words!

Tokenization Unraveled: Your Ultimate Guide to NLP’s Core!

Fundamental Functions in A3C explained

Vanilla Actor-Critic : Coding, Theory, Examples, and 2025