Vanilla Policy Gradient : A Complete Theoretical Foundation

Table of Contents

Introduction

Vanilla Policy Gradient (VPG) is one of the most fundamental and conceptually pure algorithms in policy-based Reinforcement Learning (RL). It represents the earliest formalization of directly optimizing a parameterized policy using gradient ascent on expected return. Unlike value-based methods—which first learn value functions and then derive policies indirectly—VPG operates by explicitly modeling the policy and updating its parameters in the direction that improves long-term performance.

At its core, Vanilla Policy Gradient answers a central question in reinforcement learning:

How can an agent adjust the parameters of a stochastic policy so as to maximize the expected cumulative reward obtained from interacting with an environment?

To address this, VPG leverages tools from probability theory, stochastic optimization, and differential calculus, culminating in the celebrated Policy Gradient Theorem. This theorem provides a mathematically rigorous expression for the gradient of the expected return with respect to policy parameters—without requiring gradients of the environment dynamics.

This property is crucial, as it allows VPG to be applied to unknown, non-differentiable, and stochastic environments, which are common in real-world decision-making problems.

Vanilla Policy Gradient

Formal Definition of a Stochastic Policy

In Reinforcement Learning (RL), a policy defines the agent’s decision-making mechanism—that is, how actions are selected based on observed states. When actions are chosen probabilistically rather than deterministically, the policy is referred to as a stochastic policy. Stochastic policies are foundational to policy-gradient methods such as Vanilla Policy Gradient , because they enable differentiable optimization and principled exploration.


1. Policy in the Markov Decision Process Framework

Formally, reinforcement learning problems are modeled as a Markov Decision Process (MDP), defined by the tuple:

M=(S,A,P,R,γ)\mathcal{M} = (\mathcal{S}, \mathcal{A}, P, R, \gamma)

where:

  • S

    is the state space

  • A

    is the action space

  • P

    (ss,a)P(s’ \mid s, a)

     is the state transition probability

  • R

    (s,a)R(s, a)

     is the reward function

  • γ(0,1] is the discount factor

Within this framework, a policy governs the agent’s interaction with the environment.


2. Formal Definition of a Stochastic Policy

A stochastic policy is defined as a conditional probability distribution over actions given a state:

π(as)=P(At=aSt=s)\pi(a \mid s) = \mathbb{P}(A_t = a \mid S_t = s)

where:

  • sS is the current state

  • aA is a possible action

  • π(as) denotes the probability of selecting action

    aa

    in state s

Key Properties

For every state

sSs \in \mathcal{S}

, the policy satisfies:

  1. Non-negativity

π(as)0aA\pi(a \mid s) \ge 0 \quad \forall a \in \mathcal{A}

  1. Normalization

aAπ(as)=1\sum_{a \in \mathcal{A}} \pi(a \mid s) = 1

(or

Aπ(as)da=1\int_{\mathcal{A}} \pi(a \mid s)\, da = 1

 continuous action spaces)

These conditions ensure that

π(s)\pi(\cdot \mid s)

 is a valid probability distribution.


3. Parameterized Stochastic Policies

In policy-gradient methods, policies are typically parameterized by a vector

θRd\theta \in \mathbb{R}^d

, yielding a family of policies:

πθ(as)\pi_\theta(a \mid s)

The objective of learning is to adjust

θ\theta

so as to maximize expected cumulative reward.

Examples of Parameterizations

  • Discrete actions: Softmax policy

πθ(as)=exp(fθ(s,a))aexp(fθ(s,a))\pi_\theta(a \mid s) = \frac{\exp(f_\theta(s,a))}{\sum_{a’} \exp(f_\theta(s,a’))}

  • Continuous actions: Gaussian policy

πθ(as)=N(aμθ(s),Σθ(s))\pi_\theta(a \mid s) = \mathcal{N}(a \mid \mu_\theta(s), \Sigma_\theta(s))

These parameterizations are chosen to be smooth and differentiable with respect to

θ\theta

, a critical requirement for gradient-based optimization.


4. Stochastic Policy as a Randomized Decision Rule

From a theoretical standpoint, a stochastic policy can be interpreted as a randomized decision rule:

π:SΔ(A)\pi: \mathcal{S} \rightarrow \Delta(\mathcal{A})

where

Δ(A)\Delta(\mathcal{A})

denotes the probability simplex over the action space.

This formulation highlights an important conceptual point:

The policy does not choose an action directly—it defines a distribution from which actions are sampled.

This probabilistic structure induces randomness in the agent’s behavior, even when the environment itself is deterministic.


5. Why Stochastic Policies Are Essential in Policy Gradient Theory

Stochastic policies play a central theoretical role in VPG for several reasons:

5.1 Differentiability of the Objective

The expected return objective:

J(θ)=Eτπθ[R(τ)]J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} [R(\tau)]

depends on

θ\theta

 only through the policy distribution. The likelihood-ratio (log-derivative) trick used in policy gradients:

θlogπθ(as)\nabla_\theta \log \pi_\theta(a \mid s)

is well-defined only when the policy assigns non-zero probability mass smoothly across actions.


5.2 Exploration Guarantee

Unlike deterministic policies, stochastic policies naturally ensure exploration:

π(as)>0action a is eventually explored\pi(a \mid s) > 0 \Rightarrow \text{action } a \text{ is eventually explored}

This is critical for convergence guarantees in theoretical analyses of policy-gradient algorithms.


5.3 Avoiding Non-Differentiability

Deterministic policies often lead to non-differentiable mappings from parameters to actions. In contrast, stochastic policies maintain a smooth dependence of action probabilities on

θ\theta

, enabling unbiased gradient estimation.


6. Relationship to Deterministic Policies

A deterministic policy

μ(s)\mu(s)

can be viewed as a degenerate stochastic policy:

π(as)={1,a=μ(s)0,otherwise\pi(a \mid s) = \begin{cases} 1, & a = \mu(s) \\ 0, & \text{otherwise} \end{cases}

However, such policies lack differentiability almost everywhere, which is why classical VPG relies on stochastic rather than deterministic formulations.


7. Theoretical Role in Trajectory Distributions

A stochastic policy induces a probability distribution over trajectories:

pθ(τ)=ρ0(s0)t=0T1πθ(atst)P(st+1st,at)p_\theta(\tau) = \rho_0(s_0)\prod_{t=0}^{T-1} \pi_\theta(a_t \mid s_t)\, P(s_{t+1} \mid s_t, a_t)

Here, the policy is the only component dependent on

θ\theta

. This factorization is fundamental to deriving the Policy Gradient Theorem, as it allows gradients of expected return to be expressed entirely in terms of

θlogπθ(as)\nabla_\theta \log \pi_\theta(a \mid s)

.


8. Summary (Theoretical Perspective)

From a theoretical standpoint, a stochastic policy is:

  • A probability distribution over actions conditioned on states

  • A differentiable, parameterized object enabling gradient-based optimization

  • The core mechanism through which randomness, exploration, and learning are introduced in VPG

In Vanilla Policy Gradient, the stochastic policy is not merely a design choice—it is the mathematical object that makes policy optimization tractable, analyzable, and theoretically sound.

Expected Return as an Optimization Objective

In policy-based Reinforcement Learning, and particularly in Vanilla Policy Gradient (VPG), learning is framed as a direct optimization problem. The agent does not aim to approximate value functions as an end in themselves; instead, it seeks to optimize the parameters of a policy so that long-term performance is maximized. The quantity that formalizes this notion of performance is the expected return.

From a theoretical perspective, the expected return serves as the objective functional over the space of stochastic policies.


1. Return: Cumulative Reward Along a Trajectory

Consider an MDP M=(S,A,P,R,γ)\mathcal{M} = (\mathcal{S}, \mathcal{A}, P, R, \gamma).
A trajectory (or episode) of length TT is defined as:

τ=(s0,a0,r0,s1,a1,r1,,sT)\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \dots, s_T)

The return from time step tt is the discounted cumulative reward:

Gt=k=tT1γktrkG_t = \sum_{k=t}^{T-1} \gamma^{k-t} r_k

For episodic tasks, the return from the initial state is:

G0=t=0T1γtrtG_0 = \sum_{t=0}^{T-1} \gamma^t r_t

The discount factor γ(0,1]\gamma \in (0,1] ensures convergence of the sum and encodes a preference for immediate rewards over distant ones.


2. Expected Return: From Random Trajectories to Objective Function

Under a stochastic policy πθ\pi_\theta, trajectories are random variables due to:

  • stochasticity in the policy

  • stochasticity in environment transitions

Therefore, performance cannot be measured by a single trajectory. Instead, it is defined in expectation.

Definition (Expected Return)

The expected return of a parameterized policy πθ\pi_\theta is:

J(θ)=Eτpθ(τ)[G0(τ)]J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ G_0(\tau) \right]

where:

  • pθ(τ)p_\theta(\tau) is the probability distribution over trajectories induced by πθ\pi_\theta

  • G0(τ)G_0(\tau) is the return associated with trajectory τ\tau

This expectation transforms a stochastic interaction process into a deterministic optimization objective.


3. Trajectory Distribution Induced by a Policy

The trajectory distribution factorizes as:

pθ(τ)=ρ0(s0)t=0T1πθ(atst)P(st+1st,at)p_\theta(\tau) = \rho_0(s_0) \prod_{t=0}^{T-1} \pi_\theta(a_t \mid s_t)\, P(s_{t+1} \mid s_t, a_t)

Key theoretical insight:

The policy πθ\pi_\theta is the only component of pθ(τ)p_\theta(\tau) that depends on θ\theta.

This property is fundamental—it enables gradient-based optimization without requiring knowledge of environment dynamics.


4. Optimization Problem Formulation

The learning objective in Vanilla Policy Gradient is formally written as:

θ=argmaxθJ(θ)\theta^\ast = \arg\max_{\theta} J(\theta)This is a stochastic optimization problem over a high-dimensional, non-convex objective landscape.

Notably:

  • J(θ)J(\theta) is generally non-linear and non-convex

  • Closed-form solutions are almost never available

  • Optimization must rely on Monte Carlo gradient estimates


5. Alternative Equivalent Forms of Expected Return

5.1 State-Value Function Form

Using the state-value function under policy πθ\pi_\theta:

Vπθ(s)=Eπθ[GtSt=s]V^{\pi_\theta}(s) = \mathbb{E}_{\pi_\theta}[G_t \mid S_t = s]

the objective can be written as:

J(θ)=Es0ρ0[Vπθ(s0)]J(\theta) = \mathbb{E}_{s_0 \sim \rho_0} \left[ V^{\pi_\theta}(s_0) \right]

This form emphasizes dependence on the initial-state distribution.


5.2 Expected Reward Over State–Action Occupancy Measure

Define the discounted state–action visitation distribution:

dπθ(s,a)=t=0γtP(St=s,At=aπθ)d^{\pi_\theta}(s,a) = \sum_{t=0}^{\infty} \gamma^t \mathbb{P}(S_t = s, A_t = a \mid \pi_\theta)

Then:

J(θ)=s,adπθ(s,a)r(s,a)J(\theta) = \sum_{s,a} d^{\pi_\theta}(s,a)\, r(s,a)

This formulation is critical in theoretical analysis, connecting policy optimization to occupancy measures and fixed-point equations.


6. Why Expectation Is Essential (Theoretical Justification)

6.1 Randomness of Trajectories

Because both policy and environment are stochastic, any single trajectory is an unreliable performance estimate. The expectation ensures:

  • robustness to randomness

  • well-defined gradients

  • convergence in the limit of infinite samples

Monte Carlo Estimation Theory in Vanilla Policy Gradient

Monte Carlo (MC) estimation plays a central theoretical role in Vanilla Policy Gradient (VPG). Since the true expected return and its gradient are analytically intractable in most reinforcement learning problems, VPG relies on sample-based estimates obtained from complete trajectories. Understanding Monte Carlo estimation is therefore essential for grasping both the correctness and the limitations of VPG.

This section develops the theory behind Monte Carlo estimation as used in policy gradient methods.


1. Why Monte Carlo Estimation Is Necessary in VPG

The objective in VPG is the expected return:

 

J(θ)=Eτpθ(τ)[G0(τ)]J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)}[G_0(\tau)]

 

Computing this expectation exactly would require:

  • Full knowledge of the environment dynamics

    P(ss,a)P(s’ \mid s,a)

     

  • Summation or integration over all possible trajectories

In realistic MDPs, the trajectory space is exponentially large. Therefore, exact computation is infeasible. Monte Carlo estimation provides a principled way to approximate expectations using samples drawn from the true trajectory distribution.


2. Monte Carlo Estimation: General Theory

Let

XX

be a random variable with distribution

p(x)p(x), and let

f(X)f(X) be a function of interest. The expectation:

 

E[f(X)]=f(x)p(x)dx\mathbb{E}[f(X)] = \int f(x)\, p(x)\, dx

can be approximated using

NN samples

{xi}i=1N\{x_i\}_{i=1}^N

:

 

μ^N=1Ni=1Nf(xi)\hat{\mu}_N = \frac{1}{N} \sum_{i=1}^N f(x_i)

 

Fundamental Properties

  1. Unbiasedness

 

E[μ^N]=E[f(X)]\mathbb{E}[\hat{\mu}_N] = \mathbb{E}[f(X)]

 

  1. Consistency (Law of Large Numbers)

 

μ^Na.s.E[f(X)]as N\hat{\mu}_N \xrightarrow{a.s.} \mathbb{E}[f(X)] \quad \text{as } N \to \infty

 

  1. Variance

 

Var(μ^N)=1NVar(f(X))\mathrm{Var}(\hat{\mu}_N) = \frac{1}{N}\mathrm{Var}(f(X))

 

These properties directly carry over to trajectory-based estimation in VPG.


3. Monte Carlo Estimation of Expected Return

In VPG, the random variable is the trajectory

τ\tau, and the function of interest is the return

G0(τ)G_0(\tau).

Given

NN trajectories sampled under policy

πθ\pi_\theta:

 

{τ(i)}i=1Npθ(τ)\{\tau^{(i)}\}_{i=1}^N \sim p_\theta(\tau)

 

the Monte Carlo estimator of the expected return is:

 

J^N(θ)=1Ni=1NG0(τ(i))\hat{J}_N(\theta) = \frac{1}{N} \sum_{i=1}^N G_0(\tau^{(i)})

 

Theoretical Properties

  • Unbiased:

 

E[J^N(θ)]=J(θ)\mathbb{E}[\hat{J}_N(\theta)] = J(\theta)

 

  • Consistent:

 

J^N(θ)J(θ)as N\hat{J}_N(\theta) \to J(\theta) \quad \text{as } N \to \infty

Thus, Monte Carlo estimation provides a valid estimator of the policy objective.


4. Monte Carlo Estimation of the Policy Gradient

The policy gradient is given by:

 

θJ(θ)=Eτpθ(τ)[t=0T1θlogπθ(atst)Gt]\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t \mid s_t)\, G_t \right]

 

Since the expectation is intractable, VPG uses a Monte Carlo estimator:

 

θJ^=1Ni=1Nt=0Ti1θlogπθ(at(i)st(i))Gt(i)\widehat{\nabla_\theta J} = \frac{1}{N} \sum_{i=1}^N \sum_{t=0}^{T_i-1} \nabla_\theta \log \pi_\theta(a_t^{(i)} \mid s_t^{(i)})\, G_t^{(i)}

This estimator is known as the REINFORCE estimator.


5. Unbiasedness of the Monte Carlo Policy Gradient Estimator

A critical theoretical result is that the Monte Carlo estimator of the policy gradient is unbiased:

 

E[θJ^]=θJ(θ)\mathbb{E}[\widehat{\nabla_\theta J}] = \nabla_\theta J(\theta)

This follows from:

  • Linearity of expectation

  • Correct sampling from

    pθ(τ)p_\theta(\tau)

     

  • The likelihood-ratio identity

Unbiasedness ensures that, in expectation, gradient ascent steps move the policy parameters in a direction that improves expected return.

REINFORCE Algorithm as a Theoretical Instantiation of Vanilla Policy Gradient (VPG)

The REINFORCE algorithm is the earliest and most canonical realization of Vanilla Policy Gradient . From a theoretical standpoint, REINFORCE is not a separate algorithmic family but rather the direct, explicit instantiation of the policy gradient theorem using Monte Carlo estimation and likelihood-ratio gradients. It embodies the purest form of policy-gradient learning—free from approximations such as bootstrapping, trust regions, or critics.

This section develops REINFORCE as a mathematical consequence of VPG theory rather than as a procedural algorithm.


1. Conceptual Role of REINFORCE in Policy Gradient Theory

At a high level:

REINFORCE = Policy Gradient Theorem + Monte Carlo Estimation

REINFORCE operationalizes the theoretical gradient:

 

θJ(θ)=Eτpθ(τ)[t=0T1θlogπθ(atst)Gt]\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t \mid s_t)\, G_t \right]

 

by replacing the expectation with empirical averages over sampled trajectories.

Thus, REINFORCE is the minimal algorithmic embodiment of VPG.


2. Likelihood-Ratio Gradient: Theoretical Foundation

The core theoretical mechanism underlying REINFORCE is the likelihood-ratio (score function) identity:

 

θpθ(x)=pθ(x)θlogpθ(x)\nabla_\theta p_\theta(x) = p_\theta(x)\, \nabla_\theta \log p_\theta(x)

 

This identity allows gradients of expectations to be written as expectations of gradients—without differentiating through the stochastic process itself.

Applied to trajectories:

 

θJ(θ)=θpθ(τ)G0(τ)dτ=Eτ[θlogpθ(τ)G0(τ)]\nabla_\theta J(\theta) = \nabla_\theta \int p_\theta(\tau)\, G_0(\tau)\, d\tau = \mathbb{E}_{\tau} \left[ \nabla_\theta \log p_\theta(\tau)\, G_0(\tau) \right]

 


3. Factorization of the Trajectory Log-Likelihood

The trajectory distribution factorizes as:

 

pθ(τ)=ρ0(s0)t=0T1πθ(atst)P(st+1st,at)p_\theta(\tau) = \rho_0(s_0) \prod_{t=0}^{T-1} \pi_\theta(a_t \mid s_t)\, P(s_{t+1} \mid s_t, a_t)

 

Taking the logarithm:

 

logpθ(τ)=logρ0(s0)+t=0T1logπθ(atst)+t=0T1logP(st+1st,at)\log p_\theta(\tau) = \log \rho_0(s_0) + \sum_{t=0}^{T-1} \log \pi_\theta(a_t \mid s_t) + \sum_{t=0}^{T-1} \log P(s_{t+1} \mid s_t, a_t)

 

Since only the policy depends on

θ\theta

:

 

θlogpθ(τ)=t=0T1θlogπθ(atst)\nabla_\theta \log p_\theta(\tau) = \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t \mid s_t)

 

This step is the key theoretical simplification that enables policy gradient methods.


4. REINFORCE Gradient Estimator

Substituting into the gradient expression yields:

 

θJ(θ)=Eτ[t=0T1θlogπθ(atst)G0]\nabla_\theta J(\theta) = \mathbb{E}_{\tau} \left[ \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t \mid s_t)\, G_0 \right]

 

Using the causality principle, rewards before time

tt

do not depend on

ata_t

, allowing the return to be truncated:

 

θJ(θ)=Eτ[t=0T1θlogπθ(atst)Gt]\nabla_\theta J(\theta) = \mathbb{E}_{\tau} \left[ \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t \mid s_t)\, G_t \right]

 

This is the theoretical REINFORCE gradient.


5. Monte Carlo Instantiation

Given

NN

sampled trajectories

{τ(i)}\{\tau^{(i)}\}

, the REINFORCE estimator is:

 

θJ^=1Ni=1Nt=0Ti1θlogπθ(at(i)st(i))Gt(i)\widehat{\nabla_\theta J} = \frac{1}{N} \sum_{i=1}^N \sum_{t=0}^{T_i-1} \nabla_\theta \log \pi_\theta(a_t^{(i)} \mid s_t^{(i)})\, G_t^{(i)}

Theoretical Properties

  • Unbiased

 

E[θJ^]=θJ(θ)\mathbb{E}[\widehat{\nabla_\theta J}] = \nabla_\theta J(\theta)

 

  • Consistent
    Converges to the true gradient as

    NN \to \infty

     

Thus, REINFORCE exactly matches the theoretical objective of VPG in expectation.

Case-Based Theoretical Analysis (Vanilla Policy Gradient & REINFORCE)

A case-based theoretical analysis examines Vanilla Policy Gradient  not by implementation details, but by analyzing how its mathematical structure behaves under different theoretical regimes. Each case isolates one assumption or structural property of the Markov Decision Process (MDP) and studies its consequences for gradient correctness, variance, convergence, and expressiveness.


Case 1: Finite-Horizon, Episodic MDP

Setup

  • Finite horizon

    T<
  • Episodes terminate naturally

  • Return:

Gt=k=tT1γktrkG_t = \sum_{k=t}^{T-1} \gamma^{k-t} r_k

Theoretical Implications

  1. Well-Defined Objective

J(θ)=E[G0]J(\theta) = \mathbb{E}[G_0]

is finite without requiring

γ<1\gamma < 1

.

  1. Unbiased Monte Carlo Gradient

θJ(θ)=E[t=0T1θlogπθ(atst)Gt]\nabla_\theta J(\theta) = \mathbb{E}\left[ \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t \mid s_t) G_t \right]

  1. Credit Assignment Clarity
    Each action influences only future rewards, enabling strict causal decomposition.

Conclusion

This is the ideal theoretical setting for Vanilla Policy Gradient—minimal assumptions, exact gradients, and clean convergence analysis.


Case 2: Infinite-Horizon, Discounted MDP

Setup

  • T

  • Discount factor

    γ(0,1)

Theoretical Challenges

  1. Convergence of Return

t=0γtrt<\sum_{t=0}^{\infty} \gamma^t r_t < \infty

requires bounded rewards.

  1. Interchanging Gradient and Expectation
    Requires regularity conditions (dominated convergence theorem).

  2. State Distribution Shift

dπθ(s)d^{\pi_\theta}(s)

depends on

θ\theta

, complicating analysis.

Result

Policy Gradient Theorem still holds, but proofs become measure-theoretic.

Conclusion

VPG remains valid, but theoretical guarantees rely on stronger assumptions.


Case 3: Deterministic Environment, Stochastic Policy

Setup

  • P(ss,a)P(s’ \mid s,a)

     deterministic

  • Policy

    πθ(as)\pi_\theta(a \mid s)

     stochastic

Key Insight

All randomness arises from the policy:

Var(θJ)Varπ(Gt)\mathrm{Var}(\nabla_\theta J) \propto \mathrm{Var}_\pi(G_t)

Implications

  • Exploration is policy-driven

  • Gradient estimator remains unbiased

  • Variance remains high if policy entropy is large

Conclusion

Stochastic policies are sufficient for learning even in deterministic worlds.


Case 4: Stochastic Environment, Deterministic Policy (Failure Case)

Setup

  • Deterministic policy

    μθ(s)\mu_\theta(s)

Theoretical Breakdown

θlogπθ(as)undefined\nabla_\theta \log \pi_\theta(a \mid s) \quad \text{undefined}

No likelihood-ratio gradient exists.

Consequence

  • VPG theory collapses

  • Requires alternative frameworks (Deterministic Policy Gradient)

Conclusion

Stochasticity of the policy is a theoretical necessity, not a design choice.


Case 5: Sparse-Reward MDPs

Setup

  • Rewards

    rt=0 for most timesteps
  • Terminal reward only

Theoretical Effect

Gt0tTG_t \approx 0 \quad \forall t \ll T

Gradient Degeneracy

θJ0\nabla_\theta J \approx 0

leading to:

  • Slow learning

  • High variance

  • Poor signal-to-noise ratio

Conclusion

VPG is theoretically correct but inefficient under sparse rewards.


Case 6: Long-Horizon Tasks

Setup

  • Large

    T or weak discounting

Variance Explosion

Var(Gt) exponentially with T\mathrm{Var}(G_t) \uparrow \text{ exponentially with } T

Theoretical Result

Gradient variance grows faster than learning rate decay.

Conclusion

Explains empirical instability of VPG in robotics and control tasks.

Baseline Theory and Variance Reduction in Vanilla Policy Gradient

In Vanilla Policy Gradient  and its canonical instantiation REINFORCE, variance—not bias—is the central theoretical obstacle. While Monte Carlo policy gradient estimators are unbiased, their variance can be prohibitively large, leading to slow convergence and unstable learning. Baseline theory provides a mathematically rigorous mechanism to reduce variance without altering the expected gradient.

This section develops baseline methods from first principles, emphasizing why they work theoretically, not how they are implemented.


1. Variance in Policy Gradient Estimation: The Core Problem

The policy gradient is:

θJ(θ)=E[t=0T1θlogπθ(atst)Gt]\nabla_\theta J(\theta) = \mathbb{E} \left[ \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t \mid s_t)\, G_t \right]

Although unbiased, the estimator:

g=tθlogπθ(atst)Gtg = \sum_{t} \nabla_\theta \log \pi_\theta(a_t \mid s_t)\, G_t

has variance:

Var(g)=E[g2]θJ(θ)2\mathrm{Var}(g) = \mathbb{E}[g^2] – \|\nabla_\theta J(\theta)\|^2

High variance arises because:

  • GtG_t aggregates many random future rewards

  • θlogπθ(atst)\nabla_\theta \log \pi_\theta(a_t \mid s_t)

    can be large

  • Long horizons amplify noise


2. Baseline Concept: Theoretical Definition

A baseline is any function

btb_t that does not depend on the action

ata_t The baseline-modified gradient estimator is:

θJ(θ)=E[tθlogπθ(atst)(Gtbt)]\nabla_\theta J(\theta) = \mathbb{E} \left[ \sum_t \nabla_\theta \log \pi_\theta(a_t \mid s_t)\, \left(G_t – b_t\right) \right]

The critical theoretical question is:

Why does subtracting

btb_t

not change the expected gradient?


3. Baseline Unbiasedness Theorem

Theorem

For any baseline

btb_t

independent of

ata_t:

E[θlogπθ(atst)bt]=0\mathbb{E} \left[ \nabla_\theta \log \pi_\theta(a_t \mid s_t)\, b_t \right] = 0

Proof

Eatπθ[θlogπθ(atst)bt]=btatπθ(atst)θlogπθ(atst)=btatθπθ(atst)=btθatπθ(atst)=btθ(1)=0

Thus, subtracting a baseline preserves unbiasedness.


4. Variance Reduction Mechanism (Intuition + Theory)

Variance depends on the magnitude of the term multiplied by the score function:

θlogπθ(atst)(Gtbt)\nabla_\theta \log \pi_\theta(a_t \mid s_t)\,(G_t – b_t)

If

btb_t

 approximates

E[Gtst]\mathbb{E}[G_t \mid s_t]

then:

Var(Gtbt)Var(Gt)\mathrm{Var}(G_t – b_t) \ll \mathrm{Var}(G_t)

Hence, variance of the gradient estimator is reduced.


5. Optimal Baseline: Theoretical Derivation

Consider the single-timestep gradient estimator:

gt=θlogπθ(atst)(Gtb(st))g_t = \nabla_\theta \log \pi_\theta(a_t \mid s_t)\,(G_t – b(s_t))

The variance-minimizing baseline satisfies:

b(st)=E[Gtθlogπθ(atst)2st]E[θlogπθ(atst)2st]b^\star(s_t) = \frac{ \mathbb{E} \left[ G_t \|\nabla_\theta \log \pi_\theta(a_t \mid s_t)\|^2 \mid s_t \right] }{ \mathbb{E} \left[ \|\nabla_\theta \log \pi_\theta(a_t \mid s_t)\|^2 \mid s_t \right] }

This is the theoretically optimal baseline in the mean-squared sense.


6. State-Value Function as a Baseline

A common and theoretically grounded choice is:

b(st)=Vπ(st)

Then:

GtVπ(st)=Aπ(st,at)G_t – V^{\pi}(s_t) = A^{\pi}(s_t, a_t)

which is the advantage function.

Theoretical Interpretation

  • Removes predictable reward component

  • Leaves only action-dependent deviation

  • Minimizes variance under mild assumptions

Advantage Function: A Purely Theoretical Interpretation

The advantage function occupies a central conceptual position in modern policy-gradient theory. Although it is often introduced operationally as a variance-reduction tool, its true importance lies deeper: the advantage function provides a relative, state-conditioned measure of action quality, isolating the causal contribution of an action beyond what is already expected from the state itself.

This section develops the advantage function purely from theory, without reference to implementation or algorithms.


1. Motivation: Absolute vs Relative Action Evaluation

In a Markov Decision Process, the return following an action depends on two factors:

  1. The state in which the action is taken

  2. The choice of action itself

If a state is inherently good, every action taken in that state may lead to high return. Conversely, in a poor state, even the best action may yield a low return.

Thus, evaluating an action by its absolute return confounds state quality with action quality.

The advantage function resolves this confounding by asking:
“How much better (or worse) is this action compared to the average action in this state?”


2. Formal Definitions of Value Functions

Let π\pi be a fixed stochastic policy.

State-Value Function

Vπ(s)=Eπ[GtSt=s]V^\pi(s) = \mathbb{E}_\pi \left[ G_t \mid S_t = s \right]

This is the expected return starting from state ss and thereafter following π.


Action-Value Function

Qπ(s,a)=Eπ[GtSt=s,At=a]Q^\pi(s,a) = \mathbb{E}_\pi \left[ G_t \mid S_t = s, A_t = a \right]

This represents the expected return after taking action aa in state ss, then following π\pi.


3. Definition of the Advantage Function

The advantage function is defined as:

Aπ(s,a)=Qπ(s,a)Vπ(s)A^\pi(s,a) = Q^\pi(s,a) – V^\pi(s)

This difference removes the state-dependent baseline Vπ(s)V^\pi(s), leaving only the relative benefit of choosing action aa.


4. Zero-Mean Property (Fundamental Theorem)

Theorem

For any state ss:

Eaπ(s)[Aπ(s,a)]=0\mathbb{E}_{a \sim \pi(\cdot \mid s)} \left[ A^\pi(s,a) \right] = 0

Proof

Eaπ[Aπ(s,a)]=aπ(as)(Qπ(s,a)Vπ(s))=aπ(as)Qπ(s,a)Vπ(s)aπ(as)=Vπ(s)Vπ(s)=0\begin{aligned} \mathbb{E}_{a \sim \pi} [A^\pi(s,a)] &= \sum_a \pi(a \mid s) \left(Q^\pi(s,a) – V^\pi(s)\right) \\ &= \sum_a \pi(a \mid s) Q^\pi(s,a) – V^\pi(s) \sum_a \pi(a \mid s) \\ &= V^\pi(s) – V^\pi(s) = 0 \end{aligned}

This property is central: advantages measure deviations, not absolute value.


5. Advantage as a Centered Action-Value Function

From a functional perspective:

Aπ(s,a)=Qπ(s,a)Eaπ[Qπ(s,a)]A^\pi(s,a) = Q^\pi(s,a) – \mathbb{E}_{a’ \sim \pi}[Q^\pi(s,a’)]

Thus, the advantage function is a mean-centered version of QπQ^\pi over the action distribution.

This centering is what makes advantage-weighted gradients:

  • Lower variance

  • Better conditioned

  • More stable


6. Advantage and Policy Gradient Theory

The policy gradient theorem can be written as:

θJ(θ)=Es,aπθ[θlogπθ(as)Qπθ(s,a)]\nabla_\theta J(\theta) = \mathbb{E}_{s,a \sim \pi_\theta} \left[ \nabla_\theta \log \pi_\theta(a \mid s)\, Q^{\pi_\theta}(s,a) \right]

Substituting:

Qπθ(s,a)=Aπθ(s,a)+Vπθ(s)Q^{\pi_\theta}(s,a) = A^{\pi_\theta}(s,a) + V^{\pi_\theta}(s)

and using the baseline property:

E[θlogπθ(as)Vπθ(s)]=0\mathbb{E} \left[ \nabla_\theta \log \pi_\theta(a \mid s)\, V^{\pi_\theta}(s) \right] = 0

yields:

θJ(θ)=E[θlogπθ(as)Aπθ(s,a)]\nabla_\theta J(\theta) = \mathbb{E} \left[ \nabla_\theta \log \pi_\theta(a \mid s)\, A^{\pi_\theta}(s,a) \right]

This shows that only the advantage matters for policy improvement.

Vanilla Policy Gradient as an Actor-Only Method (Theoretical Perspective)

Vanilla Policy Gradient  occupies a unique position in reinforcement learning theory: it is a pure actor-only optimization method. Unlike Actor–Critic architectures, which decompose learning into separate policy (actor) and value (critic) components, Vanilla Policy Gradient operates exclusively on the policy itself, without introducing any auxiliary value-function approximation as part of the learning dynamics.

This section presents a strictly theoretical interpretation of VPG as an actor-only method, clarifying what this means mathematically, why it is possible, and what fundamental limitations arise from this design choice.


1. Definition of an Actor-Only Method (Theory)

A learning algorithm is said to be actor-only if:

  1. The policy parameters

    θ\theta

    are the only optimized variables

  2. The learning objective is expressed directly in terms of the policy

  3. No separate parametric estimator of

    VπV^\pi

    ,

    QπQ^\pi

    , or

    AπA^\pi

     is required for correctness

Vanilla Policy Gradient satisfies all three conditions.


2. Direct Optimization of the Policy Objective

VPG optimizes the expected return:

 

J(θ)=Eτpθ(τ)[G0]

 

and updates

θ\theta

via stochastic gradient ascent:

 

θk+1=θk+αkθJ(θk)\theta_{k+1} = \theta_k + \alpha_k \nabla_\theta J(\theta_k)

 

No auxiliary optimization problem is introduced. The policy itself is the sole object of optimization.


3. Policy Gradient Without a Critic

The policy gradient theorem states:

 

θJ(θ)=Es,aπθ[θlogπθ(as)Qπθ(s,a)]\nabla_\theta J(\theta) = \mathbb{E}_{s,a \sim \pi_\theta} \left[ \nabla_\theta \log \pi_\theta(a \mid s)\, Q^{\pi_\theta}(s,a) \right]

 

In VPG:

  •  

    Qπθ(s,a)Q^{\pi_\theta}(s,a)

    is not learned

  • It is replaced by Monte Carlo returns

    GtG_t

Thus, the gradient estimator uses raw trajectory data, not a learned critic.


4. Monte Carlo Returns as Implicit Value Estimates

Although no critic is explicitly present, Vanilla Policy Gradient implicitly estimates values via:

 

Qπθ(st,at)GtQ^{\pi_\theta}(s_t,a_t) \approx G_t

This approximation is:

  • Unbiased

  • Consistent

  • High variance

Crucially, it does not introduce an independent parametric object—returns are computed directly from observed rewards.


5. Actor-Only Nature and Exactness of the Gradient

Because VPG uses full returns, the gradient estimator satisfies:

 

E[θlogπθ(atst)Gt]=θJ(θ)\mathbb{E} \left[ \nabla_\theta \log \pi_\theta(a_t \mid s_t)\, G_t \right] = \nabla_\theta J(\theta)

 

This means:

  • No approximation error is introduced by a critic

  • No bootstrapping bias exists

  • The gradient is exact in expectation

This property is unique to actor-only Monte Carlo methods.


6. Baselines Do Not Create a Critic (Theoretical Clarification)

Even when baselines are introduced:

 

θJ(θ)=E[θlogπθ(as)(Gtb(st))]\nabla_\theta J(\theta) = \mathbb{E} \left[ \nabla_\theta \log \pi_\theta(a \mid s)\, (G_t – b(s_t)) \right]

 

Vanilla Policy Gradient remains actor-only as long as:

  •  

    b(st)b(s_t)

     is not learned as a separate optimization target

  • The baseline does not define an independent objective

The baseline modifies the estimator, not the optimization problem.

Frequently Asked Questions (FAQs): Vanilla Policy Gradient

Q1. What is Vanilla Policy Gradient (VPG) in simple theoretical terms?

Vanilla Policy Gradient is a reinforcement learning method that directly optimizes a stochastic policy by performing gradient ascent on the expected return. The term “vanilla” indicates that it uses the pure policy gradient theorem with Monte Carlo estimation, without critics, trust regions, clipping, or second-order corrections.


Q2. Why is Vanilla Policy Gradient called an actor-only method?

Vanilla Policy Gradient is called actor-only because it optimizes only the policy parameters. It does not learn or maintain a separate value function (critic). All learning signals come directly from trajectory returns, making the policy the sole object of optimization.


Q3. Does VPG require a model of the environment?

No. Vanilla Policy Gradient is a model-free method. The policy gradient theorem allows gradients to be computed without differentiating through environment dynamics, relying only on sampled trajectories.


Q4. Why must the policy be stochastic in VPG?

The theoretical foundation of Vanilla Policy Gradient relies on the likelihood-ratio gradient:

 

θlogπθ(as)\nabla_\theta \log \pi_\theta(a \mid s)

 

This expression is only well-defined for stochastic policies. Deterministic policies break the mathematical assumptions of Vanilla Policy Gradient and require a different framework (e.g., deterministic policy gradients).


Q5. What exactly is optimized in Vanilla Policy Gradient?

VPG optimizes the expected return:

 

J(θ)=Eτπθ[G0]J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[G_0]

 

This expectation is taken over all trajectories induced by the policy. Every policy update aims to increase this quantity.

Conclusions

Vanilla Policy Gradient  represents one of the most conceptually important turning points in the theoretical development of reinforcement learning. Its significance does not lie in empirical efficiency or practical dominance, but in the clarity of its mathematical formulation. Vanilla Policy Gradient is the first framework that cleanly reframes reinforcement learning as a direct, differentiable optimization problem over stochastic policies, eliminating the need for argmax-based policy extraction and value-function dominance.

From a theoretical standpoint, VPG provides an unbiased gradient estimator of the expected return by operating directly on the trajectory distribution induced by the policy. The Policy Gradient Theorem demonstrates a profound result: environment dynamics vanish from the gradient, allowing policy optimization without an explicit model of the environment. This insight alone reshaped how researchers conceptualize learning in unknown and continuous domains.

However, the same properties that make Vanilla Policy Gradient theoretically elegant also expose its fundamental weaknesses. The reliance on Monte Carlo policy gradient estimation leads to severe variance, especially in long-horizon, sparse-reward, or high-dimensional settings. Credit assignment remains coarse, as all actions in a trajectory are reinforced equally by the total return. Consequently, convergence is slow, unstable, and sample-inefficient despite mathematical correctness.

VPG’s role as an actor-only reinforcement learning method highlights the core bias–variance tradeoff that governs all policy optimization techniques. By avoiding critics and bootstrapping, VPG achieves zero bias at the cost of maximal variance. This tradeoff is not a flaw but a theoretical baseline against which all later methods—Actor-Critic, Natural Policy Gradient, TRPO, and PPO—can be understood as structured compromises.

Historically, Vanilla Policy Gradient is the conceptual foundation upon which modern policy optimization is built. Nearly every advanced policy gradient method can be interpreted as a variance-reduced, geometry-aware, or constraint-stabilized extension of VPG. For this reason, it remains indispensable in graduate-level education and theoretical research, even if it is rarely deployed in real-world systems.

In summary, Vanilla Policy Gradient is not a practical algorithm to outperform others—it is a theoretical reference point. Mastering its assumptions, derivations, and limitations is essential for anyone seeking a deep understanding of reinforcement learning theory. Without VPG, modern stochastic policy optimization would lack both its mathematical grounding and its conceptual coherence.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top