Table of Contents
ToggleIntroduction
Vanilla Policy Gradient (VPG) is one of the most fundamental and conceptually pure algorithms in policy-based Reinforcement Learning (RL). It represents the earliest formalization of directly optimizing a parameterized policy using gradient ascent on expected return. Unlike value-based methods—which first learn value functions and then derive policies indirectly—VPG operates by explicitly modeling the policy and updating its parameters in the direction that improves long-term performance.
At its core, Vanilla Policy Gradient answers a central question in reinforcement learning:
How can an agent adjust the parameters of a stochastic policy so as to maximize the expected cumulative reward obtained from interacting with an environment?
To address this, VPG leverages tools from probability theory, stochastic optimization, and differential calculus, culminating in the celebrated Policy Gradient Theorem. This theorem provides a mathematically rigorous expression for the gradient of the expected return with respect to policy parameters—without requiring gradients of the environment dynamics.
This property is crucial, as it allows VPG to be applied to unknown, non-differentiable, and stochastic environments, which are common in real-world decision-making problems.
Formal Definition of a Stochastic Policy
In Reinforcement Learning (RL), a policy defines the agent’s decision-making mechanism—that is, how actions are selected based on observed states. When actions are chosen probabilistically rather than deterministically, the policy is referred to as a stochastic policy. Stochastic policies are foundational to policy-gradient methods such as Vanilla Policy Gradient , because they enable differentiable optimization and principled exploration.
1. Policy in the Markov Decision Process Framework
Formally, reinforcement learning problems are modeled as a Markov Decision Process (MDP), defined by the tuple:
where:
S
is the state space
A
is the action space
P
is the state transition probability
R
is the reward function
γ∈(0,1] is the discount factor
Within this framework, a policy governs the agent’s interaction with the environment.
2. Formal Definition of a Stochastic Policy
A stochastic policy is defined as a conditional probability distribution over actions given a state:
where:
s∈S is the current state
a∈A is a possible action
π(a∣s) denotes the probability of selecting action
a in state s
Key Properties
For every state
, the policy satisfies:
Non-negativity
Normalization
(or
continuous action spaces)
These conditions ensure that
is a valid probability distribution.
3. Parameterized Stochastic Policies
In policy-gradient methods, policies are typically parameterized by a vector
, yielding a family of policies:
The objective of learning is to adjust
so as to maximize expected cumulative reward.
Examples of Parameterizations
Discrete actions: Softmax policy
Continuous actions: Gaussian policy
These parameterizations are chosen to be smooth and differentiable with respect to
, a critical requirement for gradient-based optimization.
4. Stochastic Policy as a Randomized Decision Rule
From a theoretical standpoint, a stochastic policy can be interpreted as a randomized decision rule:
where
denotes the probability simplex over the action space.
This formulation highlights an important conceptual point:
The policy does not choose an action directly—it defines a distribution from which actions are sampled.
This probabilistic structure induces randomness in the agent’s behavior, even when the environment itself is deterministic.
5. Why Stochastic Policies Are Essential in Policy Gradient Theory
Stochastic policies play a central theoretical role in VPG for several reasons:
5.1 Differentiability of the Objective
The expected return objective:
depends on
only through the policy distribution. The likelihood-ratio (log-derivative) trick used in policy gradients:
is well-defined only when the policy assigns non-zero probability mass smoothly across actions.
5.2 Exploration Guarantee
Unlike deterministic policies, stochastic policies naturally ensure exploration:
This is critical for convergence guarantees in theoretical analyses of policy-gradient algorithms.
5.3 Avoiding Non-Differentiability
Deterministic policies often lead to non-differentiable mappings from parameters to actions. In contrast, stochastic policies maintain a smooth dependence of action probabilities on
, enabling unbiased gradient estimation.
6. Relationship to Deterministic Policies
A deterministic policy
can be viewed as a degenerate stochastic policy:
However, such policies lack differentiability almost everywhere, which is why classical VPG relies on stochastic rather than deterministic formulations.
7. Theoretical Role in Trajectory Distributions
A stochastic policy induces a probability distribution over trajectories:
Here, the policy is the only component dependent on
. This factorization is fundamental to deriving the Policy Gradient Theorem, as it allows gradients of expected return to be expressed entirely in terms of
.
8. Summary (Theoretical Perspective)
From a theoretical standpoint, a stochastic policy is:
A probability distribution over actions conditioned on states
A differentiable, parameterized object enabling gradient-based optimization
The core mechanism through which randomness, exploration, and learning are introduced in VPG
In Vanilla Policy Gradient, the stochastic policy is not merely a design choice—it is the mathematical object that makes policy optimization tractable, analyzable, and theoretically sound.
Expected Return as an Optimization Objective
In policy-based Reinforcement Learning, and particularly in Vanilla Policy Gradient (VPG), learning is framed as a direct optimization problem. The agent does not aim to approximate value functions as an end in themselves; instead, it seeks to optimize the parameters of a policy so that long-term performance is maximized. The quantity that formalizes this notion of performance is the expected return.
From a theoretical perspective, the expected return serves as the objective functional over the space of stochastic policies.
1. Return: Cumulative Reward Along a Trajectory
Consider an MDP M.
A trajectory (or episode) of length T is defined as:
The return from time step t is the discounted cumulative reward:
For episodic tasks, the return from the initial state is:
The discount factor ensures convergence of the sum and encodes a preference for immediate rewards over distant ones.
2. Expected Return: From Random Trajectories to Objective Function
Under a stochastic policy , trajectories are random variables due to:
stochasticity in the policy
stochasticity in environment transitions
Therefore, performance cannot be measured by a single trajectory. Instead, it is defined in expectation.
Definition (Expected Return)
The expected return of a parameterized policy πθ is:
where:
is the probability distribution over trajectories induced by
is the return associated with trajectory
This expectation transforms a stochastic interaction process into a deterministic optimization objective.
3. Trajectory Distribution Induced by a Policy
The trajectory distribution factorizes as:
Key theoretical insight:
The policy is the only component of that depends on .
This property is fundamental—it enables gradient-based optimization without requiring knowledge of environment dynamics.
4. Optimization Problem Formulation
The learning objective in Vanilla Policy Gradient is formally written as:
This is a stochastic optimization problem over a high-dimensional, non-convex objective landscape.
Notably:
is generally non-linear and non-convex
Closed-form solutions are almost never available
Optimization must rely on Monte Carlo gradient estimates
5. Alternative Equivalent Forms of Expected Return
5.1 State-Value Function Form
Using the state-value function under policy πθ:
the objective can be written as:
This form emphasizes dependence on the initial-state distribution.
5.2 Expected Reward Over State–Action Occupancy Measure
Define the discounted state–action visitation distribution:
Then:
This formulation is critical in theoretical analysis, connecting policy optimization to occupancy measures and fixed-point equations.
6. Why Expectation Is Essential (Theoretical Justification)
6.1 Randomness of Trajectories
Because both policy and environment are stochastic, any single trajectory is an unreliable performance estimate. The expectation ensures:
robustness to randomness
well-defined gradients
convergence in the limit of infinite samples
Monte Carlo Estimation Theory in Vanilla Policy Gradient
Monte Carlo (MC) estimation plays a central theoretical role in Vanilla Policy Gradient (VPG). Since the true expected return and its gradient are analytically intractable in most reinforcement learning problems, VPG relies on sample-based estimates obtained from complete trajectories. Understanding Monte Carlo estimation is therefore essential for grasping both the correctness and the limitations of VPG.
This section develops the theory behind Monte Carlo estimation as used in policy gradient methods.
1. Why Monte Carlo Estimation Is Necessary in VPG
The objective in VPG is the expected return:
Computing this expectation exactly would require:
Full knowledge of the environment dynamics
Summation or integration over all possible trajectories
In realistic MDPs, the trajectory space is exponentially large. Therefore, exact computation is infeasible. Monte Carlo estimation provides a principled way to approximate expectations using samples drawn from the true trajectory distribution.
2. Monte Carlo Estimation: General Theory
Let
X be a random variable with distribution
, and let
be a function of interest. The expectation:
can be approximated using
samples
:
Fundamental Properties
Unbiasedness
Consistency (Law of Large Numbers)
Variance
These properties directly carry over to trajectory-based estimation in VPG.
3. Monte Carlo Estimation of Expected Return
In VPG, the random variable is the trajectory
, and the function of interest is the return
.
Given
trajectories sampled under policy
:
the Monte Carlo estimator of the expected return is:
Theoretical Properties
Unbiased:
Consistent:
Thus, Monte Carlo estimation provides a valid estimator of the policy objective.
4. Monte Carlo Estimation of the Policy Gradient
The policy gradient is given by:
Since the expectation is intractable, VPG uses a Monte Carlo estimator:
This estimator is known as the REINFORCE estimator.
5. Unbiasedness of the Monte Carlo Policy Gradient Estimator
A critical theoretical result is that the Monte Carlo estimator of the policy gradient is unbiased:
This follows from:
Linearity of expectation
Correct sampling from
The likelihood-ratio identity
Unbiasedness ensures that, in expectation, gradient ascent steps move the policy parameters in a direction that improves expected return.
REINFORCE Algorithm as a Theoretical Instantiation of Vanilla Policy Gradient (VPG)
The REINFORCE algorithm is the earliest and most canonical realization of Vanilla Policy Gradient . From a theoretical standpoint, REINFORCE is not a separate algorithmic family but rather the direct, explicit instantiation of the policy gradient theorem using Monte Carlo estimation and likelihood-ratio gradients. It embodies the purest form of policy-gradient learning—free from approximations such as bootstrapping, trust regions, or critics.
This section develops REINFORCE as a mathematical consequence of VPG theory rather than as a procedural algorithm.
1. Conceptual Role of REINFORCE in Policy Gradient Theory
At a high level:
REINFORCE = Policy Gradient Theorem + Monte Carlo Estimation
REINFORCE operationalizes the theoretical gradient:
by replacing the expectation with empirical averages over sampled trajectories.
Thus, REINFORCE is the minimal algorithmic embodiment of VPG.
2. Likelihood-Ratio Gradient: Theoretical Foundation
The core theoretical mechanism underlying REINFORCE is the likelihood-ratio (score function) identity:
This identity allows gradients of expectations to be written as expectations of gradients—without differentiating through the stochastic process itself.
Applied to trajectories:
3. Factorization of the Trajectory Log-Likelihood
The trajectory distribution factorizes as:
Taking the logarithm:
Since only the policy depends on
θ:
This step is the key theoretical simplification that enables policy gradient methods.
4. REINFORCE Gradient Estimator
Substituting into the gradient expression yields:
Using the causality principle, rewards before time
t do not depend on
at, allowing the return to be truncated:
This is the theoretical REINFORCE gradient.
5. Monte Carlo Instantiation
Given
N sampled trajectories
, the REINFORCE estimator is:
Theoretical Properties
Unbiased
Consistent
Converges to the true gradient as
Thus, REINFORCE exactly matches the theoretical objective of VPG in expectation.
Case-Based Theoretical Analysis (Vanilla Policy Gradient & REINFORCE)
A case-based theoretical analysis examines Vanilla Policy Gradient not by implementation details, but by analyzing how its mathematical structure behaves under different theoretical regimes. Each case isolates one assumption or structural property of the Markov Decision Process (MDP) and studies its consequences for gradient correctness, variance, convergence, and expressiveness.
Case 1: Finite-Horizon, Episodic MDP
Setup
Finite horizon
Episodes terminate naturally
Return:
Theoretical Implications
Well-Defined Objective
is finite without requiring
.
Unbiased Monte Carlo Gradient
Credit Assignment Clarity
Each action influences only future rewards, enabling strict causal decomposition.
Conclusion
This is the ideal theoretical setting for Vanilla Policy Gradient—minimal assumptions, exact gradients, and clean convergence analysis.
Case 2: Infinite-Horizon, Discounted MDP
Setup
Discount factor
Theoretical Challenges
Convergence of Return
requires bounded rewards.
Interchanging Gradient and Expectation
Requires regularity conditions (dominated convergence theorem).State Distribution Shift
depends on
, complicating analysis.
Result
Policy Gradient Theorem still holds, but proofs become measure-theoretic.
Conclusion
VPG remains valid, but theoretical guarantees rely on stronger assumptions.
Case 3: Deterministic Environment, Stochastic Policy
Setup
deterministic
Policy
stochastic
Key Insight
All randomness arises from the policy:
Implications
Exploration is policy-driven
Gradient estimator remains unbiased
Variance remains high if policy entropy is large
Conclusion
Stochastic policies are sufficient for learning even in deterministic worlds.
Case 4: Stochastic Environment, Deterministic Policy (Failure Case)
Setup
Deterministic policy
Theoretical Breakdown
No likelihood-ratio gradient exists.
Consequence
VPG theory collapses
Requires alternative frameworks (Deterministic Policy Gradient)
Conclusion
Stochasticity of the policy is a theoretical necessity, not a design choice.
Case 5: Sparse-Reward MDPs
Setup
Rewards
Terminal reward only
Theoretical Effect
Gradient Degeneracy
leading to:
Slow learning
High variance
Poor signal-to-noise ratio
Conclusion
VPG is theoretically correct but inefficient under sparse rewards.
Case 6: Long-Horizon Tasks
Setup
Large
Variance Explosion
Theoretical Result
Gradient variance grows faster than learning rate decay.
Conclusion
Explains empirical instability of VPG in robotics and control tasks.
Baseline Theory and Variance Reduction in Vanilla Policy Gradient
In Vanilla Policy Gradient and its canonical instantiation REINFORCE, variance—not bias—is the central theoretical obstacle. While Monte Carlo policy gradient estimators are unbiased, their variance can be prohibitively large, leading to slow convergence and unstable learning. Baseline theory provides a mathematically rigorous mechanism to reduce variance without altering the expected gradient.
This section develops baseline methods from first principles, emphasizing why they work theoretically, not how they are implemented.
1. Variance in Policy Gradient Estimation: The Core Problem
The policy gradient is:
Although unbiased, the estimator:
has variance:
High variance arises because:
aggregates many random future rewards
can be large
Long horizons amplify noise
2. Baseline Concept: Theoretical Definition
A baseline is any function
that does not depend on the action
The baseline-modified gradient estimator is:
The critical theoretical question is:
Why does subtracting
not change the expected gradient?
3. Baseline Unbiasedness Theorem
Theorem
For any baseline
independent of
:
Proof
Thus, subtracting a baseline preserves unbiasedness.
4. Variance Reduction Mechanism (Intuition + Theory)
Variance depends on the magnitude of the term multiplied by the score function:
If
approximates
then:
Hence, variance of the gradient estimator is reduced.
5. Optimal Baseline: Theoretical Derivation
Consider the single-timestep gradient estimator:
The variance-minimizing baseline satisfies:
This is the theoretically optimal baseline in the mean-squared sense.
6. State-Value Function as a Baseline
A common and theoretically grounded choice is:
Then:
which is the advantage function.
Theoretical Interpretation
Removes predictable reward component
Leaves only action-dependent deviation
Minimizes variance under mild assumptions
Advantage Function: A Purely Theoretical Interpretation
The advantage function occupies a central conceptual position in modern policy-gradient theory. Although it is often introduced operationally as a variance-reduction tool, its true importance lies deeper: the advantage function provides a relative, state-conditioned measure of action quality, isolating the causal contribution of an action beyond what is already expected from the state itself.
This section develops the advantage function purely from theory, without reference to implementation or algorithms.
1. Motivation: Absolute vs Relative Action Evaluation
In a Markov Decision Process, the return following an action depends on two factors:
The state in which the action is taken
The choice of action itself
If a state is inherently good, every action taken in that state may lead to high return. Conversely, in a poor state, even the best action may yield a low return.
Thus, evaluating an action by its absolute return confounds state quality with action quality.
The advantage function resolves this confounding by asking:
“How much better (or worse) is this action compared to the average action in this state?”
2. Formal Definitions of Value Functions
Let π be a fixed stochastic policy.
State-Value Function
This is the expected return starting from state s and thereafter following
Action-Value Function
This represents the expected return after taking action a in state s, then following π.
3. Definition of the Advantage Function
The advantage function is defined as:
This difference removes the state-dependent baseline , leaving only the relative benefit of choosing action a.
4. Zero-Mean Property (Fundamental Theorem)
Theorem
For any state s:
Proof
This property is central: advantages measure deviations, not absolute value.
5. Advantage as a Centered Action-Value Function
From a functional perspective:
Thus, the advantage function is a mean-centered version of over the action distribution.
This centering is what makes advantage-weighted gradients:
Lower variance
Better conditioned
More stable
6. Advantage and Policy Gradient Theory
The policy gradient theorem can be written as:
Substituting:
and using the baseline property:
yields:
This shows that only the advantage matters for policy improvement.
Vanilla Policy Gradient as an Actor-Only Method (Theoretical Perspective)
Vanilla Policy Gradient occupies a unique position in reinforcement learning theory: it is a pure actor-only optimization method. Unlike Actor–Critic architectures, which decompose learning into separate policy (actor) and value (critic) components, Vanilla Policy Gradient operates exclusively on the policy itself, without introducing any auxiliary value-function approximation as part of the learning dynamics.
This section presents a strictly theoretical interpretation of VPG as an actor-only method, clarifying what this means mathematically, why it is possible, and what fundamental limitations arise from this design choice.
1. Definition of an Actor-Only Method (Theory)
A learning algorithm is said to be actor-only if:
The policy parameters
θ are the only optimized variables
The learning objective is expressed directly in terms of the policy
No separate parametric estimator of
,
, or
is required for correctness
Vanilla Policy Gradient satisfies all three conditions.
2. Direct Optimization of the Policy Objective
VPG optimizes the expected return:
and updates
via stochastic gradient ascent:
No auxiliary optimization problem is introduced. The policy itself is the sole object of optimization.
3. Policy Gradient Without a Critic
The policy gradient theorem states:
In VPG:
is not learned
It is replaced by Monte Carlo returns
Thus, the gradient estimator uses raw trajectory data, not a learned critic.
4. Monte Carlo Returns as Implicit Value Estimates
Although no critic is explicitly present, Vanilla Policy Gradient implicitly estimates values via:
This approximation is:
Unbiased
Consistent
High variance
Crucially, it does not introduce an independent parametric object—returns are computed directly from observed rewards.
5. Actor-Only Nature and Exactness of the Gradient
Because VPG uses full returns, the gradient estimator satisfies:
This means:
No approximation error is introduced by a critic
No bootstrapping bias exists
The gradient is exact in expectation
This property is unique to actor-only Monte Carlo methods.
6. Baselines Do Not Create a Critic (Theoretical Clarification)
Even when baselines are introduced:
Vanilla Policy Gradient remains actor-only as long as:
is not learned as a separate optimization target
The baseline does not define an independent objective
The baseline modifies the estimator, not the optimization problem.
Frequently Asked Questions (FAQs): Vanilla Policy Gradient
Q1. What is Vanilla Policy Gradient (VPG) in simple theoretical terms?
Vanilla Policy Gradient is a reinforcement learning method that directly optimizes a stochastic policy by performing gradient ascent on the expected return. The term “vanilla” indicates that it uses the pure policy gradient theorem with Monte Carlo estimation, without critics, trust regions, clipping, or second-order corrections.
Q2. Why is Vanilla Policy Gradient called an actor-only method?
Vanilla Policy Gradient is called actor-only because it optimizes only the policy parameters. It does not learn or maintain a separate value function (critic). All learning signals come directly from trajectory returns, making the policy the sole object of optimization.
Q3. Does VPG require a model of the environment?
No. Vanilla Policy Gradient is a model-free method. The policy gradient theorem allows gradients to be computed without differentiating through environment dynamics, relying only on sampled trajectories.
Q4. Why must the policy be stochastic in VPG?
The theoretical foundation of Vanilla Policy Gradient relies on the likelihood-ratio gradient:
This expression is only well-defined for stochastic policies. Deterministic policies break the mathematical assumptions of Vanilla Policy Gradient and require a different framework (e.g., deterministic policy gradients).
Q5. What exactly is optimized in Vanilla Policy Gradient?
VPG optimizes the expected return:
This expectation is taken over all trajectories induced by the policy. Every policy update aims to increase this quantity.
Conclusions
Vanilla Policy Gradient represents one of the most conceptually important turning points in the theoretical development of reinforcement learning. Its significance does not lie in empirical efficiency or practical dominance, but in the clarity of its mathematical formulation. Vanilla Policy Gradient is the first framework that cleanly reframes reinforcement learning as a direct, differentiable optimization problem over stochastic policies, eliminating the need for argmax-based policy extraction and value-function dominance.
From a theoretical standpoint, VPG provides an unbiased gradient estimator of the expected return by operating directly on the trajectory distribution induced by the policy. The Policy Gradient Theorem demonstrates a profound result: environment dynamics vanish from the gradient, allowing policy optimization without an explicit model of the environment. This insight alone reshaped how researchers conceptualize learning in unknown and continuous domains.
However, the same properties that make Vanilla Policy Gradient theoretically elegant also expose its fundamental weaknesses. The reliance on Monte Carlo policy gradient estimation leads to severe variance, especially in long-horizon, sparse-reward, or high-dimensional settings. Credit assignment remains coarse, as all actions in a trajectory are reinforced equally by the total return. Consequently, convergence is slow, unstable, and sample-inefficient despite mathematical correctness.
VPG’s role as an actor-only reinforcement learning method highlights the core bias–variance tradeoff that governs all policy optimization techniques. By avoiding critics and bootstrapping, VPG achieves zero bias at the cost of maximal variance. This tradeoff is not a flaw but a theoretical baseline against which all later methods—Actor-Critic, Natural Policy Gradient, TRPO, and PPO—can be understood as structured compromises.
Historically, Vanilla Policy Gradient is the conceptual foundation upon which modern policy optimization is built. Nearly every advanced policy gradient method can be interpreted as a variance-reduced, geometry-aware, or constraint-stabilized extension of VPG. For this reason, it remains indispensable in graduate-level education and theoretical research, even if it is rarely deployed in real-world systems.
In summary, Vanilla Policy Gradient is not a practical algorithm to outperform others—it is a theoretical reference point. Mastering its assumptions, derivations, and limitations is essential for anyone seeking a deep understanding of reinforcement learning theory. Without VPG, modern stochastic policy optimization would lack both its mathematical grounding and its conceptual coherence.