Vanilla Policy Gradient : A Complete Theoretical Foundation

Table of Contents

Introduction

Vanilla Policy Gradient (VPG) is one of the most fundamental and conceptually pure algorithms in policy-based Reinforcement Learning (RL). It represents the earliest formalization of directly optimizing a parameterized policy using gradient ascent on expected return. Unlike value-based methods—which first learn value functions and then derive policies indirectly—VPG operates by explicitly modeling the policy and updating its parameters in the direction that improves long-term performance.

At its core, Vanilla Policy Gradient answers a central question in reinforcement learning:

How can an agent adjust the parameters of a stochastic policy so as to maximize the expected cumulative reward obtained from interacting with an environment?

To address this, VPG leverages tools from probability theory, stochastic optimization, and differential calculus, culminating in the celebrated Policy Gradient Theorem. This theorem provides a mathematically rigorous expression for the gradient of the expected return with respect to policy parameters—without requiring gradients of the environment dynamics.

This property is crucial, as it allows VPG to be applied to unknown, non-differentiable, and stochastic environments, which are common in real-world decision-making problems.

Formal Definition of a Stochastic Policy

In Reinforcement Learning (RL), a policy defines the agent’s decision-making mechanism—that is, how actions are selected based on observed states. When actions are chosen probabilistically rather than deterministically, the policy is referred to as a stochastic policy. Stochastic policies are foundational to policy-gradient methods such as Vanilla Policy Gradient , because they enable differentiable optimization and principled exploration.

1. Policy in the Markov Decision Process Framework

Formally, reinforcement learning problems are modeled as a Markov Decision Process (MDP), defined by the tuple:

$\mathcal{M} = (\mathcal{S}, \mathcal{A}, P, R, \gamma)$

where:

S
is the state space
A
is the action space
P
$P(s’ \mid s, a)$
is the state transition probability
R
$R(s, a)$
is the reward function
γ∈(0,1] is the discount factor

Within this framework, a policy governs the agent’s interaction with the environment.

2. Formal Definition of a Stochastic Policy

A stochastic policy is defined as a conditional probability distribution over actions given a state:

$\pi(a \mid s) = \mathbb{P}(A_t = a \mid S_t = s)$

where:

s∈S is the current state
a∈A is a possible action
π(a∣s) denotes the probability of selecting action
$a$
$a$ in state s

Key Properties

For every state

$s \in \mathcal{S}$

, the policy satisfies:

Non-negativity

$\pi(a \mid s) \ge 0 \quad \forall a \in \mathcal{A}$

Normalization

$\sum_{a \in \mathcal{A}} \pi(a \mid s) = 1$

(or

$\int_{\mathcal{A}} \pi(a \mid s)\, da = 1$

continuous action spaces)

These conditions ensure that

$\pi(\cdot \mid s)$

is a valid probability distribution.

3. Parameterized Stochastic Policies

In policy-gradient methods, policies are typically parameterized by a vector

$\theta \in \mathbb{R}^d$

, yielding a family of policies:

$\pi_\theta(a \mid s)$

The objective of learning is to adjust

$\theta$

so as to maximize expected cumulative reward.

Examples of Parameterizations

Discrete actions: Softmax policy

$\pi_\theta(a \mid s) = \frac{\exp(f_\theta(s,a))}{\sum_{a’} \exp(f_\theta(s,a’))}$

Continuous actions: Gaussian policy

$\pi_\theta(a \mid s) = \mathcal{N}(a \mid \mu_\theta(s), \Sigma_\theta(s))$

These parameterizations are chosen to be smooth and differentiable with respect to

$\theta$

, a critical requirement for gradient-based optimization.

4. Stochastic Policy as a Randomized Decision Rule

From a theoretical standpoint, a stochastic policy can be interpreted as a randomized decision rule:

$\pi: \mathcal{S} \rightarrow \Delta(\mathcal{A})$

where

$\Delta(\mathcal{A})$

denotes the probability simplex over the action space.

This formulation highlights an important conceptual point:

The policy does not choose an action directly—it defines a distribution from which actions are sampled.

This probabilistic structure induces randomness in the agent’s behavior, even when the environment itself is deterministic.

5. Why Stochastic Policies Are Essential in Policy Gradient Theory

Stochastic policies play a central theoretical role in VPG for several reasons:

5.1 Differentiability of the Objective

The expected return objective:

$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} [R(\tau)]$

depends on

$\theta$

only through the policy distribution. The likelihood-ratio (log-derivative) trick used in policy gradients:

$\nabla_\theta \log \pi_\theta(a \mid s)$

is well-defined only when the policy assigns non-zero probability mass smoothly across actions.

5.2 Exploration Guarantee

Unlike deterministic policies, stochastic policies naturally ensure exploration:

$\pi(a \mid s) > 0 \Rightarrow \text{action } a \text{ is eventually explored}$

This is critical for convergence guarantees in theoretical analyses of policy-gradient algorithms.

5.3 Avoiding Non-Differentiability

Deterministic policies often lead to non-differentiable mappings from parameters to actions. In contrast, stochastic policies maintain a smooth dependence of action probabilities on

$\theta$

, enabling unbiased gradient estimation.

6. Relationship to Deterministic Policies

A deterministic policy

$\mu(s)$

can be viewed as a degenerate stochastic policy:

$\pi(a \mid s) = \begin{cases} 1, & a = \mu(s) \\ 0, & \text{otherwise} \end{cases}$

However, such policies lack differentiability almost everywhere, which is why classical VPG relies on stochastic rather than deterministic formulations.

7. Theoretical Role in Trajectory Distributions

A stochastic policy induces a probability distribution over trajectories:

$p_\theta(\tau) = \rho_0(s_0)\prod_{t=0}^{T-1} \pi_\theta(a_t \mid s_t)\, P(s_{t+1} \mid s_t, a_t)$

Here, the policy is the only component dependent on

$\theta$

. This factorization is fundamental to deriving the Policy Gradient Theorem, as it allows gradients of expected return to be expressed entirely in terms of

$\nabla_\theta \log \pi_\theta(a \mid s)$

8. Summary (Theoretical Perspective)

From a theoretical standpoint, a stochastic policy is:

A probability distribution over actions conditioned on states
A differentiable, parameterized object enabling gradient-based optimization
The core mechanism through which randomness, exploration, and learning are introduced in VPG

In Vanilla Policy Gradient, the stochastic policy is not merely a design choice—it is the mathematical object that makes policy optimization tractable, analyzable, and theoretically sound.

Expected Return as an Optimization Objective

In policy-based Reinforcement Learning, and particularly in Vanilla Policy Gradient (VPG), learning is framed as a direct optimization problem. The agent does not aim to approximate value functions as an end in themselves; instead, it seeks to optimize the parameters of a policy so that long-term performance is maximized. The quantity that formalizes this notion of performance is the expected return.

From a theoretical perspective, the expected return serves as the objective functional over the space of stochastic policies.

1. Return: Cumulative Reward Along a Trajectory

Consider an MDP $\mathcal{M} = (\mathcal{S}, \mathcal{A}, P, R, \gamma)$ .
A trajectory (or episode) of length $T$ is defined as:

$\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \dots, s_T)$

The return from time step $t$ is the discounted cumulative reward:

$G_t = \sum_{k=t}^{T-1} \gamma^{k-t} r_k$

For episodic tasks, the return from the initial state is:

$G_0 = \sum_{t=0}^{T-1} \gamma^t r_t$

The discount factor $\gamma \in (0,1]$ ensures convergence of the sum and encodes a preference for immediate rewards over distant ones.

2. Expected Return: From Random Trajectories to Objective Function

Under a stochastic policy $\pi_\theta$ , trajectories are random variables due to:

stochasticity in the policy
stochasticity in environment transitions

Therefore, performance cannot be measured by a single trajectory. Instead, it is defined in expectation.

Definition (Expected Return)

The expected return of a parameterized policy $\pi_\theta$ is:

$J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ G_0(\tau) \right]$

where:

$p_\theta(\tau)$ is the probability distribution over trajectories induced by $\pi_\theta$
$G_0(\tau)$ is the return associated with trajectory $\tau$

This expectation transforms a stochastic interaction process into a deterministic optimization objective.

3. Trajectory Distribution Induced by a Policy

The trajectory distribution factorizes as:

$p_\theta(\tau) = \rho_0(s_0) \prod_{t=0}^{T-1} \pi_\theta(a_t \mid s_t)\, P(s_{t+1} \mid s_t, a_t)$

Key theoretical insight:

The policy $\pi_\theta$ is the only component of $p_\theta(\tau)$ that depends on $\theta$ .

This property is fundamental—it enables gradient-based optimization without requiring knowledge of environment dynamics.

4. Optimization Problem Formulation

The learning objective in Vanilla Policy Gradient is formally written as:

$\theta^\ast = \arg\max_{\theta} J(\theta)$ This is a stochastic optimization problem over a high-dimensional, non-convex objective landscape.

Notably:

$J(\theta)$ is generally non-linear and non-convex
Closed-form solutions are almost never available
Optimization must rely on Monte Carlo gradient estimates

5. Alternative Equivalent Forms of Expected Return

5.1 State-Value Function Form

Using the state-value function under policy $\pi_\theta$ :

$V^{\pi_\theta}(s) = \mathbb{E}_{\pi_\theta}[G_t \mid S_t = s]$

the objective can be written as:

$J(\theta) = \mathbb{E}_{s_0 \sim \rho_0} \left[ V^{\pi_\theta}(s_0) \right]$

This form emphasizes dependence on the initial-state distribution.

5.2 Expected Reward Over State–Action Occupancy Measure

Define the discounted state–action visitation distribution:

$d^{\pi_\theta}(s,a) = \sum_{t=0}^{\infty} \gamma^t \mathbb{P}(S_t = s, A_t = a \mid \pi_\theta)$

Then:

$J(\theta) = \sum_{s,a} d^{\pi_\theta}(s,a)\, r(s,a)$

This formulation is critical in theoretical analysis, connecting policy optimization to occupancy measures and fixed-point equations.

6. Why Expectation Is Essential (Theoretical Justification)

6.1 Randomness of Trajectories

Because both policy and environment are stochastic, any single trajectory is an unreliable performance estimate. The expectation ensures:

robustness to randomness
well-defined gradients
convergence in the limit of infinite samples

Monte Carlo Estimation Theory in Vanilla Policy Gradient

Monte Carlo (MC) estimation plays a central theoretical role in Vanilla Policy Gradient (VPG). Since the true expected return and its gradient are analytically intractable in most reinforcement learning problems, VPG relies on sample-based estimates obtained from complete trajectories. Understanding Monte Carlo estimation is therefore essential for grasping both the correctness and the limitations of VPG.

This section develops the theory behind Monte Carlo estimation as used in policy gradient methods.

1. Why Monte Carlo Estimation Is Necessary in VPG

The objective in VPG is the expected return:

$J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)}[G_0(\tau)]$

Computing this expectation exactly would require:

Full knowledge of the environment dynamics
$P(s’ \mid s,a)$
Summation or integration over all possible trajectories

In realistic MDPs, the trajectory space is exponentially large. Therefore, exact computation is infeasible. Monte Carlo estimation provides a principled way to approximate expectations using samples drawn from the true trajectory distribution.

2. Monte Carlo Estimation: General Theory

Let

$X$

$X$ be a random variable with distribution

$p(x)$ , and let

$f(X)$ be a function of interest. The expectation:

$\mathbb{E}[f(X)] = \int f(x)\, p(x)\, dx$

can be approximated using

$N$ samples

$\{x_i\}_{i=1}^N$

$\hat{\mu}_N = \frac{1}{N} \sum_{i=1}^N f(x_i)$

Fundamental Properties

Unbiasedness

$\mathbb{E}[\hat{\mu}_N] = \mathbb{E}[f(X)]$

Consistency (Law of Large Numbers)

$\hat{\mu}_N \xrightarrow{a.s.} \mathbb{E}[f(X)] \quad \text{as } N \to \infty$

Variance

$\mathrm{Var}(\hat{\mu}_N) = \frac{1}{N}\mathrm{Var}(f(X))$

These properties directly carry over to trajectory-based estimation in VPG.

3. Monte Carlo Estimation of Expected Return

In VPG, the random variable is the trajectory

$\tau$ , and the function of interest is the return

$G_0(\tau)$ .

Given

$N$ trajectories sampled under policy

$\pi_\theta$ :

$\{\tau^{(i)}\}_{i=1}^N \sim p_\theta(\tau)$

the Monte Carlo estimator of the expected return is:

$\hat{J}_N(\theta) = \frac{1}{N} \sum_{i=1}^N G_0(\tau^{(i)})$

Theoretical Properties

Unbiased:

$\mathbb{E}[\hat{J}_N(\theta)] = J(\theta)$

Consistent:

$\hat{J}_N(\theta) \to J(\theta) \quad \text{as } N \to \infty$

Thus, Monte Carlo estimation provides a valid estimator of the policy objective.

4. Monte Carlo Estimation of the Policy Gradient

The policy gradient is given by:

$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t \mid s_t)\, G_t \right]$

Since the expectation is intractable, VPG uses a Monte Carlo estimator:

$\widehat{\nabla_\theta J} = \frac{1}{N} \sum_{i=1}^N \sum_{t=0}^{T_i-1} \nabla_\theta \log \pi_\theta(a_t^{(i)} \mid s_t^{(i)})\, G_t^{(i)}$

This estimator is known as the REINFORCE estimator.

5. Unbiasedness of the Monte Carlo Policy Gradient Estimator

A critical theoretical result is that the Monte Carlo estimator of the policy gradient is unbiased:

$\mathbb{E}[\widehat{\nabla_\theta J}] = \nabla_\theta J(\theta)$

This follows from:

Linearity of expectation
Correct sampling from
$p_\theta(\tau)$
The likelihood-ratio identity

Unbiasedness ensures that, in expectation, gradient ascent steps move the policy parameters in a direction that improves expected return.

REINFORCE Algorithm as a Theoretical Instantiation of Vanilla Policy Gradient (VPG)

The REINFORCE algorithm is the earliest and most canonical realization of Vanilla Policy Gradient . From a theoretical standpoint, REINFORCE is not a separate algorithmic family but rather the direct, explicit instantiation of the policy gradient theorem using Monte Carlo estimation and likelihood-ratio gradients. It embodies the purest form of policy-gradient learning—free from approximations such as bootstrapping, trust regions, or critics.

This section develops REINFORCE as a mathematical consequence of VPG theory rather than as a procedural algorithm.

1. Conceptual Role of REINFORCE in Policy Gradient Theory

At a high level:

REINFORCE = Policy Gradient Theorem + Monte Carlo Estimation

REINFORCE operationalizes the theoretical gradient:

$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t \mid s_t)\, G_t \right]$

by replacing the expectation with empirical averages over sampled trajectories.

Thus, REINFORCE is the minimal algorithmic embodiment of VPG.

2. Likelihood-Ratio Gradient: Theoretical Foundation

The core theoretical mechanism underlying REINFORCE is the likelihood-ratio (score function) identity:

$\nabla_\theta p_\theta(x) = p_\theta(x)\, \nabla_\theta \log p_\theta(x)$

This identity allows gradients of expectations to be written as expectations of gradients—without differentiating through the stochastic process itself.

Applied to trajectories:

$\nabla_\theta J(\theta) = \nabla_\theta \int p_\theta(\tau)\, G_0(\tau)\, d\tau = \mathbb{E}_{\tau} \left[ \nabla_\theta \log p_\theta(\tau)\, G_0(\tau) \right]$

3. Factorization of the Trajectory Log-Likelihood

The trajectory distribution factorizes as:

$p_\theta(\tau) = \rho_0(s_0) \prod_{t=0}^{T-1} \pi_\theta(a_t \mid s_t)\, P(s_{t+1} \mid s_t, a_t)$

Taking the logarithm:

$\log p_\theta(\tau) = \log \rho_0(s_0) + \sum_{t=0}^{T-1} \log \pi_\theta(a_t \mid s_t) + \sum_{t=0}^{T-1} \log P(s_{t+1} \mid s_t, a_t)$

Since only the policy depends on

$\theta$

$θ$ :

$\nabla_\theta \log p_\theta(\tau) = \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t \mid s_t)$

This step is the key theoretical simplification that enables policy gradient methods.

4. REINFORCE Gradient Estimator

Substituting into the gradient expression yields:

$\nabla_\theta J(\theta) = \mathbb{E}_{\tau} \left[ \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t \mid s_t)\, G_0 \right]$

Using the causality principle, rewards before time

$t$

$t$ do not depend on

$a_t$

$a_{t}$ , allowing the return to be truncated:

$\nabla_\theta J(\theta) = \mathbb{E}_{\tau} \left[ \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t \mid s_t)\, G_t \right]$

This is the theoretical REINFORCE gradient.

5. Monte Carlo Instantiation

Given

$N$

$N$ sampled trajectories

$\{\tau^{(i)}\}$

, the REINFORCE estimator is:

$\widehat{\nabla_\theta J} = \frac{1}{N} \sum_{i=1}^N \sum_{t=0}^{T_i-1} \nabla_\theta \log \pi_\theta(a_t^{(i)} \mid s_t^{(i)})\, G_t^{(i)}$

Theoretical Properties

Unbiased

$\mathbb{E}[\widehat{\nabla_\theta J}] = \nabla_\theta J(\theta)$

Consistent
Converges to the true gradient as
$N \to \infty$

Thus, REINFORCE exactly matches the theoretical objective of VPG in expectation.

Case-Based Theoretical Analysis (Vanilla Policy Gradient & REINFORCE)

A case-based theoretical analysis examines Vanilla Policy Gradient not by implementation details, but by analyzing how its mathematical structure behaves under different theoretical regimes. Each case isolates one assumption or structural property of the Markov Decision Process (MDP) and studies its consequences for gradient correctness, variance, convergence, and expressiveness.

Case 1: Finite-Horizon, Episodic MDP

Setup

Finite horizon
$T < \infty$
Episodes terminate naturally
Return:

$G_t = \sum_{k=t}^{T-1} \gamma^{k-t} r_k$

Theoretical Implications

Well-Defined Objective

$J(\theta) = \mathbb{E}[G_0]$

is finite without requiring

$\gamma < 1$

Unbiased Monte Carlo Gradient

$\nabla_\theta J(\theta) = \mathbb{E}\left[ \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t \mid s_t) G_t \right]$

Credit Assignment Clarity
Each action influences only future rewards, enabling strict causal decomposition.

Conclusion

This is the ideal theoretical setting for Vanilla Policy Gradient—minimal assumptions, exact gradients, and clean convergence analysis.

Case 2: Infinite-Horizon, Discounted MDP

Setup

$T \to \infty$
Discount factor
$γ \in (0, 1)$

Theoretical Challenges

Convergence of Return

$\sum_{t=0}^{\infty} \gamma^t r_t < \infty$

requires bounded rewards.

Interchanging Gradient and Expectation
Requires regularity conditions (dominated convergence theorem).
State Distribution Shift

$d^{\pi_\theta}(s)$

depends on

$\theta$

, complicating analysis.

Result

Policy Gradient Theorem still holds, but proofs become measure-theoretic.

Conclusion

VPG remains valid, but theoretical guarantees rely on stronger assumptions.

Case 3: Deterministic Environment, Stochastic Policy

Setup

$P(s’ \mid s,a)$
deterministic
Policy
$\pi_\theta(a \mid s)$
stochastic

Key Insight

All randomness arises from the policy:

$\mathrm{Var}(\nabla_\theta J) \propto \mathrm{Var}_\pi(G_t)$

Implications

Exploration is policy-driven
Gradient estimator remains unbiased
Variance remains high if policy entropy is large

Conclusion

Stochastic policies are sufficient for learning even in deterministic worlds.

Case 4: Stochastic Environment, Deterministic Policy (Failure Case)

Setup

Deterministic policy
$\mu_\theta(s)$

Theoretical Breakdown

$\nabla_\theta \log \pi_\theta(a \mid s) \quad \text{undefined}$

No likelihood-ratio gradient exists.

Consequence

VPG theory collapses
Requires alternative frameworks (Deterministic Policy Gradient)

Conclusion

Stochasticity of the policy is a theoretical necessity, not a design choice.

Case 5: Sparse-Reward MDPs

Setup

Rewards
$r_{t} = 0for most timesteps$
Terminal reward only

Theoretical Effect

$G_t \approx 0 \quad \forall t \ll T$

Gradient Degeneracy

$\nabla_\theta J \approx 0$

leading to:

Slow learning
High variance
Poor signal-to-noise ratio

Conclusion

VPG is theoretically correct but inefficient under sparse rewards.

Case 6: Long-Horizon Tasks

Setup

Large
$Tor weak discounting$

Variance Explosion

$\mathrm{Var}(G_t) \uparrow \text{ exponentially with } T$

Theoretical Result

Gradient variance grows faster than learning rate decay.

Conclusion

Explains empirical instability of VPG in robotics and control tasks.

Baseline Theory and Variance Reduction in Vanilla Policy Gradient

In Vanilla Policy Gradient and its canonical instantiation REINFORCE, variance—not bias—is the central theoretical obstacle. While Monte Carlo policy gradient estimators are unbiased, their variance can be prohibitively large, leading to slow convergence and unstable learning. Baseline theory provides a mathematically rigorous mechanism to reduce variance without altering the expected gradient.

This section develops baseline methods from first principles, emphasizing why they work theoretically, not how they are implemented.

1. Variance in Policy Gradient Estimation: The Core Problem

The policy gradient is:

$\nabla_\theta J(\theta) = \mathbb{E} \left[ \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t \mid s_t)\, G_t \right]$

Although unbiased, the estimator:

$g = \sum_{t} \nabla_\theta \log \pi_\theta(a_t \mid s_t)\, G_t$

has variance:

$\mathrm{Var}(g) = \mathbb{E}[g^2] – \|\nabla_\theta J(\theta)\|^2$

High variance arises because:

$G_t$ aggregates many random future rewards
$\nabla_\theta \log \pi_\theta(a_t \mid s_t)$
can be large
Long horizons amplify noise

2. Baseline Concept: Theoretical Definition

A baseline is any function

$b_t$ that does not depend on the action

$a_t$ The baseline-modified gradient estimator is:

$\nabla_\theta J(\theta) = \mathbb{E} \left[ \sum_t \nabla_\theta \log \pi_\theta(a_t \mid s_t)\, \left(G_t – b_t\right) \right]$

The critical theoretical question is:

Why does subtracting
$b_t$
not change the expected gradient?

3. Baseline Unbiasedness Theorem

Theorem

For any baseline

$b_t$

independent of

$a_t$ :

$\mathbb{E} \left[ \nabla_\theta \log \pi_\theta(a_t \mid s_t)\, b_t \right] = 0$

Proof

$\begin{aligned} E_{a_{t} \sim π_{θ}} [\nabla_{θ} \log π_{θ} (a_{t} ∣ s_{t}) b_{t}] & = b_{t} \sum_{a_{t}} π_{θ} (a_{t} ∣ s_{t}) \nabla_{θ} \log π_{θ} (a_{t} ∣ s_{t}) \\ = b_{t} \sum_{a_{t}} \nabla_{θ} π_{θ} (a_{t} ∣ s_{t}) \\ = b_{t} \nabla_{θ} \sum_{a_{t}} π_{θ} (a_{t} ∣ s_{t}) \\ = b_{t} \nabla_{θ} (1) = 0 \end{aligned}$

Thus, subtracting a baseline preserves unbiasedness.

4. Variance Reduction Mechanism (Intuition + Theory)

Variance depends on the magnitude of the term multiplied by the score function:

$\nabla_\theta \log \pi_\theta(a_t \mid s_t)\,(G_t – b_t)$

$b_t$

approximates

$\mathbb{E}[G_t \mid s_t]$

then:

$\mathrm{Var}(G_t – b_t) \ll \mathrm{Var}(G_t)$

Hence, variance of the gradient estimator is reduced.

5. Optimal Baseline: Theoretical Derivation

Consider the single-timestep gradient estimator:

$g_t = \nabla_\theta \log \pi_\theta(a_t \mid s_t)\,(G_t – b(s_t))$

The variance-minimizing baseline satisfies:

$b^\star(s_t) = \frac{ \mathbb{E} \left[ G_t \|\nabla_\theta \log \pi_\theta(a_t \mid s_t)\|^2 \mid s_t \right] }{ \mathbb{E} \left[ \|\nabla_\theta \log \pi_\theta(a_t \mid s_t)\|^2 \mid s_t \right] }$

This is the theoretically optimal baseline in the mean-squared sense.

6. State-Value Function as a Baseline

A common and theoretically grounded choice is:

$b (s_{t}) = V^{π} (s_{t})$

Then:

$G_t – V^{\pi}(s_t) = A^{\pi}(s_t, a_t)$

which is the advantage function.

Theoretical Interpretation

Removes predictable reward component
Leaves only action-dependent deviation
Minimizes variance under mild assumptions

Advantage Function: A Purely Theoretical Interpretation

The advantage function occupies a central conceptual position in modern policy-gradient theory. Although it is often introduced operationally as a variance-reduction tool, its true importance lies deeper: the advantage function provides a relative, state-conditioned measure of action quality, isolating the causal contribution of an action beyond what is already expected from the state itself.

This section develops the advantage function purely from theory, without reference to implementation or algorithms.

1. Motivation: Absolute vs Relative Action Evaluation

In a Markov Decision Process, the return following an action depends on two factors:

The state in which the action is taken
The choice of action itself

If a state is inherently good, every action taken in that state may lead to high return. Conversely, in a poor state, even the best action may yield a low return.

Thus, evaluating an action by its absolute return confounds state quality with action quality.

The advantage function resolves this confounding by asking:
“How much better (or worse) is this action compared to the average action in this state?”

2. Formal Definitions of Value Functions

Let $\pi$ be a fixed stochastic policy.

State-Value Function

$V^\pi(s) = \mathbb{E}_\pi \left[ G_t \mid S_t = s \right]$

This is the expected return starting from state $s$ and thereafter following $π.$

Action-Value Function

$Q^\pi(s,a) = \mathbb{E}_\pi \left[ G_t \mid S_t = s, A_t = a \right]$

This represents the expected return after taking action $a$ in state $s$ , then following $\pi$ .

3. Definition of the Advantage Function

The advantage function is defined as:

$A^\pi(s,a) = Q^\pi(s,a) – V^\pi(s)$

This difference removes the state-dependent baseline $V^\pi(s)$ , leaving only the relative benefit of choosing action $a$ .

4. Zero-Mean Property (Fundamental Theorem)

Theorem

For any state $s$ :

$\mathbb{E}_{a \sim \pi(\cdot \mid s)} \left[ A^\pi(s,a) \right] = 0$

Proof

$\begin{aligned} \mathbb{E}_{a \sim \pi} [A^\pi(s,a)] &= \sum_a \pi(a \mid s) \left(Q^\pi(s,a) – V^\pi(s)\right) \\ &= \sum_a \pi(a \mid s) Q^\pi(s,a) – V^\pi(s) \sum_a \pi(a \mid s) \\ &= V^\pi(s) – V^\pi(s) = 0 \end{aligned}$

This property is central: advantages measure deviations, not absolute value.

5. Advantage as a Centered Action-Value Function

From a functional perspective:

$A^\pi(s,a) = Q^\pi(s,a) – \mathbb{E}_{a’ \sim \pi}[Q^\pi(s,a’)]$

Thus, the advantage function is a mean-centered version of $Q^\pi$ over the action distribution.

This centering is what makes advantage-weighted gradients:

Lower variance
Better conditioned
More stable

6. Advantage and Policy Gradient Theory

The policy gradient theorem can be written as:

$\nabla_\theta J(\theta) = \mathbb{E}_{s,a \sim \pi_\theta} \left[ \nabla_\theta \log \pi_\theta(a \mid s)\, Q^{\pi_\theta}(s,a) \right]$

Substituting:

$Q^{\pi_\theta}(s,a) = A^{\pi_\theta}(s,a) + V^{\pi_\theta}(s)$

and using the baseline property:

$\mathbb{E} \left[ \nabla_\theta \log \pi_\theta(a \mid s)\, V^{\pi_\theta}(s) \right] = 0$

yields:

$\nabla_\theta J(\theta) = \mathbb{E} \left[ \nabla_\theta \log \pi_\theta(a \mid s)\, A^{\pi_\theta}(s,a) \right]$

This shows that only the advantage matters for policy improvement.

Vanilla Policy Gradient as an Actor-Only Method (Theoretical Perspective)

Vanilla Policy Gradient occupies a unique position in reinforcement learning theory: it is a pure actor-only optimization method. Unlike Actor–Critic architectures, which decompose learning into separate policy (actor) and value (critic) components, Vanilla Policy Gradient operates exclusively on the policy itself, without introducing any auxiliary value-function approximation as part of the learning dynamics.

This section presents a strictly theoretical interpretation of VPG as an actor-only method, clarifying what this means mathematically, why it is possible, and what fundamental limitations arise from this design choice.

1. Definition of an Actor-Only Method (Theory)

A learning algorithm is said to be actor-only if:

The policy parameters
$\theta$
$θ$ are the only optimized variables
The learning objective is expressed directly in terms of the policy
No separate parametric estimator of
$V^\pi$
,
$Q^\pi$
, or
$A^\pi$
is required for correctness

Vanilla Policy Gradient satisfies all three conditions.

2. Direct Optimization of the Policy Objective

VPG optimizes the expected return:

$J (θ) = E_{τ \sim p_{θ} (τ)} [G_{0}]$

and updates

$\theta$

via stochastic gradient ascent:

$\theta_{k+1} = \theta_k + \alpha_k \nabla_\theta J(\theta_k)$

No auxiliary optimization problem is introduced. The policy itself is the sole object of optimization.

3. Policy Gradient Without a Critic

The policy gradient theorem states:

$\nabla_\theta J(\theta) = \mathbb{E}_{s,a \sim \pi_\theta} \left[ \nabla_\theta \log \pi_\theta(a \mid s)\, Q^{\pi_\theta}(s,a) \right]$

In VPG:

$Q^{\pi_\theta}(s,a)$
is not learned
It is replaced by Monte Carlo returns
$G_t$

Thus, the gradient estimator uses raw trajectory data, not a learned critic.

4. Monte Carlo Returns as Implicit Value Estimates

Although no critic is explicitly present, Vanilla Policy Gradient implicitly estimates values via:

$Q^{\pi_\theta}(s_t,a_t) \approx G_t$

This approximation is:

Unbiased
Consistent
High variance

Crucially, it does not introduce an independent parametric object—returns are computed directly from observed rewards.

5. Actor-Only Nature and Exactness of the Gradient

Because VPG uses full returns, the gradient estimator satisfies:

$\mathbb{E} \left[ \nabla_\theta \log \pi_\theta(a_t \mid s_t)\, G_t \right] = \nabla_\theta J(\theta)$

This means:

No approximation error is introduced by a critic
No bootstrapping bias exists
The gradient is exact in expectation

This property is unique to actor-only Monte Carlo methods.

6. Baselines Do Not Create a Critic (Theoretical Clarification)

Even when baselines are introduced:

$\nabla_\theta J(\theta) = \mathbb{E} \left[ \nabla_\theta \log \pi_\theta(a \mid s)\, (G_t – b(s_t)) \right]$

Vanilla Policy Gradient remains actor-only as long as:

$b(s_t)$
is not learned as a separate optimization target
The baseline does not define an independent objective

The baseline modifies the estimator, not the optimization problem.

Frequently Asked Questions (FAQs): Vanilla Policy Gradient

Q1. What is Vanilla Policy Gradient (VPG) in simple theoretical terms?

Vanilla Policy Gradient is a reinforcement learning method that directly optimizes a stochastic policy by performing gradient ascent on the expected return. The term “vanilla” indicates that it uses the pure policy gradient theorem with Monte Carlo estimation, without critics, trust regions, clipping, or second-order corrections.

Q2. Why is Vanilla Policy Gradient called an actor-only method?

Vanilla Policy Gradient is called actor-only because it optimizes only the policy parameters. It does not learn or maintain a separate value function (critic). All learning signals come directly from trajectory returns, making the policy the sole object of optimization.

Q3. Does VPG require a model of the environment?

No. Vanilla Policy Gradient is a model-free method. The policy gradient theorem allows gradients to be computed without differentiating through environment dynamics, relying only on sampled trajectories.

Q4. Why must the policy be stochastic in VPG?

The theoretical foundation of Vanilla Policy Gradient relies on the likelihood-ratio gradient:

$\nabla_\theta \log \pi_\theta(a \mid s)$

This expression is only well-defined for stochastic policies. Deterministic policies break the mathematical assumptions of Vanilla Policy Gradient and require a different framework (e.g., deterministic policy gradients).

Q5. What exactly is optimized in Vanilla Policy Gradient?

VPG optimizes the expected return:

$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[G_0]$

This expectation is taken over all trajectories induced by the policy. Every policy update aims to increase this quantity.

Conclusions

Vanilla Policy Gradient represents one of the most conceptually important turning points in the theoretical development of reinforcement learning. Its significance does not lie in empirical efficiency or practical dominance, but in the clarity of its mathematical formulation. Vanilla Policy Gradient is the first framework that cleanly reframes reinforcement learning as a direct, differentiable optimization problem over stochastic policies, eliminating the need for argmax-based policy extraction and value-function dominance.

From a theoretical standpoint, VPG provides an unbiased gradient estimator of the expected return by operating directly on the trajectory distribution induced by the policy. The Policy Gradient Theorem demonstrates a profound result: environment dynamics vanish from the gradient, allowing policy optimization without an explicit model of the environment. This insight alone reshaped how researchers conceptualize learning in unknown and continuous domains.

However, the same properties that make Vanilla Policy Gradient theoretically elegant also expose its fundamental weaknesses. The reliance on Monte Carlo policy gradient estimation leads to severe variance, especially in long-horizon, sparse-reward, or high-dimensional settings. Credit assignment remains coarse, as all actions in a trajectory are reinforced equally by the total return. Consequently, convergence is slow, unstable, and sample-inefficient despite mathematical correctness.

VPG’s role as an actor-only reinforcement learning method highlights the core bias–variance tradeoff that governs all policy optimization techniques. By avoiding critics and bootstrapping, VPG achieves zero bias at the cost of maximal variance. This tradeoff is not a flaw but a theoretical baseline against which all later methods—Actor-Critic, Natural Policy Gradient, TRPO, and PPO—can be understood as structured compromises.

Historically, Vanilla Policy Gradient is the conceptual foundation upon which modern policy optimization is built. Nearly every advanced policy gradient method can be interpreted as a variance-reduced, geometry-aware, or constraint-stabilized extension of VPG. For this reason, it remains indispensable in graduate-level education and theoretical research, even if it is rarely deployed in real-world systems.

In summary, Vanilla Policy Gradient is not a practical algorithm to outperform others—it is a theoretical reference point. Mastering its assumptions, derivations, and limitations is essential for anyone seeking a deep understanding of reinforcement learning theory. Without VPG, modern stochastic policy optimization would lack both its mathematical grounding and its conceptual coherence.

Machine Learning Algorithms Unveiled: Types, Examples & More!

Cellular Neural Networks Unveiled: Your Ultimate Guide to the Future of AI!

Deep Learning and Machine Learning: A Complete Guide for Beginners

Deep Deterministic Policy Gradient (DDPG): Your Guide to Mastering Continuous Control in AI

Introduction

Formal Definition of a Stochastic Policy

1. Policy in the Markov Decision Process Framework

2. Formal Definition of a Stochastic Policy

3. Parameterized Stochastic Policies

4. Stochastic Policy as a Randomized Decision Rule

5. Why Stochastic Policies Are Essential in Policy Gradient Theory

5.1 Differentiability of the Objective

5.2 Exploration Guarantee

5.3 Avoiding Non-Differentiability

6. Relationship to Deterministic Policies

7. Theoretical Role in Trajectory Distributions

8. Summary (Theoretical Perspective)

Expected Return as an Optimization Objective

1. Return: Cumulative Reward Along a Trajectory

2. Expected Return: From Random Trajectories to Objective Function

3. Trajectory Distribution Induced by a Policy

4. Optimization Problem Formulation

5. Alternative Equivalent Forms of Expected Return

5.1 State-Value Function Form

5.2 Expected Reward Over State–Action Occupancy Measure

6. Why Expectation Is Essential (Theoretical Justification)

6.1 Randomness of Trajectories

Monte Carlo Estimation Theory in Vanilla Policy Gradient

1. Why Monte Carlo Estimation Is Necessary in VPG

2. Monte Carlo Estimation: General Theory

3. Monte Carlo Estimation of Expected Return

4. Monte Carlo Estimation of the Policy Gradient

5. Unbiasedness of the Monte Carlo Policy Gradient Estimator

REINFORCE Algorithm as a Theoretical Instantiation of Vanilla Policy Gradient (VPG)

1. Conceptual Role of REINFORCE in Policy Gradient Theory

2. Likelihood-Ratio Gradient: Theoretical Foundation

3. Factorization of the Trajectory Log-Likelihood

4. REINFORCE Gradient Estimator

5. Monte Carlo Instantiation

Case-Based Theoretical Analysis (Vanilla Policy Gradient & REINFORCE)

Case 1: Finite-Horizon, Episodic MDP

Case 2: Infinite-Horizon, Discounted MDP

Case 3: Deterministic Environment, Stochastic Policy

Case 4: Stochastic Environment, Deterministic Policy (Failure Case)

Case 5: Sparse-Reward MDPs

Case 6: Long-Horizon Tasks

Baseline Theory and Variance Reduction in Vanilla Policy Gradient

1. Variance in Policy Gradient Estimation: The Core Problem

2. Baseline Concept: Theoretical Definition

3. Baseline Unbiasedness Theorem

Theorem

4. Variance Reduction Mechanism (Intuition + Theory)

5. Optimal Baseline: Theoretical Derivation

6. State-Value Function as a Baseline

Advantage Function: A Purely Theoretical Interpretation

1. Motivation: Absolute vs Relative Action Evaluation

2. Formal Definitions of Value Functions

3. Definition of the Advantage Function

4. Zero-Mean Property (Fundamental Theorem)

5. Advantage as a Centered Action-Value Function

6. Advantage and Policy Gradient Theory

Vanilla Policy Gradient as an Actor-Only Method (Theoretical Perspective)

1. Definition of an Actor-Only Method (Theory)

2. Direct Optimization of the Policy Objective

3. Policy Gradient Without a Critic

4. Monte Carlo Returns as Implicit Value Estimates

5. Actor-Only Nature and Exactness of the Gradient

6. Baselines Do Not Create a Critic (Theoretical Clarification)

Frequently Asked Questions (FAQs): Vanilla Policy Gradient

Conclusions

Related posts:

Leave a Comment Cancel Reply