How to Solve CartPole-v1 in OpenAI Gym: Your First Step into Reinforcement Learning

Reinforcement learning (RL) is like teaching a child to ride a bike—you let them try, fail, and learn from their mistakes. The CartPole-v1 environment in OpenAI Gym (now Gymnasium) is a perfect playground for this. In CartPole-v1, you train an agent to balance a pole on a moving cart, a task that’s simple but surprisingly deep. This guide will walk you through solving CartPole-v1 with RL, offering a hands-on approach for beginners and curious coders alike. Let’s jump into the world of CartPole-v1!

Table of Contents

Introduction to CartPole-v1 and Reinforcement Learning

Think of CartPole-v1 as a digital tightrope act. A pole sits on a cart, and your job is to keep it upright by nudging the cart left or right. It’s a classic problem in the OpenAI Gym, loved for its simplicity and ability to teach RL basics. Solving CartPole-v1 means building an agent that learns to balance the pole for as long as possible, earning rewards along the way. This article will guide you step-by-step, from setting up CartPole-v1 to mastering it with a practical RL algorithm.

Understanding the CartPole-v1 Environment

Before we code, let’s unpack what makes v1 tick.

What’s the Deal with CartPole-v1?

In a cart slides on a frictionless track with a pole attached via a hinge. Your goal? Keep the pole from falling by moving the cart. It’s like balancing a broom on your palm—tricky but doable with practice. V1 is a staple in RL because it’s easy to grasp yet challenges you to think strategically.

The Mechanics:

State: CartPole gives you four numbers:
Cart’s position (where it is on the track).
Cart’s velocity (how fast it’s moving).
Pole’s angle (how much it’s tilting).
Pole’s angular velocity (how fast it’s tilting).

Actions: Two options in-v1:

Push left.
Push right.

Reward: You get +1 for each moment the pole stays balanced.

Game Over: The episode stops if:

The pole tilts past ±12 degrees.
The cart strays beyond ±2.4 units.
You hit 500 steps (the max).

v1’s continuous states make it a fun puzzle for RL algorithms.

Getting Ready to Tackle CartPole-v1

Let’s set up your tools.

What You Need:

Python 3.6+: The backbone for running-v1.
Gymnasium: The library that includes v1.
Python Basics: Know your way around loops and NumPy.
RL Knowledge: A loose grasp of agents and rewards helps.

Setup Steps:

Install Gymnasium for v1:

				
					pip install gymnasium

2. (Optional) Use a virtual environment:

				
					python -m venv myenv
source myenv/bin/activate  # Windows: myenv\Scripts\activate
pip install gymnasium

You’re now ready to dive into v1!

Playing with the CartPole-v1 Environment

Let’s see -v1 in action with a simple script.

First Experiment:

Here’s code to run -v1 with random moves:

				
					import gymnasium as gym

env = gym.make('CartPole-v1', render_mode='human')
state = env.reset()

for _ in range(100):
    action = env.action_space.sample()  # Pick left or right randomly
    state, reward, done, _, _ = env.step(action)
    env.render()
    if done:
        state = env.reset()

env.close()

What Happens:

The cart in -v1 wobbles chaotically.
The pole falls fast—random actions don’t work.
Episodes end in 10-20 steps, far from the 500-step goal.

This shows CartPole-v1 needs a clever RL solution.

Reinforcement Learning for CartPole-v1

To conquer v1, we’ll use Q-Learning, a beginner-friendly RL method.

RL Options:

Q-Learning: Tracks action values in a table for v1.
Deep Q-Networks (DQN): Uses neural networks for tougher tasks.
Policy Gradients: Learns actions directly.

Since v1 has continuous states, we’ll discretize them for Q-Learning.

Q-Learning Plan for CartPole-v1:

Split states into bins.
Create a Q-table for action values.
Explore (try random moves) and exploit (use learned moves).
Update the Q-table with v1 rewards.
Train over many episodes.

Let’s build it.

Coding a Q-Learning Solution for -v1

Here’s how to solve v1 with Q-Learning.

Step 1: Discretize States

Turn continuous states into bins:

				
					import numpy as np

bins = [12, 12, 12, 12]  # More bins for precision

def discretize_state(state, bins):
    state_disc = []
    for i in range(len(state)):
        state_disc.append(np.digitize(state[i], np.linspace(-1.2, 1.2, bins[i])))
    return tuple(state_disc)

Tip: Tweak bins for better v1 results.

Step 2: Initialize

Set up v1 and the Q-table:

				
					import gymnasium as gym

env = gym.make('CartPole-v1')
q_table = np.zeros([*bins, env.action_space.n])

Step 3: Parameters

Define learning settings:

				
					alpha = 0.08  # Learning rate
gamma = 0.98  # Reward discount
epsilon = 1.0  # Exploration rate
epsilon_decay = 0.99
min_epsilon = 0.02
episodes = 1200

Step 4: Train

Run the training loop for v1:

				
					for episode in range(episodes):
    state = env.reset()
    state = discretize_state(state[0], bins) if isinstance(state, tuple) else discretize_state(state, bins)
    done = False
    total_reward = 0

    while not done:
        if np.random.random() < epsilon:
            action = env.action_space.sample()
        else:
            action = np.argmax(q_table[state])

        next_state, reward, done, _, _ = env.step(action)
        next_state = discretize_state(next_state, bins)

        q_table[state][action] += alpha * (
            reward + gamma * np.max(q_table[next_state]) - q_table[state][action]
        )

        state = next_state
        total_reward += reward

    epsilon = max(min_epsilon, epsilon * epsilon_decay)
    print(f"Episode {episode + 1}: Reward = {total_reward}")

env.close()

Step 5: Test

Check your v1 agent:

				
					state = env.reset()
state = discretize_state(state[0], bins) if isinstance(state, tuple) else discretize_state(state, bins)
done = False
total_reward = 0

while not done:
    action = np.argmax(q_table[state])
    state, reward, done, _, _ = env.step(action)
    state = discretize_state(state, bins)
    total_reward += reward
    env.render()

print(f"Test Reward: {total_reward}")
env.close()

Evaluating Your CartPole-v1 Agent

Let’s see how your v1 solution performs.

Metrics:

Reward Growth: Higher rewards show learning.
Stability: Consistent 500-step episodes are ideal.

Plotting Progress:

Visualize training with a plot:

				
					import matplotlib.pyplot as plt

rewards = []  # Collect rewards during training
plt.plot(rewards)
plt.xlabel('Episode')
plt.ylabel('Reward')
plt.title('CartPole-v1 Learning Curve')
plt.show()

Fixes:

Low Scores: Train longer or adjust alpha/gamma.
Unstable: Increase bins or tweak epsilon for v1.

Next Steps After -v1

Nailed v1? Try these:

DQN: Neural networks for smoother v1 policies.
Policy Gradients: Direct action optimization.
New Environments: Explore MountainCar or LunarLander.

Each builds on your v1 experience.

Conclusion

Solving v1 is a big win in RL. You’ve set up the v1 environment, coded a Q-Learning agent, and learned to refine it. Keep experimenting—tweak parameters, try new algorithms, and tackle more Gymnasium challenges to grow your RL skills.

FAQs

What’s the goal in CartPole-v1?
Balance a pole on a cart for as long as possible by moving left or right.
Why start with CartPole-v1?
CartPole-v1 is simple but teaches RL essentials like states and actions.
Which algorithms work for CartPole-v1?
Q-Learning, DQN, and Policy Gradients are great for CartPole-v1.
How do I know my CartPole-v1 agent is good?
Rewards near 500 show mastery of CartPole-v1.
What if my CartPole-v1 agent struggles?
Train longer, adjust bins, or fine-tune learning rates.

Neural Networks Unlocked: How They Work & Why They Matter!

How Does the DDPG Agent Work?

Implementing Policy Gradient in Python — Full article with line-by-line code explanations

A Coding Implementation to Build a Complete Self-Hosted LLM Workflow with Ollama, REST API, and Grad...