Reinforcement learning (RL) is like teaching a child to ride a bike—you let them try, fail, and learn from their mistakes. The CartPole-v1 environment in OpenAI Gym (now Gymnasium) is a perfect playground for this. In CartPole-v1, you train an agent to balance a pole on a moving cart, a task that’s simple but surprisingly deep. This guide will walk you through solving CartPole-v1 with RL, offering a hands-on approach for beginners and curious coders alike. Let’s jump into the world of CartPole-v1!

Table of Contents
ToggleIntroduction to CartPole-v1 and Reinforcement Learning
Think of CartPole-v1 as a digital tightrope act. A pole sits on a cart, and your job is to keep it upright by nudging the cart left or right. It’s a classic problem in the OpenAI Gym, loved for its simplicity and ability to teach RL basics. Solving CartPole-v1 means building an agent that learns to balance the pole for as long as possible, earning rewards along the way. This article will guide you step-by-step, from setting up CartPole-v1 to mastering it with a practical RL algorithm.
Understanding the CartPole-v1 Environment
Before we code, let’s unpack what makes v1 tick.
What’s the Deal with CartPole-v1?
In a cart slides on a frictionless track with a pole attached via a hinge. Your goal? Keep the pole from falling by moving the cart. It’s like balancing a broom on your palm—tricky but doable with practice. V1 is a staple in RL because it’s easy to grasp yet challenges you to think strategically.
The Mechanics:
- State: CartPole gives you four numbers:
- Cart’s position (where it is on the track).
- Cart’s velocity (how fast it’s moving).
- Pole’s angle (how much it’s tilting).
- Pole’s angular velocity (how fast it’s tilting).
Actions: Two options in-v1:
- Push left.
- Push right.
Reward: You get +1 for each moment the pole stays balanced.
Game Over: The episode stops if:
- The pole tilts past ±12 degrees.
- The cart strays beyond ±2.4 units.
- You hit 500 steps (the max).
v1’s continuous states make it a fun puzzle for RL algorithms.
Getting Ready to Tackle CartPole-v1
Let’s set up your tools.
What You Need:
- Python 3.6+: The backbone for running-v1.
- Gymnasium: The library that includes v1.
- Python Basics: Know your way around loops and NumPy.
- RL Knowledge: A loose grasp of agents and rewards helps.
Setup Steps:
- Install Gymnasium for v1:
pip install gymnasium
2. (Optional) Use a virtual environment:
python -m venv myenv
source myenv/bin/activate # Windows: myenv\Scripts\activate
pip install gymnasium
You’re now ready to dive into v1!
Playing with the CartPole-v1 Environment
Let’s see -v1 in action with a simple script.
First Experiment:
Here’s code to run -v1 with random moves:
import gymnasium as gym
env = gym.make('CartPole-v1', render_mode='human')
state = env.reset()
for _ in range(100):
action = env.action_space.sample() # Pick left or right randomly
state, reward, done, _, _ = env.step(action)
env.render()
if done:
state = env.reset()
env.close()
What Happens:
- The cart in -v1 wobbles chaotically.
- The pole falls fast—random actions don’t work.
- Episodes end in 10-20 steps, far from the 500-step goal.
This shows CartPole-v1 needs a clever RL solution.
Reinforcement Learning for CartPole-v1
To conquer v1, we’ll use Q-Learning, a beginner-friendly RL method.
RL Options:
- Q-Learning: Tracks action values in a table for v1.
- Deep Q-Networks (DQN): Uses neural networks for tougher tasks.
- Policy Gradients: Learns actions directly.
Since v1 has continuous states, we’ll discretize them for Q-Learning.
Q-Learning Plan for CartPole-v1:
- Split states into bins.
- Create a Q-table for action values.
- Explore (try random moves) and exploit (use learned moves).
- Update the Q-table with v1 rewards.
- Train over many episodes.
Let’s build it.
Coding a Q-Learning Solution for -v1
Here’s how to solve v1 with Q-Learning.
Step 1: Discretize States
Turn continuous states into bins:
import numpy as np
bins = [12, 12, 12, 12] # More bins for precision
def discretize_state(state, bins):
state_disc = []
for i in range(len(state)):
state_disc.append(np.digitize(state[i], np.linspace(-1.2, 1.2, bins[i])))
return tuple(state_disc)
Tip: Tweak bins for better v1 results.
Step 2: Initialize
Set up v1 and the Q-table:
import gymnasium as gym
env = gym.make('CartPole-v1')
q_table = np.zeros([*bins, env.action_space.n])
Step 3: Parameters
Define learning settings:
alpha = 0.08 # Learning rate
gamma = 0.98 # Reward discount
epsilon = 1.0 # Exploration rate
epsilon_decay = 0.99
min_epsilon = 0.02
episodes = 1200
Step 4: Train
Run the training loop for v1:
for episode in range(episodes):
state = env.reset()
state = discretize_state(state[0], bins) if isinstance(state, tuple) else discretize_state(state, bins)
done = False
total_reward = 0
while not done:
if np.random.random() < epsilon:
action = env.action_space.sample()
else:
action = np.argmax(q_table[state])
next_state, reward, done, _, _ = env.step(action)
next_state = discretize_state(next_state, bins)
q_table[state][action] += alpha * (
reward + gamma * np.max(q_table[next_state]) - q_table[state][action]
)
state = next_state
total_reward += reward
epsilon = max(min_epsilon, epsilon * epsilon_decay)
print(f"Episode {episode + 1}: Reward = {total_reward}")
env.close()
Step 5: Test
Check your v1 agent:
state = env.reset()
state = discretize_state(state[0], bins) if isinstance(state, tuple) else discretize_state(state, bins)
done = False
total_reward = 0
while not done:
action = np.argmax(q_table[state])
state, reward, done, _, _ = env.step(action)
state = discretize_state(state, bins)
total_reward += reward
env.render()
print(f"Test Reward: {total_reward}")
env.close()
Evaluating Your CartPole-v1 Agent
Let’s see how your v1 solution performs.
Metrics:
- Reward Growth: Higher rewards show learning.
- Stability: Consistent 500-step episodes are ideal.
Plotting Progress:
Visualize training with a plot:
import matplotlib.pyplot as plt
rewards = [] # Collect rewards during training
plt.plot(rewards)
plt.xlabel('Episode')
plt.ylabel('Reward')
plt.title('CartPole-v1 Learning Curve')
plt.show()
Fixes:
- Low Scores: Train longer or adjust alpha/gamma.
- Unstable: Increase bins or tweak epsilon for v1.
Next Steps After -v1
Nailed v1? Try these:
- DQN: Neural networks for smoother v1 policies.
- Policy Gradients: Direct action optimization.
- New Environments: Explore MountainCar or LunarLander.
Each builds on your v1 experience.
Conclusion
Solving v1 is a big win in RL. You’ve set up the v1 environment, coded a Q-Learning agent, and learned to refine it. Keep experimenting—tweak parameters, try new algorithms, and tackle more Gymnasium challenges to grow your RL skills.
FAQs
What’s the goal in CartPole-v1?
Balance a pole on a cart for as long as possible by moving left or right.Why start with CartPole-v1?
CartPole-v1 is simple but teaches RL essentials like states and actions.Which algorithms work for CartPole-v1?
Q-Learning, DQN, and Policy Gradients are great for CartPole-v1.How do I know my CartPole-v1 agent is good?
Rewards near 500 show mastery of CartPole-v1.What if my CartPole-v1 agent struggles?
Train longer, adjust bins, or fine-tune learning rates.