Welcome to Stable Baselines docs! - RL Baselines Made Easy¶
Stable Baselines is a set of improved implementations of Reinforcement Learning (RL) algorithms based on OpenAI Baselines.
Github repository: https://github.com/hill-a/stable-baselines
You can read a detailed presentation of Stable Baselines in the Medium article: link
Main differences with OpenAI Baselines¶
This toolset is a fork of OpenAI Baselines, with a major structural refactoring, and code cleanups:
- Unified structure for all algorithms
- PEP8 compliant (unified code style)
- Documented functions and classes
- More tests & more code coverage
Installation¶
Prerequisites¶
Baselines requires python3 (>=3.5) with the development headers. You’ll also need system packages CMake, OpenMPI and zlib. Those can be installed as follows
Ubuntu¶
sudo apt-get update && sudo apt-get install cmake libopenmpi-dev python3-dev zlib1g-dev
Stable Release¶
pip install stable-baselines
Bleeding-edge version¶
With support for running tests and building the documentation.
git clone https://github.com/hill-a/stable-baselines && cd stable-baselines
pip install -e .[docs,tests]
Using Docker Images¶
Use Built Images¶
GPU image (requires nvidia-docker):
docker pull araffin/stable-baselines
CPU only:
docker pull araffin/stable-baselines-cpu
Build the Docker Images¶
Build GPU image (with nvidia-docker):
docker build . -f docker/Dockerfile.gpu -t stable-baselines
Build CPU image:
docker build . -f docker/Dockerfile.cpu -t stable-baselines-cpu
Note: if you are using a proxy, you need to pass extra params during build and do some tweaks:
--network=host --build-arg HTTP_PROXY=http://your.proxy.fr:8080/ --build-arg http_proxy=http://your.proxy.fr:8080/ --build-arg HTTPS_PROXY=https://your.proxy.fr:8080/ --build-arg https_proxy=https://your.proxy.fr:8080/
Run the images (CPU/GPU)¶
Run the nvidia-docker GPU image
docker run -it --runtime=nvidia --rm --network host --ipc=host --name test --mount src="$(pwd)",target=/root/code/stable-baselines,type=bind araffin/stable-baselines bash -c 'cd /root/code/stable-baselines/ && pytest tests/'
Or, with the shell file:
./run_docker_gpu.sh pytest tests/
Run the docker CPU image
docker run -it --rm --network host --ipc=host --name test --mount src="$(pwd)",target=/root/code/stable-baselines,type=bind araffin/stable-baselines-cpu bash -c 'cd /root/code/stable-baselines/ && pytest tests/'
Or, with the shell file:
./run_docker_cpu.sh pytest tests/
Explanation of the docker command:
docker run -it
create an instance of an image (=container), and run it interactively (so ctrl+c will work)--rm
option means to remove the container once it exits/stops (otherwise, you will have to usedocker rm
)--network host
don’t use network isolation, this allow to use tensorboard/visdom on host machine--ipc=host
Use the host system’s IPC namespace. IPC (POSIX/SysV IPC) namespace provides separation of named shared memory segments, semaphores and message queues.--name test
give explicitely the nametest
to the container, otherwise it will be assigned a random name--mount src=...
give access of the local directory (pwd
command) to the container (it will be map to/root/code/stable-baselines
), so all the logs created in the container in this folder will be keptbash -c '...'
Run command inside the docker image, here run the tests (pytest tests/
)
Getting Started¶
Most of the library tries to follow a sklearn-like syntax for the Reinforcement Learning algorithms.
Here is a quick example of how to train and run PPO2 on a cartpole environment:
import gym
from stable_baselines.common.policies import MlpPolicy
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines import PPO2
env = gym.make('CartPole-v1')
env = DummyVecEnv([lambda: env]) # The algorithms require a vectorized environment to run
model = PPO2(MlpPolicy, env, verbose=1)
model.learn(total_timesteps=10000)
obs = env.reset()
for i in range(1000):
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)
env.render()
Or just train a model with a one liner if the environment is registered in Gym and if the policy is registered:
from stable_baselines import PPO2
model = PPO2('MlpPolicy', 'CartPole-v1').learn(10000)

Define and train a RL agent in one line of code!
RL Algorithms¶
This table displays the rl algorithms that are implemented in the stable baselines project, along with some useful characteristics: support for recurrent policies, discrete/continuous actions, multiprocessing.
Name | Refactored [1] | Recurrent | Box |
Discrete |
Multi Processing |
---|---|---|---|---|---|
A2C | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ |
ACER | ✔️ | ✔️ | ❌ [5] | ✔️ | ✔️ |
ACKTR | ✔️ | ✔️ | ❌ [5] | ✔️ | ✔️ |
DDPG | ✔️ | ✔️ | ✔️ | ❌ | ❌ |
DQN | ✔️ | ❌ | ❌ | ✔️ | ❌ |
GAIL [2] | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ [4] |
PPO1 | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ [4] |
PPO2 | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ |
TRPO | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ [4] |
[1] | Whether or not the algorithm has be refactored to fit the BaseRLModel class. |
[2] | Only implemented for TRPO. |
[3] | Only implemented for DDPG. |
[4] | (1, 2, 3) Multi Processing with MPI. |
[5] | (1, 2) TODO, in project scope. |
Actions gym.spaces
:
Box
: A N-dimensional box that containes every point in the action space.Discrete
: A list of possible actions, where each timestep only one of the actions can be used.MultiDiscrete
: A list of possible actions, where each timestep only one action of each discrete set can be used.MultiBinary
: A list of possible actions, where each timestep any of the actions can be used in any combination.
Examples¶
Try it online with Colab Notebooks!¶
All the following examples can be executed online using Google colab
notebooks:
- Getting Started
- Training, Saving, Loading
- Multiprocessing
- Monitor Training and Plotting
- Atari Games
- Breakout (trained agent included)
Basic Usage: Training, Saving, Loading¶
In the following example, we will train, save and load an A2C model on the Lunar Lander environment.


Lunar Lander Environment
Note
LunarLander requires the python package box2d.
You can install it using apt install swig
and then pip install box2d box2d-kengz
import gym
from stable_baselines.common.policies import MlpPolicy
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines import A2C
# Create and wrap the environment
env = gym.make('LunarLander-v2')
env = DummyVecEnv([lambda: env])
model = A2C(MlpPolicy, env, ent_coef=0.1, verbose=1)
# Train the agent
model.learn(total_timesteps=100000)
# Save the agent
model.save("a2c_lunar")
del model # delete trained model to demonstrate loading
# Load the trained agent
model = A2C.load("a2c_lunar")
# Enjoy trained agent
obs = env.reset()
for i in range(1000):
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)
env.render()
Multiprocessing: Unleashing the Power of Vectorized Environments¶


CartPole Environment
import gym
import numpy as np
from stable_baselines.common.policies import MlpPolicy
from stable_baselines.common.vec_env import SubprocVecEnv
from stable_baselines.common import set_global_seeds
from stable_baselines import ACKTR
def make_env(env_id, rank, seed=0):
"""
Utility function for multiprocessed env.
:param env_id: (str) the environment ID
:param num_env: (int) the number of environments you wish to have in subprocesses
:param seed: (int) the inital seed for RNG
:param rank: (int) index of the subprocess
"""
def _init():
env = gym.make(env_id)
env.seed(seed + rank)
return env
set_global_seeds(seed)
return _init
env_id = "CartPole-v1"
num_cpu = 4 # Number of processes to use
# Create the vectorized environment
env = SubprocVecEnv([make_env(env_id, i) for i in range(num_cpu)])
model = ACKTR(MlpPolicy, env, verbose=1)
model.learn(total_timesteps=25000)
obs = env.reset()
for _ in range(1000):
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)
env.render()
Using Callback: Monitoring Training¶
You can define a custom callback function that will be called inside the agent. This could be useful when you want to monitor training, for instance display live learning curves in Tensorboard (or in Visdom) or save the best agent.


Learning curve of DDPG on LunarLanderContinuous environment
import os
import gym
import numpy as np
import matplotlib.pyplot as plt
from stable_baselines.ddpg.policies import MlpPolicy
from stable_baselines.common.vec_env.dummy_vec_env import DummyVecEnv
from stable_baselines.bench import Monitor
from stable_baselines.results_plotter import load_results, ts2xy
from stable_baselines import DDPG
from stable_baselines.ddpg.noise import AdaptiveParamNoiseSpec
best_mean_reward, n_steps = -np.inf, 0
def callback(_locals, _globals):
"""
Callback called at each step (for DQN an others) or after n steps (see ACER or PPO2)
:param _locals: (dict)
:param _globals: (dict)
"""
global n_steps, best_mean_reward
# Print stats every 1000 calls
if (n_steps + 1) % 1000 == 0:
# Evaluate policy performance
x, y = ts2xy(load_results(log_dir), 'timesteps')
if len(x) > 0:
mean_reward = np.mean(y[-100:])
print(x[-1], 'timesteps')
print("Best mean reward: {:.2f} - Last mean reward per episode: {:.2f}".format(best_mean_reward, mean_reward))
# New best model, you could save the agent here
if mean_reward > best_mean_reward:
best_mean_reward = mean_reward
# Example for saving best model
print("Saving new best model")
_locals['self'].save(log_dir + 'best_model.pkl')
n_steps += 1
return False
# Create log dir
log_dir = "/tmp/gym/"
os.makedirs(log_dir, exist_ok=True)
# Create and wrap the environment
env = gym.make('LunarLanderContinuous-v2')
env = Monitor(env, log_dir, allow_early_resets=True)
env = DummyVecEnv([lambda: env])
# Add some param noise for exploration
param_noise = AdaptiveParamNoiseSpec(initial_stddev=0.2, desired_action_stddev=0.2)
model = DDPG(MlpPolicy, env, param_noise=param_noise, memory_limit=int(1e6), verbose=0)
# Train the agent
model.learn(total_timesteps=200000, callback=callback)
Atari Games¶

Trained A2C agent on Breakout

Pong Environment
Training a RL agent on Atari games is straightforward thanks to make_atari_env
helper function.
It will do all the preprocessing
and multiprocessing for you.

from stable_baselines.common.cmd_util import make_atari_env
from stable_baselines.common.policies import CnnPolicy
from stable_baselines.common.vec_env import VecFrameStack
from stable_baselines import ACER
# There already exists an environment generator
# that will make and wrap atari environments correctly.
# Here we are also multiprocessing training (num_env=4 => 4 processes)
env = make_atari_env('PongNoFrameskip-v4', num_env=4, seed=0)
# Frame-stacking with 4 frames
env = VecFrameStack(env, n_stack=4)
model = ACER(CnnPolicy, env, verbose=1)
model.learn(total_timesteps=25000)
obs = env.reset()
while True:
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)
env.render()
Mujoco: Normalizing input features¶
Normalizing input features may be essential to successful training of an RL agent (by default, images are scaled but not other types of input), for instance when training on Mujoco. For that, a wrapper exists and will compute a running average and standard deviation of input features (it can do the same for rewards).
Note
We cannot provide a notebook for this example because Mujoco is a proprietary engine and requires a license.
import gym
from stable_baselines.common.policies import MlpPolicy
from stable_baselines.common.vec_env import DummyVecEnv, VecNormalize
from stable_baselines import PPO2
env = DummyVecEnv([lambda: gym.make("Reacher-v2")])
# Automatically normalize the input features
env = VecNormalize(env, norm_obs=True, norm_reward=False,
clip_obs=10.)
model = PPO2(MlpPolicy, env)
model.learn(total_timesteps=2000)
# Don't forget to save the running average when saving the agent
log_dir = "/tmp/"
model.save(log_dir + "ppo_reacher")
env.save_running_average(log_dir)
Custom Policy Network¶
Stable baselines provides default policy networks for images (CNNPolicies) and other type of inputs (MlpPolicies). However, you can also easily define a custom architecture for the policy network (see custom policy section):
import gym
from stable_baselines.common.policies import FeedForwardPolicy
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines import A2C
# Custom MLP policy of three layers of size 128 each
class CustomPolicy(FeedForwardPolicy):
def __init__(self, *args, **kwargs):
super(CustomPolicy, self).__init__(*args, **kwargs,
layers=[128, 128, 128],
feature_extraction="mlp")
# Create and wrap the environment
env = gym.make('LunarLander-v2')
env = DummyVecEnv([lambda: env])
model = A2C(CustomPolicy, env, verbose=1)
# Train the agent
model.learn(total_timesteps=100000)
Continual Learning¶
You can also move from learning on one environment to another for continual learning
(PPO2 on DemonAttack-v0
, then transferred on SpaceInvaders-v0
):
from stable_baselines.common.cmd_util import make_atari_env
from stable_baselines.common.policies import CnnPolicy
from stable_baselines import PPO2
# There already exists an environment generator
# that will make and wrap atari environments correctly
env = make_atari_env('DemonAttackNoFrameskip-v4', num_env=8, seed=0)
model = PPO2(CnnPolicy, env, verbose=1)
model.learn(total_timesteps=10000)
obs = env.reset()
for i in range(1000):
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)
env.render()
# The number of environments must be identical when changing environments
env = make_atari_env('SpaceInvadersNoFrameskip-v4', num_env=8, seed=0)
# change env
model.set_env(env)
model.learn(total_timesteps=10000)
obs = env.reset()
while True:
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)
env.render()
Bonus: Make a GIF of a Trained Agent¶
Note
For Atari games, you need to use a screen recorder such as Kazam. And then convert the video using ffmpeg
import imageio
import numpy as np
from stable_baselines.common.policies import MlpPolicy
from stable_baselines import A2C
model = A2C(MlpPolicy, "LunarLander-v2").learn(100000)
images = []
obs = model.env.reset()
img = model.env.render(mode='rgb_array')
for i in range(350):
images.append(img)
action, _ = model.predict(obs)
obs, _, _ ,_ = model.env.step(action)
img = model.env.render(mode='rgb_array')
imageio.mimsave('lander_a2c.gif', [np.array(img[0]) for i, img in enumerate(images) if i%2 == 0], fps=29)
Vectorized Environments¶
Vectorized Environments are a way to multiprocess training. Instead of training a RL agent on 1 environment, it allows to train it on n environments using n processes. Because of that, actions passed to the environment are now a vector (of dimension n). It is the same for observations, rewards and end of episode signals (dones).
Note
Vectorized environments are required when using wrappers for frame-stacking or normalization.
Note
When using vectorized environments, the environments are automatically resetted at the end of each episode.
Warning
It seems that Windows users are experiencing issues with SubprocVecEnv. We recommend to use the docker image in that case. (See Issue #42)
DummyVecEnv¶
-
class
stable_baselines.common.vec_env.
DummyVecEnv
(env_fns)[source]¶ Creates a simple vectorized wrapper for multiple environments
Parameters: env_fns – ([Gym Environment]) the list of environments to vectorize -
env_method
(method_name, *method_args, **method_kwargs)[source]¶ Provides an interface to call arbitrary class methods of vectorized environments
Parameters: - method_name – (str) The name of the env class method to invoke
- method_args – (tuple) Any positional arguments to provide in the call
- method_kwargs – (dict) Any keyword arguments to provide in the call
Returns: (list) List of items retured by the environment’s method call
-
get_attr
(attr_name)[source]¶ Provides a mechanism for getting class attribues from vectorized environments
Parameters: attr_name – (str) The name of the attribute whose value to return Returns: (list) List of values of ‘attr_name’ in all environments
-
render
(*args, **kwargs)[source]¶ Gym environment rendering
Parameters: mode – (str) the rendering type
-
reset
()[source]¶ Reset all the environments and return an array of observations, or a tuple of observation arrays.
If step_async is still doing work, that work will be cancelled and step_wait() should not be called until step_async() is invoked again.
Returns: ([int] or [float]) observation
-
set_attr
(attr_name, value, indices=None)[source]¶ Provides a mechanism for setting arbitrary class attributes inside vectorized environments
Parameters: - attr_name – (str) Name of attribute to assign new value
- value – (obj) Value to assign to ‘attr_name’
- indices – (list,int) Indices of envs to assign value
Returns: (list) in case env access methods might return something, they will be returned in a list
-
SubprocVecEnv¶
-
class
stable_baselines.common.vec_env.
SubprocVecEnv
(env_fns)[source]¶ Creates a multiprocess vectorized wrapper for multiple environments
Parameters: env_fns – ([Gym Environment]) Environments to run in subprocesses -
env_method
(method_name, *method_args, **method_kwargs)[source]¶ Provides an interface to call arbitrary class methods of vectorized environments
Parameters: - method_name – (str) The name of the env class method to invoke
- method_args – (tuple) Any positional arguments to provide in the call
- method_kwargs – (dict) Any keyword arguments to provide in the call
Returns: (list) List of items retured by each environment’s method call
-
get_attr
(attr_name)[source]¶ Provides a mechanism for getting class attribues from vectorized environments (note: attribute value returned must be picklable)
Parameters: attr_name – (str) The name of the attribute whose value to return Returns: (list) List of values of ‘attr_name’ in all environments
-
render
(mode='human', *args, **kwargs)[source]¶ Gym environment rendering
Parameters: mode – (str) the rendering type
-
reset
()[source]¶ Reset all the environments and return an array of observations, or a tuple of observation arrays.
If step_async is still doing work, that work will be cancelled and step_wait() should not be called until step_async() is invoked again.
Returns: ([int] or [float]) observation
-
set_attr
(attr_name, value, indices=None)[source]¶ Provides a mechanism for setting arbitrary class attributes inside vectorized environments (note: this is a broadcast of a single value to all instances) (note: the value must be picklable)
Parameters: - attr_name – (str) Name of attribute to assign new value
- value – (obj) Value to assign to ‘attr_name’
- indices – (list,tuple) Iterable containing indices of envs whose attr to set
Returns: (list) in case env access methods might return something, they will be returned in a list
-
Wrappers¶
VecFrameStack¶
VecNormalize¶
-
class
stable_baselines.common.vec_env.
VecNormalize
(venv, training=True, norm_obs=True, norm_reward=True, clip_obs=10.0, clip_reward=10.0, gamma=0.99, epsilon=1e-08)[source]¶ A moving average, normalizing wrapper for vectorized environment. has support for saving/loading moving average,
Parameters: - venv – (VecEnv) the vectorized environment to wrap
- training – (bool) Whether to update or not the moving average
- norm_obs – (bool) Whether to normalize observation or not (default: True)
- norm_reward – (bool) Whether to normalize rewards or not (default: False)
- clip_obs – (float) Max absolute value for observation
- clip_reward – (float) Max value absolute for discounted reward
- gamma – (float) discount factor
- epsilon – (float) To avoid division by zero
Using Custom Environments¶
To use the rl baselines with custom environments, they just need to follow the gym interface. That is to say, your environment must implement the following methods (and inherits from OpenAI Gym Class):
import gym
from gym import spaces
class CustomEnv(gym.Env):
"""Custom Environment that follows gym interface"""
metadata = {'render.modes': ['human']}
def __init__(self, arg1, arg2, ...):
super(CustomEnv, self).__init__()
# Define action and observation space
# They must be gym.spaces objects
# Example when using discrete actions:
self.action_space = spaces.Discrete(N_DISCRETE_ACTIONS)
# Example for using image as input:
self.observation_space = spaces.Box(low=0, high=255,
shape=(HEIGHT, WIDTH, N_CHANNELS), dtype=np.uint8)
def step(self, action):
...
def reset(self):
...
def render(self, mode='human', close=False):
...
Then you can define and train a RL agent with:
# Instantiate and wrap the env
env = DummyVecEnv([lambda: CustomEnv(arg1, ...)])
# Define and Train the agent
model = A2C(CnnPolicy, env).learn(total_timesteps=1000)
You can find a complete guide online on creating a custom Gym environment.
Optionnaly, you can also register the environment with gym,
that will allow you to create the RL agent in one line (and use gym.make()
to instantiate the env).
In the project, for testing purposes, we use a custom environment named IdentityEnv
defined in this file.
An example of how to use it can be found here.
Custom Policy Network¶
Stable baselines provides default policy networks (see Policies ) for images (CNNPolicies) and other type of input features (MlpPolicies). However, you can also easily define a custom architecture for the policy (or value) network:
import gym
from stable_baselines.common.policies import FeedForwardPolicy, register_policy
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines import A2C
# Custom MLP policy of three layers of size 128 each
class CustomPolicy(FeedForwardPolicy):
def __init__(self, *args, **kwargs):
super(CustomPolicy, self).__init__(*args, **kwargs,
layers=[128, 128, 128],
feature_extraction="mlp")
# Create and wrap the environment
env = gym.make('LunarLander-v2')
env = DummyVecEnv([lambda: env])
model = A2C(CustomPolicy, env, verbose=1)
# Train the agent
model.learn(total_timesteps=100000)
del model
# When loading a model with a custom policy
# you MUST pass explicitly the policy when loading the saved model
model = A2C.load(policy=CustomPolicy)
Warning
When loading a model with a custom policy, you must pass the custom policy explicitly when loading the model. (cf previous example)
You can also registered your policy, to help with code simplicity: you can refer to your custom policy using a string.
import gym
from stable_baselines.common.policies import FeedForwardPolicy, register_policy
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines import A2C
# Custom MLP policy of three layers of size 128 each
class CustomPolicy(FeedForwardPolicy):
def __init__(self, *args, **kwargs):
super(CustomPolicy, self).__init__(*args, **kwargs,
layers=[128, 128, 128],
feature_extraction="mlp")
# Register the policy, it will check that the name is not already taken
register_policy('CustomPolicy', CustomPolicy)
# Because the policy is now registered, you can pass
# a string to the agent constructor instead of passing a class
model = A2C(policy='CustomPolicy', env='LunarLander-v2', verbose=1).learn(total_timesteps=100000)
If however, your task requires a more granular control over the policy architecture, you can redefine the policy directly:
import gym
import tensorflow as tf
from stable_baselines.common.policies import ActorCriticPolicy, register_policy, nature_cnn
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines import A2C
# Custom MLP policy of three layers of size 128 each for the actor and 2 layers of 32 for the critic,
# with a nature_cnn feature extractor
class CustomPolicy(ActorCriticPolicy):
def __init__(self, sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, **kwargs):
super(CustomPolicy, self).__init__(sess, ob_space, ac_space, n_env, n_steps, n_batch, n_lstm=256,
reuse=reuse, scale=True)
with tf.variable_scope("model", reuse=reuse):
activ = tf.nn.relu
extracted_features = nature_cnn(self.processed_x, **kwargs)
extracted_features = tf.layers.flatten(extracted_features)
pi_h = extracted_features
for i, layer_size in enumerate([128, 128, 128]):
pi_h = activ(tf.layers.dense(pi_h, layer_size, name='pi_fc' + str(i)))
pi_latent = pi_h
vf_h = extracted_features
for i, layer_size in enumerate([32, 32]):
vf_h = activ(tf.layers.dense(vf_h, layer_size, name='vf_fc' + str(i)))
value_fn = tf.layers.dense(vf_h, 1, name='vf')
vf_latent = vf_h
self.proba_distribution, self.policy, self.q_value = \
self.pdtype.proba_distribution_from_latent(pi_latent, vf_latent, init_scale=0.01)
self.value_fn = value_fn
self.initial_state = None
self._setup_init()
def step(self, obs, state=None, mask=None):
action, value, neglogp = self.sess.run([self.action, self._value, self.neglogp], {self.obs_ph: obs})
return action, value, self.initial_state, neglogp
def proba_step(self, obs, state=None, mask=None):
return self.sess.run(self.policy_proba, {self.obs_ph: obs})
def value(self, obs, state=None, mask=None):
return self.sess.run(self._value, {self.obs_ph: obs})
# Create and wrap the environment
env = gym.make('Breakout-v0')
env = DummyVecEnv([lambda: env])
model = A2C(CustomPolicy, env, verbose=1)
# Train the agent
model.learn(total_timesteps=100000)
Tensorboard Integration¶
Basic Usage¶
To use Tensorboard with the rl baselines, you simply need to define a log location for the RL agent:
import gym
from stable_baselines.common.policies import MlpPolicy
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines import A2C
env = gym.make('CartPole-v1')
env = DummyVecEnv([lambda: env]) # The algorithms require a vectorized environment to run
model = A2C(MlpPolicy, env, verbose=1, tensorboard_log="./a2c_cartpole_tensorboard/")
model.learn(total_timesteps=10000)
Or after loading an existing model (by default the log path is not saved):
import gym
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines import A2C
env = gym.make('CartPole-v1')
env = DummyVecEnv([lambda: env]) # The algorithms require a vectorized environment to run
model = A2C.load("./a2c_cartpole.pkl", env=env, tensorboard_log="./a2c_cartpole_tensorboard/")
model.learn(total_timesteps=10000)
You can also define custom logging name when training (by default it is the algorithm name)
import gym
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines.common.policies import MlpPolicy
from stable_baselines import A2C
env = gym.make('CartPole-v1')
env = DummyVecEnv([lambda: env]) # The algorithms require a vectorized environment to run
model = A2C(MlpPolicy, env, verbose=1, tensorboard_log="./a2c_cartpole_tensorboard/")
model.learn(total_timesteps=10000, tb_log_name="first_run")
model.learn(total_timesteps=10000, tb_log_name="second_run")
model.learn(total_timesteps=10000, tb_log_name="thrid_run")
Once the learn function is called, you can monitor the RL agent during or after the training, with the following bash command:
tensorboard --logdir ./a2c_cartpole_tensorboard/
you can also add past logging folders:
tensorboard --logdir ./a2c_cartpole_tensorboard/;./ppo2_cartpole_tensorboard/
It will display information such as the model graph, the episode reward, the model losses, the observation and other parameter unique to some models.



Legacy Integration¶
All the information displayed in the terminal (default logging) can be also logged in tensorboard. For that, you need to define several environment variables:
# formats are comma-separated, but for tensorboard you only need the last one
# stdout -> terminal
export OPENAI_LOG_FORMAT='stdout,log,csv,tensorboard'
export OPENAI_LOGDIR=path/to/tensorboard/data
Then start tensorboard with:
tensorboard --logdir=$OPENAI_LOGDIR
Base RL Class¶
Common interface for all the RL algorithms
-
class
stable_baselines.common.base_class.
BaseRLModel
(policy, env, verbose=0, *, requires_vec_env, policy_base)[source]¶ The base RL model
Parameters: - policy – (BasePolicy) Policy object
- env – (Gym environment) The environment to learn from (if registered in Gym, can be str. Can be None for loading trained models)
- verbose – (int) the verbosity level: 0 none, 1 training information, 2 tensorflow debug
- requires_vec_env – (bool) Does this model require a vectorized environment
- policy_base – (BasePolicy) the base policy used by this method
-
action_probability
(observation, state=None, mask=None)[source]¶ Get the model’s action probability distribution from an observation
Parameters: - observation – (np.ndarray) the input observation
- state – (np.ndarray) The last states (can be None, used in recurrent policies)
- mask – (np.ndarray) The last masks (can be None, used in recurrent policies)
Returns: (np.ndarray) the model’s action probability distribution
-
get_env
()[source]¶ returns the current environment (can be None if not defined)
Returns: (Gym Environment) The current environment
-
learn
(total_timesteps, callback=None, seed=None, log_interval=100, tb_log_name='run')[source]¶ Return a trained model.
Parameters: - total_timesteps – (int) The total number of samples to train on
- seed – (int) The initial seed for training, if None: keep current seed
- callback – (function (dict, dict)) function called at every steps with state of the algorithm. It takes the local and global variables.
- log_interval – (int) The number of timesteps before logging.
- tb_log_name – (str) the name of the run for tensorboard log
Returns: (BaseRLModel) the trained model
-
classmethod
load
(load_path, env=None, **kwargs)[source]¶ Load the model from file
Parameters: - load_path – (str) the saved parameter location
- env – (Gym Envrionment) the new environment to run the loaded model on (can be None if you only need prediction from a trained model)
- kwargs – extra arguments to change the model when loading
-
predict
(observation, state=None, mask=None, deterministic=False)[source]¶ Get the model’s action from an observation
Parameters: - observation – (np.ndarray) the input observation
- state – (np.ndarray) The last states (can be None, used in recurrent policies)
- mask – (np.ndarray) The last masks (can be None, used in recurrent policies)
- deterministic – (bool) Whether or not to return deterministic actions.
Returns: (np.ndarray, np.ndarray) the model’s action and the next state (used in recurrent policies)
-
save
(save_path)[source]¶ Save the current parameters to file
Parameters: save_path – (str) the save location
Policy Networks¶
Stable-baselines provides a set of default policies, that can be used with most action spaces. If you need more control on the policy architecture, You can also create a custom policy (see Custom Policy Network).
Note
CnnPolicies are for images only. MlpPolicies are made for other type of features (e.g. robot joints)
Warning
For all algorithms (except DDPG), continuous actions are only clipped during training (to avoid out of bound error). However, you have to manually clip the action when using the predict() method.
Available Policies
MlpPolicy |
Policy object that implements actor critic, using a MLP (2 layers of 64) |
MlpLstmPolicy |
Policy object that implements actor critic, using LSTMs with a MLP feature extraction |
MlpLnLstmPolicy |
Policy object that implements actor critic, using a layer normalized LSTMs with a MLP feature extraction |
CnnPolicy |
Policy object that implements actor critic, using a CNN (the nature CNN) |
CnnLstmPolicy |
Policy object that implements actor critic, using LSTMs with a CNN feature extraction |
CnnLnLstmPolicy |
Policy object that implements actor critic, using a layer normalized LSTMs with a CNN feature extraction |
Base Classes¶
-
class
stable_baselines.common.policies.
ActorCriticPolicy
(sess, ob_space, ac_space, n_env, n_steps, n_batch, n_lstm=256, reuse=False, scale=False)[source]¶ Policy object that implements actor critic
Parameters: - sess – (TensorFlow session) The current TensorFlow session
- ob_space – (Gym Space) The observation space of the environment
- ac_space – (Gym Space) The action space of the environment
- n_env – (int) The number of environments to run
- n_steps – (int) The number of steps to run for each environment
- n_batch – (int) The number of batch to run (n_envs * n_steps)
- n_lstm – (int) The number of LSTM cells (for recurrent policies)
- reuse – (bool) If the policy is reusable or not
- scale – (bool) whether or not to scale the input
-
proba_step
(obs, state=None, mask=None)[source]¶ Returns the action probability for a single step
Parameters: - obs – ([float] or [int]) The current observation of the environment
- state – ([float]) The last states (used in recurrent policies)
- mask – ([float]) The last masks (used in recurrent policies)
Returns: ([float]) the action probability
-
step
(obs, state=None, mask=None, deterministic=False)[source]¶ Returns the policy for a single step
Parameters: - obs – ([float] or [int]) The current observation of the environment
- state – ([float]) The last states (used in recurrent policies)
- mask – ([float]) The last masks (used in recurrent policies)
- deterministic – (bool) Whether or not to return deterministic actions.
Returns: ([float], [float], [float], [float]) actions, values, states, neglogp
-
value
(obs, state=None, mask=None)[source]¶ Returns the value for a single step
Parameters: - obs – ([float] or [int]) The current observation of the environment
- state – ([float]) The last states (used in recurrent policies)
- mask – ([float]) The last masks (used in recurrent policies)
Returns: ([float]) The associated value of the action
-
class
stable_baselines.common.policies.
FeedForwardPolicy
(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, layers=None, cnn_extractor=<function nature_cnn>, feature_extraction='cnn', **kwargs)[source]¶ Policy object that implements actor critic, using a feed forward neural network.
Parameters: - sess – (TensorFlow session) The current TensorFlow session
- ob_space – (Gym Space) The observation space of the environment
- ac_space – (Gym Space) The action space of the environment
- n_env – (int) The number of environments to run
- n_steps – (int) The number of steps to run for each environment
- n_batch – (int) The number of batch to run (n_envs * n_steps)
- reuse – (bool) If the policy is reusable or not
- layers – ([int]) The size of the Neural network for the policy (if None, default to [64, 64])
- cnn_extractor – (function (TensorFlow Tensor,
**kwargs
): (TensorFlow Tensor)) the CNN feature extraction - feature_extraction – (str) The feature extraction type (“cnn” or “mlp”)
- kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction
-
proba_step
(obs, state=None, mask=None)[source]¶ Returns the action probability for a single step
Parameters: - obs – ([float] or [int]) The current observation of the environment
- state – ([float]) The last states (used in recurrent policies)
- mask – ([float]) The last masks (used in recurrent policies)
Returns: ([float]) the action probability
-
step
(obs, state=None, mask=None, deterministic=False)[source]¶ Returns the policy for a single step
Parameters: - obs – ([float] or [int]) The current observation of the environment
- state – ([float]) The last states (used in recurrent policies)
- mask – ([float]) The last masks (used in recurrent policies)
- deterministic – (bool) Whether or not to return deterministic actions.
Returns: ([float], [float], [float], [float]) actions, values, states, neglogp
-
value
(obs, state=None, mask=None)[source]¶ Returns the value for a single step
Parameters: - obs – ([float] or [int]) The current observation of the environment
- state – ([float]) The last states (used in recurrent policies)
- mask – ([float]) The last masks (used in recurrent policies)
Returns: ([float]) The associated value of the action
-
class
stable_baselines.common.policies.
LstmPolicy
(sess, ob_space, ac_space, n_env, n_steps, n_batch, n_lstm=256, reuse=False, layers=None, cnn_extractor=<function nature_cnn>, layer_norm=False, feature_extraction='cnn', **kwargs)[source]¶ Policy object that implements actor critic, using LSTMs.
Parameters: - sess – (TensorFlow session) The current TensorFlow session
- ob_space – (Gym Space) The observation space of the environment
- ac_space – (Gym Space) The action space of the environment
- n_env – (int) The number of environments to run
- n_steps – (int) The number of steps to run for each environment
- n_batch – (int) The number of batch to run (n_envs * n_steps)
- n_lstm – (int) The number of LSTM cells (for recurrent policies)
- reuse – (bool) If the policy is reusable or not
- layers – ([int]) The size of the Neural network before the LSTM layer (if None, default to [64, 64])
- cnn_extractor – (function (TensorFlow Tensor,
**kwargs
): (TensorFlow Tensor)) the CNN feature extraction - layer_norm – (bool) Whether or not to use layer normalizing LSTMs
- feature_extraction – (str) The feature extraction type (“cnn” or “mlp”)
- kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction
-
proba_step
(obs, state=None, mask=None)[source]¶ Returns the action probability for a single step
Parameters: - obs – ([float] or [int]) The current observation of the environment
- state – ([float]) The last states (used in recurrent policies)
- mask – ([float]) The last masks (used in recurrent policies)
Returns: ([float]) the action probability
-
step
(obs, state=None, mask=None, deterministic=False)[source]¶ Returns the policy for a single step
Parameters: - obs – ([float] or [int]) The current observation of the environment
- state – ([float]) The last states (used in recurrent policies)
- mask – ([float]) The last masks (used in recurrent policies)
- deterministic – (bool) Whether or not to return deterministic actions.
Returns: ([float], [float], [float], [float]) actions, values, states, neglogp
-
value
(obs, state=None, mask=None)[source]¶ Returns the value for a single step
Parameters: - obs – ([float] or [int]) The current observation of the environment
- state – ([float]) The last states (used in recurrent policies)
- mask – ([float]) The last masks (used in recurrent policies)
Returns: ([float]) The associated value of the action
MLP Policies¶
-
class
stable_baselines.common.policies.
MlpPolicy
(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, **_kwargs)[source]¶ Policy object that implements actor critic, using a MLP (2 layers of 64)
Parameters: - sess – (TensorFlow session) The current TensorFlow session
- ob_space – (Gym Space) The observation space of the environment
- ac_space – (Gym Space) The action space of the environment
- n_env – (int) The number of environments to run
- n_steps – (int) The number of steps to run for each environment
- n_batch – (int) The number of batch to run (n_envs * n_steps)
- reuse – (bool) If the policy is reusable or not
- _kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction
-
class
stable_baselines.common.policies.
MlpLstmPolicy
(sess, ob_space, ac_space, n_env, n_steps, n_batch, n_lstm=256, reuse=False, **_kwargs)[source]¶ Policy object that implements actor critic, using LSTMs with a MLP feature extraction
Parameters: - sess – (TensorFlow session) The current TensorFlow session
- ob_space – (Gym Space) The observation space of the environment
- ac_space – (Gym Space) The action space of the environment
- n_env – (int) The number of environments to run
- n_steps – (int) The number of steps to run for each environment
- n_batch – (int) The number of batch to run (n_envs * n_steps)
- n_lstm – (int) The number of LSTM cells (for recurrent policies)
- reuse – (bool) If the policy is reusable or not
- kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction
-
class
stable_baselines.common.policies.
MlpLnLstmPolicy
(sess, ob_space, ac_space, n_env, n_steps, n_batch, n_lstm=256, reuse=False, **_kwargs)[source]¶ Policy object that implements actor critic, using a layer normalized LSTMs with a MLP feature extraction
Parameters: - sess – (TensorFlow session) The current TensorFlow session
- ob_space – (Gym Space) The observation space of the environment
- ac_space – (Gym Space) The action space of the environment
- n_env – (int) The number of environments to run
- n_steps – (int) The number of steps to run for each environment
- n_batch – (int) The number of batch to run (n_envs * n_steps)
- n_lstm – (int) The number of LSTM cells (for recurrent policies)
- reuse – (bool) If the policy is reusable or not
- kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction
CNN Policies¶
-
class
stable_baselines.common.policies.
CnnPolicy
(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, **_kwargs)[source]¶ Policy object that implements actor critic, using a CNN (the nature CNN)
Parameters: - sess – (TensorFlow session) The current TensorFlow session
- ob_space – (Gym Space) The observation space of the environment
- ac_space – (Gym Space) The action space of the environment
- n_env – (int) The number of environments to run
- n_steps – (int) The number of steps to run for each environment
- n_batch – (int) The number of batch to run (n_envs * n_steps)
- reuse – (bool) If the policy is reusable or not
- _kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction
-
class
stable_baselines.common.policies.
CnnLstmPolicy
(sess, ob_space, ac_space, n_env, n_steps, n_batch, n_lstm=256, reuse=False, **_kwargs)[source]¶ Policy object that implements actor critic, using LSTMs with a CNN feature extraction
Parameters: - sess – (TensorFlow session) The current TensorFlow session
- ob_space – (Gym Space) The observation space of the environment
- ac_space – (Gym Space) The action space of the environment
- n_env – (int) The number of environments to run
- n_steps – (int) The number of steps to run for each environment
- n_batch – (int) The number of batch to run (n_envs * n_steps)
- n_lstm – (int) The number of LSTM cells (for recurrent policies)
- reuse – (bool) If the policy is reusable or not
- kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction
-
class
stable_baselines.common.policies.
CnnLnLstmPolicy
(sess, ob_space, ac_space, n_env, n_steps, n_batch, n_lstm=256, reuse=False, **_kwargs)[source]¶ Policy object that implements actor critic, using a layer normalized LSTMs with a CNN feature extraction
Parameters: - sess – (TensorFlow session) The current TensorFlow session
- ob_space – (Gym Space) The observation space of the environment
- ac_space – (Gym Space) The action space of the environment
- n_env – (int) The number of environments to run
- n_steps – (int) The number of steps to run for each environment
- n_batch – (int) The number of batch to run (n_envs * n_steps)
- n_lstm – (int) The number of LSTM cells (for recurrent policies)
- reuse – (bool) If the policy is reusable or not
- kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction
A2C¶
A synchronous, deterministic variant of Asynchronous Advantage Actor Critic (A3C). It uses multiple workers to avoid the use of a replay buffer.
Notes¶
- Original paper: https://arxiv.org/abs/1602.01783
- OpenAI blog post: https://blog.openai.com/openai-baselines-ppo/
python -m stable_baselines.ppo2.run_atari
runs the algorithm for 40M frames = 10M timesteps on an Atari game. See help (-h
) for more options.python -m stable_baselines.ppo2.run_mujoco
runs the algorithm for 1M frames on a Mujoco environment.
Can I use?¶
- Reccurent policies: ✔️
- Multi processing: ✔️
- Gym spaces:
Space | Action | Observation |
---|---|---|
Discrete | ✔️ | ✔️ |
Box | ✔️ | ✔️ |
MultiDiscrete | ✔️ | ✔️ |
MultiBinary | ✔️ | ✔️ |
Example¶
Train a A2C agent on CartPole-v1 using 4 processes.
import gym
from stable_baselines.common.policies import MlpPolicy
from stable_baselines.common.vec_env import SubprocVecEnv
from stable_baselines import A2C
# multiprocess environment
n_cpu = 4
env = SubprocVecEnv([lambda: gym.make('CartPole-v1') for i in range(n_cpu)])
model = A2C(MlpPolicy, env, verbose=1)
model.learn(total_timesteps=25000)
model.save("a2c_cartpole")
del model # remove to demonstrate saving and loading
model = A2C.load("a2c_cartpole")
obs = env.reset()
while True:
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)
env.render()
Parameters¶
-
class
stable_baselines.a2c.
A2C
(policy, env, gamma=0.99, n_steps=5, vf_coef=0.25, ent_coef=0.01, max_grad_norm=0.5, learning_rate=0.0007, alpha=0.99, epsilon=1e-05, lr_schedule='linear', verbose=0, tensorboard_log=None, _init_setup_model=True)[source]¶ The A2C (Advantage Actor Critic) model class, https://arxiv.org/abs/1602.01783
Parameters: - policy – (ActorCriticPolicy or str) The policy model to use (MlpPolicy, CnnPolicy, CnnLstmPolicy, …)
- env – (Gym environment or str) The environment to learn from (if registered in Gym, can be str)
- gamma – (float) Discount factor
- n_steps – (int) The number of steps to run for each environment per update (i.e. batch size is n_steps * n_env where n_env is number of environment copies running in parallel)
- vf_coef – (float) Value function coefficient for the loss calculation
- ent_coef – (float) Entropy coefficient for the loss caculation
- max_grad_norm – (float) The maximum value for the gradient clipping
- learning_rate – (float) The learning rate
- alpha – (float) RMSProp decay parameter (default: 0.99)
- epsilon – (float) RMSProp epsilon (stabilizes square root computation in denominator of RMSProp update) (default: 1e-5)
- lr_schedule – (str) The type of scheduler for the learning rate update (‘linear’, ‘constant’, ‘double_linear_con’, ‘middle_drop’ or ‘double_middle_drop’)
- verbose – (int) the verbosity level: 0 none, 1 training information, 2 tensorflow debug
- tensorboard_log – (str) the log location for tensorboard (if None, no logging)
- _init_setup_model – (bool) Whether or not to build the network at the creation of the instance (used only for loading)
-
action_probability
(observation, state=None, mask=None)¶ Get the model’s action probability distribution from an observation
Parameters: - observation – (np.ndarray) the input observation
- state – (np.ndarray) The last states (can be None, used in recurrent policies)
- mask – (np.ndarray) The last masks (can be None, used in recurrent policies)
Returns: (np.ndarray) the model’s action probability distribution
-
get_env
()¶ returns the current environment (can be None if not defined)
Returns: (Gym Environment) The current environment
-
learn
(total_timesteps, callback=None, seed=None, log_interval=100, tb_log_name='A2C')[source]¶ Return a trained model.
Parameters: - total_timesteps – (int) The total number of samples to train on
- seed – (int) The initial seed for training, if None: keep current seed
- callback – (function (dict, dict)) function called at every steps with state of the algorithm. It takes the local and global variables.
- log_interval – (int) The number of timesteps before logging.
- tb_log_name – (str) the name of the run for tensorboard log
Returns: (BaseRLModel) the trained model
-
classmethod
load
(load_path, env=None, **kwargs)¶ Load the model from file
Parameters: - load_path – (str) the saved parameter location
- env – (Gym Envrionment) the new environment to run the loaded model on (can be None if you only need prediction from a trained model)
- kwargs – extra arguments to change the model when loading
-
predict
(observation, state=None, mask=None, deterministic=False)¶ Get the model’s action from an observation
Parameters: - observation – (np.ndarray) the input observation
- state – (np.ndarray) The last states (can be None, used in recurrent policies)
- mask – (np.ndarray) The last masks (can be None, used in recurrent policies)
- deterministic – (bool) Whether or not to return deterministic actions.
Returns: (np.ndarray, np.ndarray) the model’s action and the next state (used in recurrent policies)
-
save
(save_path)[source]¶ Save the current parameters to file
Parameters: save_path – (str) the save location
-
set_env
(env)¶ Checks the validity of the environment, and if it is coherent, set it as the current environment.
Parameters: env – (Gym Environment) The environment for learning a policy
ACER¶
Sample Efficient Actor-Critic with Experience Replay (ACER) combines several ideas of previous algorithms: it uses multiple workers (as A2C), implements a replay buffer (as in DQN), uses Retrace for Q-value estimation, importance sampling and a trust region.
Notes¶
- Original paper: https://arxiv.org/abs/1611.01224
python -m stable_baselines.acer.run_atari
runs the algorithm for 40M frames = 10M timesteps on an Atari game. See help (-h
) for more options.
Can I use?¶
- Reccurent policies: ✔️
- Multi processing: ✔️
- Gym spaces:
Space | Action | Observation |
---|---|---|
Discrete | ✔️ | ✔️ |
Box | ❌ | ✔️ |
MultiDiscrete | ❌ | ✔️ |
MultiBinary | ❌ | ✔️ |
Example¶
import gym
from stable_baselines.common.policies import MlpPolicy, MlpLstmPolicy, MlpLnLstmPolicy
from stable_baselines.common.vec_env import SubprocVecEnv
from stable_baselines import ACER
# multiprocess environment
n_cpu = 4
env = SubprocVecEnv([lambda: gym.make('CartPole-v1') for i in range(n_cpu)])
model = ACER(MlpPolicy, env, verbose=1)
model.learn(total_timesteps=25000)
model.save("acer_cartpole")
del model # remove to demonstrate saving and loading
model = ACER.load("acer_cartpole")
obs = env.reset()
while True:
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)
env.render()
Parameters¶
-
class
stable_baselines.acer.
ACER
(policy, env, gamma=0.99, n_steps=20, num_procs=1, q_coef=0.5, ent_coef=0.01, max_grad_norm=10, learning_rate=0.0007, lr_schedule='linear', rprop_alpha=0.99, rprop_epsilon=1e-05, buffer_size=5000, replay_ratio=4, replay_start=1000, correction_term=10.0, trust_region=True, alpha=0.99, delta=1, verbose=0, tensorboard_log=None, _init_setup_model=True)[source]¶ The ACER (Actor-Critic with Experience Replay) model class, https://arxiv.org/abs/1611.01224
Parameters: - policy – (ActorCriticPolicy or str) The policy model to use (MlpPolicy, CnnPolicy, CnnLstmPolicy, …)
- env – (Gym environment or str) The environment to learn from (if registered in Gym, can be str)
- gamma – (float) The discount value
- n_steps – (int) The number of steps to run for each environment per update (i.e. batch size is n_steps * n_env where n_env is number of environment copies running in parallel)
- num_procs – (int) The number of threads for TensorFlow operations
- q_coef – (float) The weight for the loss on the Q value
- ent_coef – (float) The weight for the entropic loss
- max_grad_norm – (float) The clipping value for the maximum gradient
- learning_rate – (float) The initial learning rate for the RMS prop optimizer
- lr_schedule – (str) The type of scheduler for the learning rate update (‘linear’, ‘constant’, ‘double_linear_con’, ‘middle_drop’ or ‘double_middle_drop’)
- rprop_epsilon – (float) RMSProp epsilon (stabilizes square root computation in denominator of RMSProp update) (default: 1e-5)
- rprop_alpha – (float) RMSProp decay parameter (default: 0.99)
- buffer_size – (int) The buffer size in number of steps
- replay_ratio – (float) The number of replay learning per on policy learning on average, using a poisson distribution
- replay_start – (int) The minimum number of steps in the buffer, before learning replay
- correction_term – (float) Importance weight clipping factor (default: 10)
- trust_region – (bool) Whether or not algorithms estimates the gradient KL divergence between the old and updated policy and uses it to determine step size (default: True)
- alpha – (float) The decay rate for the Exponential moving average of the parameters
- delta – (float) max KL divergence between the old policy and updated policy (default: 1)
- verbose – (int) the verbosity level: 0 none, 1 training information, 2 tensorflow debug
- tensorboard_log – (str) the log location for tensorboard (if None, no logging)
- _init_setup_model – (bool) Whether or not to build the network at the creation of the instance
-
action_probability
(observation, state=None, mask=None)¶ Get the model’s action probability distribution from an observation
Parameters: - observation – (np.ndarray) the input observation
- state – (np.ndarray) The last states (can be None, used in recurrent policies)
- mask – (np.ndarray) The last masks (can be None, used in recurrent policies)
Returns: (np.ndarray) the model’s action probability distribution
-
get_env
()¶ returns the current environment (can be None if not defined)
Returns: (Gym Environment) The current environment
-
learn
(total_timesteps, callback=None, seed=None, log_interval=100, tb_log_name='ACER')[source]¶ Return a trained model.
Parameters: - total_timesteps – (int) The total number of samples to train on
- seed – (int) The initial seed for training, if None: keep current seed
- callback – (function (dict, dict)) function called at every steps with state of the algorithm. It takes the local and global variables.
- log_interval – (int) The number of timesteps before logging.
- tb_log_name – (str) the name of the run for tensorboard log
Returns: (BaseRLModel) the trained model
-
classmethod
load
(load_path, env=None, **kwargs)¶ Load the model from file
Parameters: - load_path – (str) the saved parameter location
- env – (Gym Envrionment) the new environment to run the loaded model on (can be None if you only need prediction from a trained model)
- kwargs – extra arguments to change the model when loading
-
predict
(observation, state=None, mask=None, deterministic=False)¶ Get the model’s action from an observation
Parameters: - observation – (np.ndarray) the input observation
- state – (np.ndarray) The last states (can be None, used in recurrent policies)
- mask – (np.ndarray) The last masks (can be None, used in recurrent policies)
- deterministic – (bool) Whether or not to return deterministic actions.
Returns: (np.ndarray, np.ndarray) the model’s action and the next state (used in recurrent policies)
-
save
(save_path)[source]¶ Save the current parameters to file
Parameters: save_path – (str) the save location
ACKTR¶
Actor Critic using Kronecker-Factored Trust Region (ACKTR) uses Kronecker-factored approximate curvature (K-FAC) for trust region optimization.
Notes¶
- Original paper: https://arxiv.org/abs/1708.05144
- Baselines blog post: https://blog.openai.com/baselines-acktr-a2c/
python -m stable_baselines.acktr.run_atari
runs the algorithm for 40M frames = 10M timesteps on an Atari game. See help (-h
) for more options.
Can I use?¶
- Reccurent policies: ✔️
- Multi processing: ✔️
- Gym spaces:
Space | Action | Observation |
---|---|---|
Discrete | ✔️ | ✔️ |
Box | ❌ | ✔️ |
MultiDiscrete | ❌ | ✔️ |
MultiBinary | ❌ | ✔️ |
Example¶
import gym
from stable_baselines.common.policies import MlpPolicy, MlpLstmPolicy, MlpLnLstmPolicy
from stable_baselines.common.vec_env import SubprocVecEnv
from stable_baselines import ACKTR
# multiprocess environment
n_cpu = 4
env = SubprocVecEnv([lambda: gym.make('CartPole-v1') for i in range(n_cpu)])
model = ACKTR(MlpPolicy, env, verbose=1)
model.learn(total_timesteps=25000)
model.save("acktr_cartpole")
del model # remove to demonstrate saving and loading
model = ACKTR.load("acktr_cartpole")
obs = env.reset()
while True:
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)
env.render()
Parameters¶
-
class
stable_baselines.acktr.
ACKTR
(policy, env, gamma=0.99, nprocs=1, n_steps=20, ent_coef=0.01, vf_coef=0.25, vf_fisher_coef=1.0, learning_rate=0.25, max_grad_norm=0.5, kfac_clip=0.001, lr_schedule='linear', verbose=0, tensorboard_log=None, _init_setup_model=True, async_eigen_decomp=False)[source]¶ The ACKTR (Actor Critic using Kronecker-Factored Trust Region) model class, https://arxiv.org/abs/1708.05144
Parameters: - policy – (ActorCriticPolicy or str) The policy model to use (MlpPolicy, CnnPolicy, CnnLstmPolicy, …)
- env – (Gym environment or str) The environment to learn from (if registered in Gym, can be str)
- gamma – (float) Discount factor
- nprocs – (int) The number of threads for TensorFlow operations
- n_steps – (int) The number of steps to run for each environment
- ent_coef – (float) The weight for the entropic loss
- vf_coef – (float) The weight for the loss on the value function
- vf_fisher_coef – (float) The weight for the fisher loss on the value function
- learning_rate – (float) The initial learning rate for the RMS prop optimizer
- max_grad_norm – (float) The clipping value for the maximum gradient
- kfac_clip – (float) gradient clipping for Kullback leiber
- lr_schedule – (str) The type of scheduler for the learning rate update (‘linear’, ‘constant’, ‘double_linear_con’, ‘middle_drop’ or ‘double_middle_drop’)
- verbose – (int) the verbosity level: 0 none, 1 training information, 2 tensorflow debug
- tensorboard_log – (str) the log location for tensorboard (if None, no logging)
- _init_setup_model – (bool) Whether or not to build the network at the creation of the instance
- async_eigen_decomp – (bool) Use async eigen decomposition
-
action_probability
(observation, state=None, mask=None)¶ Get the model’s action probability distribution from an observation
Parameters: - observation – (np.ndarray) the input observation
- state – (np.ndarray) The last states (can be None, used in recurrent policies)
- mask – (np.ndarray) The last masks (can be None, used in recurrent policies)
Returns: (np.ndarray) the model’s action probability distribution
-
get_env
()¶ returns the current environment (can be None if not defined)
Returns: (Gym Environment) The current environment
-
learn
(total_timesteps, callback=None, seed=None, log_interval=100, tb_log_name='ACKTR')[source]¶ Return a trained model.
Parameters: - total_timesteps – (int) The total number of samples to train on
- seed – (int) The initial seed for training, if None: keep current seed
- callback – (function (dict, dict)) function called at every steps with state of the algorithm. It takes the local and global variables.
- log_interval – (int) The number of timesteps before logging.
- tb_log_name – (str) the name of the run for tensorboard log
Returns: (BaseRLModel) the trained model
-
classmethod
load
(load_path, env=None, **kwargs)¶ Load the model from file
Parameters: - load_path – (str) the saved parameter location
- env – (Gym Envrionment) the new environment to run the loaded model on (can be None if you only need prediction from a trained model)
- kwargs – extra arguments to change the model when loading
-
predict
(observation, state=None, mask=None, deterministic=False)¶ Get the model’s action from an observation
Parameters: - observation – (np.ndarray) the input observation
- state – (np.ndarray) The last states (can be None, used in recurrent policies)
- mask – (np.ndarray) The last masks (can be None, used in recurrent policies)
- deterministic – (bool) Whether or not to return deterministic actions.
Returns: (np.ndarray, np.ndarray) the model’s action and the next state (used in recurrent policies)
-
save
(save_path)[source]¶ Save the current parameters to file
Parameters: save_path – (str) the save location
-
set_env
(env)¶ Checks the validity of the environment, and if it is coherent, set it as the current environment.
Parameters: env – (Gym Environment) The environment for learning a policy
DDPG¶
Deep Deterministic Policy Gradient (DDPG)
Warning
The DDPG model does not support stable_baselines.common.policies
because it uses q-value instead
of value estimation, as a result it must use its own policy models (see DDPG Policies).
Available Policies
MlpPolicy |
Policy object that implements actor critic, using a MLP (2 layers of 64) |
LnMlpPolicy |
Policy object that implements actor critic, using a MLP (2 layers of 64), with layer normalisation |
CnnPolicy |
Policy object that implements actor critic, using a CNN (the nature CNN) |
LnCnnPolicy |
Policy object that implements actor critic, using a CNN (the nature CNN), with layer normalisation |
Notes¶
- Original paper: https://arxiv.org/abs/1509.02971
- Baselines post: https://blog.openai.com/better-exploration-with-parameter-noise/
python -m stable_baselines.ddpg.main
runs the algorithm for 1M frames = 10M timesteps on a Mujoco environment. See help (-h
) for more options.
Can I use?¶
- Reccurent policies: ❌
- Multi processing: ❌
- Gym spaces:
Space | Action | Observation |
---|---|---|
Discrete | ❌ | ✔️ |
Box | ✔️ | ✔️ |
MultiDiscrete | ❌ | ✔️ |
MultiBinary | ❌ | ✔️ |
Example¶
import gym
import numpy as np
from stable_baselines.ddpg.policies import MlpPolicy
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines.ddpg.noise import NormalActionNoise, OrnsteinUhlenbeckActionNoise, AdaptiveParamNoiseSpec
from stable_baselines import DDPG
env = gym.make('MountainCarContinuous-v0')
env = DummyVecEnv([lambda: env])
# the noise objects for DDPG
n_actions = env.action_space.shape[-1]
param_noise = None
action_noise = OrnsteinUhlenbeckActionNoise(mean=np.zeros(n_actions), sigma=float(0.5) * np.ones(n_actions))
model = DDPG(MlpPolicy, env, verbose=1, param_noise=param_noise, action_noise=action_noise)
model.learn(total_timesteps=400000)
model.save("ddpg_mountain")
del model # remove to demonstrate saving and loading
model = DDPG.load("ddpg_mountain")
obs = env.reset()
while True:
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)
env.render()
Parameters¶
-
class
stable_baselines.ddpg.
DDPG
(policy, env, gamma=0.99, memory_policy=None, eval_env=None, nb_train_steps=50, nb_rollout_steps=100, nb_eval_steps=100, param_noise=None, action_noise=None, normalize_observations=False, tau=0.001, batch_size=128, param_noise_adaption_interval=50, normalize_returns=False, enable_popart=False, observation_range=(-5.0, 5.0), critic_l2_reg=0.0, return_range=(-inf, inf), actor_lr=0.0001, critic_lr=0.001, clip_norm=None, reward_scale=1.0, render=False, render_eval=False, memory_limit=100, verbose=0, tensorboard_log=None, _init_setup_model=True)[source]¶ Deep Deterministic Policy Gradient (DDPG) model
DDPG: https://arxiv.org/pdf/1509.02971.pdf
Parameters: - policy – (DDPGPolicy or str) The policy model to use (MlpPolicy, CnnPolicy, LnMlpPolicy, …)
- env – (Gym environment or str) The environment to learn from (if registered in Gym, can be str)
- gamma – (float) the discount rate
- memory_policy – (Memory) the replay buffer (if None, default to baselines.ddpg.memory.Memory)
- eval_env – (Gym Environment) the evaluation environment (can be None)
- nb_train_steps – (int) the number of training steps
- nb_rollout_steps – (int) the number of rollout steps
- nb_eval_steps – (int) the number of evalutation steps
- param_noise – (AdaptiveParamNoiseSpec) the parameter noise type (can be None)
- action_noise – (ActionNoise) the action noise type (can be None)
- param_noise_adaption_interval – (int) apply param noise every N steps
- tau – (float) the soft update coefficient (keep old values, between 0 and 1)
- normalize_returns – (bool) should the critic output be normalized
- enable_popart – (bool) enable pop-art normalization of the critic output (https://arxiv.org/pdf/1602.07714.pdf)
- normalize_observations – (bool) should the observation be normalized
- batch_size – (int) the size of the batch for learning the policy
- observation_range – (tuple) the bounding values for the observation
- return_range – (tuple) the bounding values for the critic output
- critic_l2_reg – (float) l2 regularizer coefficient
- actor_lr – (float) the actor learning rate
- critic_lr – (float) the critic learning rate
- clip_norm – (float) clip the gradients (disabled if None)
- reward_scale – (float) the value the reward should be scaled by
- render – (bool) enable rendering of the environment
- render_eval – (bool) enable rendering of the evalution environment
- memory_limit – (int) the max number of transitions to store
- verbose – (int) the verbosity level: 0 none, 1 training information, 2 tensorflow debug
- tensorboard_log – (str) the log location for tensorboard (if None, no logging)
- _init_setup_model – (bool) Whether or not to build the network at the creation of the instance
-
action_probability
(observation, state=None, mask=None)[source]¶ Get the model’s action probability distribution from an observation
Parameters: - observation – (np.ndarray) the input observation
- state – (np.ndarray) The last states (can be None, used in recurrent policies)
- mask – (np.ndarray) The last masks (can be None, used in recurrent policies)
Returns: (np.ndarray) the model’s action probability distribution
-
get_env
()¶ returns the current environment (can be None if not defined)
Returns: (Gym Environment) The current environment
-
learn
(total_timesteps, callback=None, seed=None, log_interval=100, tb_log_name='DDPG')[source]¶ Return a trained model.
Parameters: - total_timesteps – (int) The total number of samples to train on
- seed – (int) The initial seed for training, if None: keep current seed
- callback – (function (dict, dict)) function called at every steps with state of the algorithm. It takes the local and global variables.
- log_interval – (int) The number of timesteps before logging.
- tb_log_name – (str) the name of the run for tensorboard log
Returns: (BaseRLModel) the trained model
-
classmethod
load
(load_path, env=None, **kwargs)[source]¶ Load the model from file
Parameters: - load_path – (str) the saved parameter location
- env – (Gym Envrionment) the new environment to run the loaded model on (can be None if you only need prediction from a trained model)
- kwargs – extra arguments to change the model when loading
-
predict
(observation, state=None, mask=None, deterministic=True)[source]¶ Get the model’s action from an observation
Parameters: - observation – (np.ndarray) the input observation
- state – (np.ndarray) The last states (can be None, used in recurrent policies)
- mask – (np.ndarray) The last masks (can be None, used in recurrent policies)
- deterministic – (bool) Whether or not to return deterministic actions.
Returns: (np.ndarray, np.ndarray) the model’s action and the next state (used in recurrent policies)
-
save
(save_path)[source]¶ Save the current parameters to file
Parameters: save_path – (str) the save location
-
set_env
(env)¶ Checks the validity of the environment, and if it is coherent, set it as the current environment.
Parameters: env – (Gym Environment) The environment for learning a policy
DDPG Policies¶
-
class
stable_baselines.ddpg.
MlpPolicy
(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, **_kwargs)[source]¶ Policy object that implements actor critic, using a MLP (2 layers of 64)
Parameters: - sess – (TensorFlow session) The current TensorFlow session
- ob_space – (Gym Space) The observation space of the environment
- ac_space – (Gym Space) The action space of the environment
- n_env – (int) The number of environments to run
- n_steps – (int) The number of steps to run for each environment
- n_batch – (int) The number of batch to run (n_envs * n_steps)
- reuse – (bool) If the policy is reusable or not
- _kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction
-
make_actor
(obs=None, reuse=False, scope='pi')¶ creates an actor object
Parameters: - obs – (TensorFlow Tensor) The observation placeholder (can be None for default placeholder)
- reuse – (bool) whether or not to resue parameters
- scope – (str) the scope name of the actor
Returns: (TensorFlow Tensor) the output tensor
-
make_critic
(obs=None, action=None, reuse=False, scope='qf')¶ creates a critic object
Parameters: - obs – (TensorFlow Tensor) The observation placeholder (can be None for default placeholder)
- action – (TensorFlow Tensor) The action placeholder (can be None for default placeholder)
- reuse – (bool) whether or not to resue parameters
- scope – (str) the scope name of the critic
Returns: (TensorFlow Tensor) the output tensor
-
proba_step
(obs, state=None, mask=None)¶ Returns the action probability for a single step
Parameters: - obs – ([float] or [int]) The current observation of the environment
- state – ([float]) The last states (used in recurrent policies)
- mask – ([float]) The last masks (used in recurrent policies)
Returns: ([float]) the action probability
-
step
(obs, state=None, mask=None)¶ Returns the policy for a single step
Parameters: - obs – ([float] or [int]) The current observation of the environment
- state – ([float]) The last states (used in recurrent policies)
- mask – ([float]) The last masks (used in recurrent policies)
Returns: ([float]) actions
-
value
(obs, action, state=None, mask=None)¶ Returns the value for a single step
Parameters: - obs – ([float] or [int]) The current observation of the environment
- action – ([float] or [int]) The taken action
- state – ([float]) The last states (used in recurrent policies)
- mask – ([float]) The last masks (used in recurrent policies)
Returns: ([float]) The associated value of the action
-
class
stable_baselines.ddpg.
LnMlpPolicy
(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, **_kwargs)[source]¶ Policy object that implements actor critic, using a MLP (2 layers of 64), with layer normalisation
Parameters: - sess – (TensorFlow session) The current TensorFlow session
- ob_space – (Gym Space) The observation space of the environment
- ac_space – (Gym Space) The action space of the environment
- n_env – (int) The number of environments to run
- n_steps – (int) The number of steps to run for each environment
- n_batch – (int) The number of batch to run (n_envs * n_steps)
- reuse – (bool) If the policy is reusable or not
- _kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction
-
make_actor
(obs=None, reuse=False, scope='pi')¶ creates an actor object
Parameters: - obs – (TensorFlow Tensor) The observation placeholder (can be None for default placeholder)
- reuse – (bool) whether or not to resue parameters
- scope – (str) the scope name of the actor
Returns: (TensorFlow Tensor) the output tensor
-
make_critic
(obs=None, action=None, reuse=False, scope='qf')¶ creates a critic object
Parameters: - obs – (TensorFlow Tensor) The observation placeholder (can be None for default placeholder)
- action – (TensorFlow Tensor) The action placeholder (can be None for default placeholder)
- reuse – (bool) whether or not to resue parameters
- scope – (str) the scope name of the critic
Returns: (TensorFlow Tensor) the output tensor
-
proba_step
(obs, state=None, mask=None)¶ Returns the action probability for a single step
Parameters: - obs – ([float] or [int]) The current observation of the environment
- state – ([float]) The last states (used in recurrent policies)
- mask – ([float]) The last masks (used in recurrent policies)
Returns: ([float]) the action probability
-
step
(obs, state=None, mask=None)¶ Returns the policy for a single step
Parameters: - obs – ([float] or [int]) The current observation of the environment
- state – ([float]) The last states (used in recurrent policies)
- mask – ([float]) The last masks (used in recurrent policies)
Returns: ([float]) actions
-
value
(obs, action, state=None, mask=None)¶ Returns the value for a single step
Parameters: - obs – ([float] or [int]) The current observation of the environment
- action – ([float] or [int]) The taken action
- state – ([float]) The last states (used in recurrent policies)
- mask – ([float]) The last masks (used in recurrent policies)
Returns: ([float]) The associated value of the action
-
class
stable_baselines.ddpg.
CnnPolicy
(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, **_kwargs)[source]¶ Policy object that implements actor critic, using a CNN (the nature CNN)
Parameters: - sess – (TensorFlow session) The current TensorFlow session
- ob_space – (Gym Space) The observation space of the environment
- ac_space – (Gym Space) The action space of the environment
- n_env – (int) The number of environments to run
- n_steps – (int) The number of steps to run for each environment
- n_batch – (int) The number of batch to run (n_envs * n_steps)
- reuse – (bool) If the policy is reusable or not
- _kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction
-
make_actor
(obs=None, reuse=False, scope='pi')¶ creates an actor object
Parameters: - obs – (TensorFlow Tensor) The observation placeholder (can be None for default placeholder)
- reuse – (bool) whether or not to resue parameters
- scope – (str) the scope name of the actor
Returns: (TensorFlow Tensor) the output tensor
-
make_critic
(obs=None, action=None, reuse=False, scope='qf')¶ creates a critic object
Parameters: - obs – (TensorFlow Tensor) The observation placeholder (can be None for default placeholder)
- action – (TensorFlow Tensor) The action placeholder (can be None for default placeholder)
- reuse – (bool) whether or not to resue parameters
- scope – (str) the scope name of the critic
Returns: (TensorFlow Tensor) the output tensor
-
proba_step
(obs, state=None, mask=None)¶ Returns the action probability for a single step
Parameters: - obs – ([float] or [int]) The current observation of the environment
- state – ([float]) The last states (used in recurrent policies)
- mask – ([float]) The last masks (used in recurrent policies)
Returns: ([float]) the action probability
-
step
(obs, state=None, mask=None)¶ Returns the policy for a single step
Parameters: - obs – ([float] or [int]) The current observation of the environment
- state – ([float]) The last states (used in recurrent policies)
- mask – ([float]) The last masks (used in recurrent policies)
Returns: ([float]) actions
-
value
(obs, action, state=None, mask=None)¶ Returns the value for a single step
Parameters: - obs – ([float] or [int]) The current observation of the environment
- action – ([float] or [int]) The taken action
- state – ([float]) The last states (used in recurrent policies)
- mask – ([float]) The last masks (used in recurrent policies)
Returns: ([float]) The associated value of the action
-
class
stable_baselines.ddpg.
LnCnnPolicy
(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, **_kwargs)[source]¶ Policy object that implements actor critic, using a CNN (the nature CNN), with layer normalisation
Parameters: - sess – (TensorFlow session) The current TensorFlow session
- ob_space – (Gym Space) The observation space of the environment
- ac_space – (Gym Space) The action space of the environment
- n_env – (int) The number of environments to run
- n_steps – (int) The number of steps to run for each environment
- n_batch – (int) The number of batch to run (n_envs * n_steps)
- reuse – (bool) If the policy is reusable or not
- _kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction
-
make_actor
(obs=None, reuse=False, scope='pi')¶ creates an actor object
Parameters: - obs – (TensorFlow Tensor) The observation placeholder (can be None for default placeholder)
- reuse – (bool) whether or not to resue parameters
- scope – (str) the scope name of the actor
Returns: (TensorFlow Tensor) the output tensor
-
make_critic
(obs=None, action=None, reuse=False, scope='qf')¶ creates a critic object
Parameters: - obs – (TensorFlow Tensor) The observation placeholder (can be None for default placeholder)
- action – (TensorFlow Tensor) The action placeholder (can be None for default placeholder)
- reuse – (bool) whether or not to resue parameters
- scope – (str) the scope name of the critic
Returns: (TensorFlow Tensor) the output tensor
-
proba_step
(obs, state=None, mask=None)¶ Returns the action probability for a single step
Parameters: - obs – ([float] or [int]) The current observation of the environment
- state – ([float]) The last states (used in recurrent policies)
- mask – ([float]) The last masks (used in recurrent policies)
Returns: ([float]) the action probability
-
step
(obs, state=None, mask=None)¶ Returns the policy for a single step
Parameters: - obs – ([float] or [int]) The current observation of the environment
- state – ([float]) The last states (used in recurrent policies)
- mask – ([float]) The last masks (used in recurrent policies)
Returns: ([float]) actions
-
value
(obs, action, state=None, mask=None)¶ Returns the value for a single step
Parameters: - obs – ([float] or [int]) The current observation of the environment
- action – ([float] or [int]) The taken action
- state – ([float]) The last states (used in recurrent policies)
- mask – ([float]) The last masks (used in recurrent policies)
Returns: ([float]) The associated value of the action
Action and Parameters Noise¶
-
class
stable_baselines.ddpg.
AdaptiveParamNoiseSpec
(initial_stddev=0.1, desired_action_stddev=0.1, adoption_coefficient=1.01)[source]¶ Implements adaptive parameter noise
Parameters: - initial_stddev – (float) the initial value for the standard deviation of the noise
- desired_action_stddev – (float) the desired value for the standard deviation of the noise
- adoption_coefficient – (float) the update coefficient for the standard deviation of the noise
-
class
stable_baselines.ddpg.
NormalActionNoise
(mean, sigma)[source]¶ A gaussian action noise
Parameters: - mean – (float) the mean value of the noise
- sigma – (float) the scale of the noise (std here)
-
reset
()¶ call end of episode reset for the noise
-
class
stable_baselines.ddpg.
OrnsteinUhlenbeckActionNoise
(mean, sigma, theta=0.15, dt=0.01, initial_noise=None)[source]¶ A Ornstein Uhlenbeck action noise, this is designed to aproximate brownian motion with friction.
Based on http://math.stackexchange.com/questions/1287634/implementing-ornstein-uhlenbeck-in-matlab
Parameters: - mean – (float) the mean of the noise
- sigma – (float) the scale of the noise
- theta – (float) the rate of mean reversion
- dt – (float) the timestep for the noise
- initial_noise – ([float]) the initial value for the noise output, (if None: 0)
Custom Policy Network¶
Similarly to the example given in the examples page. You can easily define a custom architecture for the policy network:
import gym
from stable_baselines.ddpg.policies import FeedForwardPolicy
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines import DDPG
# Custom MLP policy of three layers of size 128 each
class CustomPolicy(FeedForwardPolicy):
def __init__(self, *args, **kwargs):
super(CustomPolicy, self).__init__(*args, **kwargs,
layers=[128, 128, 128],
layer_norm=False,
feature_extraction="mlp")
# Create and wrap the environment
env = gym.make('Pendulum-v0')
env = DummyVecEnv([lambda: env])
model = DDPG(CustomPolicy, env, verbose=1)
# Train the agent
model.learn(total_timesteps=100000)
DQN¶
Deep Q Network (DQN) and its extensions (Double-DQN, Dueling-DQN, Prioritized Experience Replay).
Warning
The DQN model does not support stable_baselines.common.policies
,
as a result it must use its own policy models (see DQN Policies).
Available Policies
MlpPolicy |
Policy object that implements DQN policy, using a MLP (2 layers of 64) |
LnMlpPolicy |
Policy object that implements DQN policy, using a MLP (2 layers of 64), with layer normalisation |
CnnPolicy |
Policy object that implements DQN policy, using a CNN (the nature CNN) |
LnCnnPolicy |
Policy object that implements DQN policy, using a CNN (the nature CNN), with layer normalisation |
Notes¶
- Original paper: https://arxiv.org/abs/1312.5602
Can I use?¶
- Reccurent policies: ❌
- Multi processing: ❌
- Gym spaces:
Space | Action | Observation |
---|---|---|
Discrete | ✔️ | ✔️ |
Box | ❌ | ✔️ |
MultiDiscrete | ❌ | ✔️ |
MultiBinary | ❌ | ✔️ |
Example¶
import gym
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines.deepq.policies import MlpPolicy
from stable_baselines import DQN
env = gym.make('CartPole-v1')
env = DummyVecEnv([lambda: env])
model = DQN(MlpPolicy, env, verbose=1)
model.learn(total_timesteps=25000)
model.save("deepq_cartpole")
del model # remove to demonstrate saving and loading
model = DQN.load("deepq_cartpole")
obs = env.reset()
while True:
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)
env.render()
With Atari:
from stable_baselines.common.atari_wrappers import make_atari
from stable_baselines.deepq.policies import MlpPolicy, CnnPolicy
from stable_baselines import DQN
env = make_atari('BreakoutNoFrameskip-v4')
model = DQN(CnnPolicy, env, verbose=1)
model.learn(total_timesteps=25000)
model.save("deepq_breakout")
del model # remove to demonstrate saving and loading
DQN.load("deepq_breakout")
obs = env.reset()
while True:
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)
env.render()
Parameters¶
-
class
stable_baselines.deepq.
DQN
(policy, env, gamma=0.99, learning_rate=0.0005, buffer_size=50000, exploration_fraction=0.1, exploration_final_eps=0.02, train_freq=1, batch_size=32, checkpoint_freq=10000, checkpoint_path=None, learning_starts=1000, target_network_update_freq=500, prioritized_replay=False, prioritized_replay_alpha=0.6, prioritized_replay_beta0=0.4, prioritized_replay_beta_iters=None, prioritized_replay_eps=1e-06, param_noise=False, verbose=0, tensorboard_log=None, _init_setup_model=True)[source]¶ The DQN model class. DQN paper: https://arxiv.org/pdf/1312.5602.pdf
Parameters: - policy – (DQNPolicy or str) The policy model to use (MlpPolicy, CnnPolicy, LnMlpPolicy, …)
- env – (Gym environment or str) The environment to learn from (if registered in Gym, can be str)
- gamma – (float) discount factor
- learning_rate – (float) learning rate for adam optimizer
- buffer_size – (int) size of the replay buffer
- exploration_fraction – (float) fraction of entire training period over which the exploration rate is annealed
- exploration_final_eps – (float) final value of random action probability
- train_freq – (int) update the model every train_freq steps. set to None to disable printing
- batch_size – (int) size of a batched sampled from replay buffer for training
- checkpoint_freq – (int) how often to save the model. This is so that the best version is restored at the end of the training. If you do not wish to restore the best version at the end of the training set this variable to None.
- checkpoint_path – (str) replacement path used if you need to log to somewhere else than a temporary directory.
- learning_starts – (int) how many steps of the model to collect transitions for before learning starts
- target_network_update_freq – (int) update the target network every target_network_update_freq steps.
- prioritized_replay – (bool) if True prioritized replay buffer will be used.
- prioritized_replay_alpha – (float) alpha parameter for prioritized replay buffer
- prioritized_replay_beta0 – (float) initial value of beta for prioritized replay buffer
- prioritized_replay_beta_iters – (int) number of iterations over which beta will be annealed from initial value to 1.0. If set to None equals to max_timesteps.
- prioritized_replay_eps – (float) epsilon to add to the TD errors when updating priorities.
- param_noise – (bool) Whether or not to apply noise to the parameters of the policy.
- verbose – (int) the verbosity level: 0 none, 1 training information, 2 tensorflow debug
- tensorboard_log – (str) the log location for tensorboard (if None, no logging)
- _init_setup_model – (bool) Whether or not to build the network at the creation of the instance
-
action_probability
(observation, state=None, mask=None)[source]¶ Get the model’s action probability distribution from an observation
Parameters: - observation – (np.ndarray) the input observation
- state – (np.ndarray) The last states (can be None, used in recurrent policies)
- mask – (np.ndarray) The last masks (can be None, used in recurrent policies)
Returns: (np.ndarray) the model’s action probability distribution
-
get_env
()¶ returns the current environment (can be None if not defined)
Returns: (Gym Environment) The current environment
-
learn
(total_timesteps, callback=None, seed=None, log_interval=100, tb_log_name='DQN')[source]¶ Return a trained model.
Parameters: - total_timesteps – (int) The total number of samples to train on
- seed – (int) The initial seed for training, if None: keep current seed
- callback – (function (dict, dict)) function called at every steps with state of the algorithm. It takes the local and global variables.
- log_interval – (int) The number of timesteps before logging.
- tb_log_name – (str) the name of the run for tensorboard log
Returns: (BaseRLModel) the trained model
-
classmethod
load
(load_path, env=None, **kwargs)[source]¶ Load the model from file
Parameters: - load_path – (str) the saved parameter location
- env – (Gym Envrionment) the new environment to run the loaded model on (can be None if you only need prediction from a trained model)
- kwargs – extra arguments to change the model when loading
-
predict
(observation, state=None, mask=None, deterministic=True)[source]¶ Get the model’s action from an observation
Parameters: - observation – (np.ndarray) the input observation
- state – (np.ndarray) The last states (can be None, used in recurrent policies)
- mask – (np.ndarray) The last masks (can be None, used in recurrent policies)
- deterministic – (bool) Whether or not to return deterministic actions.
Returns: (np.ndarray, np.ndarray) the model’s action and the next state (used in recurrent policies)
-
save
(save_path)[source]¶ Save the current parameters to file
Parameters: save_path – (str) the save location
-
set_env
(env)¶ Checks the validity of the environment, and if it is coherent, set it as the current environment.
Parameters: env – (Gym Environment) The environment for learning a policy
DQN Policies¶
-
class
stable_baselines.deepq.
MlpPolicy
(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, obs_phs=None, dueling=True, **_kwargs)[source]¶ Policy object that implements DQN policy, using a MLP (2 layers of 64)
Parameters: - sess – (TensorFlow session) The current TensorFlow session
- ob_space – (Gym Space) The observation space of the environment
- ac_space – (Gym Space) The action space of the environment
- n_env – (int) The number of environments to run
- n_steps – (int) The number of steps to run for each environment
- n_batch – (int) The number of batch to run (n_envs * n_steps)
- reuse – (bool) If the policy is reusable or not
- obs_phs – (TensorFlow Tensor, TensorFlow Tensor) a tuple containing an override for observation placeholder and the processed observation placeholder respectivly
- dueling – (bool) if true double the output MLP to compute a baseline for action scores
- _kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction
-
proba_step
(obs, state=None, mask=None)¶ Returns the action probability for a single step
Parameters: - obs – (np.ndarray float or int) The current observation of the environment
- state – (np.ndarray float) The last states (used in recurrent policies)
- mask – (np.ndarray float) The last masks (used in recurrent policies)
Returns: (np.ndarray float) the action probability
-
step
(obs, state=None, mask=None, deterministic=True)¶ Returns the q_values for a single step
Parameters: - obs – (np.ndarray float or int) The current observation of the environment
- state – (np.ndarray float) The last states (used in recurrent policies)
- mask – (np.ndarray float) The last masks (used in recurrent policies)
- deterministic – (bool) Whether or not to return deterministic actions.
Returns: (np.ndarray int, np.ndarray float, np.ndarray float) actions, q_values, states
-
class
stable_baselines.deepq.
LnMlpPolicy
(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, obs_phs=None, dueling=True, **_kwargs)[source]¶ Policy object that implements DQN policy, using a MLP (2 layers of 64), with layer normalisation
Parameters: - sess – (TensorFlow session) The current TensorFlow session
- ob_space – (Gym Space) The observation space of the environment
- ac_space – (Gym Space) The action space of the environment
- n_env – (int) The number of environments to run
- n_steps – (int) The number of steps to run for each environment
- n_batch – (int) The number of batch to run (n_envs * n_steps)
- reuse – (bool) If the policy is reusable or not
- obs_phs – (TensorFlow Tensor, TensorFlow Tensor) a tuple containing an override for observation placeholder and the processed observation placeholder respectivly
- dueling – (bool) if true double the output MLP to compute a baseline for action scores
- _kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction
-
proba_step
(obs, state=None, mask=None)¶ Returns the action probability for a single step
Parameters: - obs – (np.ndarray float or int) The current observation of the environment
- state – (np.ndarray float) The last states (used in recurrent policies)
- mask – (np.ndarray float) The last masks (used in recurrent policies)
Returns: (np.ndarray float) the action probability
-
step
(obs, state=None, mask=None, deterministic=True)¶ Returns the q_values for a single step
Parameters: - obs – (np.ndarray float or int) The current observation of the environment
- state – (np.ndarray float) The last states (used in recurrent policies)
- mask – (np.ndarray float) The last masks (used in recurrent policies)
- deterministic – (bool) Whether or not to return deterministic actions.
Returns: (np.ndarray int, np.ndarray float, np.ndarray float) actions, q_values, states
-
class
stable_baselines.deepq.
CnnPolicy
(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, obs_phs=None, dueling=True, **_kwargs)[source]¶ Policy object that implements DQN policy, using a CNN (the nature CNN)
Parameters: - sess – (TensorFlow session) The current TensorFlow session
- ob_space – (Gym Space) The observation space of the environment
- ac_space – (Gym Space) The action space of the environment
- n_env – (int) The number of environments to run
- n_steps – (int) The number of steps to run for each environment
- n_batch – (int) The number of batch to run (n_envs * n_steps)
- reuse – (bool) If the policy is reusable or not
- obs_phs – (TensorFlow Tensor, TensorFlow Tensor) a tuple containing an override for observation placeholder and the processed observation placeholder respectivly
- dueling – (bool) if true double the output MLP to compute a baseline for action scores
- _kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction
-
proba_step
(obs, state=None, mask=None)¶ Returns the action probability for a single step
Parameters: - obs – (np.ndarray float or int) The current observation of the environment
- state – (np.ndarray float) The last states (used in recurrent policies)
- mask – (np.ndarray float) The last masks (used in recurrent policies)
Returns: (np.ndarray float) the action probability
-
step
(obs, state=None, mask=None, deterministic=True)¶ Returns the q_values for a single step
Parameters: - obs – (np.ndarray float or int) The current observation of the environment
- state – (np.ndarray float) The last states (used in recurrent policies)
- mask – (np.ndarray float) The last masks (used in recurrent policies)
- deterministic – (bool) Whether or not to return deterministic actions.
Returns: (np.ndarray int, np.ndarray float, np.ndarray float) actions, q_values, states
-
class
stable_baselines.deepq.
LnCnnPolicy
(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, obs_phs=None, dueling=True, **_kwargs)[source]¶ Policy object that implements DQN policy, using a CNN (the nature CNN), with layer normalisation
Parameters: - sess – (TensorFlow session) The current TensorFlow session
- ob_space – (Gym Space) The observation space of the environment
- ac_space – (Gym Space) The action space of the environment
- n_env – (int) The number of environments to run
- n_steps – (int) The number of steps to run for each environment
- n_batch – (int) The number of batch to run (n_envs * n_steps)
- reuse – (bool) If the policy is reusable or not
- obs_phs – (TensorFlow Tensor, TensorFlow Tensor) a tuple containing an override for observation placeholder and the processed observation placeholder respectivly
- dueling – (bool) if true double the output MLP to compute a baseline for action scores
- _kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction
-
proba_step
(obs, state=None, mask=None)¶ Returns the action probability for a single step
Parameters: - obs – (np.ndarray float or int) The current observation of the environment
- state – (np.ndarray float) The last states (used in recurrent policies)
- mask – (np.ndarray float) The last masks (used in recurrent policies)
Returns: (np.ndarray float) the action probability
-
step
(obs, state=None, mask=None, deterministic=True)¶ Returns the q_values for a single step
Parameters: - obs – (np.ndarray float or int) The current observation of the environment
- state – (np.ndarray float) The last states (used in recurrent policies)
- mask – (np.ndarray float) The last masks (used in recurrent policies)
- deterministic – (bool) Whether or not to return deterministic actions.
Returns: (np.ndarray int, np.ndarray float, np.ndarray float) actions, q_values, states
Custom Policy Network¶
Similarly to the example given in the examples page. You can easily define a custom architecture for the policy network:
import gym
from stable_baselines.deepq.policies import FeedForwardPolicy
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines import DQN
# Custom MLP policy of three layers of size 128 each
class CustomPolicy(FeedForwardPolicy):
def __init__(self, *args, **kwargs):
super(CustomPolicy, self).__init__(*args, **kwargs,
layers=[128, 128, 128],
layer_norm=False,
feature_extraction="mlp")
# Create and wrap the environment
env = gym.make('LunarLander-v2')
env = DummyVecEnv([lambda: env])
model = DQN(CustomPolicy, env, verbose=1)
# Train the agent
model.learn(total_timesteps=100000)
GAIL¶
Generative Adversarial Imitation Learning (GAIL)
Notes¶
- Original paper: https://arxiv.org/abs/1606.03476
If you want to train an imitation learning agent¶
Step 1: Download expert data¶
Download the expert data into ./data
, download link
Step 2: Run GAIL¶
Run with single thread:
python -m stable_baselines.gail.run_mujoco
Run with multiple threads:
mpirun -np 16 python -m stable_baselines.gail.run_mujoco
See help (-h
) for more options.
In case you want to run Behavior Cloning (BC)
python -m stable_baselines.gail.behavior_clone
See help (-h
) for more options.
OpenAI Maintainers:
- Yuan-Hong Liao, andrewliao11_at_gmail_dot_com
- Ryan Julian, ryanjulian_at_gmail_dot_com
Others
Thanks to the open source:
- @openai/imitation
- @carpedm20/deep-rl-tensorflow
Can I use?¶
- Reccurent policies: ✔️
- Multi processing: ✔️ (using MPI)
- Gym spaces:
Space | Action | Observation |
---|---|---|
Discrete | ❌ | ✔️ |
Box | ✔️ | ✔️ |
MultiDiscrete | ❌ | ✔️ |
MultiBinary | ❌ | ✔️ |
Parameters¶
-
class
stable_baselines.gail.
GAIL
(policy, env, pretrained_weight=False, hidden_size_adversary=100, adversary_entcoeff=0.001, expert_dataset=None, save_per_iter=1, checkpoint_dir='/tmp/gail/ckpt/', g_step=1, d_step=1, task_name='task_name', d_stepsize=0.0003, verbose=0, _init_setup_model=True, **kwargs)[source]¶ Generative Adversarial Imitation Learning (GAIL)
Parameters: - policy – (ActorCriticPolicy or str) The policy model to use (MlpPolicy, CnnPolicy, CnnLstmPolicy, …)
- env – (Gym environment or str) The environment to learn from (if registered in Gym, can be str)
- gamma – (float) the discount value
- timesteps_per_batch – (int) the number of timesteps to run per batch (horizon)
- max_kl – (float) the kullback leiber loss threashold
- cg_iters – (int) the number of iterations for the conjugate gradient calculation
- lam – (float) GAE factor
- entcoeff – (float) the weight for the entropy loss
- cg_damping – (float) the compute gradient dampening factor
- vf_stepsize – (float) the value function stepsize
- vf_iters – (int) the value function’s number iterations for learning
- pretrained_weight – (str) the save location for the pretrained weights
- hidden_size – ([int]) the hidden dimension for the MLP
- expert_dataset – (Dset) the dataset manager
- save_per_iter – (int) the number of iterations before saving
- checkpoint_dir – (str) the location for saving checkpoints
- g_step – (int) number of steps to train policy in each epoch
- d_step – (int) number of steps to train discriminator in each epoch
- task_name – (str) the name of the task (can be None)
- d_stepsize – (float) the reward giver stepsize
- verbose – (int) the verbosity level: 0 none, 1 training information, 2 tensorflow debug
- _init_setup_model – (bool) Whether or not to build the network at the creation of the instance
-
action_probability
(observation, state=None, mask=None)[source]¶ Get the model’s action probability distribution from an observation
Parameters: - observation – (np.ndarray) the input observation
- state – (np.ndarray) The last states (can be None, used in recurrent policies)
- mask – (np.ndarray) The last masks (can be None, used in recurrent policies)
Returns: (np.ndarray) the model’s action probability distribution
-
get_env
()¶ returns the current environment (can be None if not defined)
Returns: (Gym Environment) The current environment
-
learn
(total_timesteps, callback=None, seed=None, log_interval=100, tb_log_name='GAIL')[source]¶ Return a trained model.
Parameters: - total_timesteps – (int) The total number of samples to train on
- seed – (int) The initial seed for training, if None: keep current seed
- callback – (function (dict, dict)) function called at every steps with state of the algorithm. It takes the local and global variables.
- log_interval – (int) The number of timesteps before logging.
- tb_log_name – (str) the name of the run for tensorboard log
Returns: (BaseRLModel) the trained model
-
classmethod
load
(load_path, env=None, **kwargs)[source]¶ Load the model from file
Parameters: - load_path – (str) the saved parameter location
- env – (Gym Envrionment) the new environment to run the loaded model on (can be None if you only need prediction from a trained model)
- kwargs – extra arguments to change the model when loading
-
predict
(observation, state=None, mask=None, deterministic=False)[source]¶ Get the model’s action from an observation
Parameters: - observation – (np.ndarray) the input observation
- state – (np.ndarray) The last states (can be None, used in recurrent policies)
- mask – (np.ndarray) The last masks (can be None, used in recurrent policies)
- deterministic – (bool) Whether or not to return deterministic actions.
Returns: (np.ndarray, np.ndarray) the model’s action and the next state (used in recurrent policies)
-
save
(save_path)[source]¶ Save the current parameters to file
Parameters: save_path – (str) the save location
HER¶
Hindsight Experience Replay (HER)
Warning
HER is not refactored yet. We are looking for contributors to help us.
How to use Hindsight Experience Replay¶
Getting started¶
Training an agent is very simple:
python -m stable_baselines.her.experiment.train
This will train a DDPG+HER agent on the FetchReach
environment. You
should see the success rate go up quickly to 1.0
, which means that
the agent achieves the desired goal in 100% of the cases. The training
script logs other diagnostics as well and pickles the best policy so far
(w.r.t. to its test success rate), the latest policy, and, if enabled, a
history of policies every K epochs.
To inspect what the agent has learned, use the play script:
python -m stable_baselines.her.experiment.play /path/to/an/experiment/policy_best.pkl
You can try it right now with the results of the training step (the script prints out the path for you). This should visualize the current policy for 10 episodes and will also print statistics.
Reproducing results¶
In order to reproduce the results from Plappert et al. (2018), run the following command:
python -m stable_baselines.her.experiment.train --num_cpu 19
This will require a machine with sufficient amount of physical CPU cores. In our experiments, we used Azure’s D15v2 instances, which have 20 physical cores. We only scheduled the experiment on 19 of those to leave some head-room on the system.
Parameters¶
-
class
stable_baselines.her.
HER
(policy, env, verbose=0, _init_setup_model=True)[source]¶ -
action_probability
(observation, state=None, mask=None)[source]¶ Get the model’s action probability distribution from an observation
Parameters: - observation – (np.ndarray) the input observation
- state – (np.ndarray) The last states (can be None, used in recurrent policies)
- mask – (np.ndarray) The last masks (can be None, used in recurrent policies)
Returns: (np.ndarray) the model’s action probability distribution
-
learn
(total_timesteps, callback=None, seed=None, log_interval=100, tb_log_name='HER')[source]¶ Return a trained model.
Parameters: - total_timesteps – (int) The total number of samples to train on
- seed – (int) The initial seed for training, if None: keep current seed
- callback – (function (dict, dict)) function called at every steps with state of the algorithm. It takes the local and global variables.
- log_interval – (int) The number of timesteps before logging.
- tb_log_name – (str) the name of the run for tensorboard log
Returns: (BaseRLModel) the trained model
-
classmethod
load
(load_path, env=None, **kwargs)[source]¶ Load the model from file
Parameters: - load_path – (str) the saved parameter location
- env – (Gym Envrionment) the new environment to run the loaded model on (can be None if you only need prediction from a trained model)
- kwargs – extra arguments to change the model when loading
-
predict
(observation, state=None, mask=None, deterministic=False)[source]¶ Get the model’s action from an observation
Parameters: - observation – (np.ndarray) the input observation
- state – (np.ndarray) The last states (can be None, used in recurrent policies)
- mask – (np.ndarray) The last masks (can be None, used in recurrent policies)
- deterministic – (bool) Whether or not to return deterministic actions.
Returns: (np.ndarray, np.ndarray) the model’s action and the next state (used in recurrent policies)
-
PPO1¶
The Proximal Policy Optimization algorithm combines ideas from A2C (having multiple workers) and TRPO (it uses a trust region to improve the actor).
The main idea is that after an update, the new policy should be not too far form the old policy. For that, ppo uses clipping to avoid too large update.
Note
PPO2 is the implementation of OpenAI made for GPU. For multiprocessing, it uses vectorized environments compared to PPO1 which uses MPI.
Notes¶
- Original paper: https://arxiv.org/abs/1707.06347
- Clear explanation of PPO on Arxiv Insights channel: https://www.youtube.com/watch?v=5P7I-xPq8u8
- OpenAI blog post: https://blog.openai.com/openai-baselines-ppo/
mpirun -np 8 python -m stable_baselines.ppo1.run_atari
runs the algorithm for 40M frames = 10M timesteps on an Atari game. See help (-h
) for more options.python -m stable_baselines.ppo1.run_mujoco
runs the algorithm for 1M frames on a Mujoco environment.- Train mujoco 3d humanoid (with optimal-ish hyperparameters):
mpirun -np 16 python -m stable_baselines.ppo1.run_humanoid --model-path=/path/to/model
- Render the 3d humanoid:
python -m stable_baselines.ppo1.run_humanoid --play --model-path=/path/to/model
Can I use?¶
- Reccurent policies: ✔️
- Multi processing: ✔️ (using MPI)
- Gym spaces:
Space | Action | Observation |
---|---|---|
Discrete | ✔️ | ✔️ |
Box | ✔️ | ✔️ |
MultiDiscrete | ✔️ | ✔️ |
MultiBinary | ✔️ | ✔️ |
Example¶
import gym
from stable_baselines.common.policies import MlpPolicy, MlpLstmPolicy, MlpLnLstmPolicy
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines import PPO1
env = gym.make('CartPole-v1')
env = DummyVecEnv([lambda: env])
model = PPO1(MlpPolicy, env, verbose=1)
model.learn(total_timesteps=25000)
model.save("ppo1_cartpole")
del model # remove to demonstrate saving and loading
model = PPO1.load("ppo1_cartpole")
obs = env.reset()
while True:
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)
env.render()
Parameters¶
-
class
stable_baselines.ppo1.
PPO1
(policy, env, gamma=0.99, timesteps_per_actorbatch=256, clip_param=0.2, entcoeff=0.01, optim_epochs=4, optim_stepsize=0.001, optim_batchsize=64, lam=0.95, adam_epsilon=1e-05, schedule='linear', verbose=0, tensorboard_log=None, _init_setup_model=True)[source]¶ Proximal Policy Optimization algorithm (MPI version). Paper: https://arxiv.org/abs/1707.06347
Parameters: - env – (Gym environment or str) The environment to learn from (if registered in Gym, can be str)
- policy – (ActorCriticPolicy or str) The policy model to use (MlpPolicy, CnnPolicy, CnnLstmPolicy, …)
- timesteps_per_actorbatch – (int) timesteps per actor per update
- clip_param – (float) clipping parameter epsilon
- entcoeff – (float) the entropy loss weight
- optim_epochs – (float) the optimizer’s number of epochs
- optim_stepsize – (float) the optimizer’s stepsize
- optim_batchsize – (int) the optimizer’s the batch size
- gamma – (float) discount factor
- lam – (float) advantage estimation
- adam_epsilon – (float) the epsilon value for the adam optimizer
- schedule – (str) The type of scheduler for the learning rate update (‘linear’, ‘constant’, ‘double_linear_con’, ‘middle_drop’ or ‘double_middle_drop’)
- verbose – (int) the verbosity level: 0 none, 1 training information, 2 tensorflow debug
- tensorboard_log – (str) the log location for tensorboard (if None, no logging)
- _init_setup_model – (bool) Whether or not to build the network at the creation of the instance
-
action_probability
(observation, state=None, mask=None)¶ Get the model’s action probability distribution from an observation
Parameters: - observation – (np.ndarray) the input observation
- state – (np.ndarray) The last states (can be None, used in recurrent policies)
- mask – (np.ndarray) The last masks (can be None, used in recurrent policies)
Returns: (np.ndarray) the model’s action probability distribution
-
get_env
()¶ returns the current environment (can be None if not defined)
Returns: (Gym Environment) The current environment
-
learn
(total_timesteps, callback=None, seed=None, log_interval=100, tb_log_name='PPO1')[source]¶ Return a trained model.
Parameters: - total_timesteps – (int) The total number of samples to train on
- seed – (int) The initial seed for training, if None: keep current seed
- callback – (function (dict, dict)) function called at every steps with state of the algorithm. It takes the local and global variables.
- log_interval – (int) The number of timesteps before logging.
- tb_log_name – (str) the name of the run for tensorboard log
Returns: (BaseRLModel) the trained model
-
classmethod
load
(load_path, env=None, **kwargs)¶ Load the model from file
Parameters: - load_path – (str) the saved parameter location
- env – (Gym Envrionment) the new environment to run the loaded model on (can be None if you only need prediction from a trained model)
- kwargs – extra arguments to change the model when loading
-
predict
(observation, state=None, mask=None, deterministic=False)¶ Get the model’s action from an observation
Parameters: - observation – (np.ndarray) the input observation
- state – (np.ndarray) The last states (can be None, used in recurrent policies)
- mask – (np.ndarray) The last masks (can be None, used in recurrent policies)
- deterministic – (bool) Whether or not to return deterministic actions.
Returns: (np.ndarray, np.ndarray) the model’s action and the next state (used in recurrent policies)
-
save
(save_path)[source]¶ Save the current parameters to file
Parameters: save_path – (str) the save location
-
set_env
(env)¶ Checks the validity of the environment, and if it is coherent, set it as the current environment.
Parameters: env – (Gym Environment) The environment for learning a policy
PPO2¶
The Proximal Policy Optimization algorithm combines ideas from A2C (having multiple workers) and TRPO (it uses a trust region to improve the actor).
The main idea is that after an update, the new policy should be not too far form the old policy. For that, ppo uses clipping to avoid too large update.
Note
PPO2 is the implementation of OpenAI made for GPU. For multiprocessing, it uses vectorized environments compared to PPO1 which uses MPI.
Note
PPO2 contains several modifications from the original algorithm not documented by OpenAI: value function is also clipped and advantages are normalized.
Notes¶
- Original paper: https://arxiv.org/abs/1707.06347
- Clear explanation of PPO on Arxiv Insights channel: https://www.youtube.com/watch?v=5P7I-xPq8u8
- OpenAI blog post: https://blog.openai.com/openai-baselines-ppo/
python -m stable_baselines.ppo2.run_atari
runs the algorithm for 40M- frames = 10M timesteps on an Atari game. See help (
-h
) for more options.
python -m stable_baselines.ppo2.run_mujoco
runs the algorithm for 1M- frames on a Mujoco environment.
Can I use?¶
- Reccurent policies: ✔️
- Multi processing: ✔️
- Gym spaces:
Space | Action | Observation |
---|---|---|
Discrete | ✔️ | ✔️ |
Box | ✔️ | ✔️ |
MultiDiscrete | ✔️ | ✔️ |
MultiBinary | ✔️ | ✔️ |
Example¶
Train a PPO agent on CartPole-v1 using 4 processes.
import gym
from stable_baselines.common.policies import MlpPolicy
from stable_baselines.common.vec_env import SubprocVecEnv
from stable_baselines import PPO2
# multiprocess environment
n_cpu = 4
env = SubprocVecEnv([lambda: gym.make('CartPole-v1') for i in range(n_cpu)])
model = PPO2(MlpPolicy, env, verbose=1)
model.learn(total_timesteps=25000)
model.save("ppo2_cartpole")
del model # remove to demonstrate saving and loading
model = PPO2.load("ppo2_cartpole")
# Enjoy trained agent
obs = env.reset()
while True:
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)
env.render()
Parameters¶
-
class
stable_baselines.ppo2.
PPO2
(policy, env, gamma=0.99, n_steps=128, ent_coef=0.01, learning_rate=0.00025, vf_coef=0.5, max_grad_norm=0.5, lam=0.95, nminibatches=4, noptepochs=4, cliprange=0.2, verbose=0, tensorboard_log=None, _init_setup_model=True)[source]¶ Proximal Policy Optimization algorithm (GPU version). Paper: https://arxiv.org/abs/1707.06347
Parameters: - policy – (ActorCriticPolicy or str) The policy model to use (MlpPolicy, CnnPolicy, CnnLstmPolicy, …)
- env – (Gym environment or str) The environment to learn from (if registered in Gym, can be str)
- gamma – (float) Discount factor
- n_steps – (int) The number of steps to run for each environment per update (i.e. batch size is n_steps * n_env where n_env is number of environment copies running in parallel)
- ent_coef – (float) Entropy coefficient for the loss caculation
- learning_rate – (float or callable) The learning rate, it can be a function
- vf_coef – (float) Value function coefficient for the loss calculation
- max_grad_norm – (float) The maximum value for the gradient clipping
- lam – (float) Factor for trade-off of bias vs variance for Generalized Advantage Estimator
- nminibatches – (int) Number of training minibatches per update. For recurrent policies, the number of environments run in parallel should be a multiple of nminibatches.
- noptepochs – (int) Number of epoch when optimizing the surrogate
- cliprange – (float or callable) Clipping parameter, it can be a function
- verbose – (int) the verbosity level: 0 none, 1 training information, 2 tensorflow debug
- tensorboard_log – (str) the log location for tensorboard (if None, no logging)
- _init_setup_model – (bool) Whether or not to build the network at the creation of the instance
-
action_probability
(observation, state=None, mask=None)¶ Get the model’s action probability distribution from an observation
Parameters: - observation – (np.ndarray) the input observation
- state – (np.ndarray) The last states (can be None, used in recurrent policies)
- mask – (np.ndarray) The last masks (can be None, used in recurrent policies)
Returns: (np.ndarray) the model’s action probability distribution
-
get_env
()¶ returns the current environment (can be None if not defined)
Returns: (Gym Environment) The current environment
-
learn
(total_timesteps, callback=None, seed=None, log_interval=1, tb_log_name='PPO2')[source]¶ Return a trained model.
Parameters: - total_timesteps – (int) The total number of samples to train on
- seed – (int) The initial seed for training, if None: keep current seed
- callback – (function (dict, dict)) function called at every steps with state of the algorithm. It takes the local and global variables.
- log_interval – (int) The number of timesteps before logging.
- tb_log_name – (str) the name of the run for tensorboard log
Returns: (BaseRLModel) the trained model
-
classmethod
load
(load_path, env=None, **kwargs)¶ Load the model from file
Parameters: - load_path – (str) the saved parameter location
- env – (Gym Envrionment) the new environment to run the loaded model on (can be None if you only need prediction from a trained model)
- kwargs – extra arguments to change the model when loading
-
predict
(observation, state=None, mask=None, deterministic=False)¶ Get the model’s action from an observation
Parameters: - observation – (np.ndarray) the input observation
- state – (np.ndarray) The last states (can be None, used in recurrent policies)
- mask – (np.ndarray) The last masks (can be None, used in recurrent policies)
- deterministic – (bool) Whether or not to return deterministic actions.
Returns: (np.ndarray, np.ndarray) the model’s action and the next state (used in recurrent policies)
-
save
(save_path)[source]¶ Save the current parameters to file
Parameters: save_path – (str) the save location
-
set_env
(env)¶ Checks the validity of the environment, and if it is coherent, set it as the current environment.
Parameters: env – (Gym Environment) The environment for learning a policy
TRPO¶
Trust Region Policy Optimization (TRPO) is an iterative approach for optimizing policies with guaranteed monotonic improvement.
Notes¶
- Original paper: https://arxiv.org/abs/1502.05477
- OpenAI blog post: https://blog.openai.com/openai-baselines-ppo/
mpirun -np 16 python -m stable_baselines.trpo_mpi.run_atari
runs the algorithm for 40M frames = 10M timesteps on an Atari game. See help (-h
) for more options.python -m stable_baselines.trpo_mpi.run_mujoco
runs the algorithm for 1M timesteps on a Mujoco environment.
Can I use?¶
- Reccurent policies: ✔️
- Multi processing: ✔️ (using MPI)
- Gym spaces:
Space | Action | Observation |
---|---|---|
Discrete | ✔️ | ✔️ |
Box | ✔️ | ✔️ |
MultiDiscrete | ✔️ | ✔️ |
MultiBinary | ✔️ | ✔️ |
Example¶
import gym
from stable_baselines.common.policies import MlpPolicy, MlpLstmPolicy, MlpLnLstmPolicy
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines import TRPO
env = gym.make('CartPole-v1')
env = DummyVecEnv([lambda: env])
model = TRPO(MlpPolicy, env, verbose=1)
model.learn(total_timesteps=25000)
model.save("trpo_cartpole")
del model # remove to demonstrate saving and loading
model = TRPO.load("trpo_cartpole")
obs = env.reset()
while True:
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)
env.render()
Parameters¶
-
class
stable_baselines.trpo_mpi.
TRPO
(policy, env, gamma=0.99, timesteps_per_batch=1024, max_kl=0.01, cg_iters=10, lam=0.98, entcoeff=0.0, cg_damping=0.01, vf_stepsize=0.0003, vf_iters=3, verbose=0, tensorboard_log=None, _init_setup_model=True)[source]¶ -
action_probability
(observation, state=None, mask=None)¶ Get the model’s action probability distribution from an observation
Parameters: - observation – (np.ndarray) the input observation
- state – (np.ndarray) The last states (can be None, used in recurrent policies)
- mask – (np.ndarray) The last masks (can be None, used in recurrent policies)
Returns: (np.ndarray) the model’s action probability distribution
-
get_env
()¶ returns the current environment (can be None if not defined)
Returns: (Gym Environment) The current environment
-
learn
(total_timesteps, callback=None, seed=None, log_interval=100, tb_log_name='TRPO')[source]¶ Return a trained model.
Parameters: - total_timesteps – (int) The total number of samples to train on
- seed – (int) The initial seed for training, if None: keep current seed
- callback – (function (dict, dict)) function called at every steps with state of the algorithm. It takes the local and global variables.
- log_interval – (int) The number of timesteps before logging.
- tb_log_name – (str) the name of the run for tensorboard log
Returns: (BaseRLModel) the trained model
-
classmethod
load
(load_path, env=None, **kwargs)¶ Load the model from file
Parameters: - load_path – (str) the saved parameter location
- env – (Gym Envrionment) the new environment to run the loaded model on (can be None if you only need prediction from a trained model)
- kwargs – extra arguments to change the model when loading
-
predict
(observation, state=None, mask=None, deterministic=False)¶ Get the model’s action from an observation
Parameters: - observation – (np.ndarray) the input observation
- state – (np.ndarray) The last states (can be None, used in recurrent policies)
- mask – (np.ndarray) The last masks (can be None, used in recurrent policies)
- deterministic – (bool) Whether or not to return deterministic actions.
Returns: (np.ndarray, np.ndarray) the model’s action and the next state (used in recurrent policies)
-
save
(save_path)[source]¶ Save the current parameters to file
Parameters: save_path – (str) the save location
-
set_env
(env)¶ Checks the validity of the environment, and if it is coherent, set it as the current environment.
Parameters: env – (Gym Environment) The environment for learning a policy
-
Probability Distributions¶
Probability distributions used for the different action spaces:
CategoricalProbabilityDistribution
-> DiscreteDiagGaussianProbabilityDistribution
-> Box (continuous actions)MultiCategoricalProbabilityDistribution
-> MultiDiscreteBernoulliProbabilityDistribution
-> MultiBinary
The policy networks output parameters for the distributions (named flat in the methods). Actions are then sampled from those distributions.
For instance, in the case of discrete actions. The policy network outputs probability
of taking each action. The CategoricalProbabilityDistribution
allows to sample from it,
computes the entropy, the negative log probability (neglogp
) and backpropagate the gradient.
In the case of continuous actions, a Gaussian distribution is used. The policy network outputs
mean and (log) std of the distribution (assumed to be a DiagGaussianProbabilityDistribution
).
-
class
stable_baselines.common.distributions.
BernoulliProbabilityDistribution
(logits)[source]¶ -
-
classmethod
fromflat
(flat)[source]¶ Create an instance of this from new bernoulli input
Parameters: flat – ([float]) the bernoulli input data Returns: (ProbabilityDistribution) the instance from the given bernoulli input data
-
kl
(other)[source]¶ Calculates the Kullback-Leiber divergence from the given probabilty distribution
Parameters: other – ([float]) the distibution to compare with Returns: (float) the KL divergence of the two distributions
-
classmethod
-
class
stable_baselines.common.distributions.
BernoulliProbabilityDistributionType
(size)[source]¶ -
-
proba_distribution_from_latent
(pi_latent_vector, vf_latent_vector, init_scale=1.0, init_bias=0.0)[source]¶ returns the probability distribution from latent values
Parameters: - pi_latent_vector – ([float]) the latent pi values
- vf_latent_vector – ([float]) the latent vf values
- init_scale – (float) the inital scale of the distribution
- init_bias – (float) the inital bias of the distribution
Returns: (ProbabilityDistribution) the instance of the ProbabilityDistribution associated
-
-
class
stable_baselines.common.distributions.
CategoricalProbabilityDistribution
(logits)[source]¶ -
-
classmethod
fromflat
(flat)[source]¶ Create an instance of this from new logits values
Parameters: flat – ([float]) the categorical logits input Returns: (ProbabilityDistribution) the instance from the given categorical input
-
kl
(other)[source]¶ Calculates the Kullback-Leiber divergence from the given probabilty distribution
Parameters: other – ([float]) the distibution to compare with Returns: (float) the KL divergence of the two distributions
-
classmethod
-
class
stable_baselines.common.distributions.
CategoricalProbabilityDistributionType
(n_cat)[source]¶ -
-
proba_distribution_from_latent
(pi_latent_vector, vf_latent_vector, init_scale=1.0, init_bias=0.0)[source]¶ returns the probability distribution from latent values
Parameters: - pi_latent_vector – ([float]) the latent pi values
- vf_latent_vector – ([float]) the latent vf values
- init_scale – (float) the inital scale of the distribution
- init_bias – (float) the inital bias of the distribution
Returns: (ProbabilityDistribution) the instance of the ProbabilityDistribution associated
-
-
class
stable_baselines.common.distributions.
DiagGaussianProbabilityDistribution
(flat)[source]¶ -
-
classmethod
fromflat
(flat)[source]¶ Create an instance of this from new multivariate gaussian input
Parameters: flat – ([float]) the multivariate gaussian input data Returns: (ProbabilityDistribution) the instance from the given multivariate gaussian input data
-
kl
(other)[source]¶ Calculates the Kullback-Leiber divergence from the given probabilty distribution
Parameters: other – ([float]) the distibution to compare with Returns: (float) the KL divergence of the two distributions
-
classmethod
-
class
stable_baselines.common.distributions.
DiagGaussianProbabilityDistributionType
(size)[source]¶ -
-
proba_distribution_from_flat
(flat)[source]¶ returns the probability distribution from flat probabilities
Parameters: flat – ([float]) the flat probabilities Returns: (ProbabilityDistribution) the instance of the ProbabilityDistribution associated
-
proba_distribution_from_latent
(pi_latent_vector, vf_latent_vector, init_scale=1.0, init_bias=0.0)[source]¶ returns the probability distribution from latent values
Parameters: - pi_latent_vector – ([float]) the latent pi values
- vf_latent_vector – ([float]) the latent vf values
- init_scale – (float) the inital scale of the distribution
- init_bias – (float) the inital bias of the distribution
Returns: (ProbabilityDistribution) the instance of the ProbabilityDistribution associated
-
-
class
stable_baselines.common.distributions.
MultiCategoricalProbabilityDistribution
(nvec, flat)[source]¶ -
-
classmethod
fromflat
(flat)[source]¶ Create an instance of this from new logits values
Parameters: flat – ([float]) the multi categorical logits input Returns: (ProbabilityDistribution) the instance from the given multi categorical input
-
kl
(other)[source]¶ Calculates the Kullback-Leiber divergence from the given probabilty distribution
Parameters: other – ([float]) the distibution to compare with Returns: (float) the KL divergence of the two distributions
-
classmethod
-
class
stable_baselines.common.distributions.
MultiCategoricalProbabilityDistributionType
(n_vec)[source]¶ -
-
proba_distribution_from_flat
(flat)[source]¶ Returns the probability distribution from flat probabilities flat: flattened vector of parameters of probability distribution
Parameters: flat – ([float]) the flat probabilities Returns: (ProbabilityDistribution) the instance of the ProbabilityDistribution associated
-
proba_distribution_from_latent
(pi_latent_vector, vf_latent_vector, init_scale=1.0, init_bias=0.0)[source]¶ returns the probability distribution from latent values
Parameters: - pi_latent_vector – ([float]) the latent pi values
- vf_latent_vector – ([float]) the latent vf values
- init_scale – (float) the inital scale of the distribution
- init_bias – (float) the inital bias of the distribution
Returns: (ProbabilityDistribution) the instance of the ProbabilityDistribution associated
-
-
class
stable_baselines.common.distributions.
ProbabilityDistribution
[source]¶ A particular probability distribution
-
kl
(other)[source]¶ Calculates the Kullback-Leiber divergence from the given probabilty distribution
Parameters: other – ([float]) the distibution to compare with Returns: (float) the KL divergence of the two distributions
-
logp
(x)[source]¶ returns the of the log likelihood
Parameters: x – (str) the labels of each index Returns: ([float]) The log likelihood of the distribution
-
-
class
stable_baselines.common.distributions.
ProbabilityDistributionType
[source]¶ Parametrized family of probability distributions
-
param_placeholder
(prepend_shape, name=None)[source]¶ returns the TensorFlow placeholder for the input parameters
Parameters: - prepend_shape – ([int]) the prepend shape
- name – (str) the placeholder name
Returns: (TensorFlow Tensor) the placeholder
-
proba_distribution_from_flat
(flat)[source]¶ Returns the probability distribution from flat probabilities flat: flattened vector of parameters of probability distribution
Parameters: flat – ([float]) the flat probabilities Returns: (ProbabilityDistribution) the instance of the ProbabilityDistribution associated
-
proba_distribution_from_latent
(pi_latent_vector, vf_latent_vector, init_scale=1.0, init_bias=0.0)[source]¶ returns the probability distribution from latent values
Parameters: - pi_latent_vector – ([float]) the latent pi values
- vf_latent_vector – ([float]) the latent vf values
- init_scale – (float) the inital scale of the distribution
- init_bias – (float) the inital bias of the distribution
Returns: (ProbabilityDistribution) the instance of the ProbabilityDistribution associated
-
probability_distribution_class
()[source]¶ returns the ProbabilityDistribution class of this type
Returns: (Type ProbabilityDistribution) the probability distribution class associated
-
-
stable_baselines.common.distributions.
make_proba_dist_type
(ac_space)[source]¶ return an instance of ProbabilityDistributionType for the correct type of action space
Parameters: ac_space – (Gym Space) the input action space Returns: (ProbabilityDistributionType) the approriate instance of a ProbabilityDistributionType
Tensorflow Utils¶
-
stable_baselines.common.tf_util.
conv2d
(input_tensor, num_filters, name, filter_size=(3, 3), stride=(1, 1), pad='SAME', dtype=<MagicMock id='140401868522664'>, collections=None, summary_tag=None)[source]¶ Creates a 2d convolutional layer for TensorFlow
Parameters: - input_tensor – (TensorFlow Tensor) The input tensor for the convolution
- num_filters – (int) The number of filters
- name – (str) The TensorFlow variable scope
- filter_size – (tuple) The filter size
- stride – (tuple) The stride of the convolution
- pad – (str) The padding type (‘VALID’ or ‘SAME’)
- dtype – (type) The data type for the Tensors
- collections – (list) List of graph collections keys to add the Variable to
- summary_tag – (str) image summary name, can be None for no image summary
Returns: (TensorFlow Tensor) 2d convolutional layer
-
stable_baselines.common.tf_util.
display_var_info
(_vars)[source]¶ log variable information, for debug purposes
Parameters: _vars – ([TensorFlow Tensor]) the variables
-
stable_baselines.common.tf_util.
flatgrad
(loss, var_list, clip_norm=None)[source]¶ calculates the gradient and flattens it
Parameters: - loss – (float) the loss value
- var_list – ([TensorFlow Tensor]) the variables
- clip_norm – (float) clip the gradients (disabled if None)
Returns: ([TensorFlow Tensor]) flattend gradient
-
stable_baselines.common.tf_util.
flattenallbut0
(tensor)[source]¶ flatten all the dimension, except from the first one
Parameters: tensor – (TensorFlow Tensor) the input tensor Returns: (TensorFlow Tensor) the flattened tensor
-
stable_baselines.common.tf_util.
function
(inputs, outputs, updates=None, givens=None)[source]¶ Just like Theano function. Take a bunch of tensorflow placeholders and expressions computed based on those placeholders and produces f(inputs) -> outputs. Function f takes values to be fed to the input’s placeholders and produces the values of the expressions in outputs.
Input values can be passed in the same order as inputs or can be provided as kwargs based on placeholder name (passed to constructor or accessible via placeholder.op.name).
- Example:
x = tf.placeholder(tf.int32, (), name=”x”) y = tf.placeholder(tf.int32, (), name=”y”) z = 3 * x + 2 * y lin = function([x, y], z, givens={y: 0})
- with single_threaded_session():
initialize()
assert lin(2) == 6 assert lin(x=3) == 9 assert lin(2, 2) == 10 assert lin(x=2, y=3) == 12
Parameters: - inputs – (TensorFlow Tensor or Object with make_feed_dict) list of input arguments
- outputs – (TensorFlow Tensor) list of outputs or a single output to be returned from function. Returned value will also have the same shape.
- updates – (list) update functions
- givens – (dict) the values known for the output
-
stable_baselines.common.tf_util.
get_available_gpus
()[source]¶ Return a list of all the available GPUs
Returns: ([str]) the GPUs available
-
stable_baselines.common.tf_util.
get_globals_vars
(name)[source]¶ returns the trainable variables
Parameters: name – (str) the scope Returns: ([TensorFlow Variable])
-
stable_baselines.common.tf_util.
get_trainable_vars
(name)[source]¶ returns the trainable variables
Parameters: name – (str) the scope Returns: ([TensorFlow Variable])
-
stable_baselines.common.tf_util.
huber_loss
(tensor, delta=1.0)[source]¶ Reference: https://en.wikipedia.org/wiki/Huber_loss
Parameters: - tensor – (TensorFlow Tensor) the input value
- delta – (float) huber loss delta value
Returns: (TensorFlow Tensor) huber loss output
-
stable_baselines.common.tf_util.
in_session
(func)[source]¶ wrappes a function so that it is in a TensorFlow Session
Parameters: func – (function) the function to wrap Returns: (function)
-
stable_baselines.common.tf_util.
initialize
(sess=None)[source]¶ Initialize all the uninitialized variables in the global scope.
Parameters: sess – (TensorFlow Session)
-
stable_baselines.common.tf_util.
intprod
(tensor)[source]¶ calculates the product of all the elements in a list
Parameters: tensor – ([Number]) the list of elements Returns: (int) the product truncated
-
stable_baselines.common.tf_util.
leaky_relu
(tensor, leak=0.2)[source]¶ Leaky ReLU http://web.stanford.edu/~awni/papers/relu_hybrid_icml2013_final.pdf
Parameters: - tensor – (float) the input value
- leak – (float) the leaking coeficient when the function is saturated
Returns: (float) Leaky ReLU output
-
stable_baselines.common.tf_util.
load_state
(fname, sess=None, var_list=None)[source]¶ Load a TensorFlow saved model
Parameters: - fname – (str) the graph name
- sess – (TensorFlow Session) the session, if None: get_default_session()
- var_list – ([TensorFlow Tensor] or dict(str: TensorFlow Tensor)) A list of Variable/SaveableObject,
or a dictionary mapping names to SaveableObject`s. If
None
, defaults to the list of all saveable objects.
-
stable_baselines.common.tf_util.
make_session
(num_cpu=None, make_default=False, graph=None)[source]¶ Returns a session that will use <num_cpu> CPU’s only
Parameters: - num_cpu – (int) number of CPUs to use for TensorFlow
- make_default – (bool) if this should return an InteractiveSession or a normal Session
- graph – (TensorFlow Graph) the graph of the session
Returns: (TensorFlow session)
-
stable_baselines.common.tf_util.
normc_initializer
(std=1.0, axis=0)[source]¶ Return a parameter initializer for TensorFlow
Parameters: - std – (float) standard deviation
- axis – (int) the axis to normalize on
Returns: (function)
-
stable_baselines.common.tf_util.
numel
(tensor)[source]¶ get TensorFlow Tensor’s number of elements
Parameters: tensor – (TensorFlow Tensor) the input tensor Returns: (int) the number of elements
-
stable_baselines.common.tf_util.
outer_scope_getter
(scope, new_scope='')[source]¶ remove a scope layer for the getter
Parameters: - scope – (str) the layer to remove
- new_scope – (str) optional replacement name
Returns: (function (function, str,
*args
,**kwargs
): Tensorflow Tensor)
-
stable_baselines.common.tf_util.
save_state
(fname, sess=None, var_list=None)[source]¶ Save a TensorFlow model
Parameters: - fname – (str) the graph name
- sess – (TensorFlow Session) The tf session, if None, get_default_session()
- var_list – ([TensorFlow Tensor] or dict(str: TensorFlow Tensor)) A list of Variable/SaveableObject,
or a dictionary mapping names to SaveableObject`s. If
None
, defaults to the list of all saveable objects.
-
stable_baselines.common.tf_util.
single_threaded_session
(make_default=False, graph=None)[source]¶ Returns a session which will only use a single CPU
Parameters: - make_default – (bool) if this should return an InteractiveSession or a normal Session
- graph – (TensorFlow Graph) the graph of the session
Returns: (TensorFlow session)
-
stable_baselines.common.tf_util.
switch
(condition, then_expression, else_expression)[source]¶ Switches between two operations depending on a scalar value (int or bool). Note that both then_expression and else_expression should be symbolic tensors of the same shape.
Parameters: - condition – (TensorFlow Tensor) scalar tensor.
- then_expression – (TensorFlow Operation)
- else_expression – (TensorFlow Operation)
Returns: (TensorFlow Operation) the switch output
Command Utils¶
Helpers for scripts like run_atari.py.
-
stable_baselines.common.cmd_util.
arg_parser
()[source]¶ Create an empty argparse.ArgumentParser.
Returns: (ArgumentParser)
-
stable_baselines.common.cmd_util.
atari_arg_parser
()[source]¶ Create an argparse.ArgumentParser for run_atari.py.
Returns: (ArgumentParser) parser {‘–env’: ‘BreakoutNoFrameskip-v4’, ‘–seed’: 0, ‘–num-timesteps’: int(1e7)}
-
stable_baselines.common.cmd_util.
make_atari_env
(env_id, num_env, seed, wrapper_kwargs=None, start_index=0, allow_early_resets=True)[source]¶ Create a wrapped, monitored SubprocVecEnv for Atari.
Parameters: - env_id – (str) the environment ID
- num_env – (int) the number of environment you wish to have in subprocesses
- seed – (int) the inital seed for RNG
- wrapper_kwargs – (dict) the parameters for wrap_deepmind function
- start_index – (int) start rank index
- allow_early_resets – (bool) allows early reset of the environment
Returns: (Gym Environment) The atari environment
-
stable_baselines.common.cmd_util.
make_mujoco_env
(env_id, seed, allow_early_resets=True)[source]¶ Create a wrapped, monitored gym.Env for MuJoCo.
Parameters: - env_id – (str) the environment ID
- seed – (int) the inital seed for RNG
- allow_early_resets – (bool) allows early reset of the environment
Returns: (Gym Environment) The mujoco environment
-
stable_baselines.common.cmd_util.
make_robotics_env
(env_id, seed, rank=0, allow_early_resets=True)[source]¶ Create a wrapped, monitored gym.Env for MuJoCo.
Parameters: - env_id – (str) the environment ID
- seed – (int) the inital seed for RNG
- rank – (int) the rank of the environment (for logging)
- allow_early_resets – (bool) allows early reset of the environment
Returns: (Gym Environment) The robotic environment
Schedules¶
Schedules are used as hyperparameter for most of the algortihms, in order to change value of a parameter over time (usuallly the learning rate).
This file is used for specifying various schedules that evolve over time throughout the execution of the algorithm, such as:
- learning rate for the optimizer
- exploration epsilon for the epsilon greedy exploration strategy
- beta parameter for beta parameter in prioritized replay
Each schedule has a function value(t) which returns the current value of the parameter given the timestep t of the optimization procedure.
-
class
stable_baselines.common.schedules.
ConstantSchedule
(value)[source]¶ Value remains constant over time.
Parameters: value – (float) Constant value of the schedule
-
class
stable_baselines.common.schedules.
LinearSchedule
(schedule_timesteps, final_p, initial_p=1.0)[source]¶ Linear interpolation between initial_p and final_p over schedule_timesteps. After this many timesteps pass final_p is returned.
Parameters: - schedule_timesteps – (int) Number of timesteps for which to linearly anneal initial_p to final_p
- initial_p – (float) initial output value
- final_p – (float) final output value
-
class
stable_baselines.common.schedules.
PiecewiseSchedule
(endpoints, interpolation=<function linear_interpolation>, outside_value=None)[source]¶ Piecewise schedule.
Parameters: - endpoints – ([(int, int)]) list of pairs (time, value) meanining that schedule should output value when t==time. All the values for time must be sorted in an increasing order. When t is between two times, e.g. (time_a, value_a) and (time_b, value_b), such that time_a <= t < time_b then value outputs interpolation(value_a, value_b, alpha) where alpha is a fraction of time passed between time_a and time_b for time t.
- interpolation – (lambda (float, float, float): float) a function that takes value to the left and to the right of t according to the endpoints. Alpha is the fraction of distance from left endpoint to right endpoint that t has covered. See linear_interpolation for example.
- outside_value – (float) if the value is requested outside of all the intervals sepecified in endpoints this value is returned. If None then AssertionError is raised when outside value is requested.
Changelog¶
For download links, please look at Github release page.
Release 2.2.0 (2018-11-07)¶
- Hotfix for ppo2, the wrong placeholder was used for the value function
Release 2.1.2 (2018-11-06)¶
- added
async_eigen_decomp
parameter for ACKTR and set it toFalse
by default (remove deprecation warnings) - added methods for calling env methods/setting attributes inside a VecEnv (thanks to @bjmuld)
- updated gym minimum version
Release 2.1.1 (2018-10-20)¶
- fixed MpiAdam synchronization issue in PPO1 (thanks to @brendenpetersen) issue #50
- fixed dependency issues (new mujoco-py requires a mujoco licence + gym broke MultiDiscrete space shape)
Release 2.1.0 (2018-10-2)¶
Warning
This version contains breaking changes for DQN policies, please read the full details
Bug fixes + doc update
- added patch fix for equal function using gym.spaces.MultiDiscrete and gym.spaces.MultiBinary
- fixes for DQN action_probability
- re-added double DQN + refactored DQN policies breaking changes
- replaced async with async_eigen_decomp in ACKTR/KFAC for python 3.7 compatibility
- removed action clipping for prediction of continuous actions (see issue #36)
- fixed NaN issue due to clipping the continuous action in the wrong place (issue #36)
- documentation was updated (policy + DDPG example hyperparameters)
Release 2.0.0 (2018-09-18)¶
Warning
This version contains breaking changes, please read the full details
Tensorboard, refactoring and bug fixes
- Renamed DeepQ to DQN breaking changes
- Renamed DeepQPolicy to DQNPolicy breaking changes
- fixed DDPG behavior breaking changes
- changed default policies for DDPG, so that DDPG now works correctly breaking changes
- added more documentation (some modules from common).
- added doc about using custom env
- added Tensorboard support for A2C, ACER, ACKTR, DDPG, DeepQ, PPO1, PPO2 and TRPO
- added episode reward to Tensorboard
- added documentation for Tensorboard usage
- added Identity for Box action space
- fixed render function ignoring parameters when using wrapped environments
- fixed PPO1 and TRPO done values for recurrent policies
- fixed image normalization not occurring when using images
- updated VecEnv objects for the new Gym version
- added test for DDPG
- refactored DQN policies
- added registry for policies, can be passed as string to the agent
- added documentation for custom policies + policy registration
- fixed numpy warning when using DDPG Memory
- fixed DummyVecEnv not copying the observation array when stepping and resetting
- added pre-built docker images + installation instructions
- added
deterministic
argument in the predict function - added assert in PPO2 for recurrent policies
- fixed predict function to handle both vectorized and unwrapped environment
- added input check to the predict function
- refactored ActorCritic models to reduce code duplication
- refactored Off Policy models (to begin HER and replay_buffer refactoring)
- added tests for auto vectorization detection
- fixed render function, to handle positional arguments
Release 1.0.7 (2018-08-29)¶
Bug fixes and documentation
- added html documentation using sphinx + integration with read the docs
- cleaned up README + typos
- fixed normalization for DQN with images
- fixed DQN identity test
Release 1.0.1 (2018-08-20)¶
Refactored Stable Baselines
- refactored A2C, ACER, ACTKR, DDPG, DeepQ, GAIL, TRPO, PPO1 and PPO2 under a single constant class
- added callback to refactored algorithm training
- added saving and loading to refactored algorithms
- refactored ACER, DDPG, GAIL, PPO1 and TRPO to fit with A2C, PPO2 and ACKTR policies
- added new policies for most algorithms (Mlp, MlpLstm, MlpLnLstm, Cnn, CnnLstm and CnnLnLstm)
- added dynamic environment switching (so continual RL learning is now feasible)
- added prediction from observation and action probability from observation for all the algorithms
- fixed graphs issues, so models wont collide in names
- fixed behavior_clone weight loading for GAIL
- fixed Tensorflow using all the GPU VRAM
- fixed models so that they are all compatible with vectorized environments
- fixed
`set_global_seed`
to update`gym.spaces`
’s random seed - fixed PPO1 and TRPO performance issues when learning identity function
- added new tests for loading, saving, continuous actions and learning the identity function
- fixed DQN wrapping for atari
- added saving and loading for Vecnormalize wrapper
- added automatic detection of action space (for the policy network)
- fixed ACER buffer with constant values assuming n_stack=4
- fixed some RL algorithms not clipping the action to be in the action_space, when using
`gym.spaces.Box`
- refactored algorithms can take either a
`gym.Environment`
or a`str`
([if the environment name is registered](https://github.com/openai/gym/wiki/Environments)) - Hoftix in ACER (compared to v1.0.0)
Future Work :
- Finish refactoring HER
- Refactor ACKTR and ACER for continuous implementation
Release 0.1.6 (2018-07-27)¶
Deobfuscation of the code base + pep8 and fixes
- Fixed
tf.session().__enter__()
being used, rather thansess = tf.session()
and passing the session to the objects - Fixed uneven scoping of TensorFlow Sessions throughout the code
- Fixed rolling vecwrapper to handle observations that are not only grayscale images
- Fixed deepq saving the environment when trying to save itself
- Fixed
ValueError: Cannot take the length of Shape with unknown rank.
inacktr
, when runningrun_atari.py
script. - Fixed calling baselines sequentially no longer creates graph conflicts
- Fixed mean on empty array warning with deepq
- Fixed kfac eigen decomposition not cast to float64, when the parameter use_float64 is set to True
- Fixed Dataset data loader, not correctly resetting id position if shuffling is disabled
- Fixed
EOFError
when reading from connection in theworker
insubproc_vec_env.py
- Fixed
behavior_clone
weight loading and saving for GAIL - Avoid taking root square of negative number in
trpo_mpi.py
- Removed some duplicated code (a2cpolicy, trpo_mpi)
- Removed unused, undocumented and crashing function
reset_task
insubproc_vec_env.py
- Reformated code to PEP8 style
- Documented all the codebase
- Added atari tests
- Added logger tests
Missing: tests for acktr continuous (+ HER, gail but they rely on mujoco…)
Maintainers¶
Stable-Baselines is currently maintained by Ashley Hill (aka @hill-a) and Antonin Raffin (aka @araffin).
Contributors (since v2.0.0):¶
In random order…
Thanks to @bjmuld @iambenzo @iandanforth @r7vme @brendenpetersen @huvar
Plotting Results¶
-
stable_baselines.results_plotter.
main
()[source]¶ Example usage in jupyter-notebook
from stable_baselines import log_viewer %matplotlib inline log_viewer.plot_results(["./log"], 10e6, log_viewer.X_TIMESTEPS, "Breakout")
Here ./log is a directory containing the monitor.csv files
-
stable_baselines.results_plotter.
plot_curves
(xy_list, xaxis, title)[source]¶ plot the curves
Parameters: - xy_list – ([(np.ndarray, np.ndarray)]) the x and y coordinates to plot
- xaxis – (str) the axis for the x and y output (can be X_TIMESTEPS=’timesteps’, X_EPISODES=’episodes’ or X_WALLTIME=’walltime_hrs’)
- title – (str) the title of the plot
-
stable_baselines.results_plotter.
plot_results
(dirs, num_timesteps, xaxis, task_name)[source]¶ plot the results
Parameters: - dirs – ([str]) the save location of the results to plot
- num_timesteps – (int) only plot the points below this value
- xaxis – (str) the axis for the x and y output (can be X_TIMESTEPS=’timesteps’, X_EPISODES=’episodes’ or X_WALLTIME=’walltime_hrs’)
- task_name – (str) the title of the task to plot
-
stable_baselines.results_plotter.
rolling_window
(array, window)[source]¶ apply a rolling window to a np.ndarray
Parameters: - array – (np.ndarray) the input Array
- window – (int) length of the rolling window
Returns: (np.ndarray) rolling window on the input array
-
stable_baselines.results_plotter.
ts2xy
(timesteps, xaxis)[source]¶ Decompose a timesteps variable to x ans ys
Parameters: - timesteps – (Pandas DataFrame) the input data
- xaxis – (str) the axis for the x and y output (can be X_TIMESTEPS=’timesteps’, X_EPISODES=’episodes’ or X_WALLTIME=’walltime_hrs’)
Returns: (np.ndarray, np.ndarray) the x and y output
-
stable_baselines.results_plotter.
window_func
(var_1, var_2, window, func)[source]¶ apply a function to the rolling window of 2 arrays
Parameters: - var_1 – (np.ndarray) variable 1
- var_2 – (np.ndarray) variable 2
- window – (int) length of the rolling window
- func – (numpy function) function to apply on the rolling window on variable 2 (such as np.mean)
Returns: (np.ndarray, np.ndarray) the rolling output with applied function
Citing Stable Baselines¶
To cite this project in publications:
@misc{stable-baselines,
author = {Hill, Ashley and Raffin, Antonin and Traore, Rene and Dhariwal, Prafulla and Hesse, Christopher and Klimov, Oleg and Nichol, Alex and Plappert, Matthias and Radford, Alec and Schulman, John and Sidor, Szymon and Wu, Yuhuai},
title = {Stable Baselines},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/hill-a/stable-baselines}},
}
Contributing¶
To any interested in making the rl baselines better, there is still some improvements that needs to be done: good-to-have features like support for continuous actions (ACER) and more documentation on the rl algorithms.
If you want to contribute, please open an issue first and then propose your pull request on Github at https://github.com/hill-a/stable-baselines.