DDPG¶

Deep Deterministic Policy Gradient (DDPG)

Warning

The DDPG model does not support stable_baselines.common.policies because it uses q-value instead of value estimation, as a result it must use its own policy models (see DDPG Policies).

Available Policies

`MlpPolicy`	Policy object that implements actor critic, using a MLP (2 layers of 64)
`LnMlpPolicy`	Policy object that implements actor critic, using a MLP (2 layers of 64), with layer normalisation
`CnnPolicy`	Policy object that implements actor critic, using a CNN (the nature CNN)
`LnCnnPolicy`	Policy object that implements actor critic, using a CNN (the nature CNN), with layer normalisation

Notes¶

Original paper: https://arxiv.org/abs/1509.02971
Baselines post: https://blog.openai.com/better-exploration-with-parameter-noise/
python -m stable_baselines.ddpg.main runs the algorithm for 1M frames = 10M timesteps on a Mujoco environment. See help (-h) for more options.

Can I use?¶

Reccurent policies: ❌
Multi processing: ❌
Gym spaces:

Space	Action	Observation
Discrete	❌	✔️
Box	✔️	✔️
MultiDiscrete	❌	✔️
MultiBinary	❌	✔️

Example¶

import gym
import numpy as np

from stable_baselines.ddpg.policies import MlpPolicy
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines.ddpg.noise import NormalActionNoise, OrnsteinUhlenbeckActionNoise, AdaptiveParamNoiseSpec
from stable_baselines import DDPG

env = gym.make('MountainCarContinuous-v0')
env = DummyVecEnv([lambda: env])

# the noise objects for DDPG
n_actions = env.action_space.shape[-1]
param_noise = None
action_noise = OrnsteinUhlenbeckActionNoise(mean=np.zeros(n_actions), sigma=float(0.5) * np.ones(n_actions))

model = DDPG(MlpPolicy, env, verbose=1, param_noise=param_noise, action_noise=action_noise)
model.learn(total_timesteps=400000)
model.save("ddpg_mountain")

del model # remove to demonstrate saving and loading

model = DDPG.load("ddpg_mountain")

obs = env.reset()
while True:
    action, _states = model.predict(obs)
    obs, rewards, dones, info = env.step(action)
    env.render()

Parameters¶

class stable_baselines.ddpg.DDPG(policy, env, gamma=0.99, memory_policy=None, eval_env=None, nb_train_steps=50, nb_rollout_steps=100, nb_eval_steps=100, param_noise=None, action_noise=None, normalize_observations=False, tau=0.001, batch_size=128, param_noise_adaption_interval=50, normalize_returns=False, enable_popart=False, observation_range=(-5.0, 5.0), critic_l2_reg=0.0, return_range=(-inf, inf), actor_lr=0.0001, critic_lr=0.001, clip_norm=None, reward_scale=1.0, render=False, render_eval=False, memory_limit=100, verbose=0, tensorboard_log=None, _init_setup_model=True)[source]¶

Deep Deterministic Policy Gradient (DDPG) model

DDPG: https://arxiv.org/pdf/1509.02971.pdf

Parameters:

policy – (DDPGPolicy or str) The policy model to use (MlpPolicy, CnnPolicy, LnMlpPolicy, …)
env – (Gym environment or str) The environment to learn from (if registered in Gym, can be str)
gamma – (float) the discount rate
memory_policy – (Memory) the replay buffer (if None, default to baselines.ddpg.memory.Memory)
eval_env – (Gym Environment) the evaluation environment (can be None)
nb_train_steps – (int) the number of training steps
nb_rollout_steps – (int) the number of rollout steps
nb_eval_steps – (int) the number of evalutation steps
param_noise – (AdaptiveParamNoiseSpec) the parameter noise type (can be None)
action_noise – (ActionNoise) the action noise type (can be None)
param_noise_adaption_interval – (int) apply param noise every N steps
tau – (float) the soft update coefficient (keep old values, between 0 and 1)
normalize_returns – (bool) should the critic output be normalized
enable_popart – (bool) enable pop-art normalization of the critic output (https://arxiv.org/pdf/1602.07714.pdf)
normalize_observations – (bool) should the observation be normalized
batch_size – (int) the size of the batch for learning the policy
observation_range – (tuple) the bounding values for the observation
return_range – (tuple) the bounding values for the critic output
critic_l2_reg – (float) l2 regularizer coefficient
actor_lr – (float) the actor learning rate
critic_lr – (float) the critic learning rate
clip_norm – (float) clip the gradients (disabled if None)
reward_scale – (float) the value the reward should be scaled by
render – (bool) enable rendering of the environment
render_eval – (bool) enable rendering of the evalution environment
memory_limit – (int) the max number of transitions to store
verbose – (int) the verbosity level: 0 none, 1 training information, 2 tensorflow debug
tensorboard_log – (str) the log location for tensorboard (if None, no logging)
_init_setup_model – (bool) Whether or not to build the network at the creation of the instance

action_probability(observation, state=None, mask=None)[source]¶

Get the model’s action probability distribution from an observation

Parameters:	observation – (np.ndarray) the input observation state – (np.ndarray) The last states (can be None, used in recurrent policies) mask – (np.ndarray) The last masks (can be None, used in recurrent policies)
Returns:	(np.ndarray) the model’s action probability distribution

get_env()¶

returns the current environment (can be None if not defined)

Returns:	(Gym Environment) The current environment

learn(total_timesteps, callback=None, seed=None, log_interval=100, tb_log_name='DDPG')[source]¶

Return a trained model.

Parameters:

total_timesteps – (int) The total number of samples to train on
seed – (int) The initial seed for training, if None: keep current seed
callback – (function (dict, dict)) function called at every steps with state of the algorithm. It takes the local and global variables.
log_interval – (int) The number of timesteps before logging.
tb_log_name – (str) the name of the run for tensorboard log

Returns:

(BaseRLModel) the trained model

classmethod load(load_path, env=None, **kwargs)[source]¶

Load the model from file

Parameters:	load_path – (str) the saved parameter location env – (Gym Envrionment) the new environment to run the loaded model on (can be None if you only need prediction from a trained model) kwargs – extra arguments to change the model when loading

predict(observation, state=None, mask=None, deterministic=True)[source]¶

Get the model’s action from an observation

Parameters:	observation – (np.ndarray) the input observation state – (np.ndarray) The last states (can be None, used in recurrent policies) mask – (np.ndarray) The last masks (can be None, used in recurrent policies) deterministic – (bool) Whether or not to return deterministic actions.
Returns:	(np.ndarray, np.ndarray) the model’s action and the next state (used in recurrent policies)

save(save_path)[source]¶

Save the current parameters to file

Parameters:	save_path – (str) the save location

set_env(env)¶

Checks the validity of the environment, and if it is coherent, set it as the current environment.

Parameters:	env – (Gym Environment) The environment for learning a policy

setup_model()[source]¶: Create all the functions and tensorflow graphs necessary to train the model

DDPG Policies¶

class stable_baselines.ddpg.MlpPolicy(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, **_kwargs)[source]¶

Policy object that implements actor critic, using a MLP (2 layers of 64)

Parameters:

sess – (TensorFlow session) The current TensorFlow session
ob_space – (Gym Space) The observation space of the environment
ac_space – (Gym Space) The action space of the environment
n_env – (int) The number of environments to run
n_steps – (int) The number of steps to run for each environment
n_batch – (int) The number of batch to run (n_envs * n_steps)
reuse – (bool) If the policy is reusable or not
_kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction

make_actor(obs=None, reuse=False, scope='pi')¶

creates an actor object

Parameters:	obs – (TensorFlow Tensor) The observation placeholder (can be None for default placeholder) reuse – (bool) whether or not to resue parameters scope – (str) the scope name of the actor
Returns:	(TensorFlow Tensor) the output tensor

make_critic(obs=None, action=None, reuse=False, scope='qf')¶

creates a critic object

Parameters:	obs – (TensorFlow Tensor) The observation placeholder (can be None for default placeholder) action – (TensorFlow Tensor) The action placeholder (can be None for default placeholder) reuse – (bool) whether or not to resue parameters scope – (str) the scope name of the critic
Returns:	(TensorFlow Tensor) the output tensor

proba_step(obs, state=None, mask=None)¶

Returns the action probability for a single step

Parameters:	obs – ([float] or [int]) The current observation of the environment state – ([float]) The last states (used in recurrent policies) mask – ([float]) The last masks (used in recurrent policies)
Returns:	([float]) the action probability

step(obs, state=None, mask=None)¶

Returns the policy for a single step

Parameters:	obs – ([float] or [int]) The current observation of the environment state – ([float]) The last states (used in recurrent policies) mask – ([float]) The last masks (used in recurrent policies)
Returns:	([float]) actions

value(obs, action, state=None, mask=None)¶

Returns the value for a single step

Parameters:	obs – ([float] or [int]) The current observation of the environment action – ([float] or [int]) The taken action state – ([float]) The last states (used in recurrent policies) mask – ([float]) The last masks (used in recurrent policies)
Returns:	([float]) The associated value of the action

class stable_baselines.ddpg.LnMlpPolicy(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, **_kwargs)[source]¶

Policy object that implements actor critic, using a MLP (2 layers of 64), with layer normalisation

Parameters:

sess – (TensorFlow session) The current TensorFlow session
ob_space – (Gym Space) The observation space of the environment
ac_space – (Gym Space) The action space of the environment
n_env – (int) The number of environments to run
n_steps – (int) The number of steps to run for each environment
n_batch – (int) The number of batch to run (n_envs * n_steps)
reuse – (bool) If the policy is reusable or not
_kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction

make_actor(obs=None, reuse=False, scope='pi')¶

creates an actor object

Parameters:	obs – (TensorFlow Tensor) The observation placeholder (can be None for default placeholder) reuse – (bool) whether or not to resue parameters scope – (str) the scope name of the actor
Returns:	(TensorFlow Tensor) the output tensor

make_critic(obs=None, action=None, reuse=False, scope='qf')¶

creates a critic object

Parameters:	obs – (TensorFlow Tensor) The observation placeholder (can be None for default placeholder) action – (TensorFlow Tensor) The action placeholder (can be None for default placeholder) reuse – (bool) whether or not to resue parameters scope – (str) the scope name of the critic
Returns:	(TensorFlow Tensor) the output tensor

proba_step(obs, state=None, mask=None)¶

Returns the action probability for a single step

Parameters:	obs – ([float] or [int]) The current observation of the environment state – ([float]) The last states (used in recurrent policies) mask – ([float]) The last masks (used in recurrent policies)
Returns:	([float]) the action probability

step(obs, state=None, mask=None)¶

Returns the policy for a single step

Parameters:	obs – ([float] or [int]) The current observation of the environment state – ([float]) The last states (used in recurrent policies) mask – ([float]) The last masks (used in recurrent policies)
Returns:	([float]) actions

value(obs, action, state=None, mask=None)¶

Returns the value for a single step

Parameters:	obs – ([float] or [int]) The current observation of the environment action – ([float] or [int]) The taken action state – ([float]) The last states (used in recurrent policies) mask – ([float]) The last masks (used in recurrent policies)
Returns:	([float]) The associated value of the action

class stable_baselines.ddpg.CnnPolicy(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, **_kwargs)[source]¶

Policy object that implements actor critic, using a CNN (the nature CNN)

Parameters:

sess – (TensorFlow session) The current TensorFlow session
ob_space – (Gym Space) The observation space of the environment
ac_space – (Gym Space) The action space of the environment
n_env – (int) The number of environments to run
n_steps – (int) The number of steps to run for each environment
n_batch – (int) The number of batch to run (n_envs * n_steps)
reuse – (bool) If the policy is reusable or not
_kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction

make_actor(obs=None, reuse=False, scope='pi')¶

creates an actor object

Parameters:	obs – (TensorFlow Tensor) The observation placeholder (can be None for default placeholder) reuse – (bool) whether or not to resue parameters scope – (str) the scope name of the actor
Returns:	(TensorFlow Tensor) the output tensor

make_critic(obs=None, action=None, reuse=False, scope='qf')¶

creates a critic object

Parameters:	obs – (TensorFlow Tensor) The observation placeholder (can be None for default placeholder) action – (TensorFlow Tensor) The action placeholder (can be None for default placeholder) reuse – (bool) whether or not to resue parameters scope – (str) the scope name of the critic
Returns:	(TensorFlow Tensor) the output tensor

proba_step(obs, state=None, mask=None)¶

Returns the action probability for a single step

Parameters:	obs – ([float] or [int]) The current observation of the environment state – ([float]) The last states (used in recurrent policies) mask – ([float]) The last masks (used in recurrent policies)
Returns:	([float]) the action probability

step(obs, state=None, mask=None)¶

Returns the policy for a single step

Parameters:	obs – ([float] or [int]) The current observation of the environment state – ([float]) The last states (used in recurrent policies) mask – ([float]) The last masks (used in recurrent policies)
Returns:	([float]) actions

value(obs, action, state=None, mask=None)¶

Returns the value for a single step

Parameters:	obs – ([float] or [int]) The current observation of the environment action – ([float] or [int]) The taken action state – ([float]) The last states (used in recurrent policies) mask – ([float]) The last masks (used in recurrent policies)
Returns:	([float]) The associated value of the action

class stable_baselines.ddpg.LnCnnPolicy(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, **_kwargs)[source]¶

Policy object that implements actor critic, using a CNN (the nature CNN), with layer normalisation

Parameters:

sess – (TensorFlow session) The current TensorFlow session
ob_space – (Gym Space) The observation space of the environment
ac_space – (Gym Space) The action space of the environment
n_env – (int) The number of environments to run
n_steps – (int) The number of steps to run for each environment
n_batch – (int) The number of batch to run (n_envs * n_steps)
reuse – (bool) If the policy is reusable or not
_kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction

make_actor(obs=None, reuse=False, scope='pi')¶

creates an actor object

Parameters:	obs – (TensorFlow Tensor) The observation placeholder (can be None for default placeholder) reuse – (bool) whether or not to resue parameters scope – (str) the scope name of the actor
Returns:	(TensorFlow Tensor) the output tensor

make_critic(obs=None, action=None, reuse=False, scope='qf')¶

creates a critic object

Parameters:	obs – (TensorFlow Tensor) The observation placeholder (can be None for default placeholder) action – (TensorFlow Tensor) The action placeholder (can be None for default placeholder) reuse – (bool) whether or not to resue parameters scope – (str) the scope name of the critic
Returns:	(TensorFlow Tensor) the output tensor

proba_step(obs, state=None, mask=None)¶

Returns the action probability for a single step

Parameters:	obs – ([float] or [int]) The current observation of the environment state – ([float]) The last states (used in recurrent policies) mask – ([float]) The last masks (used in recurrent policies)
Returns:	([float]) the action probability

step(obs, state=None, mask=None)¶

Returns the policy for a single step

Parameters:	obs – ([float] or [int]) The current observation of the environment state – ([float]) The last states (used in recurrent policies) mask – ([float]) The last masks (used in recurrent policies)
Returns:	([float]) actions

value(obs, action, state=None, mask=None)¶

Returns the value for a single step

Parameters:	obs – ([float] or [int]) The current observation of the environment action – ([float] or [int]) The taken action state – ([float]) The last states (used in recurrent policies) mask – ([float]) The last masks (used in recurrent policies)
Returns:	([float]) The associated value of the action

Action and Parameters Noise¶

class stable_baselines.ddpg.AdaptiveParamNoiseSpec(initial_stddev=0.1, desired_action_stddev=0.1, adoption_coefficient=1.01)[source]¶

Implements adaptive parameter noise

Parameters:	initial_stddev – (float) the initial value for the standard deviation of the noise desired_action_stddev – (float) the desired value for the standard deviation of the noise adoption_coefficient – (float) the update coefficient for the standard deviation of the noise

adapt(distance)[source]¶

update the standard deviation for the parameter noise

Parameters:	distance – (float) the noise distance applied to the parameters

get_stats()[source]¶

return the standard deviation for the parameter noise

Returns:	(dict) the stats of the noise

class stable_baselines.ddpg.NormalActionNoise(mean, sigma)[source]¶

A gaussian action noise

Parameters:	mean – (float) the mean value of the noise sigma – (float) the scale of the noise (std here)

reset()¶: call end of episode reset for the noise

class stable_baselines.ddpg.OrnsteinUhlenbeckActionNoise(mean, sigma, theta=0.15, dt=0.01, initial_noise=None)[source]¶

A Ornstein Uhlenbeck action noise, this is designed to aproximate brownian motion with friction.

Based on http://math.stackexchange.com/questions/1287634/implementing-ornstein-uhlenbeck-in-matlab

Parameters:	mean – (float) the mean of the noise sigma – (float) the scale of the noise theta – (float) the rate of mean reversion dt – (float) the timestep for the noise initial_noise – ([float]) the initial value for the noise output, (if None: 0)

reset()[source]¶: reset the Ornstein Uhlenbeck noise, to the initial position

Custom Policy Network¶

Similarly to the example given in the examples page. You can easily define a custom architecture for the policy network:

import gym

from stable_baselines.ddpg.policies import FeedForwardPolicy
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines import DDPG

# Custom MLP policy of three layers of size 128 each
class CustomPolicy(FeedForwardPolicy):
    def __init__(self, *args, **kwargs):
        super(CustomPolicy, self).__init__(*args, **kwargs,
                                           layers=[128, 128, 128],
                                           layer_norm=False,
                                           feature_extraction="mlp")

# Create and wrap the environment
env = gym.make('Pendulum-v0')
env = DummyVecEnv([lambda: env])

model = DDPG(CustomPolicy, env, verbose=1)
# Train the agent
model.learn(total_timesteps=100000)