DDPG

Deep Deterministic Policy Gradient (DDPG)

Warning

The DDPG model does not support stable_baselines.common.policies because it uses q-value instead of value estimation, as a result it must use its own policy models (see DDPG Policies).

Available Policies

MlpPolicy Policy object that implements actor critic, using a MLP (2 layers of 64)
LnMlpPolicy Policy object that implements actor critic, using a MLP (2 layers of 64), with layer normalisation
CnnPolicy Policy object that implements actor critic, using a CNN (the nature CNN)
LnCnnPolicy Policy object that implements actor critic, using a CNN (the nature CNN), with layer normalisation

Notes

Can I use?

  • Reccurent policies: ❌
  • Multi processing: ❌
  • Gym spaces:
Space Action Observation
Discrete ✔️
Box ✔️ ✔️
MultiDiscrete ✔️
MultiBinary ✔️

Example

import gym
import numpy as np

from stable_baselines.ddpg.policies import MlpPolicy
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines.ddpg.noise import NormalActionNoise, OrnsteinUhlenbeckActionNoise, AdaptiveParamNoiseSpec
from stable_baselines import DDPG

env = gym.make('MountainCarContinuous-v0')
env = DummyVecEnv([lambda: env])

# the noise objects for DDPG
n_actions = env.action_space.shape[-1]
param_noise = None
action_noise = OrnsteinUhlenbeckActionNoise(mean=np.zeros(n_actions), sigma=float(0.5) * np.ones(n_actions))

model = DDPG(MlpPolicy, env, verbose=1, param_noise=param_noise, action_noise=action_noise)
model.learn(total_timesteps=400000)
model.save("ddpg_mountain")

del model # remove to demonstrate saving and loading

model = DDPG.load("ddpg_mountain")

obs = env.reset()
while True:
    action, _states = model.predict(obs)
    obs, rewards, dones, info = env.step(action)
    env.render()

Parameters

class stable_baselines.ddpg.DDPG(policy, env, gamma=0.99, memory_policy=None, eval_env=None, nb_train_steps=50, nb_rollout_steps=100, nb_eval_steps=100, param_noise=None, action_noise=None, normalize_observations=False, tau=0.001, batch_size=128, param_noise_adaption_interval=50, normalize_returns=False, enable_popart=False, observation_range=(-5.0, 5.0), critic_l2_reg=0.0, return_range=(-inf, inf), actor_lr=0.0001, critic_lr=0.001, clip_norm=None, reward_scale=1.0, render=False, render_eval=False, memory_limit=100, verbose=0, tensorboard_log=None, _init_setup_model=True)[source]

Deep Deterministic Policy Gradient (DDPG) model

DDPG: https://arxiv.org/pdf/1509.02971.pdf

Parameters:
  • policy – (DDPGPolicy or str) The policy model to use (MlpPolicy, CnnPolicy, LnMlpPolicy, …)
  • env – (Gym environment or str) The environment to learn from (if registered in Gym, can be str)
  • gamma – (float) the discount rate
  • memory_policy – (Memory) the replay buffer (if None, default to baselines.ddpg.memory.Memory)
  • eval_env – (Gym Environment) the evaluation environment (can be None)
  • nb_train_steps – (int) the number of training steps
  • nb_rollout_steps – (int) the number of rollout steps
  • nb_eval_steps – (int) the number of evalutation steps
  • param_noise – (AdaptiveParamNoiseSpec) the parameter noise type (can be None)
  • action_noise – (ActionNoise) the action noise type (can be None)
  • param_noise_adaption_interval – (int) apply param noise every N steps
  • tau – (float) the soft update coefficient (keep old values, between 0 and 1)
  • normalize_returns – (bool) should the critic output be normalized
  • enable_popart – (bool) enable pop-art normalization of the critic output (https://arxiv.org/pdf/1602.07714.pdf)
  • normalize_observations – (bool) should the observation be normalized
  • batch_size – (int) the size of the batch for learning the policy
  • observation_range – (tuple) the bounding values for the observation
  • return_range – (tuple) the bounding values for the critic output
  • critic_l2_reg – (float) l2 regularizer coefficient
  • actor_lr – (float) the actor learning rate
  • critic_lr – (float) the critic learning rate
  • clip_norm – (float) clip the gradients (disabled if None)
  • reward_scale – (float) the value the reward should be scaled by
  • render – (bool) enable rendering of the environment
  • render_eval – (bool) enable rendering of the evalution environment
  • memory_limit – (int) the max number of transitions to store
  • verbose – (int) the verbosity level: 0 none, 1 training information, 2 tensorflow debug
  • tensorboard_log – (str) the log location for tensorboard (if None, no logging)
  • _init_setup_model – (bool) Whether or not to build the network at the creation of the instance
action_probability(observation, state=None, mask=None)[source]

Get the model’s action probability distribution from an observation

Parameters:
  • observation – (np.ndarray) the input observation
  • state – (np.ndarray) The last states (can be None, used in recurrent policies)
  • mask – (np.ndarray) The last masks (can be None, used in recurrent policies)
Returns:

(np.ndarray) the model’s action probability distribution

get_env()

returns the current environment (can be None if not defined)

Returns:(Gym Environment) The current environment
learn(total_timesteps, callback=None, seed=None, log_interval=100, tb_log_name='DDPG')[source]

Return a trained model.

Parameters:
  • total_timesteps – (int) The total number of samples to train on
  • seed – (int) The initial seed for training, if None: keep current seed
  • callback – (function (dict, dict)) function called at every steps with state of the algorithm. It takes the local and global variables.
  • log_interval – (int) The number of timesteps before logging.
  • tb_log_name – (str) the name of the run for tensorboard log
Returns:

(BaseRLModel) the trained model

classmethod load(load_path, env=None, **kwargs)[source]

Load the model from file

Parameters:
  • load_path – (str) the saved parameter location
  • env – (Gym Envrionment) the new environment to run the loaded model on (can be None if you only need prediction from a trained model)
  • kwargs – extra arguments to change the model when loading
predict(observation, state=None, mask=None, deterministic=True)[source]

Get the model’s action from an observation

Parameters:
  • observation – (np.ndarray) the input observation
  • state – (np.ndarray) The last states (can be None, used in recurrent policies)
  • mask – (np.ndarray) The last masks (can be None, used in recurrent policies)
  • deterministic – (bool) Whether or not to return deterministic actions.
Returns:

(np.ndarray, np.ndarray) the model’s action and the next state (used in recurrent policies)

save(save_path)[source]

Save the current parameters to file

Parameters:save_path – (str) the save location
set_env(env)

Checks the validity of the environment, and if it is coherent, set it as the current environment.

Parameters:env – (Gym Environment) The environment for learning a policy
setup_model()[source]

Create all the functions and tensorflow graphs necessary to train the model

DDPG Policies

class stable_baselines.ddpg.MlpPolicy(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, **_kwargs)[source]

Policy object that implements actor critic, using a MLP (2 layers of 64)

Parameters:
  • sess – (TensorFlow session) The current TensorFlow session
  • ob_space – (Gym Space) The observation space of the environment
  • ac_space – (Gym Space) The action space of the environment
  • n_env – (int) The number of environments to run
  • n_steps – (int) The number of steps to run for each environment
  • n_batch – (int) The number of batch to run (n_envs * n_steps)
  • reuse – (bool) If the policy is reusable or not
  • _kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction
make_actor(obs=None, reuse=False, scope='pi')

creates an actor object

Parameters:
  • obs – (TensorFlow Tensor) The observation placeholder (can be None for default placeholder)
  • reuse – (bool) whether or not to resue parameters
  • scope – (str) the scope name of the actor
Returns:

(TensorFlow Tensor) the output tensor

make_critic(obs=None, action=None, reuse=False, scope='qf')

creates a critic object

Parameters:
  • obs – (TensorFlow Tensor) The observation placeholder (can be None for default placeholder)
  • action – (TensorFlow Tensor) The action placeholder (can be None for default placeholder)
  • reuse – (bool) whether or not to resue parameters
  • scope – (str) the scope name of the critic
Returns:

(TensorFlow Tensor) the output tensor

proba_step(obs, state=None, mask=None)

Returns the action probability for a single step

Parameters:
  • obs – ([float] or [int]) The current observation of the environment
  • state – ([float]) The last states (used in recurrent policies)
  • mask – ([float]) The last masks (used in recurrent policies)
Returns:

([float]) the action probability

step(obs, state=None, mask=None)

Returns the policy for a single step

Parameters:
  • obs – ([float] or [int]) The current observation of the environment
  • state – ([float]) The last states (used in recurrent policies)
  • mask – ([float]) The last masks (used in recurrent policies)
Returns:

([float]) actions

value(obs, action, state=None, mask=None)

Returns the value for a single step

Parameters:
  • obs – ([float] or [int]) The current observation of the environment
  • action – ([float] or [int]) The taken action
  • state – ([float]) The last states (used in recurrent policies)
  • mask – ([float]) The last masks (used in recurrent policies)
Returns:

([float]) The associated value of the action

class stable_baselines.ddpg.LnMlpPolicy(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, **_kwargs)[source]

Policy object that implements actor critic, using a MLP (2 layers of 64), with layer normalisation

Parameters:
  • sess – (TensorFlow session) The current TensorFlow session
  • ob_space – (Gym Space) The observation space of the environment
  • ac_space – (Gym Space) The action space of the environment
  • n_env – (int) The number of environments to run
  • n_steps – (int) The number of steps to run for each environment
  • n_batch – (int) The number of batch to run (n_envs * n_steps)
  • reuse – (bool) If the policy is reusable or not
  • _kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction
make_actor(obs=None, reuse=False, scope='pi')

creates an actor object

Parameters:
  • obs – (TensorFlow Tensor) The observation placeholder (can be None for default placeholder)
  • reuse – (bool) whether or not to resue parameters
  • scope – (str) the scope name of the actor
Returns:

(TensorFlow Tensor) the output tensor

make_critic(obs=None, action=None, reuse=False, scope='qf')

creates a critic object

Parameters:
  • obs – (TensorFlow Tensor) The observation placeholder (can be None for default placeholder)
  • action – (TensorFlow Tensor) The action placeholder (can be None for default placeholder)
  • reuse – (bool) whether or not to resue parameters
  • scope – (str) the scope name of the critic
Returns:

(TensorFlow Tensor) the output tensor

proba_step(obs, state=None, mask=None)

Returns the action probability for a single step

Parameters:
  • obs – ([float] or [int]) The current observation of the environment
  • state – ([float]) The last states (used in recurrent policies)
  • mask – ([float]) The last masks (used in recurrent policies)
Returns:

([float]) the action probability

step(obs, state=None, mask=None)

Returns the policy for a single step

Parameters:
  • obs – ([float] or [int]) The current observation of the environment
  • state – ([float]) The last states (used in recurrent policies)
  • mask – ([float]) The last masks (used in recurrent policies)
Returns:

([float]) actions

value(obs, action, state=None, mask=None)

Returns the value for a single step

Parameters:
  • obs – ([float] or [int]) The current observation of the environment
  • action – ([float] or [int]) The taken action
  • state – ([float]) The last states (used in recurrent policies)
  • mask – ([float]) The last masks (used in recurrent policies)
Returns:

([float]) The associated value of the action

class stable_baselines.ddpg.CnnPolicy(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, **_kwargs)[source]

Policy object that implements actor critic, using a CNN (the nature CNN)

Parameters:
  • sess – (TensorFlow session) The current TensorFlow session
  • ob_space – (Gym Space) The observation space of the environment
  • ac_space – (Gym Space) The action space of the environment
  • n_env – (int) The number of environments to run
  • n_steps – (int) The number of steps to run for each environment
  • n_batch – (int) The number of batch to run (n_envs * n_steps)
  • reuse – (bool) If the policy is reusable or not
  • _kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction
make_actor(obs=None, reuse=False, scope='pi')

creates an actor object

Parameters:
  • obs – (TensorFlow Tensor) The observation placeholder (can be None for default placeholder)
  • reuse – (bool) whether or not to resue parameters
  • scope – (str) the scope name of the actor
Returns:

(TensorFlow Tensor) the output tensor

make_critic(obs=None, action=None, reuse=False, scope='qf')

creates a critic object

Parameters:
  • obs – (TensorFlow Tensor) The observation placeholder (can be None for default placeholder)
  • action – (TensorFlow Tensor) The action placeholder (can be None for default placeholder)
  • reuse – (bool) whether or not to resue parameters
  • scope – (str) the scope name of the critic
Returns:

(TensorFlow Tensor) the output tensor

proba_step(obs, state=None, mask=None)

Returns the action probability for a single step

Parameters:
  • obs – ([float] or [int]) The current observation of the environment
  • state – ([float]) The last states (used in recurrent policies)
  • mask – ([float]) The last masks (used in recurrent policies)
Returns:

([float]) the action probability

step(obs, state=None, mask=None)

Returns the policy for a single step

Parameters:
  • obs – ([float] or [int]) The current observation of the environment
  • state – ([float]) The last states (used in recurrent policies)
  • mask – ([float]) The last masks (used in recurrent policies)
Returns:

([float]) actions

value(obs, action, state=None, mask=None)

Returns the value for a single step

Parameters:
  • obs – ([float] or [int]) The current observation of the environment
  • action – ([float] or [int]) The taken action
  • state – ([float]) The last states (used in recurrent policies)
  • mask – ([float]) The last masks (used in recurrent policies)
Returns:

([float]) The associated value of the action

class stable_baselines.ddpg.LnCnnPolicy(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, **_kwargs)[source]

Policy object that implements actor critic, using a CNN (the nature CNN), with layer normalisation

Parameters:
  • sess – (TensorFlow session) The current TensorFlow session
  • ob_space – (Gym Space) The observation space of the environment
  • ac_space – (Gym Space) The action space of the environment
  • n_env – (int) The number of environments to run
  • n_steps – (int) The number of steps to run for each environment
  • n_batch – (int) The number of batch to run (n_envs * n_steps)
  • reuse – (bool) If the policy is reusable or not
  • _kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction
make_actor(obs=None, reuse=False, scope='pi')

creates an actor object

Parameters:
  • obs – (TensorFlow Tensor) The observation placeholder (can be None for default placeholder)
  • reuse – (bool) whether or not to resue parameters
  • scope – (str) the scope name of the actor
Returns:

(TensorFlow Tensor) the output tensor

make_critic(obs=None, action=None, reuse=False, scope='qf')

creates a critic object

Parameters:
  • obs – (TensorFlow Tensor) The observation placeholder (can be None for default placeholder)
  • action – (TensorFlow Tensor) The action placeholder (can be None for default placeholder)
  • reuse – (bool) whether or not to resue parameters
  • scope – (str) the scope name of the critic
Returns:

(TensorFlow Tensor) the output tensor

proba_step(obs, state=None, mask=None)

Returns the action probability for a single step

Parameters:
  • obs – ([float] or [int]) The current observation of the environment
  • state – ([float]) The last states (used in recurrent policies)
  • mask – ([float]) The last masks (used in recurrent policies)
Returns:

([float]) the action probability

step(obs, state=None, mask=None)

Returns the policy for a single step

Parameters:
  • obs – ([float] or [int]) The current observation of the environment
  • state – ([float]) The last states (used in recurrent policies)
  • mask – ([float]) The last masks (used in recurrent policies)
Returns:

([float]) actions

value(obs, action, state=None, mask=None)

Returns the value for a single step

Parameters:
  • obs – ([float] or [int]) The current observation of the environment
  • action – ([float] or [int]) The taken action
  • state – ([float]) The last states (used in recurrent policies)
  • mask – ([float]) The last masks (used in recurrent policies)
Returns:

([float]) The associated value of the action

Action and Parameters Noise

class stable_baselines.ddpg.AdaptiveParamNoiseSpec(initial_stddev=0.1, desired_action_stddev=0.1, adoption_coefficient=1.01)[source]

Implements adaptive parameter noise

Parameters:
  • initial_stddev – (float) the initial value for the standard deviation of the noise
  • desired_action_stddev – (float) the desired value for the standard deviation of the noise
  • adoption_coefficient – (float) the update coefficient for the standard deviation of the noise
adapt(distance)[source]

update the standard deviation for the parameter noise

Parameters:distance – (float) the noise distance applied to the parameters
get_stats()[source]

return the standard deviation for the parameter noise

Returns:(dict) the stats of the noise
class stable_baselines.ddpg.NormalActionNoise(mean, sigma)[source]

A gaussian action noise

Parameters:
  • mean – (float) the mean value of the noise
  • sigma – (float) the scale of the noise (std here)
reset()

call end of episode reset for the noise

class stable_baselines.ddpg.OrnsteinUhlenbeckActionNoise(mean, sigma, theta=0.15, dt=0.01, initial_noise=None)[source]

A Ornstein Uhlenbeck action noise, this is designed to aproximate brownian motion with friction.

Based on http://math.stackexchange.com/questions/1287634/implementing-ornstein-uhlenbeck-in-matlab

Parameters:
  • mean – (float) the mean of the noise
  • sigma – (float) the scale of the noise
  • theta – (float) the rate of mean reversion
  • dt – (float) the timestep for the noise
  • initial_noise – ([float]) the initial value for the noise output, (if None: 0)
reset()[source]

reset the Ornstein Uhlenbeck noise, to the initial position

Custom Policy Network

Similarly to the example given in the examples page. You can easily define a custom architecture for the policy network:

import gym

from stable_baselines.ddpg.policies import FeedForwardPolicy
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines import DDPG

# Custom MLP policy of three layers of size 128 each
class CustomPolicy(FeedForwardPolicy):
    def __init__(self, *args, **kwargs):
        super(CustomPolicy, self).__init__(*args, **kwargs,
                                           layers=[128, 128, 128],
                                           layer_norm=False,
                                           feature_extraction="mlp")

# Create and wrap the environment
env = gym.make('Pendulum-v0')
env = DummyVecEnv([lambda: env])

model = DDPG(CustomPolicy, env, verbose=1)
# Train the agent
model.learn(total_timesteps=100000)