DDPG¶
Deep Deterministic Policy Gradient (DDPG)
Warning
The DDPG model does not support stable_baselines.common.policies
because it uses q-value instead
of value estimation, as a result it must use its own policy models (see DDPG Policies).
Available Policies
MlpPolicy |
Policy object that implements actor critic, using a MLP (2 layers of 64) |
LnMlpPolicy |
Policy object that implements actor critic, using a MLP (2 layers of 64), with layer normalisation |
CnnPolicy |
Policy object that implements actor critic, using a CNN (the nature CNN) |
LnCnnPolicy |
Policy object that implements actor critic, using a CNN (the nature CNN), with layer normalisation |
Notes¶
- Original paper: https://arxiv.org/abs/1509.02971
- Baselines post: https://blog.openai.com/better-exploration-with-parameter-noise/
python -m stable_baselines.ddpg.main
runs the algorithm for 1M frames = 10M timesteps on a Mujoco environment. See help (-h
) for more options.
Can I use?¶
- Reccurent policies: ❌
- Multi processing: ❌
- Gym spaces:
Space | Action | Observation |
---|---|---|
Discrete | ❌ | ✔️ |
Box | ✔️ | ✔️ |
MultiDiscrete | ❌ | ✔️ |
MultiBinary | ❌ | ✔️ |
Example¶
import gym
import numpy as np
from stable_baselines.ddpg.policies import MlpPolicy
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines.ddpg.noise import NormalActionNoise, OrnsteinUhlenbeckActionNoise, AdaptiveParamNoiseSpec
from stable_baselines import DDPG
env = gym.make('MountainCarContinuous-v0')
env = DummyVecEnv([lambda: env])
# the noise objects for DDPG
n_actions = env.action_space.shape[-1]
param_noise = None
action_noise = OrnsteinUhlenbeckActionNoise(mean=np.zeros(n_actions), sigma=float(0.5) * np.ones(n_actions))
model = DDPG(MlpPolicy, env, verbose=1, param_noise=param_noise, action_noise=action_noise)
model.learn(total_timesteps=400000)
model.save("ddpg_mountain")
del model # remove to demonstrate saving and loading
model = DDPG.load("ddpg_mountain")
obs = env.reset()
while True:
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)
env.render()
Parameters¶
-
class
stable_baselines.ddpg.
DDPG
(policy, env, gamma=0.99, memory_policy=None, eval_env=None, nb_train_steps=50, nb_rollout_steps=100, nb_eval_steps=100, param_noise=None, action_noise=None, normalize_observations=False, tau=0.001, batch_size=128, param_noise_adaption_interval=50, normalize_returns=False, enable_popart=False, observation_range=(-5.0, 5.0), critic_l2_reg=0.0, return_range=(-inf, inf), actor_lr=0.0001, critic_lr=0.001, clip_norm=None, reward_scale=1.0, render=False, render_eval=False, memory_limit=100, verbose=0, tensorboard_log=None, _init_setup_model=True)[source]¶ Deep Deterministic Policy Gradient (DDPG) model
DDPG: https://arxiv.org/pdf/1509.02971.pdf
Parameters: - policy – (DDPGPolicy or str) The policy model to use (MlpPolicy, CnnPolicy, LnMlpPolicy, …)
- env – (Gym environment or str) The environment to learn from (if registered in Gym, can be str)
- gamma – (float) the discount rate
- memory_policy – (Memory) the replay buffer (if None, default to baselines.ddpg.memory.Memory)
- eval_env – (Gym Environment) the evaluation environment (can be None)
- nb_train_steps – (int) the number of training steps
- nb_rollout_steps – (int) the number of rollout steps
- nb_eval_steps – (int) the number of evalutation steps
- param_noise – (AdaptiveParamNoiseSpec) the parameter noise type (can be None)
- action_noise – (ActionNoise) the action noise type (can be None)
- param_noise_adaption_interval – (int) apply param noise every N steps
- tau – (float) the soft update coefficient (keep old values, between 0 and 1)
- normalize_returns – (bool) should the critic output be normalized
- enable_popart – (bool) enable pop-art normalization of the critic output (https://arxiv.org/pdf/1602.07714.pdf)
- normalize_observations – (bool) should the observation be normalized
- batch_size – (int) the size of the batch for learning the policy
- observation_range – (tuple) the bounding values for the observation
- return_range – (tuple) the bounding values for the critic output
- critic_l2_reg – (float) l2 regularizer coefficient
- actor_lr – (float) the actor learning rate
- critic_lr – (float) the critic learning rate
- clip_norm – (float) clip the gradients (disabled if None)
- reward_scale – (float) the value the reward should be scaled by
- render – (bool) enable rendering of the environment
- render_eval – (bool) enable rendering of the evalution environment
- memory_limit – (int) the max number of transitions to store
- verbose – (int) the verbosity level: 0 none, 1 training information, 2 tensorflow debug
- tensorboard_log – (str) the log location for tensorboard (if None, no logging)
- _init_setup_model – (bool) Whether or not to build the network at the creation of the instance
-
action_probability
(observation, state=None, mask=None)[source]¶ Get the model’s action probability distribution from an observation
Parameters: - observation – (np.ndarray) the input observation
- state – (np.ndarray) The last states (can be None, used in recurrent policies)
- mask – (np.ndarray) The last masks (can be None, used in recurrent policies)
Returns: (np.ndarray) the model’s action probability distribution
-
get_env
()¶ returns the current environment (can be None if not defined)
Returns: (Gym Environment) The current environment
-
learn
(total_timesteps, callback=None, seed=None, log_interval=100, tb_log_name='DDPG')[source]¶ Return a trained model.
Parameters: - total_timesteps – (int) The total number of samples to train on
- seed – (int) The initial seed for training, if None: keep current seed
- callback – (function (dict, dict)) function called at every steps with state of the algorithm. It takes the local and global variables.
- log_interval – (int) The number of timesteps before logging.
- tb_log_name – (str) the name of the run for tensorboard log
Returns: (BaseRLModel) the trained model
-
classmethod
load
(load_path, env=None, **kwargs)[source]¶ Load the model from file
Parameters: - load_path – (str) the saved parameter location
- env – (Gym Envrionment) the new environment to run the loaded model on (can be None if you only need prediction from a trained model)
- kwargs – extra arguments to change the model when loading
-
predict
(observation, state=None, mask=None, deterministic=True)[source]¶ Get the model’s action from an observation
Parameters: - observation – (np.ndarray) the input observation
- state – (np.ndarray) The last states (can be None, used in recurrent policies)
- mask – (np.ndarray) The last masks (can be None, used in recurrent policies)
- deterministic – (bool) Whether or not to return deterministic actions.
Returns: (np.ndarray, np.ndarray) the model’s action and the next state (used in recurrent policies)
-
save
(save_path)[source]¶ Save the current parameters to file
Parameters: save_path – (str) the save location
-
set_env
(env)¶ Checks the validity of the environment, and if it is coherent, set it as the current environment.
Parameters: env – (Gym Environment) The environment for learning a policy
DDPG Policies¶
-
class
stable_baselines.ddpg.
MlpPolicy
(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, **_kwargs)[source]¶ Policy object that implements actor critic, using a MLP (2 layers of 64)
Parameters: - sess – (TensorFlow session) The current TensorFlow session
- ob_space – (Gym Space) The observation space of the environment
- ac_space – (Gym Space) The action space of the environment
- n_env – (int) The number of environments to run
- n_steps – (int) The number of steps to run for each environment
- n_batch – (int) The number of batch to run (n_envs * n_steps)
- reuse – (bool) If the policy is reusable or not
- _kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction
-
make_actor
(obs=None, reuse=False, scope='pi')¶ creates an actor object
Parameters: - obs – (TensorFlow Tensor) The observation placeholder (can be None for default placeholder)
- reuse – (bool) whether or not to resue parameters
- scope – (str) the scope name of the actor
Returns: (TensorFlow Tensor) the output tensor
-
make_critic
(obs=None, action=None, reuse=False, scope='qf')¶ creates a critic object
Parameters: - obs – (TensorFlow Tensor) The observation placeholder (can be None for default placeholder)
- action – (TensorFlow Tensor) The action placeholder (can be None for default placeholder)
- reuse – (bool) whether or not to resue parameters
- scope – (str) the scope name of the critic
Returns: (TensorFlow Tensor) the output tensor
-
proba_step
(obs, state=None, mask=None)¶ Returns the action probability for a single step
Parameters: - obs – ([float] or [int]) The current observation of the environment
- state – ([float]) The last states (used in recurrent policies)
- mask – ([float]) The last masks (used in recurrent policies)
Returns: ([float]) the action probability
-
step
(obs, state=None, mask=None)¶ Returns the policy for a single step
Parameters: - obs – ([float] or [int]) The current observation of the environment
- state – ([float]) The last states (used in recurrent policies)
- mask – ([float]) The last masks (used in recurrent policies)
Returns: ([float]) actions
-
value
(obs, action, state=None, mask=None)¶ Returns the value for a single step
Parameters: - obs – ([float] or [int]) The current observation of the environment
- action – ([float] or [int]) The taken action
- state – ([float]) The last states (used in recurrent policies)
- mask – ([float]) The last masks (used in recurrent policies)
Returns: ([float]) The associated value of the action
-
class
stable_baselines.ddpg.
LnMlpPolicy
(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, **_kwargs)[source]¶ Policy object that implements actor critic, using a MLP (2 layers of 64), with layer normalisation
Parameters: - sess – (TensorFlow session) The current TensorFlow session
- ob_space – (Gym Space) The observation space of the environment
- ac_space – (Gym Space) The action space of the environment
- n_env – (int) The number of environments to run
- n_steps – (int) The number of steps to run for each environment
- n_batch – (int) The number of batch to run (n_envs * n_steps)
- reuse – (bool) If the policy is reusable or not
- _kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction
-
make_actor
(obs=None, reuse=False, scope='pi')¶ creates an actor object
Parameters: - obs – (TensorFlow Tensor) The observation placeholder (can be None for default placeholder)
- reuse – (bool) whether or not to resue parameters
- scope – (str) the scope name of the actor
Returns: (TensorFlow Tensor) the output tensor
-
make_critic
(obs=None, action=None, reuse=False, scope='qf')¶ creates a critic object
Parameters: - obs – (TensorFlow Tensor) The observation placeholder (can be None for default placeholder)
- action – (TensorFlow Tensor) The action placeholder (can be None for default placeholder)
- reuse – (bool) whether or not to resue parameters
- scope – (str) the scope name of the critic
Returns: (TensorFlow Tensor) the output tensor
-
proba_step
(obs, state=None, mask=None)¶ Returns the action probability for a single step
Parameters: - obs – ([float] or [int]) The current observation of the environment
- state – ([float]) The last states (used in recurrent policies)
- mask – ([float]) The last masks (used in recurrent policies)
Returns: ([float]) the action probability
-
step
(obs, state=None, mask=None)¶ Returns the policy for a single step
Parameters: - obs – ([float] or [int]) The current observation of the environment
- state – ([float]) The last states (used in recurrent policies)
- mask – ([float]) The last masks (used in recurrent policies)
Returns: ([float]) actions
-
value
(obs, action, state=None, mask=None)¶ Returns the value for a single step
Parameters: - obs – ([float] or [int]) The current observation of the environment
- action – ([float] or [int]) The taken action
- state – ([float]) The last states (used in recurrent policies)
- mask – ([float]) The last masks (used in recurrent policies)
Returns: ([float]) The associated value of the action
-
class
stable_baselines.ddpg.
CnnPolicy
(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, **_kwargs)[source]¶ Policy object that implements actor critic, using a CNN (the nature CNN)
Parameters: - sess – (TensorFlow session) The current TensorFlow session
- ob_space – (Gym Space) The observation space of the environment
- ac_space – (Gym Space) The action space of the environment
- n_env – (int) The number of environments to run
- n_steps – (int) The number of steps to run for each environment
- n_batch – (int) The number of batch to run (n_envs * n_steps)
- reuse – (bool) If the policy is reusable or not
- _kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction
-
make_actor
(obs=None, reuse=False, scope='pi')¶ creates an actor object
Parameters: - obs – (TensorFlow Tensor) The observation placeholder (can be None for default placeholder)
- reuse – (bool) whether or not to resue parameters
- scope – (str) the scope name of the actor
Returns: (TensorFlow Tensor) the output tensor
-
make_critic
(obs=None, action=None, reuse=False, scope='qf')¶ creates a critic object
Parameters: - obs – (TensorFlow Tensor) The observation placeholder (can be None for default placeholder)
- action – (TensorFlow Tensor) The action placeholder (can be None for default placeholder)
- reuse – (bool) whether or not to resue parameters
- scope – (str) the scope name of the critic
Returns: (TensorFlow Tensor) the output tensor
-
proba_step
(obs, state=None, mask=None)¶ Returns the action probability for a single step
Parameters: - obs – ([float] or [int]) The current observation of the environment
- state – ([float]) The last states (used in recurrent policies)
- mask – ([float]) The last masks (used in recurrent policies)
Returns: ([float]) the action probability
-
step
(obs, state=None, mask=None)¶ Returns the policy for a single step
Parameters: - obs – ([float] or [int]) The current observation of the environment
- state – ([float]) The last states (used in recurrent policies)
- mask – ([float]) The last masks (used in recurrent policies)
Returns: ([float]) actions
-
value
(obs, action, state=None, mask=None)¶ Returns the value for a single step
Parameters: - obs – ([float] or [int]) The current observation of the environment
- action – ([float] or [int]) The taken action
- state – ([float]) The last states (used in recurrent policies)
- mask – ([float]) The last masks (used in recurrent policies)
Returns: ([float]) The associated value of the action
-
class
stable_baselines.ddpg.
LnCnnPolicy
(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, **_kwargs)[source]¶ Policy object that implements actor critic, using a CNN (the nature CNN), with layer normalisation
Parameters: - sess – (TensorFlow session) The current TensorFlow session
- ob_space – (Gym Space) The observation space of the environment
- ac_space – (Gym Space) The action space of the environment
- n_env – (int) The number of environments to run
- n_steps – (int) The number of steps to run for each environment
- n_batch – (int) The number of batch to run (n_envs * n_steps)
- reuse – (bool) If the policy is reusable or not
- _kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction
-
make_actor
(obs=None, reuse=False, scope='pi')¶ creates an actor object
Parameters: - obs – (TensorFlow Tensor) The observation placeholder (can be None for default placeholder)
- reuse – (bool) whether or not to resue parameters
- scope – (str) the scope name of the actor
Returns: (TensorFlow Tensor) the output tensor
-
make_critic
(obs=None, action=None, reuse=False, scope='qf')¶ creates a critic object
Parameters: - obs – (TensorFlow Tensor) The observation placeholder (can be None for default placeholder)
- action – (TensorFlow Tensor) The action placeholder (can be None for default placeholder)
- reuse – (bool) whether or not to resue parameters
- scope – (str) the scope name of the critic
Returns: (TensorFlow Tensor) the output tensor
-
proba_step
(obs, state=None, mask=None)¶ Returns the action probability for a single step
Parameters: - obs – ([float] or [int]) The current observation of the environment
- state – ([float]) The last states (used in recurrent policies)
- mask – ([float]) The last masks (used in recurrent policies)
Returns: ([float]) the action probability
-
step
(obs, state=None, mask=None)¶ Returns the policy for a single step
Parameters: - obs – ([float] or [int]) The current observation of the environment
- state – ([float]) The last states (used in recurrent policies)
- mask – ([float]) The last masks (used in recurrent policies)
Returns: ([float]) actions
-
value
(obs, action, state=None, mask=None)¶ Returns the value for a single step
Parameters: - obs – ([float] or [int]) The current observation of the environment
- action – ([float] or [int]) The taken action
- state – ([float]) The last states (used in recurrent policies)
- mask – ([float]) The last masks (used in recurrent policies)
Returns: ([float]) The associated value of the action
Action and Parameters Noise¶
-
class
stable_baselines.ddpg.
AdaptiveParamNoiseSpec
(initial_stddev=0.1, desired_action_stddev=0.1, adoption_coefficient=1.01)[source]¶ Implements adaptive parameter noise
Parameters: - initial_stddev – (float) the initial value for the standard deviation of the noise
- desired_action_stddev – (float) the desired value for the standard deviation of the noise
- adoption_coefficient – (float) the update coefficient for the standard deviation of the noise
-
class
stable_baselines.ddpg.
NormalActionNoise
(mean, sigma)[source]¶ A gaussian action noise
Parameters: - mean – (float) the mean value of the noise
- sigma – (float) the scale of the noise (std here)
-
reset
()¶ call end of episode reset for the noise
-
class
stable_baselines.ddpg.
OrnsteinUhlenbeckActionNoise
(mean, sigma, theta=0.15, dt=0.01, initial_noise=None)[source]¶ A Ornstein Uhlenbeck action noise, this is designed to aproximate brownian motion with friction.
Based on http://math.stackexchange.com/questions/1287634/implementing-ornstein-uhlenbeck-in-matlab
Parameters: - mean – (float) the mean of the noise
- sigma – (float) the scale of the noise
- theta – (float) the rate of mean reversion
- dt – (float) the timestep for the noise
- initial_noise – ([float]) the initial value for the noise output, (if None: 0)
Custom Policy Network¶
Similarly to the example given in the examples page. You can easily define a custom architecture for the policy network:
import gym
from stable_baselines.ddpg.policies import FeedForwardPolicy
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines import DDPG
# Custom MLP policy of three layers of size 128 each
class CustomPolicy(FeedForwardPolicy):
def __init__(self, *args, **kwargs):
super(CustomPolicy, self).__init__(*args, **kwargs,
layers=[128, 128, 128],
layer_norm=False,
feature_extraction="mlp")
# Create and wrap the environment
env = gym.make('Pendulum-v0')
env = DummyVecEnv([lambda: env])
model = DDPG(CustomPolicy, env, verbose=1)
# Train the agent
model.learn(total_timesteps=100000)