Policy Networks

Stable-baselines provides a set of default policies, that can be used with most action spaces. If you need more control on the policy architecture, You can also create a custom policy (see Custom Policy Network).

Note

CnnPolicies are for images only. MlpPolicies are made for other type of features (e.g. robot joints)

Warning

For all algorithms (except DDPG), continuous actions are only clipped during training (to avoid out of bound error). However, you have to manually clip the action when using the predict() method.

Available Policies

MlpPolicy Policy object that implements actor critic, using a MLP (2 layers of 64)
MlpLstmPolicy Policy object that implements actor critic, using LSTMs with a MLP feature extraction
MlpLnLstmPolicy Policy object that implements actor critic, using a layer normalized LSTMs with a MLP feature extraction
CnnPolicy Policy object that implements actor critic, using a CNN (the nature CNN)
CnnLstmPolicy Policy object that implements actor critic, using LSTMs with a CNN feature extraction
CnnLnLstmPolicy Policy object that implements actor critic, using a layer normalized LSTMs with a CNN feature extraction

Base Classes

class stable_baselines.common.policies.ActorCriticPolicy(sess, ob_space, ac_space, n_env, n_steps, n_batch, n_lstm=256, reuse=False, scale=False)[source]

Policy object that implements actor critic

Parameters:
  • sess – (TensorFlow session) The current TensorFlow session
  • ob_space – (Gym Space) The observation space of the environment
  • ac_space – (Gym Space) The action space of the environment
  • n_env – (int) The number of environments to run
  • n_steps – (int) The number of steps to run for each environment
  • n_batch – (int) The number of batch to run (n_envs * n_steps)
  • n_lstm – (int) The number of LSTM cells (for recurrent policies)
  • reuse – (bool) If the policy is reusable or not
  • scale – (bool) whether or not to scale the input
proba_step(obs, state=None, mask=None)[source]

Returns the action probability for a single step

Parameters:
  • obs – ([float] or [int]) The current observation of the environment
  • state – ([float]) The last states (used in recurrent policies)
  • mask – ([float]) The last masks (used in recurrent policies)
Returns:

([float]) the action probability

step(obs, state=None, mask=None, deterministic=False)[source]

Returns the policy for a single step

Parameters:
  • obs – ([float] or [int]) The current observation of the environment
  • state – ([float]) The last states (used in recurrent policies)
  • mask – ([float]) The last masks (used in recurrent policies)
  • deterministic – (bool) Whether or not to return deterministic actions.
Returns:

([float], [float], [float], [float]) actions, values, states, neglogp

value(obs, state=None, mask=None)[source]

Returns the value for a single step

Parameters:
  • obs – ([float] or [int]) The current observation of the environment
  • state – ([float]) The last states (used in recurrent policies)
  • mask – ([float]) The last masks (used in recurrent policies)
Returns:

([float]) The associated value of the action

class stable_baselines.common.policies.FeedForwardPolicy(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, layers=None, net_arch=None, act_fun=<MagicMock id='140245272951328'>, cnn_extractor=<function nature_cnn>, feature_extraction='cnn', **kwargs)[source]

Policy object that implements actor critic, using a feed forward neural network.

Parameters:
  • sess – (TensorFlow session) The current TensorFlow session
  • ob_space – (Gym Space) The observation space of the environment
  • ac_space – (Gym Space) The action space of the environment
  • n_env – (int) The number of environments to run
  • n_steps – (int) The number of steps to run for each environment
  • n_batch – (int) The number of batch to run (n_envs * n_steps)
  • reuse – (bool) If the policy is reusable or not
  • layers – ([int]) (deprecated, use net_arch instead) The size of the Neural network for the policy (if None, default to [64, 64])
  • net_arch – (list) Specification of the actor-critic policy network architecture (see mlp_extractor documentation for details).
  • act_fun – the activation function to use in the neural network.
  • cnn_extractor – (function (TensorFlow Tensor, **kwargs): (TensorFlow Tensor)) the CNN feature extraction
  • feature_extraction – (str) The feature extraction type (“cnn” or “mlp”)
  • kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction
proba_step(obs, state=None, mask=None)[source]

Returns the action probability for a single step

Parameters:
  • obs – ([float] or [int]) The current observation of the environment
  • state – ([float]) The last states (used in recurrent policies)
  • mask – ([float]) The last masks (used in recurrent policies)
Returns:

([float]) the action probability

step(obs, state=None, mask=None, deterministic=False)[source]

Returns the policy for a single step

Parameters:
  • obs – ([float] or [int]) The current observation of the environment
  • state – ([float]) The last states (used in recurrent policies)
  • mask – ([float]) The last masks (used in recurrent policies)
  • deterministic – (bool) Whether or not to return deterministic actions.
Returns:

([float], [float], [float], [float]) actions, values, states, neglogp

value(obs, state=None, mask=None)[source]

Returns the value for a single step

Parameters:
  • obs – ([float] or [int]) The current observation of the environment
  • state – ([float]) The last states (used in recurrent policies)
  • mask – ([float]) The last masks (used in recurrent policies)
Returns:

([float]) The associated value of the action

class stable_baselines.common.policies.LstmPolicy(sess, ob_space, ac_space, n_env, n_steps, n_batch, n_lstm=256, reuse=False, layers=None, cnn_extractor=<function nature_cnn>, layer_norm=False, feature_extraction='cnn', **kwargs)[source]

Policy object that implements actor critic, using LSTMs.

Parameters:
  • sess – (TensorFlow session) The current TensorFlow session
  • ob_space – (Gym Space) The observation space of the environment
  • ac_space – (Gym Space) The action space of the environment
  • n_env – (int) The number of environments to run
  • n_steps – (int) The number of steps to run for each environment
  • n_batch – (int) The number of batch to run (n_envs * n_steps)
  • n_lstm – (int) The number of LSTM cells (for recurrent policies)
  • reuse – (bool) If the policy is reusable or not
  • layers – ([int]) The size of the Neural network before the LSTM layer (if None, default to [64, 64])
  • cnn_extractor – (function (TensorFlow Tensor, **kwargs): (TensorFlow Tensor)) the CNN feature extraction
  • layer_norm – (bool) Whether or not to use layer normalizing LSTMs
  • feature_extraction – (str) The feature extraction type (“cnn” or “mlp”)
  • kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction
proba_step(obs, state=None, mask=None)[source]

Returns the action probability for a single step

Parameters:
  • obs – ([float] or [int]) The current observation of the environment
  • state – ([float]) The last states (used in recurrent policies)
  • mask – ([float]) The last masks (used in recurrent policies)
Returns:

([float]) the action probability

step(obs, state=None, mask=None, deterministic=False)[source]

Returns the policy for a single step

Parameters:
  • obs – ([float] or [int]) The current observation of the environment
  • state – ([float]) The last states (used in recurrent policies)
  • mask – ([float]) The last masks (used in recurrent policies)
  • deterministic – (bool) Whether or not to return deterministic actions.
Returns:

([float], [float], [float], [float]) actions, values, states, neglogp

value(obs, state=None, mask=None)[source]

Returns the value for a single step

Parameters:
  • obs – ([float] or [int]) The current observation of the environment
  • state – ([float]) The last states (used in recurrent policies)
  • mask – ([float]) The last masks (used in recurrent policies)
Returns:

([float]) The associated value of the action

MLP Policies

class stable_baselines.common.policies.MlpPolicy(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, **_kwargs)[source]

Policy object that implements actor critic, using a MLP (2 layers of 64)

Parameters:
  • sess – (TensorFlow session) The current TensorFlow session
  • ob_space – (Gym Space) The observation space of the environment
  • ac_space – (Gym Space) The action space of the environment
  • n_env – (int) The number of environments to run
  • n_steps – (int) The number of steps to run for each environment
  • n_batch – (int) The number of batch to run (n_envs * n_steps)
  • reuse – (bool) If the policy is reusable or not
  • _kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction
class stable_baselines.common.policies.MlpLstmPolicy(sess, ob_space, ac_space, n_env, n_steps, n_batch, n_lstm=256, reuse=False, **_kwargs)[source]

Policy object that implements actor critic, using LSTMs with a MLP feature extraction

Parameters:
  • sess – (TensorFlow session) The current TensorFlow session
  • ob_space – (Gym Space) The observation space of the environment
  • ac_space – (Gym Space) The action space of the environment
  • n_env – (int) The number of environments to run
  • n_steps – (int) The number of steps to run for each environment
  • n_batch – (int) The number of batch to run (n_envs * n_steps)
  • n_lstm – (int) The number of LSTM cells (for recurrent policies)
  • reuse – (bool) If the policy is reusable or not
  • kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction
class stable_baselines.common.policies.MlpLnLstmPolicy(sess, ob_space, ac_space, n_env, n_steps, n_batch, n_lstm=256, reuse=False, **_kwargs)[source]

Policy object that implements actor critic, using a layer normalized LSTMs with a MLP feature extraction

Parameters:
  • sess – (TensorFlow session) The current TensorFlow session
  • ob_space – (Gym Space) The observation space of the environment
  • ac_space – (Gym Space) The action space of the environment
  • n_env – (int) The number of environments to run
  • n_steps – (int) The number of steps to run for each environment
  • n_batch – (int) The number of batch to run (n_envs * n_steps)
  • n_lstm – (int) The number of LSTM cells (for recurrent policies)
  • reuse – (bool) If the policy is reusable or not
  • kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction

CNN Policies

class stable_baselines.common.policies.CnnPolicy(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, **_kwargs)[source]

Policy object that implements actor critic, using a CNN (the nature CNN)

Parameters:
  • sess – (TensorFlow session) The current TensorFlow session
  • ob_space – (Gym Space) The observation space of the environment
  • ac_space – (Gym Space) The action space of the environment
  • n_env – (int) The number of environments to run
  • n_steps – (int) The number of steps to run for each environment
  • n_batch – (int) The number of batch to run (n_envs * n_steps)
  • reuse – (bool) If the policy is reusable or not
  • _kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction
class stable_baselines.common.policies.CnnLstmPolicy(sess, ob_space, ac_space, n_env, n_steps, n_batch, n_lstm=256, reuse=False, **_kwargs)[source]

Policy object that implements actor critic, using LSTMs with a CNN feature extraction

Parameters:
  • sess – (TensorFlow session) The current TensorFlow session
  • ob_space – (Gym Space) The observation space of the environment
  • ac_space – (Gym Space) The action space of the environment
  • n_env – (int) The number of environments to run
  • n_steps – (int) The number of steps to run for each environment
  • n_batch – (int) The number of batch to run (n_envs * n_steps)
  • n_lstm – (int) The number of LSTM cells (for recurrent policies)
  • reuse – (bool) If the policy is reusable or not
  • kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction
class stable_baselines.common.policies.CnnLnLstmPolicy(sess, ob_space, ac_space, n_env, n_steps, n_batch, n_lstm=256, reuse=False, **_kwargs)[source]

Policy object that implements actor critic, using a layer normalized LSTMs with a CNN feature extraction

Parameters:
  • sess – (TensorFlow session) The current TensorFlow session
  • ob_space – (Gym Space) The observation space of the environment
  • ac_space – (Gym Space) The action space of the environment
  • n_env – (int) The number of environments to run
  • n_steps – (int) The number of steps to run for each environment
  • n_batch – (int) The number of batch to run (n_envs * n_steps)
  • n_lstm – (int) The number of LSTM cells (for recurrent policies)
  • reuse – (bool) If the policy is reusable or not
  • kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction