Policy Networks¶
Stable-baselines provides a set of default policies, that can be used with most action spaces.
To customize the default policies, you can specify the policy_kwargs
parameter to the model class you use.
Those kwargs are then passed to the policy on instantiation (see Custom Policy Network for an example).
If you need more control on the policy architecture, you can also create a custom policy (see Custom Policy Network).
Note
CnnPolicies are for images only. MlpPolicies are made for other type of features (e.g. robot joints)
Warning
For all algorithms (except DDPG and SAC), continuous actions are clipped during training and testing (to avoid out of bound error).
Available Policies
MlpPolicy |
Policy object that implements actor critic, using a MLP (2 layers of 64) |
MlpLstmPolicy |
Policy object that implements actor critic, using LSTMs with a MLP feature extraction |
MlpLnLstmPolicy |
Policy object that implements actor critic, using a layer normalized LSTMs with a MLP feature extraction |
CnnPolicy |
Policy object that implements actor critic, using a CNN (the nature CNN) |
CnnLstmPolicy |
Policy object that implements actor critic, using LSTMs with a CNN feature extraction |
CnnLnLstmPolicy |
Policy object that implements actor critic, using a layer normalized LSTMs with a CNN feature extraction |
Base Classes¶
-
class
stable_baselines.common.policies.
BasePolicy
(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, scale=False, obs_phs=None, add_action_ph=False)[source]¶ The base policy object
Parameters: - sess – (TensorFlow session) The current TensorFlow session
- ob_space – (Gym Space) The observation space of the environment
- ac_space – (Gym Space) The action space of the environment
- n_env – (int) The number of environments to run
- n_steps – (int) The number of steps to run for each environment
- n_batch – (int) The number of batches to run (n_envs * n_steps)
- reuse – (bool) If the policy is reusable or not
- scale – (bool) whether or not to scale the input
- obs_phs – (TensorFlow Tensor, TensorFlow Tensor) a tuple containing an override for observation placeholder and the processed observation placeholder respectivly
- add_action_ph – (bool) whether or not to create an action placeholder
-
action_ph
¶ tf.Tensor: placeholder for actions, shape (self.n_batch, ) + self.ac_space.shape.
-
initial_state
¶ The initial state of the policy. For feedforward policies, None. For a recurrent policy, a NumPy array of shape (self.n_env, ) + state_shape.
-
is_discrete
¶ bool: is action space discrete.
-
obs_ph
¶ tf.Tensor: placeholder for observations, shape (self.n_batch, ) + self.ob_space.shape.
-
proba_step
(obs, state=None, mask=None)[source]¶ Returns the action probability for a single step
Parameters: - obs – ([float] or [int]) The current observation of the environment
- state – ([float]) The last states (used in recurrent policies)
- mask – ([float]) The last masks (used in recurrent policies)
Returns: ([float]) the action probability
-
processed_obs
¶ tf.Tensor: processed observations, shape (self.n_batch, ) + self.ob_space.shape.
The form of processing depends on the type of the observation space, and the parameters whether scale is passed to the constructor; see observation_input for more information.
-
step
(obs, state=None, mask=None)[source]¶ Returns the policy for a single step
Parameters: - obs – ([float] or [int]) The current observation of the environment
- state – ([float]) The last states (used in recurrent policies)
- mask – ([float]) The last masks (used in recurrent policies)
Returns: ([float], [float], [float], [float]) actions, values, states, neglogp
-
class
stable_baselines.common.policies.
ActorCriticPolicy
(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, scale=False)[source]¶ Policy object that implements actor critic
Parameters: - sess – (TensorFlow session) The current TensorFlow session
- ob_space – (Gym Space) The observation space of the environment
- ac_space – (Gym Space) The action space of the environment
- n_env – (int) The number of environments to run
- n_steps – (int) The number of steps to run for each environment
- n_batch – (int) The number of batch to run (n_envs * n_steps)
- reuse – (bool) If the policy is reusable or not
- scale – (bool) whether or not to scale the input
-
action
¶ tf.Tensor: stochastic action, of shape (self.n_batch, ) + self.ac_space.shape.
-
deterministic_action
¶ tf.Tensor: deterministic action, of shape (self.n_batch, ) + self.ac_space.shape.
-
neglogp
¶ tf.Tensor: negative log likelihood of the action sampled by self.action.
-
pdtype
¶ ProbabilityDistributionType: type of the distribution for stochastic actions.
-
policy
¶ tf.Tensor: policy output, e.g. logits.
-
policy_proba
¶ tf.Tensor: parameters of the probability distribution. Depends on pdtype.
-
proba_distribution
¶ ProbabilityDistribution: distribution of stochastic actions.
-
step
(obs, state=None, mask=None, deterministic=False)[source]¶ Returns the policy for a single step
Parameters: - obs – ([float] or [int]) The current observation of the environment
- state – ([float]) The last states (used in recurrent policies)
- mask – ([float]) The last masks (used in recurrent policies)
- deterministic – (bool) Whether or not to return deterministic actions.
Returns: ([float], [float], [float], [float]) actions, values, states, neglogp
-
value
(obs, state=None, mask=None)[source]¶ Returns the value for a single step
Parameters: - obs – ([float] or [int]) The current observation of the environment
- state – ([float]) The last states (used in recurrent policies)
- mask – ([float]) The last masks (used in recurrent policies)
Returns: ([float]) The associated value of the action
-
value_flat
¶ tf.Tensor: value estimate, of shape (self.n_batch, )
-
value_fn
¶ tf.Tensor: value estimate, of shape (self.n_batch, 1)
-
class
stable_baselines.common.policies.
FeedForwardPolicy
(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, layers=None, net_arch=None, act_fun=<MagicMock id='140381189819864'>, cnn_extractor=<function nature_cnn>, feature_extraction='cnn', **kwargs)[source]¶ Policy object that implements actor critic, using a feed forward neural network.
Parameters: - sess – (TensorFlow session) The current TensorFlow session
- ob_space – (Gym Space) The observation space of the environment
- ac_space – (Gym Space) The action space of the environment
- n_env – (int) The number of environments to run
- n_steps – (int) The number of steps to run for each environment
- n_batch – (int) The number of batch to run (n_envs * n_steps)
- reuse – (bool) If the policy is reusable or not
- layers – ([int]) (deprecated, use net_arch instead) The size of the Neural network for the policy (if None, default to [64, 64])
- net_arch – (list) Specification of the actor-critic policy network architecture (see mlp_extractor documentation for details).
- act_fun – (tf.func) the activation function to use in the neural network.
- cnn_extractor – (function (TensorFlow Tensor,
**kwargs
): (TensorFlow Tensor)) the CNN feature extraction - feature_extraction – (str) The feature extraction type (“cnn” or “mlp”)
- kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction
-
proba_step
(obs, state=None, mask=None)[source]¶ Returns the action probability for a single step
Parameters: - obs – ([float] or [int]) The current observation of the environment
- state – ([float]) The last states (used in recurrent policies)
- mask – ([float]) The last masks (used in recurrent policies)
Returns: ([float]) the action probability
-
step
(obs, state=None, mask=None, deterministic=False)[source]¶ Returns the policy for a single step
Parameters: - obs – ([float] or [int]) The current observation of the environment
- state – ([float]) The last states (used in recurrent policies)
- mask – ([float]) The last masks (used in recurrent policies)
- deterministic – (bool) Whether or not to return deterministic actions.
Returns: ([float], [float], [float], [float]) actions, values, states, neglogp
-
value
(obs, state=None, mask=None)[source]¶ Returns the value for a single step
Parameters: - obs – ([float] or [int]) The current observation of the environment
- state – ([float]) The last states (used in recurrent policies)
- mask – ([float]) The last masks (used in recurrent policies)
Returns: ([float]) The associated value of the action
-
class
stable_baselines.common.policies.
LstmPolicy
(sess, ob_space, ac_space, n_env, n_steps, n_batch, n_lstm=256, reuse=False, layers=None, net_arch=None, act_fun=<MagicMock id='140381188975360'>, cnn_extractor=<function nature_cnn>, layer_norm=False, feature_extraction='cnn', **kwargs)[source]¶ Policy object that implements actor critic, using LSTMs.
Parameters: - sess – (TensorFlow session) The current TensorFlow session
- ob_space – (Gym Space) The observation space of the environment
- ac_space – (Gym Space) The action space of the environment
- n_env – (int) The number of environments to run
- n_steps – (int) The number of steps to run for each environment
- n_batch – (int) The number of batch to run (n_envs * n_steps)
- n_lstm – (int) The number of LSTM cells (for recurrent policies)
- reuse – (bool) If the policy is reusable or not
- layers – ([int]) The size of the Neural network before the LSTM layer (if None, default to [64, 64])
- net_arch – (list) Specification of the actor-critic policy network architecture. Notation similar to the format described in mlp_extractor but with additional support for a ‘lstm’ entry in the shared network part.
- act_fun – (tf.func) the activation function to use in the neural network.
- cnn_extractor – (function (TensorFlow Tensor,
**kwargs
): (TensorFlow Tensor)) the CNN feature extraction - layer_norm – (bool) Whether or not to use layer normalizing LSTMs
- feature_extraction – (str) The feature extraction type (“cnn” or “mlp”)
- kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction
-
proba_step
(obs, state=None, mask=None)[source]¶ Returns the action probability for a single step
Parameters: - obs – ([float] or [int]) The current observation of the environment
- state – ([float]) The last states (used in recurrent policies)
- mask – ([float]) The last masks (used in recurrent policies)
Returns: ([float]) the action probability
-
step
(obs, state=None, mask=None, deterministic=False)[source]¶ Returns the policy for a single step
Parameters: - obs – ([float] or [int]) The current observation of the environment
- state – ([float]) The last states (used in recurrent policies)
- mask – ([float]) The last masks (used in recurrent policies)
- deterministic – (bool) Whether or not to return deterministic actions.
Returns: ([float], [float], [float], [float]) actions, values, states, neglogp
MLP Policies¶
-
class
stable_baselines.common.policies.
MlpPolicy
(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, **_kwargs)[source]¶ Policy object that implements actor critic, using a MLP (2 layers of 64)
Parameters: - sess – (TensorFlow session) The current TensorFlow session
- ob_space – (Gym Space) The observation space of the environment
- ac_space – (Gym Space) The action space of the environment
- n_env – (int) The number of environments to run
- n_steps – (int) The number of steps to run for each environment
- n_batch – (int) The number of batch to run (n_envs * n_steps)
- reuse – (bool) If the policy is reusable or not
- _kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction
-
class
stable_baselines.common.policies.
MlpLstmPolicy
(sess, ob_space, ac_space, n_env, n_steps, n_batch, n_lstm=256, reuse=False, **_kwargs)[source]¶ Policy object that implements actor critic, using LSTMs with a MLP feature extraction
Parameters: - sess – (TensorFlow session) The current TensorFlow session
- ob_space – (Gym Space) The observation space of the environment
- ac_space – (Gym Space) The action space of the environment
- n_env – (int) The number of environments to run
- n_steps – (int) The number of steps to run for each environment
- n_batch – (int) The number of batch to run (n_envs * n_steps)
- n_lstm – (int) The number of LSTM cells (for recurrent policies)
- reuse – (bool) If the policy is reusable or not
- kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction
-
class
stable_baselines.common.policies.
MlpLnLstmPolicy
(sess, ob_space, ac_space, n_env, n_steps, n_batch, n_lstm=256, reuse=False, **_kwargs)[source]¶ Policy object that implements actor critic, using a layer normalized LSTMs with a MLP feature extraction
Parameters: - sess – (TensorFlow session) The current TensorFlow session
- ob_space – (Gym Space) The observation space of the environment
- ac_space – (Gym Space) The action space of the environment
- n_env – (int) The number of environments to run
- n_steps – (int) The number of steps to run for each environment
- n_batch – (int) The number of batch to run (n_envs * n_steps)
- n_lstm – (int) The number of LSTM cells (for recurrent policies)
- reuse – (bool) If the policy is reusable or not
- kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction
CNN Policies¶
-
class
stable_baselines.common.policies.
CnnPolicy
(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, **_kwargs)[source]¶ Policy object that implements actor critic, using a CNN (the nature CNN)
Parameters: - sess – (TensorFlow session) The current TensorFlow session
- ob_space – (Gym Space) The observation space of the environment
- ac_space – (Gym Space) The action space of the environment
- n_env – (int) The number of environments to run
- n_steps – (int) The number of steps to run for each environment
- n_batch – (int) The number of batch to run (n_envs * n_steps)
- reuse – (bool) If the policy is reusable or not
- _kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction
-
class
stable_baselines.common.policies.
CnnLstmPolicy
(sess, ob_space, ac_space, n_env, n_steps, n_batch, n_lstm=256, reuse=False, **_kwargs)[source]¶ Policy object that implements actor critic, using LSTMs with a CNN feature extraction
Parameters: - sess – (TensorFlow session) The current TensorFlow session
- ob_space – (Gym Space) The observation space of the environment
- ac_space – (Gym Space) The action space of the environment
- n_env – (int) The number of environments to run
- n_steps – (int) The number of steps to run for each environment
- n_batch – (int) The number of batch to run (n_envs * n_steps)
- n_lstm – (int) The number of LSTM cells (for recurrent policies)
- reuse – (bool) If the policy is reusable or not
- kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction
-
class
stable_baselines.common.policies.
CnnLnLstmPolicy
(sess, ob_space, ac_space, n_env, n_steps, n_batch, n_lstm=256, reuse=False, **_kwargs)[source]¶ Policy object that implements actor critic, using a layer normalized LSTMs with a CNN feature extraction
Parameters: - sess – (TensorFlow session) The current TensorFlow session
- ob_space – (Gym Space) The observation space of the environment
- ac_space – (Gym Space) The action space of the environment
- n_env – (int) The number of environments to run
- n_steps – (int) The number of steps to run for each environment
- n_batch – (int) The number of batch to run (n_envs * n_steps)
- n_lstm – (int) The number of LSTM cells (for recurrent policies)
- reuse – (bool) If the policy is reusable or not
- kwargs – (dict) Extra keyword arguments for the nature CNN feature extraction