HER¶
Hindsight Experience Replay (HER)
HER is a method wrapper that works with Off policy methods (DQN, SAC and DDPG for example).
Note
HER was re-implemented from scratch in Stable-Baselines compared to the original OpenAI baselines. If you want to reproduce results from the paper, please use the rl baselines zoo in order to have the correct hyperparameters and at least 8 MPI workers with DDPG.
Warning
HER requires the environment to inherits from gym.GoalEnv
Warning
you must pass an environment or wrap it with HERGoalEnvWrapper
in order to use the predict method
Notes¶
- Original paper: https://arxiv.org/abs/1707.01495
- OpenAI paper: Plappert et al. (2018)
- OpenAI blog post: https://openai.com/blog/ingredients-for-robotics-research/
Can I use?¶
Please refer to the wrapped model (DQN, SAC or DDPG) for that section.
Example¶
from stable_baselines import HER, DQN, SAC, DDPG
from stable_baselines.her import GoalSelectionStrategy, HERGoalEnvWrapper
from stable_baselines.common.bit_flipping_env import BitFlippingEnv
model_class = DQN # works also with SAC and DDPG
env = BitFlippingEnv(N_BITS, continuous=model_class in [DDPG, SAC], max_steps=N_BITS)
# Available strategies (cf paper): future, final, episode, random
goal_selection_strategy = 'future' # equivalent to GoalSelectionStrategy.FUTURE
# Wrap the model
model = HER('MlpPolicy', env, model_class, n_sampled_goal=4, goal_selection_strategy=goal_selection_strategy,
verbose=1)
# Train the model
model.learn(1000)
model.save("./her_bit_env")
# WARNING: you must pass an env
# or wrap your environment with HERGoalEnvWrapper to use the predict method
model = HER.load('./her_bit_env', env=env)
obs = env.reset()
for _ in range(100):
action, _ = model.predict(obs)
obs, reward, done, _ = env.step(action)
if done:
obs = env.reset()
Parameters¶
-
class
stable_baselines.her.
HER
(policy, env, model_class, n_sampled_goal=4, goal_selection_strategy='future', *args, **kwargs)[source]¶ Hindsight Experience Replay (HER) https://arxiv.org/abs/1707.01495
Parameters: - policy – (BasePolicy or str) The policy model to use (MlpPolicy, CnnPolicy, CnnLstmPolicy, …)
- env – (Gym environment or str) The environment to learn from (if registered in Gym, can be str)
- model_class – (OffPolicyRLModel) The off policy RL model to apply Hindsight Experience Replay currently supported: DQN, DDPG, SAC
- n_sampled_goal – (int)
- goal_selection_strategy – (GoalSelectionStrategy or str)
-
action_probability
(observation, state=None, mask=None, actions=None)[source]¶ If
actions
isNone
, then get the model’s action probability distribution from a given observation- depending on the action space the output is:
- Discrete: probability for each possible action
- Box: mean and standard deviation of the action output
However if
actions
is notNone
, this function will return the probability that the given actions are taken with the given parameters (observation, state, …) on this model.Warning
When working with continuous probability distribution (e.g. Gaussian distribution for continuous action) the probability of taking a particular action is exactly zero. See http://blog.christianperone.com/2019/01/ for a good explanation
Parameters: - observation – (np.ndarray) the input observation
- state – (np.ndarray) The last states (can be None, used in recurrent policies)
- mask – (np.ndarray) The last masks (can be None, used in recurrent policies)
- actions – (np.ndarray) (OPTIONAL) For calculating the likelihood that the given actions are chosen by the model for each of the given parameters. Must have the same number of actions and observations. (set to None to return the complete action probability distribution)
Returns: (np.ndarray) the model’s action probability
-
get_env
()[source]¶ returns the current environment (can be None if not defined)
Returns: (Gym Environment) The current environment
-
get_parameter_list
()[source]¶ Get tensorflow Variables of model’s parameters
This includes all variables necessary for continuing training (saving / loading).
Returns: (list) List of tensorflow Variables
-
learn
(total_timesteps, callback=None, seed=None, log_interval=100, tb_log_name='HER', reset_num_timesteps=True)[source]¶ Return a trained model.
Parameters: - total_timesteps – (int) The total number of samples to train on
- seed – (int) The initial seed for training, if None: keep current seed
- callback – (function (dict, dict)) -> boolean function called at every steps with state of the algorithm. It takes the local and global variables. If it returns False, training is aborted.
- log_interval – (int) The number of timesteps before logging.
- tb_log_name – (str) the name of the run for tensorboard log
- reset_num_timesteps – (bool) whether or not to reset the current timestep number (used in logging)
Returns: (BaseRLModel) the trained model
-
classmethod
load
(load_path, env=None, **kwargs)[source]¶ Load the model from file
Parameters: - load_path – (str or file-like) the saved parameter location
- env – (Gym Envrionment) the new environment to run the loaded model on (can be None if you only need prediction from a trained model)
- kwargs – extra arguments to change the model when loading
-
predict
(observation, state=None, mask=None, deterministic=True)[source]¶ Get the model’s action from an observation
Parameters: - observation – (np.ndarray) the input observation
- state – (np.ndarray) The last states (can be None, used in recurrent policies)
- mask – (np.ndarray) The last masks (can be None, used in recurrent policies)
- deterministic – (bool) Whether or not to return deterministic actions.
Returns: (np.ndarray, np.ndarray) the model’s action and the next state (used in recurrent policies)
-
save
(save_path)[source]¶ Save the current parameters to file
Parameters: save_path – (str or file-like object) the save location
Goal Selection Strategies¶
Gaol Env Wrapper¶
-
class
stable_baselines.her.
HERGoalEnvWrapper
(env)[source]¶ A wrapper that allow to use dict observation space (coming from GoalEnv) with the RL algorithms. It assumes that all the spaces of the dict space are of the same type.
Parameters: env – (gym.GoalEnv)
Replay Wrapper¶
-
class
stable_baselines.her.
HindsightExperienceReplayWrapper
(replay_buffer, n_sampled_goal, goal_selection_strategy, wrapped_env)[source]¶ Wrapper around a replay buffer in order to use HER. This implementation is inspired by to the one found in https://github.com/NervanaSystems/coach/.
Parameters: - replay_buffer – (ReplayBuffer)
- n_sampled_goal – (int) The number of artificial transitions to generate for each actual transition
- goal_selection_strategy – (GoalSelectionStrategy) The method that will be used to generate the goals for the artificial transitions.
- wrapped_env – (HERGoalEnvWrapper) the GoalEnv wrapped using HERGoalEnvWrapper, that enables to convert observation to dict, and vice versa