HER¶
Hindsight Experience Replay (HER)
HER is a method wrapper that works with Off policy methods (DQN, SAC, TD3 and DDPG for example).
Note
HER was reimplemented from scratch in StableBaselines compared to the original OpenAI baselines. If you want to reproduce results from the paper, please use the rl baselines zoo in order to have the correct hyperparameters and at least 8 MPI workers with DDPG.
Warning
HER requires the environment to inherits from gym.GoalEnv
Warning
you must pass an environment or wrap it with HERGoalEnvWrapper
in order to use the predict method
Notes¶
 Original paper: https://arxiv.org/abs/1707.01495
 OpenAI paper: Plappert et al. (2018)
 OpenAI blog post: https://openai.com/blog/ingredientsforroboticsresearch/
Can I use?¶
Please refer to the wrapped model (DQN, SAC, TD3 or DDPG) for that section.
Example¶
from stable_baselines import HER, DQN, SAC, DDPG, TD3
from stable_baselines.her import GoalSelectionStrategy, HERGoalEnvWrapper
from stable_baselines.common.bit_flipping_env import BitFlippingEnv
model_class = DQN # works also with SAC, DDPG and TD3
env = BitFlippingEnv(N_BITS, continuous=model_class in [DDPG, SAC, TD3], max_steps=N_BITS)
# Available strategies (cf paper): future, final, episode, random
goal_selection_strategy = 'future' # equivalent to GoalSelectionStrategy.FUTURE
# Wrap the model
model = HER('MlpPolicy', env, model_class, n_sampled_goal=4, goal_selection_strategy=goal_selection_strategy,
verbose=1)
# Train the model
model.learn(1000)
model.save("./her_bit_env")
# WARNING: you must pass an env
# or wrap your environment with HERGoalEnvWrapper to use the predict method
model = HER.load('./her_bit_env', env=env)
obs = env.reset()
for _ in range(100):
action, _ = model.predict(obs)
obs, reward, done, _ = env.step(action)
if done:
obs = env.reset()
Parameters¶

class
stable_baselines.her.
HER
(policy, env, model_class, n_sampled_goal=4, goal_selection_strategy='future', *args, **kwargs)[source]¶ Hindsight Experience Replay (HER) https://arxiv.org/abs/1707.01495
Parameters:  policy – (BasePolicy or str) The policy model to use (MlpPolicy, CnnPolicy, CnnLstmPolicy, …)
 env – (Gym environment or str) The environment to learn from (if registered in Gym, can be str)
 model_class – (OffPolicyRLModel) The off policy RL model to apply Hindsight Experience Replay currently supported: DQN, DDPG, SAC
 n_sampled_goal – (int)
 goal_selection_strategy – (GoalSelectionStrategy or str)

action_probability
(observation, state=None, mask=None, actions=None, logp=False)[source]¶ If
actions
isNone
, then get the model’s action probability distribution from a given observation. Depending on the action space the output is:
 Discrete: probability for each possible action
 Box: mean and standard deviation of the action output
However if
actions
is notNone
, this function will return the probability that the given actions are taken with the given parameters (observation, state, …) on this model. For discrete action spaces, it returns the probability mass; for continuous action spaces, the probability density. This is since the probability mass will always be zero in continuous spaces, see http://blog.christianperone.com/2019/01/ for a good explanationParameters:  observation – (np.ndarray) the input observation
 state – (np.ndarray) The last states (can be None, used in recurrent policies)
 mask – (np.ndarray) The last masks (can be None, used in recurrent policies)
 actions – (np.ndarray) (OPTIONAL) For calculating the likelihood that the given actions are chosen by the model for each of the given parameters. Must have the same number of actions and observations. (set to None to return the complete action probability distribution)
 logp – (bool) (OPTIONAL) When specified with actions, returns probability in logspace. This has no effect if actions is None.
Returns: (np.ndarray) the model’s (log) action probability

get_env
()[source]¶ returns the current environment (can be None if not defined)
Returns: (Gym Environment) The current environment

get_parameter_list
()[source]¶ Get tensorflow Variables of model’s parameters
This includes all variables necessary for continuing training (saving / loading).
Returns: (list) List of tensorflow Variables

learn
(total_timesteps, callback=None, log_interval=100, tb_log_name='HER', reset_num_timesteps=True)[source]¶ Return a trained model.
Parameters:  total_timesteps – (int) The total number of samples to train on
 callback – (Union[callable, [callable], BaseCallback]) function called at every steps with state of the algorithm. It takes the local and global variables. If it returns False, training is aborted. When the callback inherits from BaseCallback, you will have access to additional stages of the training (training start/end), please read the documentation for more details.
 log_interval – (int) The number of timesteps before logging.
 tb_log_name – (str) the name of the run for tensorboard log
 reset_num_timesteps – (bool) whether or not to reset the current timestep number (used in logging)
Returns: (BaseRLModel) the trained model

classmethod
load
(load_path, env=None, custom_objects=None, **kwargs)[source]¶ Load the model from file
Parameters:  load_path – (str or filelike) the saved parameter location
 env – (Gym Environment) the new environment to run the loaded model on (can be None if you only need prediction from a trained model)
 custom_objects – (dict) Dictionary of objects to replace upon loading. If a variable is present in this dictionary as a key, it will not be deserialized and the corresponding item will be used instead. Similar to custom_objects in keras.models.load_model. Useful when you have an object in file that can not be deserialized.
 kwargs – extra arguments to change the model when loading

predict
(observation, state=None, mask=None, deterministic=True)[source]¶ Get the model’s action from an observation
Parameters:  observation – (np.ndarray) the input observation
 state – (np.ndarray) The last states (can be None, used in recurrent policies)
 mask – (np.ndarray) The last masks (can be None, used in recurrent policies)
 deterministic – (bool) Whether or not to return deterministic actions.
Returns: (np.ndarray, np.ndarray) the model’s action and the next state (used in recurrent policies)

save
(save_path, cloudpickle=False)[source]¶ Save the current parameters to file
Parameters:  save_path – (str or filelike) The save location
 cloudpickle – (bool) Use older cloudpickle format instead of ziparchives.
Goal Selection Strategies¶
Goal Env Wrapper¶

class
stable_baselines.her.
HERGoalEnvWrapper
(env)[source]¶ A wrapper that allow to use dict observation space (coming from GoalEnv) with the RL algorithms. It assumes that all the spaces of the dict space are of the same type.
Parameters: env – (gym.GoalEnv)
Replay Wrapper¶

class
stable_baselines.her.
HindsightExperienceReplayWrapper
(replay_buffer, n_sampled_goal, goal_selection_strategy, wrapped_env)[source]¶ Wrapper around a replay buffer in order to use HER. This implementation is inspired by to the one found in https://github.com/NervanaSystems/coach/.
Parameters:  replay_buffer – (ReplayBuffer)
 n_sampled_goal – (int) The number of artificial transitions to generate for each actual transition
 goal_selection_strategy – (GoalSelectionStrategy) The method that will be used to generate the goals for the artificial transitions.
 wrapped_env – (HERGoalEnvWrapper) the GoalEnv wrapped using HERGoalEnvWrapper, that enables to convert observation to dict, and vice versa

add
(obs_t, action, reward, obs_tp1, done, info)[source]¶ add a new transition to the buffer
Parameters:  obs_t – (np.ndarray) the last observation
 action – ([float]) the action
 reward – (float) the reward of the transition
 obs_tp1 – (np.ndarray) the new observation
 done – (bool) is the episode done
 info – (dict) extra values used to compute reward