HER¶
Hindsight Experience Replay (HER)
HER is a method wrapper that works with Off policy methods (DQN, SAC, TD3 and DDPG for example).
Note
HER was re-implemented from scratch in Stable-Baselines compared to the original OpenAI baselines. If you want to reproduce results from the paper, please use the rl baselines zoo in order to have the correct hyperparameters and at least 8 MPI workers with DDPG.
Warning
HER requires the environment to inherits from gym.GoalEnv
Warning
you must pass an environment or wrap it with HERGoalEnvWrapper
in order to use the predict method
Notes¶
- Original paper: https://arxiv.org/abs/1707.01495
- OpenAI paper: Plappert et al. (2018)
- OpenAI blog post: https://openai.com/blog/ingredients-for-robotics-research/
Can I use?¶
Please refer to the wrapped model (DQN, SAC, TD3 or DDPG) for that section.
Example¶
from stable_baselines import HER, DQN, SAC, DDPG, TD3
from stable_baselines.her import GoalSelectionStrategy, HERGoalEnvWrapper
from stable_baselines.common.bit_flipping_env import BitFlippingEnv
model_class = DQN # works also with SAC, DDPG and TD3
env = BitFlippingEnv(N_BITS, continuous=model_class in [DDPG, SAC, TD3], max_steps=N_BITS)
# Available strategies (cf paper): future, final, episode, random
goal_selection_strategy = 'future' # equivalent to GoalSelectionStrategy.FUTURE
# Wrap the model
model = HER('MlpPolicy', env, model_class, n_sampled_goal=4, goal_selection_strategy=goal_selection_strategy,
verbose=1)
# Train the model
model.learn(1000)
model.save("./her_bit_env")
# WARNING: you must pass an env
# or wrap your environment with HERGoalEnvWrapper to use the predict method
model = HER.load('./her_bit_env', env=env)
obs = env.reset()
for _ in range(100):
action, _ = model.predict(obs)
obs, reward, done, _ = env.step(action)
if done:
obs = env.reset()
Parameters¶
-
class
stable_baselines.her.
HER
(policy, env, model_class, n_sampled_goal=4, goal_selection_strategy='future', *args, **kwargs)[source]¶ Hindsight Experience Replay (HER) https://arxiv.org/abs/1707.01495
Parameters: - policy – (BasePolicy or str) The policy model to use (MlpPolicy, CnnPolicy, CnnLstmPolicy, …)
- env – (Gym environment or str) The environment to learn from (if registered in Gym, can be str)
- model_class – (OffPolicyRLModel) The off policy RL model to apply Hindsight Experience Replay currently supported: DQN, DDPG, SAC
- n_sampled_goal – (int)
- goal_selection_strategy – (GoalSelectionStrategy or str)
-
action_probability
(observation, state=None, mask=None, actions=None, logp=False)[source]¶ If
actions
isNone
, then get the model’s action probability distribution from a given observation.- Depending on the action space the output is:
- Discrete: probability for each possible action
- Box: mean and standard deviation of the action output
However if
actions
is notNone
, this function will return the probability that the given actions are taken with the given parameters (observation, state, …) on this model. For discrete action spaces, it returns the probability mass; for continuous action spaces, the probability density. This is since the probability mass will always be zero in continuous spaces, see http://blog.christianperone.com/2019/01/ for a good explanationParameters: - observation – (np.ndarray) the input observation
- state – (np.ndarray) The last states (can be None, used in recurrent policies)
- mask – (np.ndarray) The last masks (can be None, used in recurrent policies)
- actions – (np.ndarray) (OPTIONAL) For calculating the likelihood that the given actions are chosen by the model for each of the given parameters. Must have the same number of actions and observations. (set to None to return the complete action probability distribution)
- logp – (bool) (OPTIONAL) When specified with actions, returns probability in log-space. This has no effect if actions is None.
Returns: (np.ndarray) the model’s (log) action probability
-
get_env
()[source]¶ returns the current environment (can be None if not defined)
Returns: (Gym Environment) The current environment
-
get_parameter_list
()[source]¶ Get tensorflow Variables of model’s parameters
This includes all variables necessary for continuing training (saving / loading).
Returns: (list) List of tensorflow Variables
-
learn
(total_timesteps, callback=None, log_interval=100, tb_log_name='HER', reset_num_timesteps=True)[source]¶ Return a trained model.
Parameters: - total_timesteps – (int) The total number of samples to train on
- callback – (function (dict, dict)) -> boolean function called at every steps with state of the algorithm. It takes the local and global variables. If it returns False, training is aborted.
- log_interval – (int) The number of timesteps before logging.
- tb_log_name – (str) the name of the run for tensorboard log
- reset_num_timesteps – (bool) whether or not to reset the current timestep number (used in logging)
Returns: (BaseRLModel) the trained model
-
classmethod
load
(load_path, env=None, custom_objects=None, **kwargs)[source]¶ Load the model from file
Parameters: - load_path – (str or file-like) the saved parameter location
- env – (Gym Environment) the new environment to run the loaded model on (can be None if you only need prediction from a trained model)
- custom_objects – (dict) Dictionary of objects to replace upon loading. If a variable is present in this dictionary as a key, it will not be deserialized and the corresponding item will be used instead. Similar to custom_objects in keras.models.load_model. Useful when you have an object in file that can not be deserialized.
- kwargs – extra arguments to change the model when loading
-
predict
(observation, state=None, mask=None, deterministic=True)[source]¶ Get the model’s action from an observation
Parameters: - observation – (np.ndarray) the input observation
- state – (np.ndarray) The last states (can be None, used in recurrent policies)
- mask – (np.ndarray) The last masks (can be None, used in recurrent policies)
- deterministic – (bool) Whether or not to return deterministic actions.
Returns: (np.ndarray, np.ndarray) the model’s action and the next state (used in recurrent policies)
-
save
(save_path, cloudpickle=False)[source]¶ Save the current parameters to file
Parameters: - save_path – (str or file-like) The save location
- cloudpickle – (bool) Use older cloudpickle format instead of zip-archives.
Goal Selection Strategies¶
Goal Env Wrapper¶
-
class
stable_baselines.her.
HERGoalEnvWrapper
(env)[source]¶ A wrapper that allow to use dict observation space (coming from GoalEnv) with the RL algorithms. It assumes that all the spaces of the dict space are of the same type.
Parameters: env – (gym.GoalEnv)
Replay Wrapper¶
-
class
stable_baselines.her.
HindsightExperienceReplayWrapper
(replay_buffer, n_sampled_goal, goal_selection_strategy, wrapped_env)[source]¶ Wrapper around a replay buffer in order to use HER. This implementation is inspired by to the one found in https://github.com/NervanaSystems/coach/.
Parameters: - replay_buffer – (ReplayBuffer)
- n_sampled_goal – (int) The number of artificial transitions to generate for each actual transition
- goal_selection_strategy – (GoalSelectionStrategy) The method that will be used to generate the goals for the artificial transitions.
- wrapped_env – (HERGoalEnvWrapper) the GoalEnv wrapped using HERGoalEnvWrapper, that enables to convert observation to dict, and vice versa