HER¶

HER is a method wrapper that works with Off policy methods (DQN, SAC, TD3 and DDPG for example).

Note

HER was re-implemented from scratch in Stable-Baselines compared to the original OpenAI baselines. If you want to reproduce results from the paper, please use the rl baselines zoo in order to have the correct hyperparameters and at least 8 MPI workers with DDPG.

Warning

HER requires the environment to inherits from gym.GoalEnv

Warning

you must pass an environment or wrap it with HERGoalEnvWrapper in order to use the predict method

Notes¶

Original paper: https://arxiv.org/abs/1707.01495
OpenAI paper: Plappert et al. (2018)
OpenAI blog post: https://openai.com/blog/ingredients-for-robotics-research/

Can I use?¶

Please refer to the wrapped model (DQN, SAC, TD3 or DDPG) for that section.

Example¶

from stable_baselines import HER, DQN, SAC, DDPG, TD3
from stable_baselines.her import GoalSelectionStrategy, HERGoalEnvWrapper
from stable_baselines.common.bit_flipping_env import BitFlippingEnv

model_class = DQN  # works also with SAC, DDPG and TD3

env = BitFlippingEnv(N_BITS, continuous=model_class in [DDPG, SAC, TD3], max_steps=N_BITS)

# Available strategies (cf paper): future, final, episode, random
goal_selection_strategy = 'future' # equivalent to GoalSelectionStrategy.FUTURE

# Wrap the model
model = HER('MlpPolicy', env, model_class, n_sampled_goal=4, goal_selection_strategy=goal_selection_strategy,
                                                verbose=1)
# Train the model
model.learn(1000)

model.save("./her_bit_env")

# WARNING: you must pass an env
# or wrap your environment with HERGoalEnvWrapper to use the predict method
model = HER.load('./her_bit_env', env=env)

obs = env.reset()
for _ in range(100):
    action, _ = model.predict(obs)
    obs, reward, done, _ = env.step(action)

    if done:
        obs = env.reset()

Parameters¶

class stable_baselines.her.HER(policy, env, model_class, n_sampled_goal=4, goal_selection_strategy='future', *args, **kwargs)[source]¶

Hindsight Experience Replay (HER) https://arxiv.org/abs/1707.01495

Parameters:

policy – (BasePolicy or str) The policy model to use (MlpPolicy, CnnPolicy, CnnLstmPolicy, …)
env – (Gym environment or str) The environment to learn from (if registered in Gym, can be str)
model_class – (OffPolicyRLModel) The off policy RL model to apply Hindsight Experience Replay currently supported: DQN, DDPG, SAC
n_sampled_goal – (int)
goal_selection_strategy – (GoalSelectionStrategy or str)

action_probability(observation, state=None, mask=None, actions=None, logp=False)[source]¶

If actions is None, then get the model’s action probability distribution from a given observation.

Depending on the action space the output is:

Discrete: probability for each possible action
Box: mean and standard deviation of the action output

However if actions is not None, this function will return the probability that the given actions are taken with the given parameters (observation, state, …) on this model. For discrete action spaces, it returns the probability mass; for continuous action spaces, the probability density. This is since the probability mass will always be zero in continuous spaces, see http://blog.christianperone.com/2019/01/ for a good explanation

Parameters:

observation – (np.ndarray) the input observation
state – (np.ndarray) The last states (can be None, used in recurrent policies)
mask – (np.ndarray) The last masks (can be None, used in recurrent policies)
actions – (np.ndarray) (OPTIONAL) For calculating the likelihood that the given actions are chosen by the model for each of the given parameters. Must have the same number of actions and observations. (set to None to return the complete action probability distribution)
logp – (bool) (OPTIONAL) When specified with actions, returns probability in log-space. This has no effect if actions is None.

Returns:

(np.ndarray) the model’s (log) action probability

get_env()[source]¶

returns the current environment (can be None if not defined)

Returns:	(Gym Environment) The current environment

get_parameter_list()[source]¶

Get tensorflow Variables of model’s parameters

This includes all variables necessary for continuing training (saving / loading).

Returns:	(list) List of tensorflow Variables

learn(total_timesteps, callback=None, log_interval=100, tb_log_name='HER', reset_num_timesteps=True)[source]¶

Return a trained model.

Parameters:

total_timesteps – (int) The total number of samples to train on
callback – (function (dict, dict)) -> boolean function called at every steps with state of the algorithm. It takes the local and global variables. If it returns False, training is aborted.
log_interval – (int) The number of timesteps before logging.
tb_log_name – (str) the name of the run for tensorboard log
reset_num_timesteps – (bool) whether or not to reset the current timestep number (used in logging)

Returns:

(BaseRLModel) the trained model

classmethod load(load_path, env=None, custom_objects=None, **kwargs)[source]¶

Load the model from file

Parameters:

load_path – (str or file-like) the saved parameter location
env – (Gym Environment) the new environment to run the loaded model on (can be None if you only need prediction from a trained model)
custom_objects – (dict) Dictionary of objects to replace upon loading. If a variable is present in this dictionary as a key, it will not be deserialized and the corresponding item will be used instead. Similar to custom_objects in keras.models.load_model. Useful when you have an object in file that can not be deserialized.
kwargs – extra arguments to change the model when loading

predict(observation, state=None, mask=None, deterministic=True)[source]¶

Get the model’s action from an observation

Parameters:	observation – (np.ndarray) the input observation state – (np.ndarray) The last states (can be None, used in recurrent policies) mask – (np.ndarray) The last masks (can be None, used in recurrent policies) deterministic – (bool) Whether or not to return deterministic actions.
Returns:	(np.ndarray, np.ndarray) the model’s action and the next state (used in recurrent policies)

save(save_path, cloudpickle=False)[source]¶

Save the current parameters to file

Parameters:	save_path – (str or file-like) The save location cloudpickle – (bool) Use older cloudpickle format instead of zip-archives.

set_env(env)[source]¶

Checks the validity of the environment, and if it is coherent, set it as the current environment.

Parameters:	env – (Gym Environment) The environment for learning a policy

setup_model()[source]¶: Create all the functions and tensorflow graphs necessary to train the model

Goal Selection Strategies¶

class stable_baselines.her.GoalSelectionStrategy[source]¶: The strategies for selecting new goals when creating artificial transitions.

Goal Env Wrapper¶

class stable_baselines.her.HERGoalEnvWrapper(env)[source]¶

A wrapper that allow to use dict observation space (coming from GoalEnv) with the RL algorithms. It assumes that all the spaces of the dict space are of the same type.

Parameters:	env – (gym.GoalEnv)

convert_dict_to_obs(obs_dict)[source]¶

Parameters:	obs_dict – (dict<np.ndarray>)
Returns:	(np.ndarray)

convert_obs_to_dict(observations)[source]¶

Inverse operation of convert_dict_to_obs

Parameters:	observations – (np.ndarray)
Returns:	(OrderedDict<np.ndarray>)

Replay Wrapper¶

class stable_baselines.her.HindsightExperienceReplayWrapper(replay_buffer, n_sampled_goal, goal_selection_strategy, wrapped_env)[source]¶

Wrapper around a replay buffer in order to use HER. This implementation is inspired by to the one found in https://github.com/NervanaSystems/coach/.

Parameters:

replay_buffer – (ReplayBuffer)
n_sampled_goal – (int) The number of artificial transitions to generate for each actual transition
goal_selection_strategy – (GoalSelectionStrategy) The method that will be used to generate the goals for the artificial transitions.
wrapped_env – (HERGoalEnvWrapper) the GoalEnv wrapped using HERGoalEnvWrapper, that enables to convert observation to dict, and vice versa

add(obs_t, action, reward, obs_tp1, done)[source]¶

add a new transition to the buffer

Parameters:	obs_t – (np.ndarray) the last observation action – ([float]) the action reward – (float) the reward of the transition obs_tp1 – (np.ndarray) the new observation done – (bool) is the episode done

can_sample(n_samples)[source]¶

Check if n_samples samples can be sampled from the buffer.

Parameters:	n_samples – (int)
Returns:	(bool)