ACER¶
Sample Efficient Actor-Critic with Experience Replay (ACER) combines several ideas of previous algorithms: it uses multiple workers (as A2C), implements a replay buffer (as in DQN), uses Retrace for Q-value estimation, importance sampling and a trust region.
Notes¶
- Original paper: https://arxiv.org/abs/1611.01224
python -m stable_baselines.acer.run_atari
runs the algorithm for 40M frames = 10M timesteps on an Atari game. See help (-h
) for more options.
Can I use?¶
- Reccurent policies: ✔️
- Multi processing: ✔️
- Gym spaces:
Space | Action | Observation |
---|---|---|
Discrete | ✔️ | ✔️ |
Box | ❌ | ✔️ |
MultiDiscrete | ❌ | ✔️ |
MultiBinary | ❌ | ✔️ |
Example¶
import gym
from stable_baselines.common.policies import MlpPolicy, MlpLstmPolicy, MlpLnLstmPolicy
from stable_baselines.common.vec_env import SubprocVecEnv
from stable_baselines import ACER
# multiprocess environment
n_cpu = 4
env = SubprocVecEnv([lambda: gym.make('CartPole-v1') for i in range(n_cpu)])
model = ACER(MlpPolicy, env, verbose=1)
model.learn(total_timesteps=25000)
model.save("acer_cartpole")
del model # remove to demonstrate saving and loading
model = ACER.load("acer_cartpole")
obs = env.reset()
while True:
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)
env.render()
Parameters¶
-
class
stable_baselines.acer.
ACER
(policy, env, gamma=0.99, n_steps=20, num_procs=1, q_coef=0.5, ent_coef=0.01, max_grad_norm=10, learning_rate=0.0007, lr_schedule='linear', rprop_alpha=0.99, rprop_epsilon=1e-05, buffer_size=5000, replay_ratio=4, replay_start=1000, correction_term=10.0, trust_region=True, alpha=0.99, delta=1, verbose=0, tensorboard_log=None, _init_setup_model=True)[source]¶ The ACER (Actor-Critic with Experience Replay) model class, https://arxiv.org/abs/1611.01224
Parameters: - policy – (ActorCriticPolicy or str) The policy model to use (MlpPolicy, CnnPolicy, CnnLstmPolicy, …)
- env – (Gym environment or str) The environment to learn from (if registered in Gym, can be str)
- gamma – (float) The discount value
- n_steps – (int) The number of steps to run for each environment per update (i.e. batch size is n_steps * n_env where n_env is number of environment copies running in parallel)
- num_procs – (int) The number of threads for TensorFlow operations
- q_coef – (float) The weight for the loss on the Q value
- ent_coef – (float) The weight for the entropic loss
- max_grad_norm – (float) The clipping value for the maximum gradient
- learning_rate – (float) The initial learning rate for the RMS prop optimizer
- lr_schedule – (str) The type of scheduler for the learning rate update (‘linear’, ‘constant’, ‘double_linear_con’, ‘middle_drop’ or ‘double_middle_drop’)
- rprop_epsilon – (float) RMSProp epsilon (stabilizes square root computation in denominator of RMSProp update) (default: 1e-5)
- rprop_alpha – (float) RMSProp decay parameter (default: 0.99)
- buffer_size – (int) The buffer size in number of steps
- replay_ratio – (float) The number of replay learning per on policy learning on average, using a poisson distribution
- replay_start – (int) The minimum number of steps in the buffer, before learning replay
- correction_term – (float) Importance weight clipping factor (default: 10)
- trust_region – (bool) Whether or not algorithms estimates the gradient KL divergence between the old and updated policy and uses it to determine step size (default: True)
- alpha – (float) The decay rate for the Exponential moving average of the parameters
- delta – (float) max KL divergence between the old policy and updated policy (default: 1)
- verbose – (int) the verbosity level: 0 none, 1 training information, 2 tensorflow debug
- tensorboard_log – (str) the log location for tensorboard (if None, no logging)
- _init_setup_model – (bool) Whether or not to build the network at the creation of the instance
-
action_probability
(observation, state=None, mask=None)¶ Get the model’s action probability distribution from an observation
Parameters: - observation – (np.ndarray) the input observation
- state – (np.ndarray) The last states (can be None, used in recurrent policies)
- mask – (np.ndarray) The last masks (can be None, used in recurrent policies)
Returns: (np.ndarray) the model’s action probability distribution
-
get_env
()¶ returns the current environment (can be None if not defined)
Returns: (Gym Environment) The current environment
-
learn
(total_timesteps, callback=None, seed=None, log_interval=100, tb_log_name='ACER')[source]¶ Return a trained model.
Parameters: - total_timesteps – (int) The total number of samples to train on
- seed – (int) The initial seed for training, if None: keep current seed
- callback – (function (dict, dict)) -> boolean function called at every steps with state of the algorithm. It takes the local and global variables. If it returns False, training is aborted.
- log_interval – (int) The number of timesteps before logging.
- tb_log_name – (str) the name of the run for tensorboard log
Returns: (BaseRLModel) the trained model
-
classmethod
load
(load_path, env=None, **kwargs)¶ Load the model from file
Parameters: - load_path – (str or file-like) the saved parameter location
- env – (Gym Envrionment) the new environment to run the loaded model on (can be None if you only need prediction from a trained model)
- kwargs – extra arguments to change the model when loading
-
predict
(observation, state=None, mask=None, deterministic=False)¶ Get the model’s action from an observation
Parameters: - observation – (np.ndarray) the input observation
- state – (np.ndarray) The last states (can be None, used in recurrent policies)
- mask – (np.ndarray) The last masks (can be None, used in recurrent policies)
- deterministic – (bool) Whether or not to return deterministic actions.
Returns: (np.ndarray, np.ndarray) the model’s action and the next state (used in recurrent policies)
-
save
(save_path)[source]¶ Save the current parameters to file
Parameters: save_path – (str or file-like object) the save location