TRPO¶

Trust Region Policy Optimization (TRPO) is an iterative approach for optimizing policies with guaranteed monotonic improvement.

Notes¶

Original paper: https://arxiv.org/abs/1502.05477
OpenAI blog post: https://blog.openai.com/openai-baselines-ppo/
mpirun -np 16 python -m stable_baselines.trpo_mpi.run_atari runs the algorithm for 40M frames = 10M timesteps on an Atari game. See help (-h) for more options.
python -m stable_baselines.trpo_mpi.run_mujoco runs the algorithm for 1M timesteps on a Mujoco environment.

Can I use?¶

Reccurent policies: ✔️
Multi processing: ✔️ (using MPI)
Gym spaces:

Space	Action	Observation
Discrete	✔️	✔️
Box	✔️	✔️
MultiDiscrete	✔️	✔️
MultiBinary	✔️	✔️

Example¶

import gym

from stable_baselines.common.policies import MlpPolicy, MlpLstmPolicy, MlpLnLstmPolicy
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines import TRPO

env = gym.make('CartPole-v1')
env = DummyVecEnv([lambda: env])

model = TRPO(MlpPolicy, env, verbose=1)
model.learn(total_timesteps=25000)
model.save("trpo_cartpole")

del model # remove to demonstrate saving and loading

model = TRPO.load("trpo_cartpole")

obs = env.reset()
while True:
    action, _states = model.predict(obs)
    obs, rewards, dones, info = env.step(action)
    env.render()

Parameters¶

class stable_baselines.trpo_mpi.TRPO(policy, env, gamma=0.99, timesteps_per_batch=1024, max_kl=0.01, cg_iters=10, lam=0.98, entcoeff=0.0, cg_damping=0.01, vf_stepsize=0.0003, vf_iters=3, verbose=0, tensorboard_log=None, _init_setup_model=True)[source]¶

action_probability(observation, state=None, mask=None)¶

Get the model’s action probability distribution from an observation

Parameters:	observation – (np.ndarray) the input observation state – (np.ndarray) The last states (can be None, used in recurrent policies) mask – (np.ndarray) The last masks (can be None, used in recurrent policies)
Returns:	(np.ndarray) the model’s action probability distribution

get_env()¶

returns the current environment (can be None if not defined)

Returns:	(Gym Environment) The current environment

learn(total_timesteps, callback=None, seed=None, log_interval=100, tb_log_name='TRPO')[source]¶

Return a trained model.

Parameters:

total_timesteps – (int) The total number of samples to train on
seed – (int) The initial seed for training, if None: keep current seed
callback – (function (dict, dict)) function called at every steps with state of the algorithm. It takes the local and global variables.
log_interval – (int) The number of timesteps before logging.
tb_log_name – (str) the name of the run for tensorboard log

Returns:

(BaseRLModel) the trained model

classmethod load(load_path, env=None, **kwargs)¶

Load the model from file

Parameters:	load_path – (str) the saved parameter location env – (Gym Envrionment) the new environment to run the loaded model on (can be None if you only need prediction from a trained model) kwargs – extra arguments to change the model when loading

predict(observation, state=None, mask=None, deterministic=False)¶

Get the model’s action from an observation

Parameters:	observation – (np.ndarray) the input observation state – (np.ndarray) The last states (can be None, used in recurrent policies) mask – (np.ndarray) The last masks (can be None, used in recurrent policies) deterministic – (bool) Whether or not to return deterministic actions.
Returns:	(np.ndarray, np.ndarray) the model’s action and the next state (used in recurrent policies)

save(save_path)[source]¶

Save the current parameters to file

Parameters:	save_path – (str) the save location

set_env(env)¶

Checks the validity of the environment, and if it is coherent, set it as the current environment.

Parameters:	env – (Gym Environment) The environment for learning a policy

setup_model()[source]¶: Create all the functions and tensorflow graphs necessary to train the model