Sectar jointly trains a state decoder and policy decoder
Paper Reading Group 9/7/2018
This week’s topic is Hierarchical Reinforcement Learning. We read two papers, OptionsCritic and SelfConsistent Trajectory Autoencoder.
OptionsCritic
The OptionCritic Architecture
 learn Options, which are actions that span multiple time steps. essentially sub policies or primitives
 provides mathematical backing for policy gradient over options
 introduces idea of metacontroller that decides which “option” to use. Then uses “call and return” style control
 lets meta controller operate over coarser time scale, which is good
 Similar to Policy Iteration algorithm, has Policy Evaluation  Policy Improvement iteration except over options
 Uses Q_u (s, w, a) instead of Q (s, a) where w is sub policy
 Experiments show that Option Critic can recover faster from goal shift
 KR points out it is still faster to train baselines from scratch again than use option critic to recover faster
 Results show that OptionsCritic isn’t necessarily better than learning from scratch, but also not worse
SelfConsistent Trajectory Autoencoder: Hierarchical Reinforcement Learning with Trajectory Embeddings
 Representation Learning approach to HRL by learning Trajectory embeddings. Outperforms lots of HRL baselines in sparse reward + long horizon tasks
 Assuming we have Trajectory embedding Z, we have a state decoder and policy decoder
 state decoder outputs a trajectory of states that are represented by Z

represented as p( tau z) where tau is trajectory

 policy decoder outputs a policy that represents behavior necessary to have a trajectory of states equal to state decoder output

represented as pi (a s, z)

 state decoder outputs a trajectory of states that are represented by Z
 We can use state decoder to do Model Predictive Control
 sample K sequences of latent vectors (Z) from prior
 use state decoder to turn each sequence of latents into sequence of states
 Evaluate each sequence and choose the highest reward one
 Use policy decoder on highest one, execute policy for T steps
 Finally, we need a way to discover trajectories and train our VAE to get Trajectory Embeddings
 Use pi_e, explorer policy to discover trajectories
 maximize entropy over trajectory distribution to find diverse distribution of trajectories
 use lower bound approximation to represent
p(tau)
 use lower bound approximation to represent
Overall Algorithm:
 They are given a function R (s) which gives reward of any arbitrary state. This is used in the MPC step to evaluate the sampled K trajectories. We would like to see if this R can be learned, as demonstrated in UNREAL aux reward
 I wonder what interpolating the trajectory embeddings will look like. can this be used to create a transition policy trajectory?
 T is fixed, so each primitive must be T steps long. might be better if this could be learned