Hierarchical Reinforcement Learning: Options-Critic and SECTAR

September 7, 2018 - 3 minute read -
Machine Learning Reinforcement Learning Options Critic SECTAR

Sectar jointly trains a state decoder and policy decoder

Paper Reading Group 9/7/2018

This week’s topic is Hierarchical Reinforcement Learning. We read two papers, Options-Critic and Self-Consistent Trajectory Autoencoder.

Options-Critic

The Option-Critic Architecture

  • learn Options, which are actions that span multiple time steps. essentially sub policies or primitives
  • provides mathematical backing for policy gradient over options
  • introduces idea of meta-controller that decides which “option” to use. Then uses “call and return” style control
    • lets meta controller operate over coarser time scale, which is good
  • Similar to Policy Iteration algorithm, has Policy Evaluation - Policy Improvement iteration except over options
  • Uses Q_u (s, w, a) instead of Q (s, a) where w is sub policy
for i in training_iterations:
	Options evaluation on Q_u
	Options improvement
  • Experiments show that Option Critic can recover faster from goal shift
    • KR points out it is still faster to train baselines from scratch again than use option critic to recover faster
  • Results show that Options-Critic isn’t necessarily better than learning from scratch, but also not worse

Self-Consistent Trajectory Autoencoder: Hierarchical Reinforcement Learning with Trajectory Embeddings

Self-Consistent Trajectory Autoencoder: Hierarchical Reinforcement Learning with Trajectory Embeddings

  • Representation Learning approach to HRL by learning Trajectory embeddings. Outperforms lots of HRL baselines in sparse reward + long horizon tasks
  • Assuming we have Trajectory embedding Z, we have a state decoder and policy decoder
    • state decoder outputs a trajectory of states that are represented by Z
      • represented as p( tau z) where tau is trajectory
    • policy decoder outputs a policy that represents behavior necessary to have a trajectory of states equal to state decoder output
      • represented as pi (a s, z)
  • We can use state decoder to do Model Predictive Control
    • sample K sequences of latent vectors (Z) from prior
    • use state decoder to turn each sequence of latents into sequence of states
    • Evaluate each sequence and choose the highest reward one
    • Use policy decoder on highest one, execute policy for T steps
  • Finally, we need a way to discover trajectories and train our VAE to get Trajectory Embeddings
    • Use pi_e, explorer policy to discover trajectories
    • maximize entropy over trajectory distribution to find diverse distribution of trajectories
      • use lower bound approximation to represent p(tau)

Overall Algorithm:

Initialize experience buffer with trajectories from pi_e
for i in training_iterations:
    execute MPC in latent space	 (essentially acting out 1 episode)
    run pi_e starting from states visited in MPC (this discovers new trajectories)
    update pi_e to maximize entropy of trajectories
    train state decoder with supervised learning
    train policy decoder with RL
  • They are given a function R (s) which gives reward of any arbitrary state. This is used in the MPC step to evaluate the sampled K trajectories. We would like to see if this R can be learned, as demonstrated in UNREAL aux reward
  • I wonder what interpolating the trajectory embeddings will look like. can this be used to create a transition policy trajectory?
  • T is fixed, so each primitive must be T steps long. might be better if this could be learned