This paper presents a theoretically-backed representation objective for HRL that extends Data Efficient HRL by Nachum et al and outperforms other HRL methods. I liked the formulation and theoretical analysis of the formulation to derive the objective. The representation objective suggests a connection with maximizing mutual information of future state given current policy and state, which is task reward agnostic and seems intuitive. The experiments are simple AntWalker experiments but are in line with previous HRL baselines, which they compare against.
This paper first formulates hierarchical reinforcement learning through a goal conditioned perspective and presents a metric of usefulness, or sub-optimality of a representation. The representation specifically maps the full state s to goal space g. The low level controller’s job is to get to g by minimizing the distance between repr(s) and g.
Given this setup, the authors then present a metric for the quality of the representation. The quality of a representation is inverse to the loss in value when the representation is used for a low level controller.
The authors present a bound on the loss of optimality by bounding the mismatch in state distributions between next states and optimal next states produced by a locally optimal low level policy. They then minimize the bounds with a proposed gradient.
They show that first, their representation is more data efficient that HRL baselines that use no representation, or other representation objectives (VAE etc). Then, they analyze the learned representations of their algorithm and compare with baselines on Antwalker task and visually show their learned representation is closest to the oracle (xy). What’s nice is they can extract a good representation out of pure images, while other baselines fail.