TLDR: This paper induces learned options to terminate in small distribution of states. This makes it easier to transition and plan.
Hierarchical Reinforcement Learning, specifically End to End option discovery, suffers from option collapse, where options collapse into single action primitives. One reason is because an option should terminate if another option has better value. Unfortunately, due to noise in learning option value function, this happens often.
In this work, authors propose modifying termination objective by making it completely independent of task reward. Instead, they use an information theoretic objective and induce termination only on a small set of states. This makes it easier for planning.
What is nice is the way this objective is optimized. They derive a termination gradient that can be expressed in terms of the option model. Then they can jointly train the option model on the task, and the termination condition on the termination objective.
Their proposed overall objective first contains the Predictability Objective. Desirable options are targeted and have small terminating regions. This is represented by by minimizing entropy of final states of options. While this may seem simple and obvious, the way the authors represent the optimization of this loss is novel.
The authors experiment on simple HRL environments like 4 rooms gridworld. They show their model outperforms (not by much, maybe 10-20%) normal options critic. They also attempt to show more directed termination helps planning performance. Their figure is hard to read, but the argument that more directed termination states => easier value prediction / iteration => faster convergence and improvement makes sense to me.