Hybrid Reward Architecture

October 19, 2018 - 3 minute read -
Paper Summary Machine Learning Reinforcement Learning

This paper focuses on improving RL algorithms by decomposing the scalar reward function into a composition of multiple reward functions. The key idea is that challenging domains’ reward functions cannot be easily reduced into a low dimensional representation so decomposition can alleviate this.

Some advantages of decomposition are - more efficient learning since each sub-reward usually depends only on a subset of the features - sub-reward functions are more generalizable / transferrable - multimodal representation of rewards

Note that by decomposing the reward, we are encoding domain knowledge through the given decomposition. An interesting direction is the automatic discovery of these sub-reward / subgoal functions.

To efficiently learn the sub-rewards, each sub-reward is assigned a separate agent. These agents and sub-rewards are learned in parallel on the same sample sequence with off policy learning. An aggregator module combines all of the sub-reward functions and selects the current action based off of the aggregated values.

The authors are able to improve performance further by removing irrelevant features, identifying terminal states, and using “pseudo-reward” functions (Generalized Value Function), which are rewards that do not depend on the environment. Note again that all of these improvements require domain knowledge.

The authors experiment on a fruit collection task, where 5 fruits are randomly placed in 10 fixed locations on a 10x10 grid. They decompose the reward down into 10 different sub-rewards, each sub-reward representing a potential fruit location. Each sub-reward function takes in the existence of a fruit at the location and the current state of the agent. The aggregator simply takes a mean of the sub-rewards. The authors show that HRA is able to beat baselines and augmented baselines.

The authors also experiment on Ms.PacMan, and are able to achieve SOTA results by solving the game. They also create sub-rewards for pellets, fruits, ghosts, and blue ghosts. Initially, the HRA starts off with 0 GVF and 0 pellet reward functions. The agent wanders around the maze, discovering new map locations it can reach. A GVF function is created for each of these new locations. When the agent finds a pellet at a new location it creates a new head corresponding to the pellet.

The Q values for an object (pellet, fruit, ghost, blue ghost) are set to the output of the GVF that corresponds with the objects location (moving objects use a different GVF each time) multiplied with a weight that is set equal to the reward received when the object is eaten.

The authors also implement a special reward function, the executive memory function. When a human reaches the maximum of his ability, he looks for favorable situations and memorizes them. This is akin to a “habit”. The executive memory function records every sequence of actions that lead to the completion of a level without any death. When faced with the same level, the function will output a very high value for the recorded action sequences, forcing the aggregator module to choose the recorded actions. This simplified module does not generalize, but could be an interesting further direction to look into.

They have a “Fixed Start” and a “Random Start” mode, where the executive memory is able to exploit the fixed start mode. In both cases, the authors show that HRA is able to beat benchmarks by a lot because it is able to exploit domain knowledge more efficiently than its competitors.