Planning Goals for Exploration

University of Pennsylvania

PEG is an unsupervised exploration algorithm that collects diverse data to train generalist agents.

PEG exhibits sophisticated exploration behavior in complex environments without any supervision like rewards or demonstrations.


Dropped into an unknown environment, what should an agent do to quickly learn about the environment and how to accomplish diverse tasks within it?

We address this question within the goal-conditioned reinforcement learning paradigm, by identifying how the agent should set its goals at training time to maximize exploration. We propose "planning exploratory goals" (PEG), a method that sets goals for each training episode to directly optimize an intrinsic exploration reward.

PEG first chooses goal commands such that the agent's goal-conditioned policy, at its current level of training, will end up in states with high exploration potential. It then launches an exploration policy starting at those promising states. To enable this direct optimization, PEG learns world models and adapts sampling-based planning algorithms to "plan goal commands". In challenging simulated robotics environments including a multi-legged ant robot in a maze, and a robot arm on a cluttered tabletop, PEG exploration enables more efficient and effective training of goal-conditioned policies relative to baselines and ablations. Our ant successfully navigates a long maze, and the robot arm successfully builds a stack of three blocks upon command


We evaluate PEG and other goal-conditioned RL agents on four different continuous-control environments ranging from navigation to manipulation.

  • Point Maze: The agent (red dot) is spawned in the lower left corner. The evaluation goal (blue star) is in the top right corner, which is the farthest point possible.
  • Walker: The agent must learn locomotion of the humanoid to move around. The evaluation goals (colored orbs) are standing poses placed ±6 and ±12 meters away from the spawn.
  • Ant Maze: The agent controls a four-legged ant robot in a maze. The evaluation goals (red orbs) are placed in the middle and end of the maze with varying leg orientations.
  • Block stacking: The agent controls a 2-gripper robot on a tabletop with 3 blocks. The evaluation goals (colored orbs) are 3-block towers. This is our hardest exploration challenge since the agent must learn pushing, picking, and stacking, and discovered 3-block stacks as a possible configuration of the environment.
Below, we show the evaluation success rates throughout training for each method, as well as PEG's evaluation episodes for each environment. PEG compares favorably to baselines both in final success rate and learning speed. On the hardest task, block stacking, PEG is the only method to achieve any significant non-zero task performance: PEG achieves about 30% success rate on this challenging exploration task, all other baselines are close to 0%.

Exploration Episodes

PEG's superior evaluation performance is attributed to its sophisticated exploration, which enables the agent to learn from more informative data.

  • Point Maze: PEG explores the maze by setting goals (red dots) frequently just beyond the seen states distribution (blue dots).
  • Walker: PEG discovers walking and flipping in both directions. To maximize exploration, it commands the Go-policy to walk in one direction as far as possible and the Explore-policy performs flips.
  • Ant Maze: PEG sets goals (red dots) in all parts of the maze - even in walls. Notice that while infeasible, the wall goal still causes the ant to explore the entire maze.
  • Block stacking: PEG discovers pushing, picking, and stacking. To maximize exploration, it commands the Go-policy to stack blocks into a tower and the Explore-policy throws them out.

How does PEG set goals for exploration?

Below, we visualize the goals (red dots) and explored states (green dots) chosen by various methods halfway through the training in the Ant Maze. PEG explores the deepest part of the maze, whereas other methods barely reach the middle. A trend across tasks is that PEG consistently picks goals (red points) beyond the frontier of seen data, such as the top left corner of the Ant Maze, driving deep exploration of the maze. Baselines like MEGA pick goals near the frontier, which does result in a few goals in the top left, but we can see the resulting exploration trajectories do not penetrate the top left.

Interpolate start reference image.