Discovering Temporal Structure: An Overview of Hierarchical Reinforcement Learning

📄 Read the paper


📘 Overview

From its humble beginnings in late 2023 to a detailed 80+ page deep dive, this project has evolved through many collaborative discussions with Martin and Akhil, and the steady guidance of George, Doina, and Marlos. Specifically, I would also like to thank Xujie for his continued support during this project. I really appreciate all these support and insights throughout this journey, and learned a lot about the fascinating world of Hierarchical Reinforcement Learning (HRL)!

What follows is an overview of the work:


📑 Table of Contents


🎯 Motivation & What is HRL?

Developing agents that can explore, plan, and learn in open-ended environments is a key desideratum in AI. Hierarchical Reinforcement Learning (HRL) tackles this challenge by identifying and utilizing temporal structure1—patterns in how decisions unfold over time.

Various terms such as options, skills, behaviors, subgoals, subroutines, and subtasks are often used interchangeably in the RL literature to describe temporal structures.

Yet, a key question remains:
What makes a temporal structure “good” or “useful”?

Our work provides insights to this question by examining the benefits of temporal abstraction through the lens of four fundamental decision-making challenges. In this framing, temporal structures are considered useful if they enable agents to effectively realize the associated benefits. Conversely, effective HRL algorithms that discover such structures inherently provide these benefits.


🧠 Core Benefits of (Useful) Temporal Structure / HRL

We identify four central benefits that useful temporal structures enable:

🔍 Exploration

  • Facilitates structured exploration by enabling agents to (a) discover and pursue meaningful subgoals, (b) diversify behavior across multiple directions / using various strategies, and (c) explore over extended time horizons than individual actions.

🔗 Credit Assignment

  • Supports more effective credit propagation by allowing learning signals (e.g., gradients or temporal-difference errors) to be assigned at higher levels of abstraction, thereby improving the identification of causally relevant decisions.

🔄 Transferability

  • Enables generalization across tasks and environments by learning behaviors or representations that can be reused under novel conditions.

👁️ Interpretability

  • Enhances interpretability by imposing structure on behavior, making the agent’s decision-making process easier to analyze and explain through high-level temporal components by humans.

🧭 How Is Temporal Structure Discovered?

We survey methods that discover useful temporal structure in three settings (based on the availability of data / prior knowledge):

  • 🟢 Online interaction
  • 🟡 Offline datasets
  • 🔵 Foundation models (e.g., LLMs)

Each method is discussed based on its focus on the four key benefits.

main


📊 Summary of Methods & Their Focus on the Benefits

• = general focus  •• = explicitly designed on the benefit

CategoryMethods🎯 Credit Assign.🔍 Explor.🔄 Transf.🧩 Interpr.
🟢 OnlineBottleneck Discovery
Spectral Methods••
Skill Chaining••••
Empowerment Maximization••
Via Environment Reward••
Optimizing HRL Benefits Directly
Meta-Learning••
Curriculum Learning••
Intrinsic Motivation••
🟡 OfflineVariational Inference••
Hindsight Subgoal Relabeling••
🔵 Foundation ModelsEmbedding Similarity••
Providing Feedback••
Reward as Code••
Direct Policy Modeling••

↑ Back to TOC

Below is the brief description of each method. Methods are categorized based on the underlying principles used to discover options.


🌱 Discovery from Online Experience

Bottleneck Discovery Methods

Bottleneck discovery methods aim to identify critical states, bottlenecks, that connect otherwise distinct regions of the state space. These methods typically represent the environment as a graph with states as nodes and edges denoting single-step reachability, and employ strategies such as: (1) diverse density, selecting states prevalent in successful but not unsuccessful trajectories; (2) graph partitioning, using min-cut formulations to separate subgraphs and reveal bottleneck nodes; and (3) centrality-based measures like betweenness centrality, which highlight states appearing on many shortest paths.

While these approaches enhance sample efficiency and generalization, scaling to large or continuous spaces remains a key challenge, motivating more localized or clustered variants.

↑ Back to TOC

Spectral Methods

Spectral methods discover options by analyzing the eigenstructure of matrix representations of the environment, such as the graph Laplacian or successor representation, to extract topological information about state connectivity and diffusion. By leveraging the eigenvectors (or eigenfunctions) of these matrices,2 these methods produce temporally extended actions (like eigenoptions) that guide agents along meaningful directions in the state space, often facilitating efficient exploration and transfer across tasks. Advances allow these representations to be approximated with neural networks in high-dimensional or continuous settings, enabling scalability.

Despite their strengths in capturing environment structure and enabling option reuse, current research focuses on improving representational learning, extending applicability to planning and partially-observable domains, and integrating reward information for more nuanced behaviors.

↑ Back to TOC

Sequentially Composable Options (Skill Chaining)

Sequentially composable options are temporally extended actions constructed so that: each option terminates in a region from which another option can be reliably initiated, enabling robust sequential planning. The skill chaining algorithm is a canonical approach that incrementally learns options backward from the goal: starting with an option that reaches the goal, it recursively creates new options whose subgoals correspond to the initiation regions of previously learned options, continuing until the start state is covered. Initiation functions are learned as probabilistic classifiers or value functions, often with explicit uncertainty estimation.

This explicit composability facilitates efficient planning, accelerates credit assignment, and exploration, but current methods are best suited for tasks with clearly defined goals or subgoals. Open directions include generalizing to non-goal-reaching settings and exploiting factored state representations for more modular skill composition.

↑ Back to TOC

Empowerment Maximization

Empowerment maximization methods discover diverse and controllable skills by maximizing the agent’s influence over its future observations, formalized as the mutual information between action sequences (or skill variables) and future states. Practically, this is achieved via variational inference and neural estimators, resulting in algorithms that learn a set of skills distinguishable by their effects on the environment, often using discriminators and mutual information objectives as intrinsic rewards.

These methods have demonstrated strong exploratory capabilities and can adapt rapidly to new tasks, but often produce skills localized to specific regions of the state space, limiting transfer. Recent work focuses on improving the scalability and composability of skills, connecting empowerment to causal representation learning and goal-based exploration, and developing methods to generalize skills beyond specific states to more global or transferable behavioral changes.

↑ Back to TOC

Via Environment Rewards

This class of methods learns hierarchical policies directly from the environment’s external reward, rather than relying on intrinsic or pseudo-rewards. Feudal methods decompose control into managers and workers, with higher-level managers setting subgoals for workers to achieve, enabling temporal abstraction and goal decomposition, and implemented in deep RL as Feudal Networks (FuN), HIRO, and Director. Option-critic methods instead jointly learn intra-option policies, termination functions, and the high-level policy using policy gradient theorems derived directly from the environmental reward signal.

Both approaches produce options or skills that are interpretable and transferable, often capturing meaningful structure in the environment. However, they can suffer from degeneracy (e.g., options collapsing to primitive actions) and rely heavily on informative environment rewards, motivating recent work on regularization, intrinsic bonuses, or leveraging demonstrations to support learning in sparse reward settings.

↑ Back to TOC

Optimizing HRL Benefits Directly

This class of methods addresses the limitations of proxy-based objectives in option discovery by explicitly formulating and optimizing for agent-level benefits, such as planning efficiency, exploration coverage, credit assignment speed, and transferability, with formal guarantees. Rather than targeting bottleneck states or maximizing empowerment as surrogates, these approaches define precise performance criteria (e.g., minimizing planning or cover time, accelerating policy evaluation, or reducing transfer sample complexity), and then derive algorithms that optimize option sets under these metrics, often with provable bounds or approximation guarantees. Key developments include option discovery for minimizing planning iterations (subject to cardinality constraints), reducing expected time to reach all states, and enhancing value propagation using concepts from matrix preconditioning.

While such formulations offer rigorous foundations for HRL, they remain challenging: often NP-hard in practice. Current research seeks to extend guarantees to more general settings, including function approximation and richer option classes.

↑ Back to TOC

Meta Learning

Meta RL aims to learn the RL algorithm itself, or parts of it, rather than just a policy. This results in a bilevel optimization: the outer loop learns the parameters of the algorithm, while the inner loop adapts a policy for a specific task.

A common approach uses meta-gradients: the inner loop updates parameters for policies or options using standard RL objectives, while the outer loop updates meta-parameters by differentiating through the inner loop, capturing how changes in meta-parameters influence learning progress. Black box meta RL, instead, leverages recurrent models, such as RNNs or Transformers, to encode experience across multiple episodes and tasks. The agent interacts with a sequence of tasks, updating its internal memory based on observations and rewards, and only resets this memory between tasks.

Meta-gradient methods and black box meta RL both facilitate the discovery of options in HRL by optimizing for reusable behaviors across tasks. Specifically, meta-gradients can be used to learn option policies, reward functions, and termination conditions that generalize well, while black box meta RL enables agents to adaptively select or compose skills based on accumulated experience, thus discovering and refining temporally abstract options that accelerate learning in new tasks.

A key opportunity for research is to relax the specialized multi-task assumption required by many meta RL approaches. Broadening the framework to more general, less structured settings, possibly via in-context learning or more flexible curricula, would further increase the applicability and scalability of meta learning in RL.

↑ Back to TOC

Curriculum Learning

Curriculum learning enables agents to master challenging goals by progressing through a sequence of tasks with increasing complexity, allowing gradual skill acquisition.

In HRL, the high-level policy selects goals based on measures of learning progress, such as recent improvements in achieving specific goals. The objective is to optimize goal selection so that iterative updates lead to high performance across the entire goal space. Local approximations of learning progress and bandit-based goal selection are common, while implicit curricula, such as hindsight experience replay, further guide agents by relabeling experiences with alternative goals. These approaches enhance exploration by continually pushing the agent’s capabilities, support the discovery of diverse behaviors, and facilitate transfer to novel tasks.

Open research directions include developing more reliable metrics for goal difficulty and interestingness, and leveraging unsupervised environment design or human knowledge for more effective curricula.

↑ Back to TOC

Intrinsic Motivation

Intrinsic motivation in RL drives agents to explore and learn new skills without explicit external rewards, focusing instead on curiosity, novelty, or information gain. This developmental approach enables agents to autonomously discover reusable skills, often represented as options with subgoals that are interesting or novel.

Common intrinsic signals include bonuses for visiting new or rarely seen states, which push agents to explore unfamiliar areas. Some methods, like Relative Novelty or First Return then Explore, set subgoals whenever the agent experiences something new. While factored approaches, like HEXQ, break the environment into smaller tasks based on how its parts interact. Together, these methods help agents explore more thoroughly, learn a wide range of behaviors, and make it easier to reuse learned skills in new situations.

Ongoing challenges include scaling factored skill discovery to high-dimensional observation spaces, developing robust novelty estimation in continuous or stochastic settings, and formally connecting intrinsic motivation with other option discovery techniques.

↑ Back to TOC


💾 Discovery through Offline Datasets

Variational Inference of Skill Latents

This class of offline skill discovery methods defines skills as latent variables inferred from unlabeled trajectories via reconstruction objectives. A common approach maximizes the likelihood of trajectories by introducing latent skill sequences that segment the data, often using a variational autoencoder (VAE) framework. Each trajectory is explained through a sequence of skill (and boundary) indicators, with the evidence lower bound (ELBO) guiding training.

Extensions incorporate regularization via minimum description length (MDL), encouraging concise and composable skill sets. These skills can be reused hierarchically, adapted across tasks, or used to augment the action space. More recent formulations impose mutual information constraints to balance skill diversity with fidelity to offline data.

This approach improves credit assignment, transfer, and exploration by distilling temporally coherent behaviors into representations in a latent skill space. Open challenges include improving optimization stability, leveraging reward signals more effectively, and scaling to more diverse and realistic domains.

↑ Back to TOC

Hindsight Subgoal Relabeling

Hindsight subgoal relabeling methods use offline datasets to identify and relabel important intermediate states (subgoals) within trajectories. By treating these subgoals as waypoints, agents can learn hierarchical policies that solve complex tasks more efficiently, even when external rewards are sparse or delayed. For example, algorithms partition trajectories into segments and train policies to reach each segment in sequence, while others relabel future states as short-term goals for low-level controllers. This decomposition improves credit assignment, sample efficiency, and makes learned behaviors easier to interpret. Key research directions include making subgoal choices more interpretable and scaling these approaches to environments with high-dimensional observations.

↑ Back to TOC


🧱 Discovery with Foundation Models

Embedding Similarity

Embedding similarity methods use foundation models to define goal-conditioned rewards based on the similarity between pretrained embeddings of goals and current observations. Agents can be trained using either fine-tuned or frozen encoders. Open challenges include comparing different embedding models, expanding beyond text-image goals, and improving reward quality in complex or sparse environments.

↑ Back to TOC

Providing Feedback

Foundation models, especially LLMs, can provide reward feedback to RL agents by either directly evaluating the success of a behavior (e.g., “Did the agent achieve the goal?”) or by expressing preferences over pairs of trajectories or states. These scalar signals or preferences are then distilled into reward models, guiding agents to explore, improve credit assignment, and transfer skills, all based on natural language goals or descriptions. This approach leverages the generalization and reasoning abilities of LLMs while simplifying reward design.

↑ Back to TOC

Reward as Code

It is possible to use LLMs to generate reward functions as executable code, given a goal description and symbolic information from the environment. The LLM receives the task and environment details (such as Python class representations of objects) and outputs code that computes rewards for the agent’s states. This allows agents to optimize behavior using automatically generated rewards, often matching or exceeding human-crafted rewards in robotics and simulation tasks. This approach enables rapid transfer across tasks but relies on the availability of meaningful symbolic representations, motivating future work to extend code-based rewards to high-dimensional, less-structured domains.

↑ Back to TOC

Directly Modeling the Policy

LLMs can directly model goal-conditioned policies by generating code or action sequences conditioned on natural language goals and state information, bypassing the need for manually designed reward functions. In robotics and simulated domains, LLMs produce skill policies as executable code that interacts with APIs or environment abstractions, while in complex domains like Minecraft or web navigation, LLMs continually expand skill libraries through auto-curriculum and self-reflection. Some approaches directly prompt LLMs for low-level actions, demonstrating the ability to solve diverse tasks zero-shot. This paradigm enables rapid adaptation and compositional skill reuse but raises open challenges in adapting to new action spaces, varying embodiments, and maintaining performance without expensive fine-tuning.

↑ Back to TOC


🛠️ How to Use the Discovered Temporal Structure?

After discovering a library of temporally abstract behaviors, the next challenge is how to use them effectively. A common approach is the call-and-return model, where the agent selects one option that runs until termination. This simplifies control but limits flexibility. Alternatives like Generalized Policy Improvement (GPI) allow agents to evaluate multiple options simultaneously, selecting actions that outperform any single option’s policy.

High-level policies over options can be trained in different ways. Model-free methods extend standard RL algorithms to operate over options, often enhanced by learning progress signals, exploration bonuses, or diversity incentives. Model-based methods learn predictive models over options to enable long-horizon planning with fewer steps. Combined with state abstraction, these models support more efficient planning in complex environments, especially when abstractions align with the agent’s own perspective, allowing for transfer across tasks.3

↑ Back to TOC


🚧 Final Remarks

We conclude by identifying open challenges in discovering temporal structure, and draw connections to related areas such as state and action abstraction, continual RL, programmatic RL, and multi-agent RL. We also highlight promising domains, like robotics, web agents, and open-ended games, where HRL may have the most transformative impact.


  1. More formally, we call it temporal abstraction. ↩︎

  2. The eigenvectors of the graph Laplacian are also known as proto-value functions (PVFs) in RL literature. ↩︎

  3. For example, a home robot may encounter many different houses with various layouts and furniture. If the robot builds a representation based on its own sensors and actuators (e.g., “object is one meter in front of me” or “this action makes me turn right”), that abstraction is agent-centric. ↩︎

Previous