Discovering Temporal Structure: An Overview of Hierarchical Reinforcement Learning
📄 Read the paper
📘 Overview
From its humble beginnings in late 2023 to a detailed 80+ page deep dive, this project has evolved through many collaborative discussions with Martin and Akhil, and the steady guidance of George, Doina, and Marlos. Specifically, I would also like to thank Xujie for his continued support during this project. I really appreciate all these support and insights throughout this journey, and learned a lot about the fascinating world of Hierarchical Reinforcement Learning (HRL)!
What follows is an overview of the work:
📑 Table of Contents
- 🎯 Motivation & What is HRL
- 🧠 Core Benefits of Temporal Structure
- 🧭 How Is Temporal Structure Discovered?
- 📊 Summary of Methods & Their Focus on the Benefits
- 🌱 Discovery from Online Experience
- 💾 Discovery through Offline Datasets
- 🧱 Discovery with Foundation Models
- 🛠️ How to Use the Discovered Temporal Structure?
- 🚧 Final Remarks
🎯 Motivation & What is HRL?
Developing agents that can explore, plan, and learn in open-ended environments is a key desideratum in AI. Hierarchical Reinforcement Learning (HRL) tackles this challenge by identifying and utilizing temporal structure1—patterns in how decisions unfold over time.
Various terms such as options, skills, behaviors, subgoals, subroutines, and subtasks are often used interchangeably in the RL literature to describe temporal structures.
Yet, a key question remains:
What makes a temporal structure “good” or “useful”?
Our work provides insights to this question by examining the benefits of temporal abstraction through the lens of four fundamental decision-making challenges. In this framing, temporal structures are considered useful if they enable agents to effectively realize the associated benefits. Conversely, effective HRL algorithms that discover such structures inherently provide these benefits.
🧠 Core Benefits of (Useful) Temporal Structure / HRL
We identify four central benefits that useful temporal structures enable:
🔍 Exploration
- Facilitates structured exploration by enabling agents to (a) discover and pursue meaningful subgoals, (b) diversify behavior across multiple directions / using various strategies, and (c) explore over extended time horizons than individual actions.
🔗 Credit Assignment
- Supports more effective credit propagation by allowing learning signals (e.g., gradients or temporal-difference errors) to be assigned at higher levels of abstraction, thereby improving the identification of causally relevant decisions.
🔄 Transferability
- Enables generalization across tasks and environments by learning behaviors or representations that can be reused under novel conditions.
👁️ Interpretability
- Enhances interpretability by imposing structure on behavior, making the agent’s decision-making process easier to analyze and explain through high-level temporal components by humans.
🧭 How Is Temporal Structure Discovered?
We survey methods that discover useful temporal structure in three settings (based on the availability of data / prior knowledge):
- 🟢 Online interaction
- 🟡 Offline datasets
- 🔵 Foundation models (e.g., LLMs)
Each method is discussed based on its focus on the four key benefits.
📊 Summary of Methods & Their Focus on the Benefits
• = general focus •• = explicitly designed on the benefit
Category | Methods | 🎯 Credit Assign. | 🔍 Explor. | 🔄 Transf. | 🧩 Interpr. |
---|---|---|---|---|---|
🟢 Online | Bottleneck Discovery | • | • | • | • |
Spectral Methods | • | •• | • | ||
Skill Chaining | •• | •• | • | • | |
Empowerment Maximization | • | •• | • | ||
Via Environment Reward | •• | • | • | ||
Optimizing HRL Benefits Directly | • | • | |||
Meta-Learning | • | •• | |||
Curriculum Learning | •• | • | |||
Intrinsic Motivation | •• | • | |||
🟡 Offline | Variational Inference | •• | • | • | |
Hindsight Subgoal Relabeling | •• | • | |||
🔵 Foundation Models | Embedding Similarity | •• | • | • | |
Providing Feedback | • | •• | • | • | |
Reward as Code | •• | • | |||
Direct Policy Modeling | • | •• | • |
Below is the brief description of each method. Methods are categorized based on the underlying principles used to discover options.
🌱 Discovery from Online Experience
Bottleneck Discovery Methods
Bottleneck discovery methods aim to identify critical states, bottlenecks, that connect otherwise distinct regions of the state space. These methods typically represent the environment as a graph with states as nodes and edges denoting single-step reachability, and employ strategies such as: (1) diverse density, selecting states prevalent in successful but not unsuccessful trajectories; (2) graph partitioning, using min-cut formulations to separate subgraphs and reveal bottleneck nodes; and (3) centrality-based measures like betweenness centrality, which highlight states appearing on many shortest paths.
While these approaches enhance sample efficiency and generalization, scaling to large or continuous spaces remains a key challenge, motivating more localized or clustered variants.
Spectral Methods
Spectral methods discover options by analyzing the eigenstructure of matrix representations of the environment, such as the graph Laplacian or successor representation, to extract topological information about state connectivity and diffusion. By leveraging the eigenvectors (or eigenfunctions) of these matrices,2 these methods produce temporally extended actions (like eigenoptions) that guide agents along meaningful directions in the state space, often facilitating efficient exploration and transfer across tasks. Advances allow these representations to be approximated with neural networks in high-dimensional or continuous settings, enabling scalability.
Despite their strengths in capturing environment structure and enabling option reuse, current research focuses on improving representational learning, extending applicability to planning and partially-observable domains, and integrating reward information for more nuanced behaviors.
Sequentially Composable Options (Skill Chaining)
Sequentially composable options are temporally extended actions constructed so that: each option terminates in a region from which another option can be reliably initiated, enabling robust sequential planning. The skill chaining algorithm is a canonical approach that incrementally learns options backward from the goal: starting with an option that reaches the goal, it recursively creates new options whose subgoals correspond to the initiation regions of previously learned options, continuing until the start state is covered. Initiation functions are learned as probabilistic classifiers or value functions, often with explicit uncertainty estimation.
This explicit composability facilitates efficient planning, accelerates credit assignment, and exploration, but current methods are best suited for tasks with clearly defined goals or subgoals. Open directions include generalizing to non-goal-reaching settings and exploiting factored state representations for more modular skill composition.
Empowerment Maximization
Empowerment maximization methods discover diverse and controllable skills by maximizing the agent’s influence over its future observations, formalized as the mutual information between action sequences (or skill variables) and future states. Practically, this is achieved via variational inference and neural estimators, resulting in algorithms that learn a set of skills distinguishable by their effects on the environment, often using discriminators and mutual information objectives as intrinsic rewards.
These methods have demonstrated strong exploratory capabilities and can adapt rapidly to new tasks, but often produce skills localized to specific regions of the state space, limiting transfer. Recent work focuses on improving the scalability and composability of skills, connecting empowerment to causal representation learning and goal-based exploration, and developing methods to generalize skills beyond specific states to more global or transferable behavioral changes.
Via Environment Rewards
This class of methods learns hierarchical policies directly from the environment’s external reward, rather than relying on intrinsic or pseudo-rewards. Feudal methods decompose control into managers and workers, with higher-level managers setting subgoals for workers to achieve, enabling temporal abstraction and goal decomposition, and implemented in deep RL as Feudal Networks (FuN), HIRO, and Director. Option-critic methods instead jointly learn intra-option policies, termination functions, and the high-level policy using policy gradient theorems derived directly from the environmental reward signal.
Both approaches produce options or skills that are interpretable and transferable, often capturing meaningful structure in the environment. However, they can suffer from degeneracy (e.g., options collapsing to primitive actions) and rely heavily on informative environment rewards, motivating recent work on regularization, intrinsic bonuses, or leveraging demonstrations to support learning in sparse reward settings.
Optimizing HRL Benefits Directly
This class of methods addresses the limitations of proxy-based objectives in option discovery by explicitly formulating and optimizing for agent-level benefits, such as planning efficiency, exploration coverage, credit assignment speed, and transferability, with formal guarantees. Rather than targeting bottleneck states or maximizing empowerment as surrogates, these approaches define precise performance criteria (e.g., minimizing planning or cover time, accelerating policy evaluation, or reducing transfer sample complexity), and then derive algorithms that optimize option sets under these metrics, often with provable bounds or approximation guarantees. Key developments include option discovery for minimizing planning iterations (subject to cardinality constraints), reducing expected time to reach all states, and enhancing value propagation using concepts from matrix preconditioning.
While such formulations offer rigorous foundations for HRL, they remain challenging: often NP-hard in practice. Current research seeks to extend guarantees to more general settings, including function approximation and richer option classes.
Meta Learning
Meta RL aims to learn the RL algorithm itself, or parts of it, rather than just a policy. This results in a bilevel optimization: the outer loop learns the parameters of the algorithm, while the inner loop adapts a policy for a specific task.
A common approach uses meta-gradients: the inner loop updates parameters for policies or options using standard RL objectives, while the outer loop updates meta-parameters by differentiating through the inner loop, capturing how changes in meta-parameters influence learning progress. Black box meta RL, instead, leverages recurrent models, such as RNNs or Transformers, to encode experience across multiple episodes and tasks. The agent interacts with a sequence of tasks, updating its internal memory based on observations and rewards, and only resets this memory between tasks.
Meta-gradient methods and black box meta RL both facilitate the discovery of options in HRL by optimizing for reusable behaviors across tasks. Specifically, meta-gradients can be used to learn option policies, reward functions, and termination conditions that generalize well, while black box meta RL enables agents to adaptively select or compose skills based on accumulated experience, thus discovering and refining temporally abstract options that accelerate learning in new tasks.
A key opportunity for research is to relax the specialized multi-task assumption required by many meta RL approaches. Broadening the framework to more general, less structured settings, possibly via in-context learning or more flexible curricula, would further increase the applicability and scalability of meta learning in RL.
Curriculum Learning
Curriculum learning enables agents to master challenging goals by progressing through a sequence of tasks with increasing complexity, allowing gradual skill acquisition.
In HRL, the high-level policy selects goals based on measures of learning progress, such as recent improvements in achieving specific goals. The objective is to optimize goal selection so that iterative updates lead to high performance across the entire goal space. Local approximations of learning progress and bandit-based goal selection are common, while implicit curricula, such as hindsight experience replay, further guide agents by relabeling experiences with alternative goals. These approaches enhance exploration by continually pushing the agent’s capabilities, support the discovery of diverse behaviors, and facilitate transfer to novel tasks.
Open research directions include developing more reliable metrics for goal difficulty and interestingness, and leveraging unsupervised environment design or human knowledge for more effective curricula.
Intrinsic Motivation
Intrinsic motivation in RL drives agents to explore and learn new skills without explicit external rewards, focusing instead on curiosity, novelty, or information gain. This developmental approach enables agents to autonomously discover reusable skills, often represented as options with subgoals that are interesting or novel.
Common intrinsic signals include bonuses for visiting new or rarely seen states, which push agents to explore unfamiliar areas. Some methods, like Relative Novelty or First Return then Explore, set subgoals whenever the agent experiences something new. While factored approaches, like HEXQ, break the environment into smaller tasks based on how its parts interact. Together, these methods help agents explore more thoroughly, learn a wide range of behaviors, and make it easier to reuse learned skills in new situations.
Ongoing challenges include scaling factored skill discovery to high-dimensional observation spaces, developing robust novelty estimation in continuous or stochastic settings, and formally connecting intrinsic motivation with other option discovery techniques.
💾 Discovery through Offline Datasets
Variational Inference of Skill Latents
This class of offline skill discovery methods defines skills as latent variables inferred from unlabeled trajectories via reconstruction objectives. A common approach maximizes the likelihood of trajectories by introducing latent skill sequences that segment the data, often using a variational autoencoder (VAE) framework. Each trajectory is explained through a sequence of skill (and boundary) indicators, with the evidence lower bound (ELBO) guiding training.
Extensions incorporate regularization via minimum description length (MDL), encouraging concise and composable skill sets. These skills can be reused hierarchically, adapted across tasks, or used to augment the action space. More recent formulations impose mutual information constraints to balance skill diversity with fidelity to offline data.
This approach improves credit assignment, transfer, and exploration by distilling temporally coherent behaviors into representations in a latent skill space. Open challenges include improving optimization stability, leveraging reward signals more effectively, and scaling to more diverse and realistic domains.
Hindsight Subgoal Relabeling
Hindsight subgoal relabeling methods use offline datasets to identify and relabel important intermediate states (subgoals) within trajectories. By treating these subgoals as waypoints, agents can learn hierarchical policies that solve complex tasks more efficiently, even when external rewards are sparse or delayed. For example, algorithms partition trajectories into segments and train policies to reach each segment in sequence, while others relabel future states as short-term goals for low-level controllers. This decomposition improves credit assignment, sample efficiency, and makes learned behaviors easier to interpret. Key research directions include making subgoal choices more interpretable and scaling these approaches to environments with high-dimensional observations.
🧱 Discovery with Foundation Models
Embedding Similarity
Embedding similarity methods use foundation models to define goal-conditioned rewards based on the similarity between pretrained embeddings of goals and current observations. Agents can be trained using either fine-tuned or frozen encoders. Open challenges include comparing different embedding models, expanding beyond text-image goals, and improving reward quality in complex or sparse environments.
Providing Feedback
Foundation models, especially LLMs, can provide reward feedback to RL agents by either directly evaluating the success of a behavior (e.g., “Did the agent achieve the goal?”) or by expressing preferences over pairs of trajectories or states. These scalar signals or preferences are then distilled into reward models, guiding agents to explore, improve credit assignment, and transfer skills, all based on natural language goals or descriptions. This approach leverages the generalization and reasoning abilities of LLMs while simplifying reward design.
Reward as Code
It is possible to use LLMs to generate reward functions as executable code, given a goal description and symbolic information from the environment. The LLM receives the task and environment details (such as Python class representations of objects) and outputs code that computes rewards for the agent’s states. This allows agents to optimize behavior using automatically generated rewards, often matching or exceeding human-crafted rewards in robotics and simulation tasks. This approach enables rapid transfer across tasks but relies on the availability of meaningful symbolic representations, motivating future work to extend code-based rewards to high-dimensional, less-structured domains.
Directly Modeling the Policy
LLMs can directly model goal-conditioned policies by generating code or action sequences conditioned on natural language goals and state information, bypassing the need for manually designed reward functions. In robotics and simulated domains, LLMs produce skill policies as executable code that interacts with APIs or environment abstractions, while in complex domains like Minecraft or web navigation, LLMs continually expand skill libraries through auto-curriculum and self-reflection. Some approaches directly prompt LLMs for low-level actions, demonstrating the ability to solve diverse tasks zero-shot. This paradigm enables rapid adaptation and compositional skill reuse but raises open challenges in adapting to new action spaces, varying embodiments, and maintaining performance without expensive fine-tuning.
🛠️ How to Use the Discovered Temporal Structure?
After discovering a library of temporally abstract behaviors, the next challenge is how to use them effectively. A common approach is the call-and-return model, where the agent selects one option that runs until termination. This simplifies control but limits flexibility. Alternatives like Generalized Policy Improvement (GPI) allow agents to evaluate multiple options simultaneously, selecting actions that outperform any single option’s policy.
High-level policies over options can be trained in different ways. Model-free methods extend standard RL algorithms to operate over options, often enhanced by learning progress signals, exploration bonuses, or diversity incentives. Model-based methods learn predictive models over options to enable long-horizon planning with fewer steps. Combined with state abstraction, these models support more efficient planning in complex environments, especially when abstractions align with the agent’s own perspective, allowing for transfer across tasks.3
🚧 Final Remarks
We conclude by identifying open challenges in discovering temporal structure, and draw connections to related areas such as state and action abstraction, continual RL, programmatic RL, and multi-agent RL. We also highlight promising domains, like robotics, web agents, and open-ended games, where HRL may have the most transformative impact.
More formally, we call it temporal abstraction. ↩︎
The eigenvectors of the graph Laplacian are also known as proto-value functions (PVFs) in RL literature. ↩︎
For example, a home robot may encounter many different houses with various layouts and furniture. If the robot builds a representation based on its own sensors and actuators (e.g., “object is one meter in front of me” or “this action makes me turn right”), that abstraction is agent-centric. ↩︎