Discovering Temporal Structure: An Overview of Hierarchical Reinforcement Learning
📄 Read the paper
📘 Overview
From its humble beginnings in late 2023 to a detailed 80+ page deep dive, this project has evolved through many collaborative discussions with Martin and Akhil, and the steady guidance of George, Doina, and Marlos. Specifically, I would also like to thank Xujie for his continued support during this project. I really appreciate all these support and insights throughout this journey, and learned a lot about the fascinating world of Hierarchical Reinforcement Learning (HRL)!
What follows is an overview of the work:
📑 Table of Contents
- 🎯 Motivation & What is HRL
- 🧠 Core Benefits of Temporal Structure
- ⭐ What Is Temporal Structure?
- 🧭 How Is Temporal Structure Discovered?
- 📊 Summary of Methods & Their Focus on the Benefits
- 🌱 Discovery from Online Experience
- 💾 Discovery through Offline Datasets
- 🧱 Discovery with Foundation Models
- 🛠️ How to Use the Discovered Temporal Structure?
- 🚧 Final Remarks
🎯 Motivation & What is HRL?
Developing agents that can explore, plan, and learn in open-ended environments is a key desideratum in AI. Hierarchical Reinforcement Learning (HRL) tackles this challenge by identifying and utilizing temporal structure1—patterns in how decisions unfold over time.
Various terms such as options, skills, behaviors, subgoals, subroutines, and subtasks are often used interchangeably in the RL literature to describe temporal structures.
Yet, a key question remains: What makes a temporal structure “good” or “useful”?
Our work provides insights to this question by examining the benefits of temporal abstraction through the lens of four fundamental decision-making challenges. In this framing, temporal structures are considered useful if they enable agents to effectively realize the associated benefits. Conversely, effective HRL algorithms that discover such structures inherently provide these benefits.
🧠 Core Benefits of (Useful) Temporal Structure / HRL
We identify four central benefits that useful temporal structures enable:
🔍 Exploration
Facilitates structured exploration by enabling agents to:
Discover and pursue meaningful subgoals, or explore in different meaningful directions;
Explore within more abstract state / action spaces;
Explore over extended time horizons, rather than individual actions.
🔗 Credit Assignment
Supports more effective credit propagation by allowing learning signals (e.g., gradients or temporal-difference errors) to be assigned at higher levels of abstraction, thereby improving the identification of causally relevant decisions (e.g., bottlenecks, subgoals).
🔄 Transferability
It the following instantiation in HRL context:
Options can be reused across different tasks, serving as components of different solutions;
Options can generalize in similar but novel conditions;
Similar tasks (goals) may require a similar set of options to complete.
The three aforementioned fundamental benefits sum up to faster learning and planning. Beyond these, HRL also has the potential to tackle the additional challenge of interpretability.
👁️ Interpretability
Enhances interpretability by imposing structure on behavior, making the agent’s decision-making process easier to analyze and explain through high-level temporal components by humans.
⭐ What Is Temporal Structure? The Option Formulation
What Is an Option?
An option is a temporally extended action executed over multiple timesteps. It is general enough to describe various temporal extended behaviors. Each option $o \in O$ is defined by three components (Sutton et al., 1999):
Initiation Function
Determines where (and to what degree) the option can start: $$ I(s, o) \in [0,1]. $$
Intra-Option Policy
Specifies how the agent behaves while the option is active: $$ \pi(a \mid s, o). $$
Termination Function
Specifies when (and to what degree) the option stops: $$ \beta(s, o) \in [0,1]. $$
Once selected, an option continues executing its intra-option policy until its termination condition triggers.
High-Level Policy
A hierarchical agent chooses among options and primitive actions, often using a high-level policy:
$$
\mu(o \mid s).
$$
After selection, control is delegated to the option until termination.
Subgoal Options
Many discovery methods interpret options as subgoal-reaching behaviors, defined via an option-specific (intrinsic) reward: $$ r_o(s, a, s’). $$ Optimizing this intrinsic reward yields an option that reliably reaches a target region.
Relation to Other HRL Formalisms
Skills = Options
Many papers use “skill” (as $z$) interchangeably with “option” (as $o$). Notably, many Bayesian approaches formalize a skill as a latent vector, $z$, which explains trajectories $\tau = (s_0, a_0, s_1, \ldots, s_T)$, and the skill-dependent policy corresponds directly to an intra-option policy $$ p(a_t \mid s_t, z) \leftrightarrow \pi_o(a_t \mid s_t). $$
Some latent-variable models additionally infer task boundaries or termination signals, which correspond to initiation and termination functions in the option framework.Goal-Conditioned RL (GCRL) is HRL
A goal can be specified by $$ (g, r_g, \gamma_g). $$
GCRL augments the observation with a goal and uses the goal to drive temporally extended behavior. Subgoals act exactly like options: they define behaviors, induce termination, support compositionality, and enable modular reuse. Note that: GCRL largely assumes goals are given rather than discovered.
Overall, skills, subgoals, and goal-conditioned policies are all alternative views of the same temporal abstraction mechanism. Different framings, same core idea: temporal abstraction through multi-step behaviors.
Why the Formalism Matters
The option framework provides:
- A structured representation for multi-step actions.
- Modular and reusable behavioral units.
- A unified mathematical description across HRL approaches.
It forms the foundation for modern HRL.
🧭 How Is Temporal Structure Discovered?
We survey methods that discover useful temporal structure in three settings (based on the availability of data / prior knowledge):
- 🟢 Online interaction: the agent construct useful options by simply interacting with the environment.
- 🟡 Offline datasets: assumes access to a fixed dataset of pre-collected trajectories.
- 🔵 Foundation models (e.g., LLMs): assumes access to pretrained models that contain knowledge about the environment.
Each method is discussed based on its focus on the four key benefits.

📊 Summary of Methods & Their Focus on the Benefits
• = general focus •• = explicitly designed on the benefit
| Category | Methods | 🎯 Credit Assign. | 🔍 Explor. | 🔄 Transf. | 🧩 Interpr. |
|---|---|---|---|---|---|
| 🟢 Online | Bottleneck Discovery | • | • | • | • |
| Spectral Methods | • | •• | • | ||
| Skill Chaining | •• | •• | • | • | |
| Empowerment Maximization | • | •• | • | ||
| Via Environment Reward | •• | • | • | ||
| Optimizing HRL Benefits Directly | • | • | |||
| Meta-Learning | • | •• | |||
| Curriculum Learning | •• | • | |||
| Intrinsic Motivation | •• | • | |||
| 🟡 Offline | Variational Inference | •• | • | • | |
| Hindsight Subgoal Relabeling | •• | • | |||
| 🔵 Foundation Models | Embedding Similarity | •• | • | • | |
| Providing Feedback | • | •• | • | • | |
| Reward as Code | •• | • | |||
| Direct Policy Modeling | • | •• | • |
Below is the brief description of each class of method. Methods are categorized based on the underlying principles used to discover options.
🌱 Discovery from Online Experience
Discovery from online experience is the most fundamental setting for temporal-structure learning: the agent extracts skills directly from its own interaction stream, without assuming prior knowledge. This makes it broadly applicable and compatible with additional assumptions when available, while remaining principled at its core. Moreover, because skills are acquired incrementally as experience grows, this setting naturally scales to continual and open-ended environments where an agent must keep expanding its behavioral repertoire over time.
Bottleneck Discovery Methods
Bottleneck discovery methods aim to identify critical states, bottlenecks, that connect otherwise distinct regions of the state space. These methods typically represent the environment as a graph with states as nodes and edges denoting single-step reachability, and employ strategies such as:
- Diverse density: selecting states prevalent in successful but not unsuccessful trajectories;
- Graph partitioning: using min-cut algorithms iteratively to separate loosely-connected subgraphs and reveal bottleneck nodes;
- Graph Centrality: using metrics like betweenness centrality to identify bottlenecks, which highlight states appearing on many shortest paths.

While the understanding of bottleneck states enhance exploration, credit assignment, and tranferability, scaling to large or continuous spaces remains a key challenge, motivating more localized or clustered variants.
Spectral Methods
While bottleneck discovery methods explicitly seek critical connecting states, spectral methods reveal them indirectly by analyzing the eigenstructure of matrix representations of the environment. These approaches treat the Markov Decision Process (MDP) as a graph, where states are nodes and transitions define edges. By examining how information diffuses through this graph, spectral methods extract global structural patterns that reflect the environment’s topology.
A central construct is the graph Laplacian:
$$ \mathcal{L} = \mathbf{D}^{-1/2} (\mathbf{D} - \mathbf{A}) \mathbf{D}^{-1/2}, $$
where $\mathbf{A}$ is the adjacency matrix ($A_{ij} = 1$ if edge $(i,j)$ exists, else $0$) and $\mathbf{D} = \mathrm{diag}(d_1, \ldots, d_n)$ is the degree matrix with $d_i = \sum_j A_{ij}$. The eigenvectors of $\mathcal{L}$ capture modes of variation across the state space, representing directions along which states change smoothly according to the graph’s connectivity2.
Notably, the second smallest eigenvalue’s eigenvector (the Fiedler vector) identifies bottlenecks and partitions in the environment, corresponding to natural subregions or pathways between them. These eigenvectors define eigenfunctions $f(s)$ that can be learned from sampled experience, allowing continuous approximations of the discrete spectral structure.
Agents can then define eigenoptions by acting along the increasing or decreasing direction of $f(s)$, producing temporally extended actions that traverse meaningful regions of the environment:
$$ \pi_{\text{eigen}}(s) \propto \nabla f(s) \quad \text{or} \quad -\nabla f(s). $$
Instead of directly following the gradient of an eigenfunction, eigenoptions can be learned by defining an intrinsic reward that encourages movement along the eigenvector direction:
$$ r_t = f(s_{t+1}) - f(s_t). $$
Modern variants employ neural networks to approximate $f(s)$ directly from experience, extending spectral reasoning to high-dimensional and continuous domains. The resulting eigenoptions can traverse high-level structures in the environment, such as corridors or bridges between regions. This enables scalable structure-aware exploration and transferable representations, preserving the geometric intuition of spectral analysis while integrating with deep learning frameworks.
Despite their strengths in capturing environment structure and enabling option reuse, current research focuses on improving representation learning, extending applicability to planning and partially-observable domains, and integrating reward information for more nuanced behaviors.
Skill Chaining: Building Sequentially Composable Options
Skill chaining methods discover sequentially composable options: each option ends where another can begin (Konidaris and Barto, 2009).
The core idea assumes access to successful trajectories that reach a desired goal. Starting from the final goal, the algorithm constructs options backward:
- Learn an option that reaches the final goal.
- Define new options whose termination regions correspond to the initiation sets of previously learned ones.
- Repeat recursively until the initial states are covered.

Initiation functions are typically learned as probabilistic classifiers or value-based predictors, often incorporating uncertainty estimation to decide where a skill can be reliably executed (Bagaria and Konidaris, 2019).
By explicitly enforcing composability, skill chaining enhances long-horizon planning, improves credit assignment, and encourages structured exploration. However, current implementations remain best suited to goal-oriented environments with well-defined subgoals. Open research directions include generalizing to non-goal-directed tasks, leveraging factored representations for modular composition, and integrating skill chaining with model-based or hierarchical planning frameworks.
Empowerment Maximization
Empowerment-based methods aim to discover diverse and controllable skills by maximizing the agent’s influence over its future observations. Empowerment is formally defined as the mutual information between an action sequence (or skill) and the resulting future state:
$$ \mathcal{E}_n(s_t) = \max_{p(a)} I(a; s_{t+n} \mid s_t), $$
with the decomposition
$$ I(a; s_{t+n}\mid s_t) = H(s_{t+n}\mid s_t) - H(s_{t+n}\mid s_t, a), $$
where the first term captures the diversity of reachable outcomes, and the second captures controllability by penalizing unpredictability once a skill is chosen.

Practically, empowerment is estimated using variational inference. Methods introduce a latent skill $z$ and a discriminator $q_\phi(z\mid s)$, producing an intrinsic reward
$$ r_z(s,a) = \log q_\phi(z \mid s) - \log p(z), $$
so that each skill leads the agent to visit states that make it easy for the discriminator to tell which skill was executed.
Influential algorithms instantiate this idea:
VIC (Gregor et al., 2017): replaces open-loop actions with latent skills and maximizes
$$J_{\text{VIC}} = I(z; s_{t+n}\mid s_t),$$
encouraging each skill to reach a distinguishable future state region.DIAYN (Eysenbach et al., 2019): focuses on distinguishing skills by their current state distributions; policies maximize stochasticity while keeping skills identifiable: $$ J_{\mathrm{DIAYN}} = I(s; z) + \mathcal{H}(a \mid s,z) = I(s; z) + \big(\mathcal{H}(a \mid s) - I(a; z \mid s)\big). $$
DADS (Sharma et al., 2020): discovers skills that produce distinguishable and predictable state transitions: $$ J_{\mathrm{DADS}} = I(s’; z \mid s). $$
These methods share the insight that useful skills should be distinguishable by the effects they produce on the environment. Whether through terminal outcomes (VIC), state visitation patterns (DIAYN), or skill-conditioned transitions (DADS), each method uses a discriminator to identify the executed skill from observed behavior, ensuring that different skills induce reliably different trajectories.
These methods learn a bottom-up decomposition of behavior: usually, an unsupervised phase constructs diverse and controllable skills that aid exploration, support high-level decision-making over skills (better credit assignment), and transfer across reward structures. The primary limitation is that empowerment is hard to estimate and optimize, often yielding skills that remain localized and not fully composable across the entire state space.
Via Environment Rewards
Methods in this class discover hierarchical structure directly from external reward. rather than relying on intrinsic or pseudo-rewards.
Feudal methods. Feudal RL (Dayan and Hinton, 1992) introduces a manager–worker hierarchy in which higher-level managers set goals and lower-level workers attempt to achieve them. Each level defines its own MDP: managers act over abstract states and longer horizons, while workers operate over primitive states and shorter horizons. Two key principles enforce abstraction:
- Information hiding: each manager sees an abstracted state representation, enabling state abstraction across levels.
- Reward hiding: managers optimize the environmental reward, while workers optimize an intrinsic reward measuring progress toward the manager’s goal.
Deep instantiations such as Feudal Networks (FuN) (Vezhnevets et al., 2017) implement this by having the manager emit a latent goal-direction vector $g_{t-i}$, and training the worker using an intrinsic reward $$ r^{\mathrm{worker}}(s_{t}, g_{t-i}, s_{t-i}) = d_{\cos}\bigl((s_{t} - s_{t-i}), g_{t-i}\bigr), $$ so that transitions follow the manager’s intended direction.
Option-Critic. The option-critic architecture (Bacon, Harb, and Precup, 2017), instead, frames option discovery itself as a learning problem: intra-option policies, option terminations, and the policy over options are all differentiable and jointly optimized from the external reward.
Each option $o$ has:
- an intra-option policy $\pi_o(a \mid s)$,
- a termination function $\beta_o(s)$,
- and the agent follows a policy over options $\mu(o \mid s)$.
The key value equation linking primitive actions to option values is: $$ q_{u}(s,o,a) = r(s,a) + \gamma \sum_{s’} p(s’ \mid s,a) u_{\beta}(o,s’). $$
The option-value upon arrival is:
$$ u_{\beta}(o,s’) = (1 - \beta_o(s’)) q_{\pi}(s’,o) + \beta_o(s’) v_{\mu}(s’), $$
where
$$ v_{\mu}(s) = \sum_{o} \mu(o \mid s) q_{\pi}(s,o), \qquad q_{\pi}(s,o) = \sum_{a} \pi_o(a \mid s) q_{u}(s,o,a). $$
These expressions make both $\pi_o$ and $\beta_o$ differentiable with respect to return, allowing standard policy-gradient updates. This converts option discovery into a fully learnable, reward-driven component of the agent.
Properties and limitations. Because both feudal and option-critic frameworks aim only to maximize return. When the reward structure is supportive, they can yield:
- improved credit assignment,
- transferable structure when reward correlates with meaningful subgoals,
- sometimes, interpretable multi-step skills.
However, they also exhibit:
- option degeneracy (e.g., collapsing to primitive actions or a single dominant option),
- strong dependence on informative reward, which limits performance in sparse-reward environments.
These issues motivate augmenting external-reward–based discovery with regularization, intrinsic objectives, or demonstrations to promote more robust temporal abstractions.
Optimizing HRL Benefits Directly
Most option-discovery methods rely on proxy objectives, and the formal link between these proxies and the agent’s actual capabilities often remains unclear. In contrast, this line of work directly optimizes quantifiable benefits of HRL, formulating objectives over metrics such as planning time, exploration coverage, credit-assignment speed, and transfer performance. These methods specify explicit criteria (e.g., minimizing cover time, reducing planning iterations, or accelerating value propagation), and derive option sets that provably improve these metrics, sometimes with provable bounds or approximation guarantees. See our paper Section 4.6 for details.
Although such formulations provide a rigorous foundation for HRL, many are computationally hard (often NP-hard). Extending guarantees to more general settings, including function approximation and richer option classes remains a key open direction.
Meta Learning
Meta RL aims to learn the RL algorithm itself, or parts of it, rather than just a policy. This results in a bilevel optimization: the outer loop learns the parameters of the algorithm, while the inner loop adapts a policy for a specific task.
A common approach uses meta-gradients: the inner loop updates parameters for policies or options using standard RL objectives, while the outer loop updates meta-parameters by differentiating through the inner loop, capturing how changes in meta-parameters influence learning progress.
Two option-focused examples:
Meta Learning Shared Hierarchies (Frans et al., 2018): high-level policy as meta-parameters
Discovery of Options via Meta-Learned Subgoals (Veeriah et al., 2021): parameterized subgoals as meta-parameters
Black-box meta RL, instead, leverages recurrent models, such as RNNs or Transformers, to encode experience across multiple episodes and tasks (Duan et al., 2016). The agent interacts with a sequence of tasks, updating its internal memory (the hidden state) based on observations and rewards, and only resets this memory between tasks. Adaptive Agent (AdA) (Bauer et al., 2023) shows an example to master an open-ended task space by large-scale black-box meta RL.
Meta-gradient methods and black-box meta RL both facilitate the discovery of options in HRL mostly by optimizing for reusable behaviors across tasks. Specifically, meta-gradients can be used to learn option policies, reward functions, and termination conditions that generalize well, while black-box meta RL enables agents to adaptively select or compose skills based on accumulated experience, thus discovering and refining temporally abstract options that accelerate learning in new tasks.
A key opportunity for research is to relax the specialized multi-task assumption required by many meta RL approaches. Broadening the framework to more general, less structured settings, possibly via in-context learning or more flexible curricula, would further increase the applicability and scalability of meta learning in RL.
Curriculum Learning
Curriculum learning enables agents to master challenging goals by progressing through a sequence of tasks (usually specified by subgoals) with increasing complexity. Those subgoals can be seen as options with increasing difficulty.
In this category, one common approach is that, the high-level policy selects goals based on measures of learning progress, such as recent improvements in achieving specific goals. The objective is to optimize goal selection so that iterative updates lead to high performance across the entire goal space.
Another prominent approach is the use of implicit curricula, notably implemented via Hindsight Experience Replay (HER, Andrychowicz et al., 2017). HER enhances learning by relabeling experiences collected when seeking an intended goal with the actual state reached.
These approaches enhance exploration by continually going towards different subgoals and pushing the agent’s capabilities, and facilitate transfer to novel tasks. Open research directions include developing more reliable metrics for goal difficulty and interestingness, and leveraging unsupervised environment design or human knowledge for more effective curricula.
Intrinsic Motivation
Intrinsic motivation in RL drives agents to learn new skills without explicit external rewards, focusing instead on curiosity, novelty, or information gain.
Common intrinsic signals include bonuses for visiting new or rarely seen states, which push agents to explore unfamiliar areas. Some methods, like Relative Novelty ((Simsek and Barto, 2004)) or First Return then Explore (Ecoffet et al., 2020), set subgoals whenever the agent experiences something new. While factored approaches, like HEXQ (Hengst et al., 2002), break the environment into smaller tasks based on how its parts interact. Together, these methods help agents explore more thoroughly, learn a wide range of behaviors, and make it easier to reuse learned skills in new situations.
Ongoing challenges include scaling factored skill discovery to high-dimensional observation spaces, developing robust novelty estimation in continuous or stochastic settings, and formally connecting intrinsic motivation with other option discovery techniques.
💾 Discovery through Offline Datasets
In this section, we assumes access to a fixed dataset of pre-collected trajectories, $D$. It possibly lacks rewards (“unsupervised” RL) or even actions, generated by expert or arbitrary behavior policies.
$$ D = (\tau_{i})_{i=1}^N,\qquad \tau_{i} = (s_{t}^i, a_{t}^i, r_{t}^i)_{t=1}^T. $$
Variational Inference of Skill Latents
This class of offline skill discovery methods defines skills as latent variables inferred from unlabeled trajectories via reconstruction objectives. A common approach maximizes the likelihood of trajectories by introducing latent skill sequences that segment the data, often using a variational autoencoder (VAE) framework. Each trajectory is explained through a sequence of skill (and boundary) indicators, with the evidence lower bound (ELBO) guiding training.
Extensions incorporate regularization via minimum description length (MDL), encouraging concise and composable skill sets. These skills can be reused hierarchically, adapted across tasks, or used to augment the action space. More recent formulations impose mutual information constraints to balance skill diversity with fidelity to offline data.
This approach improves credit assignment, transfer, and exploration by distilling temporally coherent behaviors into representations in a latent skill space. Open challenges include improving optimization stability, leveraging reward signals more effectively, and scaling to more diverse and realistic domains.
Hindsight Subgoal Relabeling
Hindsight subgoal relabeling methods use offline datasets to identify and relabel important intermediate states (subgoals) within trajectories. By treating these subgoals as waypoints, agents can learn hierarchical policies that solve complex tasks more efficiently, even when external rewards are sparse or delayed. For example, algorithms partition trajectories into segments and train policies to reach each segment in sequence, while others relabel future states as short-term goals for low-level controllers. This decomposition improves credit assignment, sample efficiency, and makes learned behaviors easier to interpret. Key research directions include making subgoal choices more interpretable and scaling these approaches to environments with high-dimensional observations.
🧱 Discovery with Foundation Models
Discovery with foundation models leverages the priors encoded in large pretrained models to guide skill learning, reducing the burden of extracting structure purely from raw interaction. Although it assumes that pretrained models contain relevant domain knowledge, this is increasingly plausible as pretraining broadens. Because these models operate in natural language, they provide an interpretable and compositional abstraction space, enabling them to propose subgoals, decompose tasks, shape rewards, or act as goal-conditioned policies, with skill learning focused on realizing these model-generated structures.
Embedding Similarity
Embedding similarity methods use foundation models to define goal-conditioned rewards based on the similarity between pretrained embeddings of goals and current observations. Agents can be trained using either fine-tuned or frozen encoders. Open challenges include comparing different embedding models, expanding beyond text-image goals, and improving reward quality in complex or sparse environments.
Providing Feedback
Foundation models, especially LLMs, can provide reward feedback to RL agents by either directly evaluating the success of a behavior (e.g., “Did the agent achieve the goal?”) or by expressing preferences over pairs of trajectories or states. These scalar signals or preferences are then distilled into reward models, guiding agents to explore, improve credit assignment, and transfer skills, all based on natural language goals or descriptions. This approach leverages the generalization and reasoning abilities of LLMs while simplifying reward design.
Reward as Code
It is possible to use LLMs to generate reward functions as executable code, given a goal description and symbolic information from the environment. The LLM receives the task and environment details (such as Python class representations of objects) and outputs code that computes rewards for the agent’s states. This allows agents to optimize behavior using automatically generated rewards, often matching or exceeding human-crafted rewards in robotics and simulation tasks. This approach enables rapid transfer across tasks but relies on the availability of meaningful symbolic representations, motivating future work to extend code-based rewards to high-dimensional, less-structured domains.
Directly Modeling the Policy
LLMs can directly model goal-conditioned policies by generating code or action sequences conditioned on natural language goals and state information, bypassing the need for manually designed reward functions. In robotics and simulated domains, LLMs produce skill policies as executable code that interacts with APIs or environment abstractions, while in complex domains like Minecraft or web navigation, LLMs continually expand skill libraries through auto-curriculum and self-reflection. Some approaches directly prompt LLMs for low-level actions, demonstrating the ability to solve diverse tasks zero-shot. This paradigm enables rapid adaptation and compositional skill reuse but raises open challenges in adapting to new action spaces, varying embodiments, and maintaining performance without expensive fine-tuning.
🛠️ How to Use the Discovered Temporal Structure?
After discovering a library of temporally abstract behaviors, the next challenge is how to use them effectively. A common approach is the call-and-return model, where the agent selects one option that runs until termination. This simplifies control but limits flexibility. Alternatives like Generalized Policy Improvement (GPI) allow agents to evaluate multiple options simultaneously, selecting actions that outperform any single option’s policy.
High-level policies over options can be trained in different ways. Model-free methods extend standard RL algorithms to operate over options, often enhanced by learning progress signals, exploration bonuses, or diversity incentives. Model-based methods learn predictive models over options to enable long-horizon planning with fewer steps. Combined with state abstraction, these models support more efficient planning in complex environments, especially when abstractions align with the agent’s own perspective, allowing for transfer across tasks.3
🚧 Final Remarks
We conclude by identifying open challenges in discovering temporal structure, and draw connections to related areas such as state and action abstraction, continual RL, programmatic RL, and multi-agent RL. We also highlight promising domains, like robotics, web agents, and open-ended games, where HRL may have the most transformative impact.
More formally, we call it temporal abstraction. In RL, we care about obtain a good policy as a solution of a decision-making problem. Thus, in Hierarchical RL, we want to come up with a hierarchical or structured solution (policy), which is usually formed by action abstraction across time (i.e., temporal abstraction). That’s how temporal abstraction connected with hierarchy. ↩︎
The eigenvectors of the graph Laplacian are also known as proto-value functions (PVFs) in RL literature. ↩︎
For example, a home robot may encounter many different houses with various layouts and furniture. If the robot builds a representation based on its own sensors and actuators (e.g., “object is one meter in front of me” or “this action makes me turn right”), that abstraction is agent-centric. ↩︎