Discovering Temporal Structure: An Overview of Hierarchical Reinforcement Learning

📄 Read the paper

📘 Overview

From its humble beginnings in late 2023 to a detailed 80+ page deep dive, this project has evolved through many collaborative discussions with Martin and Akhil, and the steady guidance of George, Doina, and Marlos. Specifically, I would also like to thank Xujie for his continued support during this project. I really appreciate all these support and insights throughout this journey, and learned a lot about the fascinating world of Hierarchical Reinforcement Learning (HRL)!

What follows is an overview of the work:

📑 Table of Contents

🎯 Motivation & What is HRL
🧠 Core Benefits of Temporal Structure
🧭 How Is Temporal Structure Discovered?
📊 Summary of Methods & Their Focus on the Benefits
🌱 Discovery from Online Experience
💾 Discovery through Offline Datasets
- Variational Inference of Skill Latents
- Hindsight Subgoal Relabeling
🧱 Discovery with Foundation Models
🛠️ How to Use the Discovered Temporal Structure?
🚧 Final Remarks

🎯 Motivation & What is HRL?

Developing agents that can explore, plan, and learn in open-ended environments is a key desideratum in AI. Hierarchical Reinforcement Learning (HRL) tackles this challenge by identifying and utilizing temporal structure¹—patterns in how decisions unfold over time.

Various terms such as options, skills, behaviors, subgoals, subroutines, and subtasks are often used interchangeably in the RL literature to describe temporal structures.

Yet, a key question remains:
What makes a temporal structure “good” or “useful”?

Our work provides insights to this question by examining the benefits of temporal abstraction through the lens of four fundamental decision-making challenges. In this framing, temporal structures are considered useful if they enable agents to effectively realize the associated benefits. Conversely, effective HRL algorithms that discover such structures inherently provide these benefits.

🧠 Core Benefits of (Useful) Temporal Structure / HRL

We identify four central benefits that useful temporal structures enable:

🔍 Exploration

Facilitates structured exploration by enabling agents to (a) discover and pursue meaningful subgoals, (b) diversify behavior across multiple directions / using various strategies, and (c) explore over extended time horizons than individual actions.

🔗 Credit Assignment

Supports more effective credit propagation by allowing learning signals (e.g., gradients or temporal-difference errors) to be assigned at higher levels of abstraction, thereby improving the identification of causally relevant decisions.

🔄 Transferability

Enables generalization across tasks and environments by learning behaviors or representations that can be reused under novel conditions.

👁️ Interpretability

Enhances interpretability by imposing structure on behavior, making the agent’s decision-making process easier to analyze and explain through high-level temporal components by humans.

🧭 How Is Temporal Structure Discovered?

We survey methods that discover useful temporal structure in three settings (based on the availability of data / prior knowledge):

🟢 Online interaction
🟡 Offline datasets
🔵 Foundation models (e.g., LLMs)

Each method is discussed based on its focus on the four key benefits.

main

📊 Summary of Methods & Their Focus on the Benefits

• = general focus •• = explicitly designed on the benefit

Category	Methods	🎯 Credit Assign.	🔍 Explor.	🔄 Transf.	🧩 Interpr.
🟢 Online	Bottleneck Discovery	•	•	•	•
	Spectral Methods	•	••	•
	Skill Chaining	••	••	•	•
	Empowerment Maximization	•	••	•
	Via Environment Reward	••		•	•
	Optimizing HRL Benefits Directly	•	•
	Meta-Learning		•	••
	Curriculum Learning		••	•
	Intrinsic Motivation		••	•
🟡 Offline	Variational Inference	••	•	•
	Hindsight Subgoal Relabeling	••			•
🔵 Foundation Models	Embedding Similarity		••	•	•
	Providing Feedback	•	••	•	•
	Reward as Code			••	•
	Direct Policy Modeling		•	••	•

Discovering Temporal Structure: An Overview of Hierarchical Reinforcement Learning

📄 Read the paper

📘 Overview

📑 Table of Contents

🎯 Motivation & What is HRL?

🧠 Core Benefits of (Useful) Temporal Structure / HRL

🔍 Exploration

🔗 Credit Assignment

🔄 Transferability

👁️ Interpretability

🧭 How Is Temporal Structure Discovered?

📊 Summary of Methods & Their Focus on the Benefits

🌱 Discovery from Online Experience

Bottleneck Discovery Methods

Spectral Methods

Sequentially Composable Options (Skill Chaining)

Empowerment Maximization

Via Environment Rewards

Optimizing HRL Benefits Directly

Meta Learning

Curriculum Learning

Intrinsic Motivation

💾 Discovery through Offline Datasets

Variational Inference of Skill Latents

Hindsight Subgoal Relabeling

🧱 Discovery with Foundation Models

Embedding Similarity

Providing Feedback

Reward as Code

Directly Modeling the Policy

🛠️ How to Use the Discovered Temporal Structure?

🚧 Final Remarks