Understanding Behavioral Metric Learning

📄 Read the RLC paper

A 50+ pages of deep dive, acceped by Reinforcement Learning Conference (RLC) 2025. Here are the slides for the RLC talk (10 min talk slides).

See a detailed version of slides (for Mila RL Sofa) as well (60 min talk slides).

Tianwei and I have been drawn to a central question in RL: How can we design algorithms to build principled, general, and scalable state abstractions? As deep RL begins to tackle domains with high-dimensional observations, whether raw pixels, rich sensor data, or symbolic programs, the need for structured, compact representations becomes urgent.

📑 Table of Contents

🛤️ Story behind this work
- Key Research Questions
🔍 Conceptual Analysis
📏 Study Design on Metric Learning: Noise and Denoising
🧪 Experiment
- Benchmarking Results
- Case Study: What Matters in Metric & Representation Learning?
  - Normalization Effects
  - ZP Effects
  - Level-up: Challenging Setting Evaluation
- Isolated Metric Evaluation: Does Metric Learning Help with Denoising?
- OOD Generalization on Pixel-based Tasks
🪄 Takeaways

🛤️ Story behind this work

In our search for principled solutions for state abstraction (without assuming any prior knowledge or predefined structure), we found ourselves captivated by behavioral metric learning. The idea is elegant yet powerful:

📦 Core Idea: Define a distance between states based on how the current state “behaves” — the next observation and reward it leads to, given an action.

An important example is the well-known bisimulation metric, which is defined as:

📐 Definition: Bisimulation Metric in MDP

Consider an MDP with state space $X$ , action space $A$ , transition kernel $P (\cdot ∣ x, a)$ , reward function $R (x, a)$ , and discount factor $γ \in [0, 1)$ .

There exists a unique pseudometric $d^{\sim} : X \times X \to R$ , called the bisimulation metric (BSM) (Ferns et al., 2004)¹, defined as:

$d^{\sim} (x_{1}, x_{2}) ≜ max_{a \in A} (c_{R} | R (x_{1}, a) - R (x_{2}, a) | + c_{T} W_{1} (d^{\sim}) (P (\cdot ∣ x_{1}, a), P (\cdot ∣ x_{2}, a))),$

with $W_{1}$ referring to 1-Wasserstein (Kantorovich) distance, $c_{R}$ and $c_{T}$ are coefficients of short-term and long-term behavioral differences.

This behavior-grounded geometry echoes classical abstractions in formal systems (e.g., labelled transition systems²) grounded by equivalence relations, while enabling a more flexible, data-driven approach. Below is an example of a bisimulation relation, which underlies the bisimulation metric³. Recent methods propose more easy-to-compute metrics, often trading off theoretical guarantees for reduced approximation bias (see target metrics section).

bisim_relation

We can also define such a distance for state-action pairs, which relates to MDP homomorphism and lax bisimulation metrics.

In practice, many recent works implement this idea by viewing learning metrics as an auxiliary task to guide the representation toward capturing meaningful behavioral state equivalences.

For brevity, we call this “behavioral metric learning (in deep RL, mostly)” as “metric learning” for short.

As we dug deeper, we noticed an interesting trend. Most papers emphasize denoising in visually distracting settings as their main validation strategy. The field appears driven by the empirical observation that better representations — mainly measured by final return — emerge in noisy environments when metric learning is applied. Motivated by this, we initially set out to propose novel extensions of these ideas in broader RL contexts, e.g., for more efficient planning, better generalization, etc.

But as we read further, several major structural issues became hard to ignore.

Fragmented empirical landscape. Results vary widely across papers. Experimental setups are often underspecified, and discrepancies between codebases and reported results make it difficult to assess what truly works.
Opaque evaluation. Most methods are evaluated solely through final episodic return. It is rarely clear whether the learned metric is the cause of the performance gain, or merely correlated. Metric learning method. The chain from metric learning to representation improvement to return is seldom examined in detail.
Limited unification across methods. While many recent methods share common underlying ideas, explicit conceptual connections between them are seldom articulated. Formal links between different metrics, or between metrics and representation learning in RL, remain underexplored in the literature.

Key Research Questions

We are now thinking more critically about how to move this area forward. This motivation led to a one-year effort on this work, aiming at addressing the following core research questions:

❓ Q1: First of all, at the very beginning, why / why not does metric learning denoise?
- Q1.1: Which metrics theoretically support denoising, and which do not?
- Q1.2: Does it still denoise in the presence of approximation error? What are the sources of approximation error?
❓ Q2: What is the connection between metric learning and representation learning in recent methods?
- Q2.1: How are recent methods related or unified under a common perspective?
- Q2.2: Other recent representation learning approaches, such as self-prediction and DeepMDP, have also demonstrated denoising capabilities⁴. How are these methods connected to metric learning?

❓ Q3: ⭐ How Are Metrics Learned in Deep RL? ⭐ How should we understand abstraction quality beyond return?
- Q3.1: How can we evaluate a behavioral metric when exact computation is infeasible in large observation spaces?
❓ Q4: How do we isolate the effectiveness of metric learning from other contributing factors or design choices?
❓ Q5: Can we develop a more robust codebase that addresses the aforementioned issues and better supports the community?

↑ Back to TOC

🔍 Conceptual Analysis

We aim to find an encoder that maps noisy observations into a structured representation space, which facilitates RL by ensuring that task‑relevant variations are captured.

motivation

A natural way to formalize this goal is through an isometric embedding⁵:

Definition: Isometric Embedding

An encoder $ϕ : X \to Ψ$ is an isometric embedding if the distances in the original space $(X, d_{X})$ are preserved in the representation space $(Ψ, d_{Ψ})$ . Formally,

$d_{X} (x_{1}, x_{2}) = d_{Ψ} (ϕ (x_{1}), ϕ (x_{2})), \forall x_{1}, x_{2} \in X,$

where $d_{X}$ is the target metric and $d_{Ψ}$ is the representational metric.

Target Metrics $d_{X}$

A target metric, inherent in an MDP (like an optimal value function), captures differences in rewards and transition dynamics, with a general form⁶ of:

$d_{X} (x_{1}, x_{2}) := c_{R} d_{R} (x_{1}, x_{2}) + c_{T} d_{T} (d_{X}) (P (\cdot ∣ x_{1}), P (\cdot ∣ x_{2})) .$

To approximate this,

$d_{X} (x_{1}, x_{2}) \approx {\hat{d}}_{X} (x_{1}, x_{2}) = c_{R} {\hat{d}}_{R} (r_{1}, r_{2}) + c_{T} {\hat{d}}_{T} ({\hat{d}}_{X}) (\hat{P} (\cdot ∣ x_{1}), \hat{P} (\cdot ∣ x_{2})) .$

Here $r_{1}, r_{2}$ are sampled immediate rewards, $d_{R}$ and $d_{T}$ denote immediate and long‑term similarity, and ${\hat{d}}_{R}$ , ${\hat{d}}_{T}$ are their approximations.

We provide some examples of recent advances of how target metrics are defined:

📊 Summary of Key Implementation Choices for Benchmarked Methods

This table summarizes metric learning methods and their key design choices, based on their open-source implementation. For each method, see paper’s Appendix C.2 and C.3 for details.

Aside from inherent metric design choice, mind the key implementation choice: self-prediction (ZP), reward prediction (RP), and normalization (Norm.)! And see our paper for a detailed discussion.

Method	$\hat{d_{R}}$	$\hat{d_{T}}$	$d_{Ψ}$	Metric Loss	Target Trick	Other Losses	Transition Model	Norm.
SAC	—	—	—	—	—	—	—	—
DeepMDP	—	—	—	—	—	RP + ZP	Probabilistic	—
DBC	Huber	$W_{2}$ closed-form	Huber	MSE	—	RP + ZP	Probabilistic	—
DBC-normed	Huber	$W_{2}$ closed-form	Huber	MSE	—	RP + ZP	Deterministic	MaxNorm
MICo	Abs.	Sample-based	Angular	Huber	✓	—	—	—
RAP	RAP	$W_{2}$ closed-form	Angular	Huber	—	RP + ZP	Probabilistic	—
SimSR	Abs.	Sample-based	Cosine	Huber	—	ZP	Prob. ensemble	L2Norm

What is Denoising?

We define denoising as learning to ignore task-irrelevant noise in observations, so the model can recognize different-looking inputs that actually represent the same underlying situation.

An encoder $ϕ$ achieves perfect denoising if:

It maps noisy observationshat correspond to the same task-relevant state ( $x$ and $x_{+}$ we define later) to the same representation.
It maps observations from different (underlying) states ( $x$ and $x_{-}$ we define later) to different representations.

This mirrors the behavior of an oracle encoder $ϕ^{*}$ , which sees through the noise and keeps only what matters for the task.

Why (why not) Metrics Help with Denoise?

Here, we show connections between the target metric $d_{X}$ and its role in denoising.

Metrics may aid denoising:
- Bisimulation metric (BSM) yields perfect denoising in EX-BMDPs⁷.
- PBSM supports denoising when the policy is exo-free, i.e., noise-irrelevant.
- MICo does not guarantee zero distance between bisimilar observations unless both policy and transition are deterministic, yet it helps empirically according to prior work.

Metrics have limitations in denoising:
- Exact BSM is hard to approximate, so methods use PBSM or MICo instead.
- PBSM may fail to denoise under non-exo-free or even optimal policies due to differing rewards or transitions. In fact, an exo-free policy is something we want as the output of the algorithm!
- Practical challenges in metric approximation:
  - Metrics like PBSM/MICo are on-policy, but rewards used are off-policy (from replay buffer). Thus, the reward difference term is biased.
  - Reward/transition models are approximated, introducing errors.
  - Multiple losses (metric, ZP, critic) jointly shape the representation, which may confound the source of performance gain, or interfere with denoising.

These motivate further empirical investigation into whether metric learning facilitates denoising and learning.

↑ Back to TOC

📏 Study Design on Metric Learning: Noise and Denoising

This table summarizes the limitations of prior metric-learning work and our corresponding amendments.

Aspect	Prior Work	Our Study Design
Task Diversity	Limited test environments: few tasks with grayscale natural video backgrounds	Diverse state-based and pixel-based noise settings across tasks
Generalization Evaluation	Entangled: evaluation only on unseen videos (OOD), hard to know the source of difficulty	Clear separation of ID and OOD generalization via distinct train/test noise
Evaluation Measure	Indirect: impact mainly on evaluation return	Direct: proposed Denoising Factor (DF) as a targeted representation measure
Loss Attribution	Mixed losses (e.g., critic, ZP, RP) obscure metric learning effect	Isolated metric evaluation disentangles representation from RL objectives

Noise Settings

state_noise

State-based Noise
- IID Gaussian: Concatenates IID Gaussian (univariate) noise directly to the state. Difficulty is adjusted via noise dimension $m$ or standard deviation $σ$ .
- IID Gaussian Noise with Random Projection: Combines state and noise vectors, then applies a full-rank random projection $x_{t} = A z_{t}$ . Enables “decryption” but increases complexity by entangling signal and noise.

pixel_noise

Pixel-based Noise
- Natural Images: Replaces clean backgrounds with static natural images (grayscale or colored) for each training run; introduces visual complexity.
- Natural Videos: Replaces background with dynamic natural videos; frame index $ξ_{t}$ advances cyclically over time.
- IID Gaussian: IID Gaussian noise is added per pixel to the background, overlaid by the robot foreground.
ID vs. OOD Evaluation
- In-distribution (ID): Same noise distribution during training and testing (e.g., Gaussian noise with fixed $σ$ ).
- Out-of-distribution (OOD): Task-relevant components remain fixed, but noise distributions shift between training and testing (e.g., train/test with different video datasets). Common in prior metric learning evaluations.

Denoising Factor: Quantifying Denoising

The denoising factor (DF) measures how well an encoder $ϕ$ removes task-irrelevant noise while preserving task-relevant structure in its learned representation. It is evaluated under a specific policy $π$ and is based on comparing two types of observation pairs:

Positive Example ( $x_{+}$ ): A different observation of the same underlying state as an anchor $x$ , but with different noise (e.g., different background or sensor noise).
Negative Example ( $x_{-}$ ): An observation sampled independently from a different state.

pos_neg

We define:

Positive Score: Measures the average distance between an anchor $x$ and a positive example $x_{+}$ that shares the same task-relevant state ( $ϕ^{*} (x) = ϕ^{*} (x_{+})$ ):

${Pos}_{d_{Ψ}}^{π} (ϕ) := E_{x \sim ρ_{π}, ξ_{+} \sim ρ (ξ_{+})} [d_{Ψ} (ϕ (x), ϕ (x_{+}))],$

where $x_{+} \sim q (\cdot ∣ ϕ^{*} (x), ξ_{+})$ .

Negative Score: Measures the average distance between independently sampled observations:

${Neg}_{d_{Ψ}}^{π} (ϕ) := E_{x, x_{-} \sim ρ_{π}} [d_{Ψ} (ϕ (x), ϕ (x_{-}))] .$

Denoising Factor (DF): The normalized difference:

${DF}_{d_{Ψ}}^{π} (ϕ) := \frac{Neg - Pos}{Neg + Pos} \in [- 1, 1] .$

A higher DF indicates better denoising performance. The oracle encoder $ϕ^{*}$ achieves the maximum DF of 1.

Isolated Metric Evaluation

In many methods, the encoder $ϕ$ is jointly optimized using RL loss and auxiliary objectives (e.g., reward/self-prediction and metric losses), making it difficult to isolate the effect of metric learning on representation quality.

To address this, we propose the isolated metric estimation setting:

sac_dmdp

metric_iso

Introduce a separate metric encoder $\tilde{ϕ}$ optimized only with the metric loss $J_{M} (\tilde{ϕ})$ .
The agent encoder $ϕ$ is trained via standard RL objectives (e.g., $J_{SAC}$ ).
A SAC agent collects data used for training $\tilde{ϕ}$ across all methods.
Denoising is evaluated via ${DF}_{d_{Ψ}}^{π} (\tilde{ϕ})$ , enabling fair comparison across metric methods.
For self-prediction-based approaches, an isolated transition model is used without backpropagating gradients to $\tilde{ϕ}$ .

This decoupling allows rigorous evaluation of metric learning independent of downstream policy behavior.

↑ Back to TOC

🧪 Experiment

Our experimental result covers:

Benchmarking results on various tasks and noise settings

Evaluates agent performance across 250+ combinations of tasks and noise. Also include task difficulty examination: identifies easy and challenging scenarios.

Protocol

Aggregated over (all per-task result in Appendix):
- All tasks
- All random seeds
- 10 evaluation points per run from 1.95M to 2.05M environment steps
Result
- Agent performance (return and DF)
- Task difficulty (by aggregating agents’ performance, see Appendix E)

State-based settings

20 tasks × 5 × 2 IID Gaussian noise configurations (varying dimension or standard deviation)
12 random seeds per task

bench_st_std

bench_st_d

bench

Pixel-based settings

14 tasks × 6 background noise types
5 random seeds per task

bench_pixel

Key findings

SimSR performs best in most state-based tasks, excelling in both return and denoising factor (DF).
RAP leads in pixel-based tasks but drops moderately in state-based settings.
→ Notably, both were only evaluated on pixel domains in their original papers.
SAC and DeepMDP, though not metric learning methods, show decent performance across both domains.
→ These are often overlooked in prior work.
DBC, despite being a standard baseline, does not perform well.
In state-based tasks, increasing either:
- the noise dimension (with $σ = 1.0$ ), or
- the noise standard deviation (with $m = 32$ )
  leads to moderate performance drops.
→ Strong methods remain robust in both return and DF.
In pixel-based tasks, the widely used grayscale natural video setting is not significantly harder than the clean background.
→ Surprisingly, IID Gaussian noise is the most challenging and deserves further investigation.
Algorithm effectiveness varies across tasks, for example,
- RAP excels in reacher/easy
- MICo leads in point_mass/easy (see paper Appendix for details)
→ Broad task coverage is crucial for generalizable conclusions.
Optimizing a metric loss (e.g., in MICo) is as expensive as a ZP loss (e.g., in DeepMDP), per runtime comparison.
→ Adding metric-related objectives trades off computation efficiency.

Case Study: What Matters in Metric & Representation Learning?

Why SimSR works here? Pinpoints key design choices that drive performance improvements in metric and representation learning.

Setup: Six representative state-based DMC tasks for case study. A challenging IID Gaussian noise setting (dims=32, std=8.0) is chosen.

Normalization Effects

Pixel-based tasks use normalization by default; state-based do not (aligned with code provided in prior work).
SimSR uses $L_{2}$ normalization and performs best in state-based settings.
To generalize the insight, LayerNorm is applied to methods that don’t rely on $L_{2}$ normalization, avoiding potential metric misspecification⁸.

layer_norm

LayerNorm consistently improves both return and DF (Fig. 29 in paper) across methods.
DeepMDP + LayerNorm performs comparably to SimSR.

ZP Effects

ZP loss is essential for SimSR’s robustness under noisy state-based settings.

Level-up: Challenging Setting Evaluation

Tested methods with LayerNorm in IID Gaussian + random projection environments (varying dimension or standard deviation).

All methods suffer from increased noise in the IID Gaussian + random projection setting.
DeepMDP and SimSR remain relatively stable even under high noise variance.

Isolated Metric Evaluation: Does Metric Learning Help with Denoising?

Investigates the direct impact of metric learning on denoising by decoupling it from RL training.

Setup:

Evaluated in 6 selected tasks under ID generalization.
All use SAC with LayerNorm as the base agent.
Three experimental configurations:
1. (Fig.5, Row 1) We compare isolated encoders $\tilde{ϕ}$ trained under following settings:
- Metric learning methods: optimized with metric loss.
- SAC (Q loss): optimized with critic (Q) loss.
- DeepMDP (ZP / ZP + RP): optimized with self-prediction (ZP) loss or both reward-prediction (RP) and ZP losses.
1. (Fig.5, Row 2) Same as (1) with LayerNorm applied to $\tilde{ϕ}$ .
2. (Fig.5, Row 3) Same as (2) plus ZP loss applied to all metric learning methods’ $\tilde{ϕ}$ .

iso

Key Findings:

Isolated metric learning provides moderate denoising but underperforms compared to applying ZP loss alone.
Adding RP loss to ZP (DeepMDP) gives limited improvement over ZP-only.
Applying LayerNorm to $\tilde{ϕ}$ substantially boosts DF for DeepMDP, but only modestly helps metric methods.
Adding metric loss to ZP does not further improve DF.
MICo’s DF remains low due to its non-zero self-distance design.

OOD Generalization on Pixel-based Tasks

Focuses on evaluation under distribution shifts, which was prior work’s primary setting of interest.

Setup:

Evaluated all 14 pixel-based tasks with distracting video backgrounds.
Training and evaluation use distinct video samples to test OOD generalization.

ood

gen_gap

Key Findings:

All methods struggle with OOD generalization under static backgrounds, lacking temporal variation.
SAC and DeepMDP remain competitive even under OOD evaluation.
Colored video backgrounds pose a significantly harder generalization challenge than grayscale ones.
Surprisingly, SAC shows minimal reward drop in the grayscale setting, questioning the added value of metric learning there.

↑ Back to TOC

🪄 Takeaways

Simple but noisy environments (e.g., random projection) are useful for verifying metric learning effects.
Evaluate with direct metrics like the denoising factor, and clarify the problem setting as ID or OOD generalization.
Normalization and self-prediction loss (ZP) are critical for learning strong representations. Future methods should design metric objectives that complement these components or ensure fair comparison against baselines that incorporate them.
Metric learning’s added value diminishes when strong design choices are already present—warranting deeper analysis of when it offers unique benefits.

As noted by Ferns et al. (2004), BSM relates to the largest bisimulation relation $\sim$ . For brevity, we simplify the original definition based on the fixed-point of a contraction operator and omit the existence proof. ↩︎
Lecture by Prakash: slides link. ↩︎
Metrics can be viewed as a relaxation of abstraction. Traditional abstraction enforces a dichotomy: two states are either bisimilar or not. This strict equivalence is often unsuitable for high-dimensional observations and intractable to compute online, motivating the use of metrics to provide a smoother, more flexible alternative. ↩︎
Here, self-prediction basically applies reward prediction (RP) and next latent prediction (ZP) as auxiliary losses; RP can be replaced by Q loss, see Ni et al., 2024 and Voelcker et al., 2024 for more. Why does self-prediction also achieve effective abstraction empirically? Intuitively, from the perspective of abstraction, the method does not enforce a single, specific abstraction. Instead, it only requires the learned representation to retain information relevant to rewards and latent transitions. Many abstractions can satisfy this criterion (even trivial or identical mappings), and the algorithm itself does not guarantee convergence to the minimal sufficient abstraction in general cases. However, when the representation is under a dimensional bottleneck (where latent dim < observation dim), self-prediction need to compress information while keeping essential information. This could partly explain why self-prediction performs well empirically despite lacking explicit regularization toward minimality. The specific outcome also depends on architectural choices and the inductive biases of the encoder (such as sparsity preferences). ↩︎
Isometry — Wikipedia ↩︎
Recent methods managed to get rid of the max operator in the bisimulation metric, and we use this form to summarize those methods. See detailed discussion in our paper’s Appendix C.2. ↩︎
Formulation used for our test environment. See our paper Section 2 for details. ↩︎
Applying $L_{2}$ normalization can cause metric misspecification, as it restricts the representational space, preventing it from faithfully preserving target distances when $d_{X}$ exceeds the expressiveness of the normalized space; see Appendix C.4 for more. ↩︎

Last updated on Nov 17, 2025

Understanding Behavioral Metric Learning

📄 Read the RLC paper

💻 View the code

📑 Table of Contents

🛤️ Story behind this work

📐 Definition: Bisimulation Metric in MDP

Key Research Questions

🔍 Conceptual Analysis

Definition: Isometric Embedding

Target Metrics dX

📊 Summary of Key Implementation Choices for Benchmarked Methods

What is Denoising?

Why (why not) Metrics Help with Denoise?

📏 Study Design on Metric Learning: Noise and Denoising

Noise Settings

Denoising Factor: Quantifying Denoising

Isolated Metric Evaluation

🧪 Experiment

Benchmarking results on various tasks and noise settings

Case Study: What Matters in Metric & Representation Learning?

Isolated Metric Evaluation: Does Metric Learning Help with Denoising?

OOD Generalization on Pixel-based Tasks

🪄 Takeaways

Target Metrics $d_{X}$