Understanding Behavioral Metric Learning
📄 Read the RLC paper
A 50+ pages of deep dive, acceped by Reinforcement Learning Conference (RLC) 2025. Here are the slides for the RLC talk (10 min talk slides).
See a detailed version of slides (for Mila RL Sofa) as well (60 min talk slides).
💻 View the code
Tianwei and I have been drawn to a central question in RL: How can we design algorithms to build principled, general, and scalable state abstractions? As deep RL begins to tackle domains with high-dimensional observations, whether raw pixels, rich sensor data, or symbolic programs, the need for structured, compact representations becomes urgent.
📑 Table of Contents
- 🛤️ Story behind this work
- 🔍 Conceptual Analysis
- 📏 Study Design on Metric Learning: Noise and Denoising
- 🧪 Experiment
- Benchmarking Results
- Case Study: What Matters in Metric & Representation Learning?
- Normalization Effects
- ZP Effects
- Level-up: Challenging Setting Evaluation
- Isolated Metric Evaluation: Does Metric Learning Help with Denoising?
- OOD Generalization on Pixel-based Tasks
- 🪄 Takeaways
🛤️ Story behind this work
In our search for principled solutions for state abstraction (without assuming any prior knowledge or predefined structure), we found ourselves captivated by behavioral metric learning. The idea is elegant yet powerful:
📦 Core Idea: Define a distance between states based on how the current state “behaves” — the next observation and reward it leads to, given an action.
An important example is the well-known bisimulation metric, which is defined as:
📐 Definition: Bisimulation Metric in MDP
Consider an MDP with state space
There exists a unique pseudometric
with
This behavior-grounded geometry echoes classical abstractions in formal systems (e.g., labelled transition systems2) grounded by equivalence relations, while enabling a more flexible, data-driven approach. Below is an example of a bisimulation relation, which underlies the bisimulation metric3. Recent methods propose more easy-to-compute metrics, often trading off theoretical guarantees for reduced approximation bias (see target metrics section).

We can also define such a distance for state-action pairs, which relates to MDP homomorphism and lax bisimulation metrics.
In practice, many recent works implement this idea by viewing learning metrics as an auxiliary task to guide the representation toward capturing meaningful behavioral state equivalences.
For brevity, we call this “behavioral metric learning (in deep RL, mostly)” as “metric learning” for short.
As we dug deeper, we noticed an interesting trend. Most papers emphasize denoising in visually distracting settings as their main validation strategy. The field appears driven by the empirical observation that better representations — mainly measured by final return — emerge in noisy environments when metric learning is applied. Motivated by this, we initially set out to propose novel extensions of these ideas in broader RL contexts, e.g., for more efficient planning, better generalization, etc.
But as we read further, several major structural issues became hard to ignore.
- Fragmented empirical landscape. Results vary widely across papers. Experimental setups are often underspecified, and discrepancies between codebases and reported results make it difficult to assess what truly works.
- Opaque evaluation. Most methods are evaluated solely through final episodic return. It is rarely clear whether the learned metric is the cause of the performance gain, or merely correlated. Metric learning method. The chain from metric learning to representation improvement to return is seldom examined in detail.
- Limited unification across methods. While many recent methods share common underlying ideas, explicit conceptual connections between them are seldom articulated. Formal links between different metrics, or between metrics and representation learning in RL, remain underexplored in the literature.
Key Research Questions
We are now thinking more critically about how to move this area forward. This motivation led to a one-year effort on this work, aiming at addressing the following core research questions:
❓ Q1: First of all, at the very beginning, why / why not does metric learning denoise?
❓ Q2: What is the connection between metric learning and representation learning in recent methods?
- Q2.1: How are recent methods related or unified under a common perspective?
- Q2.2: Other recent representation learning approaches, such as self-prediction and DeepMDP, have also demonstrated denoising capabilities4. How are these methods connected to metric learning?
❓ Q3: ⭐ How Are Metrics Learned in Deep RL? ⭐ How should we understand abstraction quality beyond return?
- Q3.1: How can we evaluate a behavioral metric when exact computation is infeasible in large observation spaces?
❓ Q4: How do we isolate the effectiveness of metric learning from other contributing factors or design choices?
❓ Q5: Can we develop a more robust codebase that addresses the aforementioned issues and better supports the community?
🔍 Conceptual Analysis
We aim to find an encoder that maps noisy observations into a structured representation space, which facilitates RL by ensuring that task‑relevant variations are captured.

A natural way to formalize this goal is through an isometric embedding5:
Definition: Isometric Embedding
An encoder
where
Target Metrics
A target metric, inherent in an MDP (like an optimal value function), captures differences in rewards and transition dynamics, with a general form6 of:
To approximate this,
Here
We provide some examples of recent advances of how target metrics are defined:

📊 Summary of Key Implementation Choices for Benchmarked Methods
This table summarizes metric learning methods and their key design choices, based on their open-source implementation. For each method, see paper’s Appendix C.2 and C.3 for details.
Aside from inherent metric design choice, mind the key implementation choice: self-prediction (ZP), reward prediction (RP), and normalization (Norm.)! And see our paper for a detailed discussion.
| Method | Metric Loss | Target Trick | Other Losses | Transition Model | Norm. | |||
|---|---|---|---|---|---|---|---|---|
| SAC | — | — | — | — | — | — | — | — |
| DeepMDP | — | — | — | — | — | RP + ZP | Probabilistic | — |
| DBC | Huber | Huber | MSE | — | RP + ZP | Probabilistic | — | |
| DBC-normed | Huber | Huber | MSE | — | RP + ZP | Deterministic | MaxNorm | |
| MICo | Abs. | Sample-based | Angular | Huber | ✓ | — | — | — |
| RAP | RAP | Angular | Huber | — | RP + ZP | Probabilistic | — | |
| SimSR | Abs. | Sample-based | Cosine | Huber | — | ZP | Prob. ensemble | L2Norm |
What is Denoising?
We define denoising as learning to ignore task-irrelevant noise in observations, so the model can recognize different-looking inputs that actually represent the same underlying situation.
An encoder
- It maps noisy observationshat correspond to the same task-relevant state (
and we define later) to the same representation. - It maps observations from different (underlying) states (
and we define later) to different representations.
This mirrors the behavior of an oracle encoder
Why (why not) Metrics Help with Denoise?
Here, we show connections between the target metric
- Metrics may aid denoising:
- Bisimulation metric (BSM) yields perfect denoising in EX-BMDPs7.
- PBSM supports denoising when the policy is exo-free, i.e., noise-irrelevant.
- MICo does not guarantee zero distance between bisimilar observations unless both policy and transition are deterministic, yet it helps empirically according to prior work.
- Metrics have limitations in denoising:
- Exact BSM is hard to approximate, so methods use PBSM or MICo instead.
- PBSM may fail to denoise under non-exo-free or even optimal policies due to differing rewards or transitions. In fact, an exo-free policy is something we want as the output of the algorithm!
- Practical challenges in metric approximation:
- Metrics like PBSM/MICo are on-policy, but rewards used are off-policy (from replay buffer). Thus, the reward difference term is biased.
- Reward/transition models are approximated, introducing errors.
- Multiple losses (metric, ZP, critic) jointly shape the representation, which may confound the source of performance gain, or interfere with denoising.
These motivate further empirical investigation into whether metric learning facilitates denoising and learning.
📏 Study Design on Metric Learning: Noise and Denoising
This table summarizes the limitations of prior metric-learning work and our corresponding amendments.
| Aspect | Prior Work | Our Study Design |
|---|---|---|
| Task Diversity | Limited test environments: few tasks with grayscale natural video backgrounds | Diverse state-based and pixel-based noise settings across tasks |
| Generalization Evaluation | Entangled: evaluation only on unseen videos (OOD), hard to know the source of difficulty | Clear separation of ID and OOD generalization via distinct train/test noise |
| Evaluation Measure | Indirect: impact mainly on evaluation return | Direct: proposed Denoising Factor (DF) as a targeted representation measure |
| Loss Attribution | Mixed losses (e.g., critic, ZP, RP) obscure metric learning effect | Isolated metric evaluation disentangles representation from RL objectives |
Noise Settings

- State-based Noise
- IID Gaussian: Concatenates IID Gaussian (univariate) noise directly to the state. Difficulty is adjusted via noise dimension
or standard deviation . - IID Gaussian Noise with Random Projection: Combines state and noise vectors, then applies a full-rank random projection
. Enables “decryption” but increases complexity by entangling signal and noise.
- IID Gaussian: Concatenates IID Gaussian (univariate) noise directly to the state. Difficulty is adjusted via noise dimension

Pixel-based Noise
- Natural Images: Replaces clean backgrounds with static natural images (grayscale or colored) for each training run; introduces visual complexity.
- Natural Videos: Replaces background with dynamic natural videos; frame index
advances cyclically over time. - IID Gaussian: IID Gaussian noise is added per pixel to the background, overlaid by the robot foreground.
ID vs. OOD Evaluation
- In-distribution (ID): Same noise distribution during training and testing (e.g., Gaussian noise with fixed
). - Out-of-distribution (OOD): Task-relevant components remain fixed, but noise distributions shift between training and testing (e.g., train/test with different video datasets). Common in prior metric learning evaluations.
- In-distribution (ID): Same noise distribution during training and testing (e.g., Gaussian noise with fixed
Denoising Factor: Quantifying Denoising
The denoising factor (DF) measures how well an encoder
Positive Example (
): A different observation of the same underlying state as an anchor , but with different noise (e.g., different background or sensor noise).Negative Example (
): An observation sampled independently from a different state.

We define:
- Positive Score: Measures the average distance between an anchor
and a positive example that shares the same task-relevant state ( ):
where
- Negative Score: Measures the average distance between independently sampled observations:
- Denoising Factor (DF): The normalized difference:
A higher DF indicates better denoising performance. The oracle encoder
Isolated Metric Evaluation
In many methods, the encoder
To address this, we propose the isolated metric estimation setting:


- Introduce a separate metric encoder
optimized only with the metric loss . - The agent encoder
is trained via standard RL objectives (e.g., ). - A SAC agent collects data used for training
across all methods. - Denoising is evaluated via
, enabling fair comparison across metric methods. - For self-prediction-based approaches, an isolated transition model is used without backpropagating gradients to
.
This decoupling allows rigorous evaluation of metric learning independent of downstream policy behavior.
🧪 Experiment
Our experimental result covers:
Benchmarking results on various tasks and noise settings
Evaluates agent performance across 250+ combinations of tasks and noise. Also include task difficulty examination: identifies easy and challenging scenarios.
Protocol
Aggregated over (all per-task result in Appendix):
- All tasks
- All random seeds
- 10 evaluation points per run from 1.95M to 2.05M environment steps
Result
- Agent performance (return and DF)
- Task difficulty (by aggregating agents’ performance, see Appendix E)
State-based settings
- 20 tasks × 5 × 2 IID Gaussian noise configurations (varying dimension or standard deviation)
- 12 random seeds per task



Pixel-based settings
- 14 tasks × 6 background noise types
- 5 random seeds per task
![]()
Key findings
SimSR performs best in most state-based tasks, excelling in both return and denoising factor (DF).
RAP leads in pixel-based tasks but drops moderately in state-based settings.→ Notably, both were only evaluated on pixel domains in their original papers.
SAC and DeepMDP, though not metric learning methods, show decent performance across both domains.
→ These are often overlooked in prior work.
DBC, despite being a standard baseline, does not perform well.In state-based tasks, increasing either:
- the noise dimension (with
), or - the noise standard deviation (with
)
leads to moderate performance drops.
→ Strong methods remain robust in both return and DF.
- the noise dimension (with
In pixel-based tasks, the widely used grayscale natural video setting is not significantly harder than the clean background.
→ Surprisingly, IID Gaussian noise is the most challenging and deserves further investigation.Algorithm effectiveness varies across tasks, for example,
- RAP excels in
reacher/easy - MICo leads in
point_mass/easy(see paper Appendix for details)
→ Broad task coverage is crucial for generalizable conclusions.
- RAP excels in
Optimizing a metric loss (e.g., in MICo) is as expensive as a ZP loss (e.g., in DeepMDP), per runtime comparison.
→ Adding metric-related objectives trades off computation efficiency.
Case Study: What Matters in Metric & Representation Learning?
Why SimSR works here? Pinpoints key design choices that drive performance improvements in metric and representation learning.
Setup: Six representative state-based DMC tasks for case study. A challenging IID Gaussian noise setting (dims=32, std=8.0) is chosen.
Normalization Effects
- Pixel-based tasks use normalization by default; state-based do not (aligned with code provided in prior work).
- SimSR uses
normalization and performs best in state-based settings. - To generalize the insight, LayerNorm is applied to methods that don’t rely on
normalization, avoiding potential metric misspecification8.

- LayerNorm consistently improves both return and DF (Fig. 29 in paper) across methods.
- DeepMDP + LayerNorm performs comparably to SimSR.
ZP Effects

- ZP loss is essential for SimSR’s robustness under noisy state-based settings.
Level-up: Challenging Setting Evaluation
- Tested methods with LayerNorm in IID Gaussian + random projection environments (varying dimension or standard deviation).

All methods suffer from increased noise in the IID Gaussian + random projection setting.
DeepMDP and SimSR remain relatively stable even under high noise variance.
Isolated Metric Evaluation: Does Metric Learning Help with Denoising?
Investigates the direct impact of metric learning on denoising by decoupling it from RL training.
Setup:
- Evaluated in 6 selected tasks under ID generalization.
- All use SAC with LayerNorm as the base agent.
- Three experimental configurations:
- (Fig.5, Row 1) We compare isolated encoders
trained under following settings:
- Metric learning methods: optimized with metric loss.
- SAC (Q loss): optimized with critic (Q) loss.
- DeepMDP (ZP / ZP + RP): optimized with self-prediction (ZP) loss or both reward-prediction (RP) and ZP losses.
- (Fig.5, Row 2) Same as (1) with LayerNorm applied to
. - (Fig.5, Row 3) Same as (2) plus ZP loss applied to all metric learning methods’
.
- (Fig.5, Row 1) We compare isolated encoders

Key Findings:
- Isolated metric learning provides moderate denoising but underperforms compared to applying ZP loss alone.
- Adding RP loss to ZP (DeepMDP) gives limited improvement over ZP-only.
- Applying LayerNorm to
substantially boosts DF for DeepMDP, but only modestly helps metric methods. - Adding metric loss to ZP does not further improve DF.
- MICo’s DF remains low due to its non-zero self-distance design.
OOD Generalization on Pixel-based Tasks
Focuses on evaluation under distribution shifts, which was prior work’s primary setting of interest.
Setup:
- Evaluated all 14 pixel-based tasks with distracting video backgrounds.
- Training and evaluation use distinct video samples to test OOD generalization.


Key Findings:
- All methods struggle with OOD generalization under static backgrounds, lacking temporal variation.
- SAC and DeepMDP remain competitive even under OOD evaluation.
- Colored video backgrounds pose a significantly harder generalization challenge than grayscale ones.
- Surprisingly, SAC shows minimal reward drop in the grayscale setting, questioning the added value of metric learning there.
🪄 Takeaways
Simple but noisy environments (e.g., random projection) are useful for verifying metric learning effects.
Evaluate with direct metrics like the denoising factor, and clarify the problem setting as ID or OOD generalization.
Normalization and self-prediction loss (ZP) are critical for learning strong representations. Future methods should design metric objectives that complement these components or ensure fair comparison against baselines that incorporate them.
Metric learning’s added value diminishes when strong design choices are already present—warranting deeper analysis of when it offers unique benefits.
As noted by Ferns et al. (2004), BSM relates to the largest bisimulation relation
. For brevity, we simplify the original definition based on the fixed-point of a contraction operator and omit the existence proof. ↩︎Lecture by Prakash: slides link. ↩︎
Metrics can be viewed as a relaxation of abstraction. Traditional abstraction enforces a dichotomy: two states are either bisimilar or not. This strict equivalence is often unsuitable for high-dimensional observations and intractable to compute online, motivating the use of metrics to provide a smoother, more flexible alternative. ↩︎
Here, self-prediction basically applies reward prediction (RP) and next latent prediction (ZP) as auxiliary losses; RP can be replaced by Q loss, see Ni et al., 2024 and Voelcker et al., 2024 for more. Why does self-prediction also achieve effective abstraction empirically? Intuitively, from the perspective of abstraction, the method does not enforce a single, specific abstraction. Instead, it only requires the learned representation to retain information relevant to rewards and latent transitions. Many abstractions can satisfy this criterion (even trivial or identical mappings), and the algorithm itself does not guarantee convergence to the minimal sufficient abstraction in general cases. However, when the representation is under a dimensional bottleneck (where latent dim < observation dim), self-prediction need to compress information while keeping essential information. This could partly explain why self-prediction performs well empirically despite lacking explicit regularization toward minimality. The specific outcome also depends on architectural choices and the inductive biases of the encoder (such as sparsity preferences). ↩︎
Recent methods managed to get rid of the max operator in the bisimulation metric, and we use this form to summarize those methods. See detailed discussion in our paper’s Appendix C.2. ↩︎
Formulation used for our test environment. See our paper Section 2 for details. ↩︎
Applying
normalization can cause metric misspecification, as it restricts the representational space, preventing it from faithfully preserving target distances when exceeds the expressiveness of the normalized space; see Appendix C.4 for more. ↩︎