Lightnews — Scholar-powered news

Gaspard Lambrechts

@gsprd.be

Interestingly, this distribution can be learned from off-policy samples with a TD-like update. And this, even when encouraging the visitation of features of future states (possibly aliased).

arxiv.org/abs/2412.06655

Off-Policy Maximum Entropy RL with Future State and Action Visitation Measures

Maximum entropy reinforcement learning integrates exploration into policy learning by providing additional intrinsic rewards proportional to the entropy of some distribution. In this paper, we propose...

arxiv.org

October 6, 2025 at 9:50 AM

Gaspard Lambrechts

@gsprd.be

4) Off-Policy Maximum Entropy RL with Future State and Action Visitation Measures.

With Adrien Bolland and Damien Ernst, we propose a new intrinsic reward. Instead of encouraging visiting states uniformly, we encourage visiting *future* states uniformly, from every state.

October 6, 2025 at 9:50 AM

Gaspard Lambrechts

@gsprd.be

This view offers interesting insights for the design of intrinsic rewards, by providing four criteria.

arxiv.org/abs/2402.00162

Behind the Myth of Exploration in Policy Gradients

In order to compute near-optimal policies with policy-gradient algorithms, it is common in practice to include intrinsic exploration terms in the learning objective. Although the effectiveness of thes...

arxiv.org

October 6, 2025 at 9:50 AM

Gaspard Lambrechts

@gsprd.be

3) Behind the Myth of Exploration in Policy Gradients.

With Adrien Bolland and Damien Ernst, we decided to frame the exploration problem for policy-gradient methods from the optimization point of view.

October 6, 2025 at 9:50 AM

Gaspard Lambrechts

@gsprd.be

By adapting a finite-time bound, we uncover an interesting tradeoff between informativeness of the additional information and complexity of the resulting value function.

openreview.net/forum?id=wNV...

Informed Asymmetric Actor-Critic: Theoretical Insights and Open...

Reinforcement learning in partially observable environments requires agents to make decisions under uncertainty, based on incomplete and noisy observations. Asymmetric actor-critic methods improve...

openreview.net

October 6, 2025 at 9:50 AM

Gaspard Lambrechts

@gsprd.be

2) Informed Asymmetric Actor-Critic: Theoretical Insights and Open Questions.

With Daniel Ebi and Damien Ernst, we looked for a reason why asymmetric actor-critic was performing better, even when using RNN-based policies with the full observation history as input (no aliasing).

October 6, 2025 at 9:50 AM

Gaspard Lambrechts

@gsprd.be

In AsymAC, while the policy maintains an agent state based on observations only, the critic also takes the state as input. Its better performance is linked to eventual "aliasing" in the agent state, hurting TD learning in the symmetric case only.

arxiv.org/abs/2501.19116

A Theoretical Justification for Asymmetric Actor-Critic Algorithms

In reinforcement learning for partially observable environments, many successful algorithms have been developed within the asymmetric learning paradigm. This paradigm leverages additional state inform...

arxiv.org

October 6, 2025 at 9:50 AM

Gaspard Lambrechts

@gsprd.be

1) A Theoretical Justification for Asymmetric Actor-Critic Algorithms.

With Damien Ernst and Aditya Mahajan, we looked for a reason why asymmetric actor-critic algorithms are performing better than their symmetric counterparts.

October 6, 2025 at 9:50 AM

Gaspard Lambrechts

@gsprd.be

TL;DR: Do not make the problem harder than it is! Using state information during training is provably better.

📝 Paper: arxiv.org/abs/2501.19116
🎤 Talk: orbi.uliege.be/handle/2268/...

A warm thank to Aditya Mahajan for welcoming me at McGill University and for his precious supervision.

A Theoretical Justification for Asymmetric Actor-Critic Algorithms

In reinforcement learning for partially observable environments, many successful algorithms have been developed within the asymmetric learning paradigm. This paradigm leverages additional state inform...

arxiv.org

June 9, 2025 at 2:43 PM

Gaspard Lambrechts

@gsprd.be

While this work has considered fixed feature z = f(h) with linear approximators, we discuss possible generalizations in the conclusion.

Despite not matching the usual recurrent actor-critic setting, this analysis still provides insights into the effectiveness of asymmetric actor-critic algorithms.

June 9, 2025 at 2:43 PM

Gaspard Lambrechts

@gsprd.be

The conclusion is that asymmetric learning is less sensitive to aliasing than symmetric learning.

Now, what is aliasing exactly?

The aliasing and inference terms arise from z = f(h) not being Markovian. They can be bounded by the difference between the approximate p(s|z) and exact p(s|h) beliefs.

June 9, 2025 at 2:43 PM

Gaspard Lambrechts

@gsprd.be

Now, as far as the actor suboptimality is concerned, we obtained the following finite-time bounds.

In addition to the average critic error, which is also present in the actor bound, the symmetric actor-critic algorithm suffers from an additional "inference term".

Theorem showing the finite-time suboptimality bound for the asymmetric and symmetric actor-critic algorithms. The asymmetric algorithm has four terms: the natural actor-critic term, the gradient estimation term, the residual gradient term, and the average critic error. The symmetric algorithm has an additional term: the inference term.

June 9, 2025 at 2:43 PM

Gaspard Lambrechts

@gsprd.be

By adapting the finite-time bound from the symmetric setting to the asymmetric setting, we obtain the following error bounds for the critic estimates.

The symmetric temporal difference learning algorithm has an additional "aliasing term".

Theorem showing the finite-time error bound for the asymmetric and symmetric temporal difference learning algorithms. The asymmetric algorithm has three terms: the temporal difference learning term, the function approximation term, and the bootstrapping shift term. The symmetric algorithm has an additional term: the aliasing term.

June 9, 2025 at 2:43 PM

Gaspard Lambrechts

@gsprd.be

While this algorithm is valid/unbiased (Baisero & Amato, 2022), a theoretical justification for its benefit is still missing.

Does it really learn faster than symmetric learning?

In this paper, we provide theoretical evidence for this, based on an adapted finite-time analysis (Cayci et al., 2024).

Title page of the paper "A Theoretical Justification for Asymmetric Actor-Critic Algorithms", written by Gaspard Lambrechts, Damien Ernst and Aditya Mahajan.

June 9, 2025 at 2:43 PM

Gaspard Lambrechts

@gsprd.be

However, with actor-critic algorithms, it can be noticed that the critic is not needed at execution!

As a result, the state can be an input of the critic, which becomes Q(s, z, a) in the asymmetric setting instead of Q(z, a) in the symmetric setting.

Figure showing the policy being passed the feature z = f(h) of the history, and the critic being passed both the feature z = f(h) of the history and the state s as input. The asymmetric critic and the policy are used together to form the sample policy-gradient expression.

June 9, 2025 at 2:43 PM

Gaspard Lambrechts

@gsprd.be

In a POMDP, the goal is to find an optimal policy π(a|z) that maps a feature z = f(h) of the history h to an action a.

In a privileged POMDP, the state can be used to learn a policy π(a|z) faster.

But note that the state cannot be an input of the policy, since it is not available at execution.

Figure showing the history h being compressed into a feature z = f(h) for then being passed to a policy g(a | z).

June 9, 2025 at 2:43 PM

Gaspard Lambrechts

@gsprd.be

Typically, classical RL methods assume:
- MDP: full state observability (too optimistic),
- POMDP: partial state observability (too pessimistic).

Instead, asymmetric RL methods assume:
- Privileged POMDP: asymmetric state observability (full at training, partial at execution).

Table showing that the MDP assumes full state observability both during training and execution and that the POMDP assumes partial state observability both during training and execution, while the privileged POMDP assumes full state observability during training but partial state observability during execution.

June 9, 2025 at 2:43 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news