Gaspard Lambrechts
gsprd.be
Gaspard Lambrechts
@gsprd.be
PhD Student doing RL in POMDP at the University of Liège - Intern at McGill - gsprd.be
At #ICML2025, we will present a theoretical justification for the benefits of « asymmetric actor-critic » algorithms (#W1008 Wednesday at 11am).

📝 Paper: hdl.handle.net/2268/326874
💻 Blog: damien-ernst.be/2025/06/10/a...
July 16, 2025 at 2:32 PM
Last week, I gave an invited talk on "asymmetric reinforcement learning" at the BeNeRL workshop. I was happy to draw attention to this niche topic, which I think can be useful to any reinforcement learning researcher.

Slides: hdl.handle.net/2268/333931.
July 11, 2025 at 9:22 AM
Two months after my PhD defense on RL in POMDP, I finally uploaded the final version of my thesis :)

You can find it here: hdl.handle.net/2268/328700 (manuscript and slides).

Many thanks to my advisors and to the jury members.
June 13, 2025 at 11:44 AM
Now, as far as the actor suboptimality is concerned, we obtained the following finite-time bounds.

In addition to the average critic error, which is also present in the actor bound, the symmetric actor-critic algorithm suffers from an additional "inference term".
June 9, 2025 at 2:43 PM
By adapting the finite-time bound from the symmetric setting to the asymmetric setting, we obtain the following error bounds for the critic estimates.

The symmetric temporal difference learning algorithm has an additional "aliasing term".
June 9, 2025 at 2:43 PM
While this algorithm is valid/unbiased (Baisero & Amato, 2022), a theoretical justification for its benefit is still missing.

Does it really learn faster than symmetric learning?

In this paper, we provide theoretical evidence for this, based on an adapted finite-time analysis (Cayci et al., 2024).
June 9, 2025 at 2:43 PM
However, with actor-critic algorithms, it can be noticed that the critic is not needed at execution!

As a result, the state can be an input of the critic, which becomes Q(s, z, a) in the asymmetric setting instead of Q(z, a) in the symmetric setting.
June 9, 2025 at 2:43 PM
In a POMDP, the goal is to find an optimal policy π(a|z) that maps a feature z = f(h) of the history h to an action a.

In a privileged POMDP, the state can be used to learn a policy π(a|z) faster.

But note that the state cannot be an input of the policy, since it is not available at execution.
June 9, 2025 at 2:43 PM
Typically, classical RL methods assume:
- MDP: full state observability (too optimistic),
- POMDP: partial state observability (too pessimistic).

Instead, asymmetric RL methods assume:
- Privileged POMDP: asymmetric state observability (full at training, partial at execution).
June 9, 2025 at 2:43 PM
📝 Our paper "A Theoretical Justification for Asymmetric Actor-Critic Algorithms" was accepted at #ICML!

Never heard of "asymmetric actor-critic" algorithms? Yet, many successful #RL applications use them (see image).

But these algorithms are not fully understood. Below, we provide some insights.
June 9, 2025 at 2:43 PM
Slydst, my Typst package for making simple slides, just got its 100th star on Github.

While I would not advise using Typst for papers yet, its markdown-like syntax allows to create slides in a few minutes, while supporting everything we love from LaTeX: equations.

github.com/glambrechts/...
April 3, 2025 at 1:51 PM