Lightnews — Scholar-powered news

Axel Brunnbauer

@axelbrunnbauer.bsky.social

This blog post is a nice complementary, behind-the-scenes extra on our recent work about on-policy pathwise gradient algorithms. @cvoelcker.bsky.social went the extra mile, and wrote this piece to provide some more context on the design decisions behind REPPO!

Claas Voelcker @cvoelcker.bsky.social · Oct 2

cvoelcker.de/blog/2025/re...

I finally gave in and made a nice blog post about my most recent paper. This was a surprising amount of work, so please be nice and go read it!

a close up of a sad cat with the words pleeeaasse written below it

ALT: a close up of a sad cat with the words pleeeaasse written below it

media.tenor.com

October 3, 2025 at 10:52 PM

Reposted by Axel Brunnbauer

Claas Voelcker

@cvoelcker.bsky.social

cvoelcker.de/blog/2025/re...

I finally gave in and made a nice blog post about my most recent paper. This was a surprising amount of work, so please be nice and go read it!

a close up of a sad cat with the words pleeeaasse written below it

ALT: a close up of a sad cat with the words pleeeaasse written below it

media.tenor.com

October 2, 2025 at 9:34 PM

Reposted by Axel Brunnbauer

Claas Voelcker

@cvoelcker.bsky.social

Big if true 🤫: #REPPO works on Atari as well 😱 👾 🚀

Some tuning is still needed, but we are seeing results roughly on par with #PQN.

If you want to test out #REPPO (atari is not integrated due to issues with envpool and jax version), check out github.com/cvoelcker/re...

#reinforcementlearning

A lonely return curve on the ALE game Qbert-v5 for the REPPO algorithm

September 16, 2025 at 1:29 PM

Reposted by Axel Brunnbauer

Marcel Hussing

@marcelhussing.bsky.social

Super stoked for the New York RL workshop tomorrow. Will be presenting 2 orals:
* Replicable Reinforcement Learning with Linear Function Approximation
* Relative Entropy Pathwise Policy Optimization

We already posted about the 2nd one (below), I'll get to talking about the first one in a bit here.

Claas Voelcker @cvoelcker.bsky.social · Jul 17

🔥 Presenting Relative Entropy Pathwise Policy Optimization #REPPO 🔥
Off-policy #RL (eg #TD3) trains by differentiating a critic, while on-policy #RL (eg #PPO) uses Monte-Carlo gradients. But is that necessary? Turns out: No! We show how to get critic gradients on-policy. arxiv.org/abs/2507.11019

GIF showing two plots that symbolize the REPPO algorithm. On the left side, four curves track the return of an optimization function, and on the right side, the optimization paths over the objective function are visualized. The GIF shows that monte-carlo gradient estimators have a high variance and fail to converge, while surrogate function estimators converge smoothly, but might find suboptimal solutions if the surrogate function is imprecise.

September 11, 2025 at 2:28 PM

Reposted by Axel Brunnbauer

Eugene Vinitsky 🍒

@eugenevinitsky.bsky.social

I’ve been hearing about this paper from Claas for a while now, the fact that they aren’t tuning per benchmark is a killer sign. Also, check out the wall clock plots!

Claas Voelcker @cvoelcker.bsky.social · Jul 17

🔥 Presenting Relative Entropy Pathwise Policy Optimization #REPPO 🔥
Off-policy #RL (eg #TD3) trains by differentiating a critic, while on-policy #RL (eg #PPO) uses Monte-Carlo gradients. But is that necessary? Turns out: No! We show how to get critic gradients on-policy. arxiv.org/abs/2507.11019

July 18, 2025 at 8:15 PM

Reposted by Axel Brunnbauer

Marcel Hussing

@marcelhussing.bsky.social

My PhD journey started with me fine-tuning hparams of PPO which ultimately led to my research on stability. With REPPO, we've made a huge step in the right direction. Stable learning, no tuning on a new benchmark, amazing performance. REPPO has the potential to be the PPO killer we all waited for.

Claas Voelcker @cvoelcker.bsky.social · Jul 17

🔥 Presenting Relative Entropy Pathwise Policy Optimization #REPPO 🔥
Off-policy #RL (eg #TD3) trains by differentiating a critic, while on-policy #RL (eg #PPO) uses Monte-Carlo gradients. But is that necessary? Turns out: No! We show how to get critic gradients on-policy. arxiv.org/abs/2507.11019

July 17, 2025 at 7:41 PM

Reposted by Axel Brunnbauer

Claas Voelcker

@cvoelcker.bsky.social

🔥 Presenting Relative Entropy Pathwise Policy Optimization #REPPO 🔥
Off-policy #RL (eg #TD3) trains by differentiating a critic, while on-policy #RL (eg #PPO) uses Monte-Carlo gradients. But is that necessary? Turns out: No! We show how to get critic gradients on-policy. arxiv.org/abs/2507.11019

July 17, 2025 at 7:11 PM

Axel Brunnbauer

@axelbrunnbauer.bsky.social

Our paper on unsupervised environment design for autonomous-driving scenarios was accepted at ICRA! We built a curriculum generator for CARLA which adapts the scenario distribution to the current capabilities of the agent.
arxiv.org/abs/2403.17805

Scenario-Based Curriculum Generation for Multi-Agent Autonomous Driving

The automated generation of diverse and complex training scenarios has been an important ingredient in many complex learning tasks. Especially in real-world application domains, such as autonomous dri...

arxiv.org

February 11, 2025 at 10:18 AM

Axel Brunnbauer

@axelbrunnbauer.bsky.social

Excited to announce that our paper "Scalable Offline Reinforcement Learning for Mean Field Games" has been accepted at #AAMAS2025! 🚀 We propose Off-MMD, an offline RL algorithm for learning equilibrium policies in MFGs from static datasets. arxiv.org/abs/2410.17898

Scalable Offline Reinforcement Learning for Mean Field Games

Reinforcement learning algorithms for mean-field games offer a scalable framework for optimizing policies in large populations of interacting agents. Existing methods often depend on online interactio...

arxiv.org

December 20, 2024 at 10:04 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news