Lightnews — Scholar-powered news

Mirco Mutti

@mircomutti.bsky.social

No, but since the pc explicitly suggested to post on the 20th, I think most people will comply

November 18, 2025 at 7:42 AM

Mirco Mutti

@mircomutti.bsky.social

Looks interesting, but cannot access the url or find the report anywhere

July 29, 2025 at 6:54 AM

Mirco Mutti

@mircomutti.bsky.social

That’s my little #ICML2025 convex RL roundup!

If you know of other cool work in this space (or are working on one), feel free to reply and share.

Hope to see even more work on convex RL variations 🚀

n/n

July 24, 2025 at 1:09 PM

Mirco Mutti

@mircomutti.bsky.social

📄Flow density control – @desariky.bsky.social et al

Bridging convex RL with generative models: How to steer diffusion/flow models to optimize non-linear user-specified utilities (beyond just entropy reg fine tuning)?

📍 EXAIT workshop
🔗 openreview.net/pdf?id=zOgAx...

7/n

July 24, 2025 at 1:09 PM

Mirco Mutti

@mircomutti.bsky.social

📄Towards unsupervised multi-agent RL – @ricczamboni.bsky.social et al (yours truly!)

Still in the convex Markov games space—this work explores more tractable objectives for the learning setting.

📍EXAIT workshop
🔗https://openreview.net/pdf?id=A1518D1Pp9

6/n

July 24, 2025 at 1:09 PM

Mirco Mutti

@mircomutti.bsky.social

📄Convex Markov games – Ian Gemp et al

If you can 'convexify' MDPs, so you can do for Markov games.
These two papers lay out a general framework + algorithms for the zero-sum version.

🔗https://openreview.net/pdf?id=yIfCq03hsM
🔗https://openreview.net/pdf?id=dSJo5X56KQ

5/n

July 24, 2025 at 1:09 PM

Mirco Mutti

@mircomutti.bsky.social

📄The number of trials matters in infinite-horizon MDPs – @pedrosantospps.bsky.social ‬ et al

A deeper look at how the number of realizations used to compute F affects the convex RL problem in infinite horizon settings.

🔗https://openreview.net/pdf?id=I4jNAbqHnM

4/n

July 24, 2025 at 1:09 PM

Mirco Mutti

@mircomutti.bsky.social

📄Online episodic convex RL – Bianca Marni Moreno et al

Regret bounds for online convex RL, where F^t is adversarial and revealed only after each episode (or just evaluated on the given trajectory in a bandit feedback variation)

🔗https://openreview.net/pdf?id=d8xnwqslqq

3/n

July 24, 2025 at 1:09 PM

Mirco Mutti

@mircomutti.bsky.social

🔍 Convex RL

Standard RL optimizes a linear objective: ⟨d^π, r⟩.
Convex RL generalizes this to any F(d^π), where F is non-linear (originally assumed convex—hence the name).

This framework subsumes:
• Imitation
• Risk sensitivity
• State coverage
• RLHF
...and more.

2/n

July 24, 2025 at 1:09 PM

Mirco Mutti

@mircomutti.bsky.social

To learn more:

- come at our poster (n. 908) on Thursday morning session #ICML2025

- read the preprint arxiv.org/abs/2504.04505

- watch the seminar youtube.com/watch?v=pNos...

n/n

A Classification View on Meta Learning Bandits

Contextual multi-armed bandits are a popular choice to model sequential decision-making. E.g., in a healthcare application we may perform various tests to asses a patient condition (exploration) and t...

arxiv.org

July 15, 2025 at 3:50 PM

Mirco Mutti

@mircomutti.bsky.social

This is how we got "A classification view on meta learning bandits", a joint work with awesome collaborators Jeongyeol, Shie, and @aviv-tamar.bsky.social

7/n

July 15, 2025 at 3:50 PM

Mirco Mutti

@mircomutti.bsky.social

The regret bounds depend on an instance-dependent "classification coefficient", which suggests classification really captures the complexity of the problem rather than being a mere implementation tool

6/n

July 15, 2025 at 3:50 PM

Mirco Mutti

@mircomutti.bsky.social

For the latter, we show exploration is *interpretable*, as it is implemented by a shallow decision tree of simple constant action policies, and *efficient*, giving upper/lower bounds to the regret

5/n

July 15, 2025 at 3:50 PM

Mirco Mutti

@mircomutti.bsky.social

Yes, apparently!
A simple algorithm that classifies the latent (condition) with a decision tree (img above right) and then exploits the best action for the classified latent does the job

4/n

July 15, 2025 at 3:50 PM

Mirco Mutti

@mircomutti.bsky.social

Humans typically develop a standard strategy prescribing a sequence of tests to diagnose the condition before committing to the best treatment (see img left). Can we design a bandit algorithm that learns a similarly interpretable exploration but it's also provably efficient?

3/n

July 15, 2025 at 3:50 PM

Mirco Mutti

@mircomutti.bsky.social

Think about a setting in which we aim to converge on the best treatment (action) for a given patient (context) with some unknown condition (latent). The difference between how humans and bandits approach this same problem is striking:

2/n

July 15, 2025 at 3:50 PM

Mirco Mutti

@mircomutti.bsky.social

Congratulations, well deserved!

May 3, 2025 at 11:55 AM

Mirco Mutti

@mircomutti.bsky.social

All stick, no carrot

May 3, 2025 at 6:35 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news