Postdoc at the Technion. PhD from Politecnico di Milano.
https://muttimirco.github.io
If you know of other cool work in this space (or are working on one), feel free to reply and share.
Hope to see even more work on convex RL variations 🚀
n/n
If you know of other cool work in this space (or are working on one), feel free to reply and share.
Hope to see even more work on convex RL variations 🚀
n/n
Bridging convex RL with generative models: How to steer diffusion/flow models to optimize non-linear user-specified utilities (beyond just entropy reg fine tuning)?
📍 EXAIT workshop
🔗 openreview.net/pdf?id=zOgAx...
7/n
Bridging convex RL with generative models: How to steer diffusion/flow models to optimize non-linear user-specified utilities (beyond just entropy reg fine tuning)?
📍 EXAIT workshop
🔗 openreview.net/pdf?id=zOgAx...
7/n
Still in the convex Markov games space—this work explores more tractable objectives for the learning setting.
📍EXAIT workshop
🔗https://openreview.net/pdf?id=A1518D1Pp9
6/n
Still in the convex Markov games space—this work explores more tractable objectives for the learning setting.
📍EXAIT workshop
🔗https://openreview.net/pdf?id=A1518D1Pp9
6/n
If you can 'convexify' MDPs, so you can do for Markov games.
These two papers lay out a general framework + algorithms for the zero-sum version.
🔗https://openreview.net/pdf?id=yIfCq03hsM
🔗https://openreview.net/pdf?id=dSJo5X56KQ
5/n
If you can 'convexify' MDPs, so you can do for Markov games.
These two papers lay out a general framework + algorithms for the zero-sum version.
🔗https://openreview.net/pdf?id=yIfCq03hsM
🔗https://openreview.net/pdf?id=dSJo5X56KQ
5/n
A deeper look at how the number of realizations used to compute F affects the convex RL problem in infinite horizon settings.
🔗https://openreview.net/pdf?id=I4jNAbqHnM
4/n
A deeper look at how the number of realizations used to compute F affects the convex RL problem in infinite horizon settings.
🔗https://openreview.net/pdf?id=I4jNAbqHnM
4/n
Regret bounds for online convex RL, where F^t is adversarial and revealed only after each episode (or just evaluated on the given trajectory in a bandit feedback variation)
🔗https://openreview.net/pdf?id=d8xnwqslqq
3/n
Regret bounds for online convex RL, where F^t is adversarial and revealed only after each episode (or just evaluated on the given trajectory in a bandit feedback variation)
🔗https://openreview.net/pdf?id=d8xnwqslqq
3/n
Standard RL optimizes a linear objective: ⟨d^π, r⟩.
Convex RL generalizes this to any F(d^π), where F is non-linear (originally assumed convex—hence the name).
This framework subsumes:
• Imitation
• Risk sensitivity
• State coverage
• RLHF
...and more.
2/n
Standard RL optimizes a linear objective: ⟨d^π, r⟩.
Convex RL generalizes this to any F(d^π), where F is non-linear (originally assumed convex—hence the name).
This framework subsumes:
• Imitation
• Risk sensitivity
• State coverage
• RLHF
...and more.
2/n
- come at our poster (n. 908) on Thursday morning session #ICML2025
- read the preprint arxiv.org/abs/2504.04505
- watch the seminar youtube.com/watch?v=pNos...
n/n
- come at our poster (n. 908) on Thursday morning session #ICML2025
- read the preprint arxiv.org/abs/2504.04505
- watch the seminar youtube.com/watch?v=pNos...
n/n
7/n
7/n
6/n
6/n
5/n
5/n
A simple algorithm that classifies the latent (condition) with a decision tree (img above right) and then exploits the best action for the classified latent does the job
4/n
A simple algorithm that classifies the latent (condition) with a decision tree (img above right) and then exploits the best action for the classified latent does the job
4/n
3/n
3/n
2/n
2/n