Prev: @ltiatcmu.bsky.social @umich.edu
Research: Agents🤖, Reasoning🧠, Games👾
🌟Huge thanks to my advisor @rajammanabrolu.bsky.social for his invaluable guidance throughout this work! 🙏
Questions/feedback welcome below 👇
🖇️Paper: arxiv.org/abs/2510.01132
💻Code: github.com/pearls-lab/m...
🌟Huge thanks to my advisor @rajammanabrolu.bsky.social for his invaluable guidance throughout this work! 🙏
Questions/feedback welcome below 👇
🖇️Paper: arxiv.org/abs/2510.01132
💻Code: github.com/pearls-lab/m...
To help the community, we're releasing 🐈🍵Meow-Tea-Taro 💜: a modular framework where you can configure 🌎environments, 🤖policies, and ⭐rewards.
We also provide our recipes, analyses, and tutorials on building agentic multi-turn RL pipelines in the codebase.
Code: github.com/pearls-lab/m...
To help the community, we're releasing 🐈🍵Meow-Tea-Taro 💜: a modular framework where you can configure 🌎environments, 🤖policies, and ⭐rewards.
We also provide our recipes, analyses, and tutorials on building agentic multi-turn RL pipelines in the codebase.
Code: github.com/pearls-lab/m...
⭐Reward:
Dense rewards significantly improve multi-turn RL performance, with optimal density varying by RL algorithm.
⭐Reward:
Dense rewards significantly improve multi-turn RL performance, with optimal density varying by RL algorithm.
🤖Policy:
1. Good SFT priors achieve the same performance with fewer RL episodes; however, RL is needed for generalization.
2. Given a fixed compute budget, there's an optimal SFT:RL data ratio.
3. Both PPO/GRPO (biased) and RLOO (unbiased) methods achieve improvements over base models
🤖Policy:
1. Good SFT priors achieve the same performance with fewer RL episodes; however, RL is needed for generalization.
2. Given a fixed compute budget, there's an optimal SFT:RL data ratio.
3. Both PPO/GRPO (biased) and RLOO (unbiased) methods achieve improvements over base models
🌎Environment:
1. Agents trained on simpler environments can generalize to more complex environments.
2. Agents trained on a subset of tasks can generalize to unseen tasks.
🌎Environment:
1. Agents trained on simpler environments can generalize to more complex environments.
2. Agents trained on a subset of tasks can generalize to unseen tasks.