https://yining610.github.io/
The gradient-based method consistently has a higher convergence rate, reducing the required steps by 6.1 on average across RL algorithms.
The gradient-based method consistently has a higher convergence rate, reducing the required steps by 6.1 on average across RL algorithms.
We further extend experiments to different math datasets and model families. Our two methods yield superior Pareto fronts compared to the baseline, with the gradient-based weighting showing the best overall performance.
We further extend experiments to different math datasets and model families. Our two methods yield superior Pareto fronts compared to the baseline, with the gradient-based weighting showing the best overall performance.
Our method generates superior Pareto fronts that dominate all baseline approaches under both GRPO and REINFORCE training.
Our method generates superior Pareto fronts that dominate all baseline approaches under both GRPO and REINFORCE training.
Across all three online RL algorithms, there is consistently at least one weight configuration our method outperforms the baselines on all objectives.
Across all three online RL algorithms, there is consistently at least one weight configuration our method outperforms the baselines on all objectives.
4/8
4/8
Different objectives vary in learning difficulty. Each objective reaches saturation at different training stages.
Different objectives vary in learning difficulty. Each objective reaches saturation at different training stages.
- Rebalance multiobjectives during training through dynamic reward weighting
- Build Pareto-dominant front over static baselines across online RL algorithms, datasets, and model families
- Faster convergence rate
1/8
- Rebalance multiobjectives during training through dynamic reward weighting
- Build Pareto-dominant front over static baselines across online RL algorithms, datasets, and model families
- Faster convergence rate
1/8
TL;DR: We found a mismatch between the decomposition policy and LLM verifier, and propose a dynamic training paradigm to bridge the gap.
TL;DR: We found a mismatch between the decomposition policy and LLM verifier, and propose a dynamic training paradigm to bridge the gap.
📅 11AM-12:30PM, Fri, May 2
📍 Hall 3
📝 arxiv.org/abs/2407.09007
🎥 www.youtube.com/watch?v=v1c...
📅 11AM-12:30PM, Fri, May 2
📍 Hall 3
📝 arxiv.org/abs/2407.09007
🎥 www.youtube.com/watch?v=v1c...