https://yining610.github.io/
The gradient-based method consistently has a higher convergence rate, reducing the required steps by 6.1 on average across RL algorithms.
The gradient-based method consistently has a higher convergence rate, reducing the required steps by 6.1 on average across RL algorithms.
We further extend experiments to different math datasets and model families. Our two methods yield superior Pareto fronts compared to the baseline, with the gradient-based weighting showing the best overall performance.
We further extend experiments to different math datasets and model families. Our two methods yield superior Pareto fronts compared to the baseline, with the gradient-based weighting showing the best overall performance.
Our method generates superior Pareto fronts that dominate all baseline approaches under both GRPO and REINFORCE training.
Our method generates superior Pareto fronts that dominate all baseline approaches under both GRPO and REINFORCE training.
Across all three online RL algorithms, there is consistently at least one weight configuration our method outperforms the baselines on all objectives.
Across all three online RL algorithms, there is consistently at least one weight configuration our method outperforms the baselines on all objectives.
4/8
4/8
Different objectives vary in learning difficulty. Each objective reaches saturation at different training stages.
Different objectives vary in learning difficulty. Each objective reaches saturation at different training stages.
Answer:
- If the user preference for objectives is given, use our hypervolume-based method
- If the user preference is unknown, use our gradient-based method.
2/8
Answer:
- If the user preference for objectives is given, use our hypervolume-based method
- If the user preference is unknown, use our gradient-based method.
2/8
youtu.be/TgloG4Oefeg
youtu.be/TgloG4Oefeg
www.youtube.com/watch?v=v1c...
www.youtube.com/watch?v=v1c...