We also show that our proposed reward calibration method is a strong baseline for optimizing standard win rate on all considered datasets, with comparable or better performance than other SOTA methods, demonstrating the benefits of the reward calibration step.
February 11, 2025 at 4:27 PM
We also show that our proposed reward calibration method is a strong baseline for optimizing standard win rate on all considered datasets, with comparable or better performance than other SOTA methods, demonstrating the benefits of the reward calibration step.
For Worst-of-N, we use Anthropic harmlessness dataset, and observe similar improvements. The best improvement is achieved by an exponential transformation with a negative exponent.
February 11, 2025 at 4:27 PM
For Worst-of-N, we use Anthropic harmlessness dataset, and observe similar improvements. The best improvement is achieved by an exponential transformation with a negative exponent.
We empirically compare InfAlign-CTRL with other SOTA alignment methods. For Best-of-N, we use Anthropic helpfulness and Reddit summarization quality dataset. We show that it offers up to 3-8% improvement on inference-time win rates, achieved by an exponential transformation with a positive exponent.
February 11, 2025 at 4:27 PM
We empirically compare InfAlign-CTRL with other SOTA alignment methods. For Best-of-N, we use Anthropic helpfulness and Reddit summarization quality dataset. We show that it offers up to 3-8% improvement on inference-time win rates, achieved by an exponential transformation with a positive exponent.
We provide an analytical tool to compare InfAlign-CTRL with different transformation functions for a given inference-time procedure. We find that exponential transformations, which optimize different quantiles of the reward with different t's, achieves close-to-optimal performance for BoN and WoN.
February 11, 2025 at 4:27 PM
We provide an analytical tool to compare InfAlign-CTRL with different transformation functions for a given inference-time procedure. We find that exponential transformations, which optimize different quantiles of the reward with different t's, achieves close-to-optimal performance for BoN and WoN.
We then particularize the study to two popular inference-time strategies, BoN sampling (BoN) and BoN jailbreaking (WoN). Despite simplicity, BoN is known to be an effective procedure for inference-time alignment and scaling. Variants of WoN are effective for evaluating safety against jailbreaks.
February 11, 2025 at 4:27 PM
We then particularize the study to two popular inference-time strategies, BoN sampling (BoN) and BoN jailbreaking (WoN). Despite simplicity, BoN is known to be an effective procedure for inference-time alignment and scaling. Variants of WoN are effective for evaluating safety against jailbreaks.
The reward calibration step makes the reward model more robust to its learning process. We empirically show that it could help mitigate reward hacking. The transformation function allows us to further tailor the alignment objective to different inference-time procedures.
February 11, 2025 at 4:27 PM
The reward calibration step makes the reward model more robust to its learning process. We empirically show that it could help mitigate reward hacking. The transformation function allows us to further tailor the alignment objective to different inference-time procedures.
To enable practical solutions, we provide the calibrate-and-transform RL (InfAlign-CTRL) algorithm to solve this problem, which involves a reward calibration step and a KL-regularized reward maximization step with a transformation Ф of the calibrated reward.
February 11, 2025 at 4:27 PM
To enable practical solutions, we provide the calibrate-and-transform RL (InfAlign-CTRL) algorithm to solve this problem, which involves a reward calibration step and a KL-regularized reward maximization step with a transformation Ф of the calibrated reward.
We show that the optimal reward transformation satisfies a coupled-transformed reward/policy optimization objective, which lends itself to iterative optimization. However, the approach is unfortunately computationally inefficient and infeasible for real-world models.
February 11, 2025 at 4:27 PM
We show that the optimal reward transformation satisfies a coupled-transformed reward/policy optimization objective, which lends itself to iterative optimization. However, the approach is unfortunately computationally inefficient and infeasible for real-world models.
Somewhat surprisingly, we prove that for any inference-time decoding procedure, the optimal aligned policy is the solution to the standard RLHF problem with a transformation of the reward. Therefore, the challenge can be captured by designing a suitable reward transformation.
February 11, 2025 at 4:27 PM
Somewhat surprisingly, we prove that for any inference-time decoding procedure, the optimal aligned policy is the solution to the standard RLHF problem with a transformation of the reward. Therefore, the challenge can be captured by designing a suitable reward transformation.
To characterize this, we propose a framework for inference-aware alignment (InfAlign), which aims to optimize inference-time win rate of the aligned policy. We show that the standard RLHF framework is sub-optimal in view of the above metric.
February 11, 2025 at 4:27 PM
To characterize this, we propose a framework for inference-aware alignment (InfAlign), which aims to optimize inference-time win rate of the aligned policy. We show that the standard RLHF framework is sub-optimal in view of the above metric.
RLHF generally entails training a reward model and then solving a KL-regularized reward maximization problem. The success is typically measured through the win rate of samples from the alignment model against the base model through standard sampling.
February 11, 2025 at 4:27 PM
RLHF generally entails training a reward model and then solving a KL-regularized reward maximization problem. The success is typically measured through the win rate of samples from the alignment model against the base model through standard sampling.
Inference-time procedures (e.g. Best-of-N, CoT) have been instrumental to recent development of LLMs. Standard RLHF focuses only on improving the trained model. This creates a train/inference mismatch.
Inference-time procedures (e.g. Best-of-N, CoT) have been instrumental to recent development of LLMs. Standard RLHF focuses only on improving the trained model. This creates a train/inference mismatch.