Conor Durkan
banner
conormdurkan.bsky.social
Conor Durkan
@conormdurkan.bsky.social
Generative modeling person
https://conordurkan.com
https://arxiv.org/abs/1805.00909
t.co
January 3, 2025 at 5:02 PM
https://arxiv.org/abs/2205.11275
t.co
January 3, 2025 at 5:02 PM
January 3, 2025 at 5:02 PM
This means post-training (of this kind at least) optimizes KL(model || posterior), whereas pre-training optimizes KL(data || model). It also means post-training is mode-seeking (as opposed to mode-covering like pre-training), so those rewards better be well calibrated.
January 3, 2025 at 5:02 PM
Reward functions are log-likelihoods, and the pre-trained model is a prior. The posterior target is the product of the likelihoods and prior (the prior weighting can equivalently sharpen or smooth your likelihoods). Rewards can be hard for math/code verification, or soft for subjective preference.
January 3, 2025 at 5:02 PM
Ironically on my wrist
November 26, 2024 at 5:29 PM