and I created four data domains: math, code, instruction-following, and general chat, so you can study their interaction during RL finetuning
and I created four data domains: math, code, instruction-following, and general chat, so you can study their interaction during RL finetuning
Online DPO slightly outperforms PPO on GSM8k but more importantly 1-step Async runs 68% faster than Sync and matches performance🔥
Online DPO slightly outperforms PPO on GSM8k but more importantly 1-step Async runs 68% faster than Sync and matches performance🔥
We run training and generation at the same time, but now we're training on samples from a previous timestep aka *off-policy* RL!
We run training and generation at the same time, but now we're training on samples from a previous timestep aka *off-policy* RL!