and I created four data domains: math, code, instruction-following, and general chat, so you can study their interaction during RL finetuning
and I created four data domains: math, code, instruction-following, and general chat, so you can study their interaction during RL finetuning
I think people are naturally adapting to LLMs giving them half-truths
I think people are naturally adapting to LLMs giving them half-truths
bsky.app/profile/mnou...
Used by @ai2.bsky.social for OLMo-2 32B 🔥
New results show ~70% speedups for LLM + RL math and reasoning 🧠
🧵below or hear my DLCT talk online on March 28!
bsky.app/profile/mnou...
@vwxyzjn.bsky.social
@sophie-xhonneux.bsky.social
@arianh.bsky.social
Rishabh and Aaron who have not yet migrated 🦋
DMs open📲let's chat about about everything LLM + RL @ ICLR and check out
Paper 📰 arxiv.org/abs/2410.18252
Code 🧑💻 github.com/mnoukhov/asy...
@vwxyzjn.bsky.social
@sophie-xhonneux.bsky.social
@arianh.bsky.social
Rishabh and Aaron who have not yet migrated 🦋
DMs open📲let's chat about about everything LLM + RL @ ICLR and check out
Paper 📰 arxiv.org/abs/2410.18252
Code 🧑💻 github.com/mnoukhov/asy...
Would love critiques from any engineers working on RLHF if they feel I missed something!
Would love critiques from any engineers working on RLHF if they feel I missed something!
Online DPO slightly outperforms PPO on GSM8k but more importantly 1-step Async runs 68% faster than Sync and matches performance🔥
Online DPO slightly outperforms PPO on GSM8k but more importantly 1-step Async runs 68% faster than Sync and matches performance🔥
We run training and generation at the same time, but now we're training on samples from a previous timestep aka *off-policy* RL!
We run training and generation at the same time, but now we're training on samples from a previous timestep aka *off-policy* RL!