Lightnews — Scholar-powered news

Michael Noukhovitch...🏄 NeurIPS 2025

@mnoukhov.bsky.social

Train and repro everything yourself! Everything is open and PRs / issues welcome! github.com/allenai/open...

open-instruct/scripts/train/olmo3 at main · allenai/open-instruct

AllenAI's post-training codebase. Contribute to allenai/open-instruct development by creating an account on GitHub.

github.com

November 20, 2025 at 8:38 PM

Michael Noukhovitch...🏄 NeurIPS 2025

@mnoukhov.bsky.social

This was a big effort with peeps at Ai2 who put in a lot of work including putting up with the weird memes I post in slack, #1 manager @natolambert.bsky.social, @finbarr.bsky.social @saurabhshah2.bsky.social @hamishivi.bsky.social Teng Hanna and @vwxyzjn.bsky.social who advised behind the scenes

November 20, 2025 at 8:38 PM

Michael Noukhovitch...🏄 NeurIPS 2025

@mnoukhov.bsky.social

Finally, for those studying midtraining and cognitive behaviours, you can ablate different midtraining mixes to see how they affect the ability to learn reasoning in the RL-Zero setup

November 20, 2025 at 8:38 PM

Michael Noukhovitch...🏄 NeurIPS 2025

@mnoukhov.bsky.social

It's also a great setup for multi-objective RL! @saurabhshah2.bsky.social
and I created four data domains: math, code, instruction-following, and general chat, so you can study their interaction during RL finetuning

November 20, 2025 at 8:38 PM

Michael Noukhovitch...🏄 NeurIPS 2025

@mnoukhov.bsky.social

Olmo 3 RL-Zero is also a great setup for studying RL Infra. We use it to ablate active sampling and find it really stabilizes loss!

November 20, 2025 at 8:38 PM

Michael Noukhovitch...🏄 NeurIPS 2025

@mnoukhov.bsky.social

But with RLVR on our curated datasets Dolci (huggingface.co/datasets/all...), Olmo 3 base can really improve on reasoning. Look at those AIME curves go!

November 20, 2025 at 8:38 PM

Michael Noukhovitch...🏄 NeurIPS 2025

@mnoukhov.bsky.social

Because Olmo 3 is fully open, we decontaminate our evals from our pretraining and midtraining data. @stellali.bsky.social proves this with spurious rewards: RL trained on a random reward signal can't improve on the evals, unlike some previous setups

November 20, 2025 at 8:38 PM

Michael Noukhovitch...🏄 NeurIPS 2025

@mnoukhov.bsky.social

I think recent technologies generally trade off big capabilities for stability and robustness. Websites are an easier way to get information compared to phone lines but we expect them to go down once in a while and its ok.

I think people are naturally adapting to LLMs giving them half-truths

November 20, 2025 at 8:27 PM

Michael Noukhovitch...🏄 NeurIPS 2025

@mnoukhov.bsky.social

Hope the Llama team releases more details. Until then check out my paper on async RLHF and feel free to message me to chat about it at ICLR!

bsky.app/profile/mnou...

Michael Noukhovitch...🏄 NeurIPS 2025 @mnoukhov.bsky.social · Mar 18

Our work on Asynchronous RLHF was accepted to #ICLR2025 ! (I was so excited to announce it, I forgot to say I was excited)

Used by @ai2.bsky.social for OLMo-2 32B 🔥
New results show ~70% speedups for LLM + RL math and reasoning 🧠

🧵below or hear my DLCT talk online on March 28!

April 7, 2025 at 7:39 PM

Michael Noukhovitch...🏄 NeurIPS 2025

@mnoukhov.bsky.social

And to reviewer 2, I guess it does work in large scale distributed training! I am really curious how they did the resource balancing to account for different computational speed

April 7, 2025 at 7:39 PM

Michael Noukhovitch...🏄 NeurIPS 2025

@mnoukhov.bsky.social

Classic Benno, hanging out with his human friends John, Ṃ̵̢͍̬̘ͧ̉͆ͤ̈͆̂ä́t̢̢̡̫̻̰͈̣͚͆͛͗̈ͭ̉̕͟ͅt̛̹̰̑̓ͭ͗h̸̷̛̛̥̱͉͎̯̻̼͕͉̻̄̅̾ͣ̉̈͌̀ͮ͋ͯ͐ͮͥ̿͛ͪ͜͠͝ẹ̱̞̬̅͂ͯ̈́̆̎ͣw̵̨̧̧̥̩͔͎̬̭͚̩͉ͤ̌͢͝, and Cͧͯ_̸̨̱͙̦͍̉̒͐͐͂͋̎̂ͬ̑͜͝h͐_̮͒͢r̸̛̳̘̠̯ͣͧͦ̏͑ͯ͡i̷̡̡͔̪̟͙͖̫̩̭̳̤͕̞͙̯͚̫̯ͭͤ̌̽͋ͯ̉ͥ́ͭͧͥͦͬ̀ͨ͌̒͢͞s̺̹͛ͭ̐͗ͤͫ́̃ͤ͢͠

March 18, 2025 at 9:57 PM

Michael Noukhovitch...🏄 NeurIPS 2025

@mnoukhov.bsky.social

Thanks again to my collaborators:
@vwxyzjn.bsky.social
@sophie-xhonneux.bsky.social
@arianh.bsky.social
Rishabh and Aaron who have not yet migrated 🦋

DMs open📲let's chat about about everything LLM + RL @ ICLR and check out
Paper 📰 arxiv.org/abs/2410.18252
Code 🧑‍💻 github.com/mnoukhov/asy...

March 18, 2025 at 8:45 PM

Michael Noukhovitch...🏄 NeurIPS 2025

@mnoukhov.bsky.social

We also have an appendix full of fun details like "How to make RLOO work off-policy" and "Why synchronous RLHF is not feasible in the long term" from an engineering perspective 👷🛠️
Would love critiques from any engineers working on RLHF if they feel I missed something!

March 18, 2025 at 8:45 PM

Michael Noukhovitch...🏄 NeurIPS 2025

@mnoukhov.bsky.social

We showed great results on RLHF but reviewers wanted reasoning + math 🧠🤔 Thanks my labmates Amirhossein and Milad, we got Rho-1B training on GSM8k!
Online DPO slightly outperforms PPO on GSM8k but more importantly 1-step Async runs 68% faster than Sync and matches performance🔥

March 18, 2025 at 8:45 PM

Michael Noukhovitch...🏄 NeurIPS 2025

@mnoukhov.bsky.social

Recap⌛️RL training of LLMs is frequently online and *on-policy* but training and generation alternate and idle while waiting for the other to finish.
We run training and generation at the same time, but now we're training on samples from a previous timestep aka *off-policy* RL!

March 18, 2025 at 8:45 PM

Michael Noukhovitch...🏄 NeurIPS 2025

@mnoukhov.bsky.social

Reminds me of a very similar shift towards open science by machine learning in 1999 (jmlr.org/statement.html). Nowadays we've got really great infrastructure in the form of @openreview.bsky.social! Reach out if you're considering shifting to open science and check out jmlr.org/tmlr/ for inspo :)

Transactions on Machine Learning Research

jmlr.org

February 12, 2025 at 10:47 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news