Michael Noukhovitch...🏄 NeurIPS 2025
banner
mnoukhov.bsky.social
Michael Noukhovitch...🏄 NeurIPS 2025
@mnoukhov.bsky.social
PhD in AI @mila-quebec.bsky.social RLHF and language grounding, whatever that means. Whitespace aficianado. mnoukhov.github.io
Finally, for those studying midtraining and cognitive behaviours, you can ablate different midtraining mixes to see how they affect the ability to learn reasoning in the RL-Zero setup
November 20, 2025 at 8:38 PM
It's also a great setup for multi-objective RL! @saurabhshah2.bsky.social
and I created four data domains: math, code, instruction-following, and general chat, so you can study their interaction during RL finetuning
November 20, 2025 at 8:38 PM
Olmo 3 RL-Zero is also a great setup for studying RL Infra. We use it to ablate active sampling and find it really stabilizes loss!
November 20, 2025 at 8:38 PM
But with RLVR on our curated datasets Dolci (huggingface.co/datasets/all...), Olmo 3 base can really improve on reasoning. Look at those AIME curves go!
November 20, 2025 at 8:38 PM
Because Olmo 3 is fully open, we decontaminate our evals from our pretraining and midtraining data. @stellali.bsky.social proves this with spurious rewards: RL trained on a random reward signal can't improve on the evals, unlike some previous setups
November 20, 2025 at 8:38 PM
@dnllvy.bsky.social @oumarkaba.bsky.social presenting cool work at #ICLR2025 on generative models for crystals leveraging symmetry ❄️🪞, repping @mila-quebec.bsky.social
April 24, 2025 at 7:07 AM
Llama 4 uses async RLHF and I would just like to announce that I called it t.co/w9qJxr944C
April 7, 2025 at 7:39 PM
We showed great results on RLHF but reviewers wanted reasoning + math 🧠🤔 Thanks my labmates Amirhossein and Milad, we got Rho-1B training on GSM8k!
Online DPO slightly outperforms PPO on GSM8k but more importantly 1-step Async runs 68% faster than Sync and matches performance🔥
March 18, 2025 at 8:45 PM
Recap⌛️RL training of LLMs is frequently online and *on-policy* but training and generation alternate and idle while waiting for the other to finish.
We run training and generation at the same time, but now we're training on samples from a previous timestep aka *off-policy* RL!
March 18, 2025 at 8:45 PM
Programming using an AI assistant in order to improve AI assistants is giving me strong sci-fi vibes. Specifically Isaac Asimov, who clearly invented vibe coding in 1956 users.ece.cmu.edu/~gamvrosi/th...
February 11, 2025 at 12:17 AM