Lightnews — Scholar-powered news

Michael Noukhovitch...🏄 NeurIPS 2025

@mnoukhov.bsky.social

230 followers 210 following 23 posts

PhD in AI @mila-quebec.bsky.social RLHF and language grounding, whatever that means. Whitespace aficianado. mnoukhov.github.io

Posts Replies Media Videos

Michael Noukhovitch...🏄 NeurIPS 2025

@mnoukhov.bsky.social

Finally, for those studying midtraining and cognitive behaviours, you can ablate different midtraining mixes to see how they affect the ability to learn reasoning in the RL-Zero setup

November 20, 2025 at 8:38 PM

Michael Noukhovitch...🏄 NeurIPS 2025

@mnoukhov.bsky.social

It's also a great setup for multi-objective RL! @saurabhshah2.bsky.social
and I created four data domains: math, code, instruction-following, and general chat, so you can study their interaction during RL finetuning

November 20, 2025 at 8:38 PM

Michael Noukhovitch...🏄 NeurIPS 2025

@mnoukhov.bsky.social

Olmo 3 RL-Zero is also a great setup for studying RL Infra. We use it to ablate active sampling and find it really stabilizes loss!

November 20, 2025 at 8:38 PM

Michael Noukhovitch...🏄 NeurIPS 2025

@mnoukhov.bsky.social

But with RLVR on our curated datasets Dolci (huggingface.co/datasets/all...), Olmo 3 base can really improve on reasoning. Look at those AIME curves go!

November 20, 2025 at 8:38 PM

Michael Noukhovitch...🏄 NeurIPS 2025

@mnoukhov.bsky.social

Because Olmo 3 is fully open, we decontaminate our evals from our pretraining and midtraining data. @stellali.bsky.social proves this with spurious rewards: RL trained on a random reward signal can't improve on the evals, unlike some previous setups

November 20, 2025 at 8:38 PM

Michael Noukhovitch...🏄 NeurIPS 2025

@mnoukhov.bsky.social

@dnllvy.bsky.social @oumarkaba.bsky.social presenting cool work at #ICLR2025 on generative models for crystals leveraging symmetry ❄️🪞, repping @mila-quebec.bsky.social

April 24, 2025 at 7:07 AM

Michael Noukhovitch...🏄 NeurIPS 2025

@mnoukhov.bsky.social

Llama 4 uses async RLHF and I would just like to announce that I called it t.co/w9qJxr944C

April 7, 2025 at 7:39 PM

Michael Noukhovitch...🏄 NeurIPS 2025

@mnoukhov.bsky.social

We showed great results on RLHF but reviewers wanted reasoning + math 🧠🤔 Thanks my labmates Amirhossein and Milad, we got Rho-1B training on GSM8k!
Online DPO slightly outperforms PPO on GSM8k but more importantly 1-step Async runs 68% faster than Sync and matches performance🔥

March 18, 2025 at 8:45 PM

Michael Noukhovitch...🏄 NeurIPS 2025

@mnoukhov.bsky.social

Recap⌛️RL training of LLMs is frequently online and *on-policy* but training and generation alternate and idle while waiting for the other to finish.
We run training and generation at the same time, but now we're training on samples from a previous timestep aka *off-policy* RL!

March 18, 2025 at 8:45 PM

Michael Noukhovitch...🏄 NeurIPS 2025

@mnoukhov.bsky.social

Programming using an AI assistant in order to improve AI assistants is giving me strong sci-fi vibes. Specifically Isaac Asimov, who clearly invented vibe coding in 1956 users.ece.cmu.edu/~gamvrosi/th...

February 11, 2025 at 12:17 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news