Petr Baudis (pasky)
xpasky.bsky.social
Petr Baudis (pasky)
@xpasky.bsky.social
Rossum.ai. A variety of OSS & AI things in the past (Git, glibc, pre-AlphaGo Pachi, OpenTTD, ...). Your computer might be running some of my code (sorry).
Is this about SotA AI in general or comparative to Gemini and Claude?
May 24, 2025 at 2:05 PM
Reposted by Petr Baudis (pasky)
Definitely believe them regarding technical capabilities of the models. (Ok maybe add a 6-12m buffer.)

Where they are imho over indexing is maximum realistic speed of adaptation of the real world.

Adoption even of the most amazing stuff will take time, and need a lot of infra.
January 6, 2025 at 1:27 PM
Definitely believe them regarding technical capabilities of the models. (Ok maybe add a 6-12m buffer.)

Where they are imho over indexing is maximum realistic speed of adaptation of the real world.

Adoption even of the most amazing stuff will take time, and need a lot of infra.
January 6, 2025 at 1:27 PM
I reposted the thread here! :)

bsky.app/profile/xpas...
Quick primer for non-wizards about the post-MCTS LLM reasoning future:

How will LLMs learn to reason efficiently?

No math in this thread, ~simple words only! Let's go through the "Process Reinforcement through IMplicit REwards" (PRIME) method. 1/n

curvy-check-498.notion.site/Process-Rein...
January 5, 2025 at 12:30 AM
Will be back with more later - by losing MCTS we also lost the exploration policy, how to plug it back?

This is a repost of a Twitter thread I made yesterday - my experiment on whether I can reach BSky DL audience. Twitter's LLM scene is very lively, I'd love to see more of that here.

'nite! 16/16
January 5, 2025 at 12:29 AM
And the pseudocode algorithm for quick reference. 15/n
January 5, 2025 at 12:26 AM
But this is the gist of the magic. And it results in reported 38x convergence speedup compared to MCTS & impressive benchmark gains. 14/n
January 5, 2025 at 12:25 AM
PRIME of course also contains tons of important technical details. (PPO policy with alternative-normalized advantages over raw rewards. The initial finetuning LLM snapshot staing as a reference and considering only token logit changes to it, makes the math work. Formally proving it's >> MCTS..) 13/n
January 5, 2025 at 12:25 AM
Unlike just using ORM approach alone, this introduces an accretive effect - information on what works is shared across training batches through the reward LLM, and as it learns, it produces better guidance and the convergence of the main LLM speeds up. 12/n
January 5, 2025 at 12:23 AM
"Continuously" learned? The reward model LLM and the main LLM epochs are interleaved - the estimates are learned in parallel with finetuning the main model, a sort of expectation-minimization dance. 11/n
January 5, 2025 at 12:23 AM
...and this extra LLM is then used as a Process RM assigning a reward to each token based on its continuously learned estimate of how much that token is helpful. 10/n
January 5, 2025 at 12:23 AM
Well, how do we know how to reward each token then? Why, by finetuning an *extra* copy of your LLM internally to use as per-token reward model. This extra LLM copy is finetuned using the Outcome RM approach (so sparse rewards just encouraging tokens that lead to good final outcomes)... 9/n
January 5, 2025 at 12:22 AM
5. Finally, PRIME (Implicit Rewards PRM)!

The basic question is: Instead of MCTS-like evaluating each CoT step by N rollouts, could we just run a beam search of N rollouts of CoT from start to end? 8/n
January 5, 2025 at 12:22 AM
The problem now is that you need to roll out the CoT ten times for each candidate - a Monte Carlo approach. This is not efficient as you are wasting a lot of time on stupid CoT step candidates and lost causes. 7/n
January 5, 2025 at 12:21 AM
4. Enter MCTS-inspired approaches for automated PRM supervision. Given a few next possible CoT steps, which one is more helpful, can we tell automatically? Well, try rolling out the rest of the CoT ten times for each candidate, and see which one reaches the right answer most frequently! 6/n
January 5, 2025 at 12:21 AM
3. So let's give a per-CoT-step reinforcement using a PRM (Process Reward Model). Like teaching humans: don't just look at the final result, tell them if their approach was good.

Naive idea: just use per step human supervision for the steps. But that's obviously unsustainable, too little data. 5/n
January 5, 2025 at 12:21 AM
The problem is that each rollout gives you only final outcome info, no sense if any particular CoT step actually helped move towards the result. Convergence is slow, so is ood generalization etc.

4/n
January 5, 2025 at 12:19 AM
2. Basic approach is to use ORM (Outcome Reward Model) - try answering a query by rolling out a CoT, and check if it led to the right answer. This gives a positive/negative reinforcement to each token in the CoT (each token in the particular CoT gets the same reward).

3/n
January 5, 2025 at 12:18 AM
1. We are RL tuning an LLM to produce good CoTs (chains of thought).

(Good == leading step by step to correct answers to complex queries.)

2/n
January 5, 2025 at 12:18 AM
Reposted by Petr Baudis (pasky)
Science/maths/programming have a tendency to depreciate the value of grinding - smart people don't grind! - yes, that project basically took me only 30 minutes. The 3 and half hours of dead ends I went down obviously don't count. Or the hour I spent installing the wrong package.
December 24, 2024 at 1:27 PM
Yes
December 14, 2024 at 12:56 AM