Lightnews — Scholar-powered news

Petr Baudis (pasky)

@xpasky.bsky.social

Is this about SotA AI in general or comparative to Gemini and Claude?

May 24, 2025 at 2:05 PM

Petr Baudis (pasky)

@xpasky.bsky.social

More details: github.com/huggingface/...

GitHub - huggingface/lighteval: Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends - huggingface/lighteval

github.com

January 6, 2025 at 1:32 PM

Reposted by Petr Baudis (pasky)

Petr Baudis (pasky)

@xpasky.bsky.social

Definitely believe them regarding technical capabilities of the models. (Ok maybe add a 6-12m buffer.)

Where they are imho over indexing is maximum realistic speed of adaptation of the real world.

Adoption even of the most amazing stuff will take time, and need a lot of infra.

January 6, 2025 at 1:27 PM

Petr Baudis (pasky)

@xpasky.bsky.social

Definitely believe them regarding technical capabilities of the models. (Ok maybe add a 6-12m buffer.)

Where they are imho over indexing is maximum realistic speed of adaptation of the real world.

Adoption even of the most amazing stuff will take time, and need a lot of infra.

January 6, 2025 at 1:27 PM

Petr Baudis (pasky)

@xpasky.bsky.social

I reposted the thread here! :)

bsky.app/profile/xpas...

Petr Baudis (pasky) @xpasky.bsky.social · Jan 5

Quick primer for non-wizards about the post-MCTS LLM reasoning future:

How will LLMs learn to reason efficiently?

No math in this thread, ~simple words only! Let's go through the "Process Reinforcement through IMplicit REwards" (PRIME) method. 1/n

curvy-check-498.notion.site/Process-Rein...

January 5, 2025 at 12:30 AM

Petr Baudis (pasky)

@xpasky.bsky.social

Will be back with more later - by losing MCTS we also lost the exploration policy, how to plug it back?

This is a repost of a Twitter thread I made yesterday - my experiment on whether I can reach BSky DL audience. Twitter's LLM scene is very lively, I'd love to see more of that here.

'nite! 16/16

January 5, 2025 at 12:29 AM

Petr Baudis (pasky)

@xpasky.bsky.social

And the pseudocode algorithm for quick reference. 15/n

January 5, 2025 at 12:26 AM

Petr Baudis (pasky)

@xpasky.bsky.social

But this is the gist of the magic. And it results in reported 38x convergence speedup compared to MCTS & impressive benchmark gains. 14/n

January 5, 2025 at 12:25 AM

Petr Baudis (pasky)

@xpasky.bsky.social

PRIME of course also contains tons of important technical details. (PPO policy with alternative-normalized advantages over raw rewards. The initial finetuning LLM snapshot staing as a reference and considering only token logit changes to it, makes the math work. Formally proving it's >> MCTS..) 13/n

January 5, 2025 at 12:25 AM

Petr Baudis (pasky)

@xpasky.bsky.social

Unlike just using ORM approach alone, this introduces an accretive effect - information on what works is shared across training batches through the reward LLM, and as it learns, it produces better guidance and the convergence of the main LLM speeds up. 12/n

January 5, 2025 at 12:23 AM

Petr Baudis (pasky)

@xpasky.bsky.social

"Continuously" learned? The reward model LLM and the main LLM epochs are interleaved - the estimates are learned in parallel with finetuning the main model, a sort of expectation-minimization dance. 11/n

January 5, 2025 at 12:23 AM

Petr Baudis (pasky)

@xpasky.bsky.social

...and this extra LLM is then used as a Process RM assigning a reward to each token based on its continuously learned estimate of how much that token is helpful. 10/n

January 5, 2025 at 12:23 AM

Petr Baudis (pasky)

@xpasky.bsky.social

Well, how do we know how to reward each token then? Why, by finetuning an *extra* copy of your LLM internally to use as per-token reward model. This extra LLM copy is finetuned using the Outcome RM approach (so sparse rewards just encouraging tokens that lead to good final outcomes)... 9/n

January 5, 2025 at 12:22 AM

Petr Baudis (pasky)

@xpasky.bsky.social

5. Finally, PRIME (Implicit Rewards PRM)!

The basic question is: Instead of MCTS-like evaluating each CoT step by N rollouts, could we just run a beam search of N rollouts of CoT from start to end? 8/n

January 5, 2025 at 12:22 AM

Petr Baudis (pasky)

@xpasky.bsky.social

The problem now is that you need to roll out the CoT ten times for each candidate - a Monte Carlo approach. This is not efficient as you are wasting a lot of time on stupid CoT step candidates and lost causes. 7/n

January 5, 2025 at 12:21 AM

Petr Baudis (pasky)

@xpasky.bsky.social

4. Enter MCTS-inspired approaches for automated PRM supervision. Given a few next possible CoT steps, which one is more helpful, can we tell automatically? Well, try rolling out the rest of the CoT ten times for each candidate, and see which one reaches the right answer most frequently! 6/n

January 5, 2025 at 12:21 AM

Petr Baudis (pasky)

@xpasky.bsky.social

3. So let's give a per-CoT-step reinforcement using a PRM (Process Reward Model). Like teaching humans: don't just look at the final result, tell them if their approach was good.

Naive idea: just use per step human supervision for the steps. But that's obviously unsustainable, too little data. 5/n

January 5, 2025 at 12:21 AM

Petr Baudis (pasky)

@xpasky.bsky.social

The problem is that each rollout gives you only final outcome info, no sense if any particular CoT step actually helped move towards the result. Convergence is slow, so is ood generalization etc.

4/n

January 5, 2025 at 12:19 AM

Petr Baudis (pasky)

@xpasky.bsky.social

2. Basic approach is to use ORM (Outcome Reward Model) - try answering a query by rolling out a CoT, and check if it led to the right answer. This gives a positive/negative reinforcement to each token in the CoT (each token in the particular CoT gets the same reward).

3/n

January 5, 2025 at 12:18 AM

Petr Baudis (pasky)

@xpasky.bsky.social

1. We are RL tuning an LLM to produce good CoTs (chains of thought).

(Good == leading step by step to correct answers to complex queries.)

2/n

January 5, 2025 at 12:18 AM

Reposted by Petr Baudis (pasky)

Marios Richards

@mariosrichards.bsky.social

Science/maths/programming have a tendency to depreciate the value of grinding - smart people don't grind! - yes, that project basically took me only 30 minutes. The 3 and half hours of dead ends I went down obviously don't count. Or the hour I spent installing the wrong package.

December 24, 2024 at 1:27 PM

Petr Baudis (pasky)

@xpasky.bsky.social

Yes

December 14, 2024 at 12:56 AM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news