Lightnews — Scholar-powered news

David Alvarez-Melis

@dmelis.bsky.social

90 followers 100 following 9 posts

Professoring at Harvard || Researching at MSR || Previously: MIT CSAIL, NYU, IBM Research, ITAM

Posts Replies Media Videos

David Alvarez-Melis

@dmelis.bsky.social

We push further with reinforcement learning 🚀

Fine-tuning with GRPO, backtracking shines: it discovers new, efficient strategies. 🌟

The no-backtracking model?
✅ Great at low compute (pass@1)
❌ But loses ability to generate diversity solutions—hurting pass@k performance.

April 11, 2025 at 4:29 PM

David Alvarez-Melis

@dmelis.bsky.social

Can we fix backtracking on CountDown by tackling these 2 issues? 🔧 We try two variations:
🔀 Mix-backtracking: trained on more diverse search traces
🧠 Think-backtracking: skips steps to encourage implicit reasoning
Both help! But with enough compute, direct solution still wins

April 11, 2025 at 4:29 PM

David Alvarez-Melis

@dmelis.bsky.social

2️⃣ Backtracking makes models verbose—often at the expense of “actual” reasoning 💬
Instead of thinking internally without outputting CoT, they learn to spell out every step, even when it’s unnecessary.
It talks more…🤯📝 but thinks less— this hurts test-time efficiency!

April 11, 2025 at 4:29 PM

David Alvarez-Melis

@dmelis.bsky.social

But what goes wrong when backtracking fails (eg in CountDown)?🤔We find 2 pitfalls:
1️⃣Teaching models to search via CoT can backfire—they learn to make mistakes. On many problems, our backtracking model makes more mistakes before finding the right answer (vs direct sol. model)!

April 11, 2025 at 4:29 PM

David Alvarez-Melis

@dmelis.bsky.social

Here’s what we found:
🔢 On CountDown, the direct solution model—no self-reflection, just raw diversity—outperforms backtracking
🧮 But on Sudoku, the result flips: backtracking wins.
So, backtracking isn’t universally beneficial—it depends on the nature of the reasoning required

April 11, 2025 at 4:29 PM

David Alvarez-Melis

@dmelis.bsky.social

We compare backtracking (BT) to an alternative way to scale test-time compute: parallel sampling + best-of-N.
We train:
1️⃣ A backtracking model using CoT to perform search
2️⃣ A direct solution model that learns from the optimal solution
Equating test-compute, who will win? 🤔

April 11, 2025 at 4:29 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news