David Alvarez-Melis
dmelis.bsky.social
David Alvarez-Melis
@dmelis.bsky.social
Professoring at Harvard || Researching at MSR || Previously: MIT CSAIL, NYU, IBM Research, ITAM
We push further with reinforcement learning 🚀

Fine-tuning with GRPO, backtracking shines: it discovers new, efficient strategies. 🌟

The no-backtracking model?
✅ Great at low compute (pass@1)
❌ But loses ability to generate diversity solutions—hurting pass@k performance.
April 11, 2025 at 4:29 PM
Can we fix backtracking on CountDown by tackling these 2 issues? 🔧 We try two variations:
🔀 Mix-backtracking: trained on more diverse search traces
🧠 Think-backtracking: skips steps to encourage implicit reasoning
Both help! But with enough compute, direct solution still wins
April 11, 2025 at 4:29 PM
2️⃣ Backtracking makes models verbose—often at the expense of “actual” reasoning 💬
Instead of thinking internally without outputting CoT, they learn to spell out every step, even when it’s unnecessary.
It talks more…🤯📝 but thinks less— this hurts test-time efficiency!
April 11, 2025 at 4:29 PM
But what goes wrong when backtracking fails (eg in CountDown)?🤔We find 2 pitfalls:
1️⃣Teaching models to search via CoT can backfire—they learn to make mistakes. On many problems, our backtracking model makes more mistakes before finding the right answer (vs direct sol. model)!
April 11, 2025 at 4:29 PM
Here’s what we found:
🔢 On CountDown, the direct solution model—no self-reflection, just raw diversity—outperforms backtracking
🧮 But on Sudoku, the result flips: backtracking wins.
So, backtracking isn’t universally beneficial—it depends on the nature of the reasoning required
April 11, 2025 at 4:29 PM
We compare backtracking (BT) to an alternative way to scale test-time compute: parallel sampling + best-of-N.
We train:
1️⃣ A backtracking model using CoT to perform search
2️⃣ A direct solution model that learns from the optimal solution
Equating test-compute, who will win? 🤔
April 11, 2025 at 4:29 PM