Our work shows when it helps, when it hurts, and why.
If you’re training models to reason—especially with RL—how you teach them to search matters!
Check out our paper 👉 arxiv.org/abs/2504.07052 🙏📄
Our work shows when it helps, when it hurts, and why.
If you’re training models to reason—especially with RL—how you teach them to search matters!
Check out our paper 👉 arxiv.org/abs/2504.07052 🙏📄
Fine-tuning with GRPO, backtracking shines: it discovers new, efficient strategies. 🌟
The no-backtracking model?
✅ Great at low compute (pass@1)
❌ But loses ability to generate diversity solutions—hurting pass@k performance.
Fine-tuning with GRPO, backtracking shines: it discovers new, efficient strategies. 🌟
The no-backtracking model?
✅ Great at low compute (pass@1)
❌ But loses ability to generate diversity solutions—hurting pass@k performance.
🔀 Mix-backtracking: trained on more diverse search traces
🧠 Think-backtracking: skips steps to encourage implicit reasoning
Both help! But with enough compute, direct solution still wins
🔀 Mix-backtracking: trained on more diverse search traces
🧠 Think-backtracking: skips steps to encourage implicit reasoning
Both help! But with enough compute, direct solution still wins
Instead of thinking internally without outputting CoT, they learn to spell out every step, even when it’s unnecessary.
It talks more…🤯📝 but thinks less— this hurts test-time efficiency!
Instead of thinking internally without outputting CoT, they learn to spell out every step, even when it’s unnecessary.
It talks more…🤯📝 but thinks less— this hurts test-time efficiency!
1️⃣Teaching models to search via CoT can backfire—they learn to make mistakes. On many problems, our backtracking model makes more mistakes before finding the right answer (vs direct sol. model)!
1️⃣Teaching models to search via CoT can backfire—they learn to make mistakes. On many problems, our backtracking model makes more mistakes before finding the right answer (vs direct sol. model)!
🔢 On CountDown, the direct solution model—no self-reflection, just raw diversity—outperforms backtracking
🧮 But on Sudoku, the result flips: backtracking wins.
So, backtracking isn’t universally beneficial—it depends on the nature of the reasoning required
🔢 On CountDown, the direct solution model—no self-reflection, just raw diversity—outperforms backtracking
🧮 But on Sudoku, the result flips: backtracking wins.
So, backtracking isn’t universally beneficial—it depends on the nature of the reasoning required
We train:
1️⃣ A backtracking model using CoT to perform search
2️⃣ A direct solution model that learns from the optimal solution
Equating test-compute, who will win? 🤔
We train:
1️⃣ A backtracking model using CoT to perform search
2️⃣ A direct solution model that learns from the optimal solution
Equating test-compute, who will win? 🤔
@sunnytqin.bsky.social , w/ @emalach.bsky.social, Samy Jelassi), we investigate a core question for LLMs: "𝑡𝑜 𝑏𝑎𝑐𝑘𝑡𝑟𝑎𝑐𝑘 𝑜𝑟 𝑛𝑜𝑡 𝑡𝑜 𝑏𝑎𝑐𝑘𝑡𝑟𝑎𝑐𝑘" in two prototypical logic-heavy puzzles: CountDown and Sudoku.
@sunnytqin.bsky.social , w/ @emalach.bsky.social, Samy Jelassi), we investigate a core question for LLMs: "𝑡𝑜 𝑏𝑎𝑐𝑘𝑡𝑟𝑎𝑐𝑘 𝑜𝑟 𝑛𝑜𝑡 𝑡𝑜 𝑏𝑎𝑐𝑘𝑡𝑟𝑎𝑐𝑘" in two prototypical logic-heavy puzzles: CountDown and Sudoku.