Fine-tuning with GRPO, backtracking shines: it discovers new, efficient strategies. 🌟
The no-backtracking model?
✅ Great at low compute (pass@1)
❌ But loses ability to generate diversity solutions—hurting pass@k performance.
Fine-tuning with GRPO, backtracking shines: it discovers new, efficient strategies. 🌟
The no-backtracking model?
✅ Great at low compute (pass@1)
❌ But loses ability to generate diversity solutions—hurting pass@k performance.
🔀 Mix-backtracking: trained on more diverse search traces
🧠 Think-backtracking: skips steps to encourage implicit reasoning
Both help! But with enough compute, direct solution still wins
🔀 Mix-backtracking: trained on more diverse search traces
🧠 Think-backtracking: skips steps to encourage implicit reasoning
Both help! But with enough compute, direct solution still wins
Instead of thinking internally without outputting CoT, they learn to spell out every step, even when it’s unnecessary.
It talks more…🤯📝 but thinks less— this hurts test-time efficiency!
Instead of thinking internally without outputting CoT, they learn to spell out every step, even when it’s unnecessary.
It talks more…🤯📝 but thinks less— this hurts test-time efficiency!
1️⃣Teaching models to search via CoT can backfire—they learn to make mistakes. On many problems, our backtracking model makes more mistakes before finding the right answer (vs direct sol. model)!
1️⃣Teaching models to search via CoT can backfire—they learn to make mistakes. On many problems, our backtracking model makes more mistakes before finding the right answer (vs direct sol. model)!
🔢 On CountDown, the direct solution model—no self-reflection, just raw diversity—outperforms backtracking
🧮 But on Sudoku, the result flips: backtracking wins.
So, backtracking isn’t universally beneficial—it depends on the nature of the reasoning required
🔢 On CountDown, the direct solution model—no self-reflection, just raw diversity—outperforms backtracking
🧮 But on Sudoku, the result flips: backtracking wins.
So, backtracking isn’t universally beneficial—it depends on the nature of the reasoning required
We train:
1️⃣ A backtracking model using CoT to perform search
2️⃣ A direct solution model that learns from the optimal solution
Equating test-compute, who will win? 🤔
We train:
1️⃣ A backtracking model using CoT to perform search
2️⃣ A direct solution model that learns from the optimal solution
Equating test-compute, who will win? 🤔