Many supposed gains don’t hold up under scrutiny.
Progress is possible—but let’s build on reproducible foundations.
🧠 Full paper: arxiv.org/abs/2504.07086
🧑🔬 By: @hrdkbhatnagar.bsky.social @vishaalurao.bsky.social @samuelalbanie.bsky.social @bayesiankitten.bsky.social @MatthiasBethge
Many supposed gains don’t hold up under scrutiny.
Progress is possible—but let’s build on reproducible foundations.
🧠 Full paper: arxiv.org/abs/2504.07086
🧑🔬 By: @hrdkbhatnagar.bsky.social @vishaalurao.bsky.social @samuelalbanie.bsky.social @bayesiankitten.bsky.social @MatthiasBethge
– Tune decoding per model
– Use appropriate prompts/templates
– Standardize hardware/software (we use Docker)
– Open-source everything
📦 Code, prompts, outputs: github.com/bethgelab/so...
– Tune decoding per model
– Use appropriate prompts/templates
– Standardize hardware/software (we use Docker)
– Open-source everything
📦 Code, prompts, outputs: github.com/bethgelab/so...
🔹 RL methods over distillations? Often negligible gains, prone to overfitting.
🔹 Supervised finetuning (SFT) on reasoning traces? Stable & generalizable.
🔹 RL methods over distillations? Often negligible gains, prone to overfitting.
🔹 Supervised finetuning (SFT) on reasoning traces? Stable & generalizable.
– Random seed: swings Pass@1 by 5–15pp
– Temperature/top-p: another ±10pp
– Software & Hardware? Yes, even that changes scores
🎯 Single-seed results on small datasets are essentially noise.
– Random seed: swings Pass@1 by 5–15pp
– Temperature/top-p: another ±10pp
– Software & Hardware? Yes, even that changes scores
🎯 Single-seed results on small datasets are essentially noise.
➡️ Performance dropped by up to 17%
➡️ Improvements fall within variance range of the base model
➡️ Some models don’t beat the baseline!
➡️ Performance dropped by up to 17%
➡️ Improvements fall within variance range of the base model
➡️ Some models don’t beat the baseline!
We find that many celebrated gains from RL methods vanish once you:
✅ average over multiple seeds
✅ control decoding
✅ standardize prompt & infra
We find that many celebrated gains from RL methods vanish once you:
✅ average over multiple seeds
✅ control decoding
✅ standardize prompt & infra
@hrdkbhatnagar.bsky.social, @vishaalurao.bsky.social, @bayesiankitten.bsky.social, Matthias Bethge [6/6]
@hrdkbhatnagar.bsky.social, @vishaalurao.bsky.social, @bayesiankitten.bsky.social, Matthias Bethge [6/6]
🔸 Some questions contain subquestions, but only one answer is labeled. The model may get penalized for "wrong" but valid reasoning. [2/6]
🔸 Some questions contain subquestions, but only one answer is labeled. The model may get penalized for "wrong" but valid reasoning. [2/6]