Miriam Wanner
miriamsw.bsky.social
Miriam Wanner
@miriamsw.bsky.social
Reposted by Miriam Wanner
🚨 You are only evaluating a slice of your test-time scaling model's performance! 🚨

📈 We consider how models’ confidence in their answers changes as test-time compute increases. Reasoning longer helps models answer more confidently!

📝: arxiv.org/abs/2502.13962
February 20, 2025 at 3:14 PM