Big thanks to my co-authors @minhducbui.bsky.social & Katharina von der Wense!
📄 Read the full paper here: arxiv.org/abs/2509.15020
Big thanks to my co-authors @minhducbui.bsky.social & Katharina von der Wense!
📄 Read the full paper here: arxiv.org/abs/2509.15020
✅ Shifts model accuracy by up to 11%
✅ Changes which model tops the leaderboard – raising serious concerns about comparability of LLM leaderboards in prior work
✅ Affects calibration (reliability of confidence estimates)
✅ Shifts model accuracy by up to 11%
✅ Changes which model tops the leaderboard – raising serious concerns about comparability of LLM leaderboards in prior work
✅ Affects calibration (reliability of confidence estimates)
Practice is currently split: no community-wide standard exists, and even popular evaluation frameworks differ.
Practice is currently split: no community-wide standard exists, and even popular evaluation frameworks differ.