Mario Sanz
msanz.bsky.social
Mario Sanz
@msanz.bsky.social
PhD student in #NLProc
What looks like a trivial formatting choice can actually alter research conclusions, so mind the gap!

Big thanks to my co-authors @minhducbui.bsky.social & Katharina von der Wense!

📄 Read the full paper here: arxiv.org/abs/2509.15020
Mind the Gap: A Closer Look at Tokenization for Multiple-Choice Question Answering with LLMs
When evaluating large language models (LLMs) with multiple-choice question answering (MCQA), it is common to end the prompt with the string "Answer:" to facilitate automated answer extraction via next...
arxiv.org
September 26, 2025 at 9:18 AM
Surprisingly, this small detail:
✅ Shifts model accuracy by up to 11%
✅ Changes which model tops the leaderboard – raising serious concerns about comparability of LLM leaderboards in prior work
✅ Affects calibration (reliability of confidence estimates)
September 26, 2025 at 9:18 AM
In our #EMNLP2025 paper we study how the space before the answer letter (e.g., "A" vs. "␣A") is tokenized.

Practice is currently split: no community-wide standard exists, and even popular evaluation frameworks differ.
September 26, 2025 at 9:18 AM