Nishant Balepur
banner
nbalepur.bsky.social
Nishant Balepur
@nbalepur.bsky.social
CS PhD Student. Trying to find that dog in me at UMD. Babysitting (aligning) + Bullying (evaluating) LLMs

nbalepur.github.io
😂
September 25, 2025 at 12:25 PM
if it is truly helpful, honest, and harmless, yes 🙏
February 26, 2025 at 1:12 AM
The alignment is a system prompt saying "if the user asks X, do Y" 😝
February 26, 2025 at 1:04 AM
And huge thanks to my friends and labmates who let me bother them to find the right people, review the paper, and for useful discussions 🙏
@saxon.me @lasha.bsky.social @yysung.bsky.social @maharshigor.bsky.social @matthewshu.com @houyu0930.bsky.social

(and many more I'm forgetting, sorry!)
February 24, 2025 at 9:04 PM
This was a really fun paper to put together with Rachel and @boydgraber.bsky.social allowing me to vent many of my frustrations working with MCQA over the past year 😪🫡

Please check out the paper, we would love to hear your feedback! 📄👇
February 24, 2025 at 9:04 PM
In short, here’s how to build better evals:
✅ Check if MCQA the right format for what you want to test
✅ Use design choices to limit leakage/errors/shortcuts
✅ Keep questions easy for humans, hard for models

If we don’t put in this effort, what is MCQA even testing? 🤷‍♂️
February 24, 2025 at 9:04 PM
Lastly, we discuss persistent flaws of LLMs when running MCQA:
🔩Robustness Issues
🌎 Biases
💬 Unfaithful Explanations

Many of our previous solutions to MCQA's format/datasets can better address or evaluate these issues 😁
February 24, 2025 at 9:04 PM
Two of the most pressing and promising dataset improvements include:
📋 Writing MCQs using educators' rubrics to improve question quality
🧑‍🎓 Designing MCQs hard for models but easy for humans (adversarial), rather than creating needlessly impossible/obscure questions
February 24, 2025 at 9:04 PM
Next, we show even when MCQA is a good format, our datasets still have issues 🥲

We discuss:
🔓 Dataset Leakage
❓ Unanswerable Questions
⚡️ Shortcuts
📈 Saturation

More good news: educators again already have solutions! We also discuss recent work tackling these problems! 💪
February 24, 2025 at 9:04 PM
So what's better? ❤️‍🩹

We explore two possible improvements:
1️⃣ Constructed Response (short-form QA)
2️⃣ Explanation MCQA (justifying answers)

Both are grounded in education research, better align with LLM use cases, and test deeper knowledge levels versus MCQA ⭐️
February 24, 2025 at 9:04 PM
First, we show MCQA is flawed as a standardized LLM eval format because it often fails to:
🔒 Test subjectivity and generation
👥 Align with real LLM use cases
🧠 Assess knowledge (based on education research)

When's the last time you asked ChatGPT to answer an MCQ? 🤔
February 24, 2025 at 9:04 PM
We break our position into three points:
1️⃣ Flaws in MCQA’s format
2️⃣ Issues in datasets
3️⃣ Weaknesses in how LLMs run MCQA

The good news? Best practices in education made for effective student testing can help fix these 🧑‍🏫

Yet, we rarely use these insights in LLM evaluation 🤦
February 24, 2025 at 9:04 PM
Namely, @boydgraber.bsky.social @lasha.bsky.social, Rachel, Feng, and folks from Adobe Research 🫡
January 31, 2025 at 2:32 PM