Lightnews — Scholar-powered news

Nishant Balepur

@nbalepur.bsky.social

140 followers 190 following 19 posts

CS PhD Student. Trying to find that dog in me at UMD. Babysitting (aligning) + Bullying (evaluating) LLMs

nbalepur.github.io

Posts Replies Media Videos

Nishant Balepur

@nbalepur.bsky.social

😂

September 25, 2025 at 12:25 PM

Nishant Balepur

@nbalepur.bsky.social

if it is truly helpful, honest, and harmless, yes 🙏

February 26, 2025 at 1:12 AM

Nishant Balepur

@nbalepur.bsky.social

The alignment is a system prompt saying "if the user asks X, do Y" 😝

February 26, 2025 at 1:04 AM

Nishant Balepur

@nbalepur.bsky.social

And huge thanks to my friends and labmates who let me bother them to find the right people, review the paper, and for useful discussions 🙏
@saxon.me @lasha.bsky.social @yysung.bsky.social @maharshigor.bsky.social @matthewshu.com @houyu0930.bsky.social

(and many more I'm forgetting, sorry!)

February 24, 2025 at 9:04 PM

Nishant Balepur

@nbalepur.bsky.social

This was a really fun paper to put together with Rachel and @boydgraber.bsky.social allowing me to vent many of my frustrations working with MCQA over the past year 😪🫡

Please check out the paper, we would love to hear your feedback! 📄👇

February 24, 2025 at 9:04 PM

Nishant Balepur

@nbalepur.bsky.social

In short, here’s how to build better evals:
✅ Check if MCQA the right format for what you want to test
✅ Use design choices to limit leakage/errors/shortcuts
✅ Keep questions easy for humans, hard for models

If we don’t put in this effort, what is MCQA even testing? 🤷‍♂️

February 24, 2025 at 9:04 PM

Nishant Balepur

@nbalepur.bsky.social

Lastly, we discuss persistent flaws of LLMs when running MCQA:
🔩Robustness Issues
🌎 Biases
💬 Unfaithful Explanations

Many of our previous solutions to MCQA's format/datasets can better address or evaluate these issues 😁

February 24, 2025 at 9:04 PM

Nishant Balepur

@nbalepur.bsky.social

Two of the most pressing and promising dataset improvements include:
📋 Writing MCQs using educators' rubrics to improve question quality
🧑‍🎓 Designing MCQs hard for models but easy for humans (adversarial), rather than creating needlessly impossible/obscure questions

February 24, 2025 at 9:04 PM

Nishant Balepur

@nbalepur.bsky.social

Next, we show even when MCQA is a good format, our datasets still have issues 🥲

We discuss:
🔓 Dataset Leakage
❓ Unanswerable Questions
⚡️ Shortcuts
📈 Saturation

More good news: educators again already have solutions! We also discuss recent work tackling these problems! 💪

February 24, 2025 at 9:04 PM

Nishant Balepur

@nbalepur.bsky.social

So what's better? ❤️‍🩹

We explore two possible improvements:
1️⃣ Constructed Response (short-form QA)
2️⃣ Explanation MCQA (justifying answers)

Both are grounded in education research, better align with LLM use cases, and test deeper knowledge levels versus MCQA ⭐️

February 24, 2025 at 9:04 PM

Nishant Balepur

@nbalepur.bsky.social

First, we show MCQA is flawed as a standardized LLM eval format because it often fails to:
🔒 Test subjectivity and generation
👥 Align with real LLM use cases
🧠 Assess knowledge (based on education research)

When's the last time you asked ChatGPT to answer an MCQ? 🤔

February 24, 2025 at 9:04 PM

Nishant Balepur

@nbalepur.bsky.social

We break our position into three points:
1️⃣ Flaws in MCQA’s format
2️⃣ Issues in datasets
3️⃣ Weaknesses in how LLMs run MCQA

The good news? Best practices in education made for effective student testing can help fix these 🧑‍🏫

Yet, we rarely use these insights in LLM evaluation 🤦

February 24, 2025 at 9:04 PM

Nishant Balepur

@nbalepur.bsky.social

Namely, @boydgraber.bsky.social @lasha.bsky.social, Rachel, Feng, and folks from Adobe Research 🫡

January 31, 2025 at 2:32 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news