nbalepur.github.io
@saxon.me @lasha.bsky.social @yysung.bsky.social @maharshigor.bsky.social @matthewshu.com @houyu0930.bsky.social
(and many more I'm forgetting, sorry!)
@saxon.me @lasha.bsky.social @yysung.bsky.social @maharshigor.bsky.social @matthewshu.com @houyu0930.bsky.social
(and many more I'm forgetting, sorry!)
Please check out the paper, we would love to hear your feedback! 📄👇
Please check out the paper, we would love to hear your feedback! 📄👇
✅ Check if MCQA the right format for what you want to test
✅ Use design choices to limit leakage/errors/shortcuts
✅ Keep questions easy for humans, hard for models
If we don’t put in this effort, what is MCQA even testing? 🤷♂️
✅ Check if MCQA the right format for what you want to test
✅ Use design choices to limit leakage/errors/shortcuts
✅ Keep questions easy for humans, hard for models
If we don’t put in this effort, what is MCQA even testing? 🤷♂️
🔩Robustness Issues
🌎 Biases
💬 Unfaithful Explanations
Many of our previous solutions to MCQA's format/datasets can better address or evaluate these issues 😁
🔩Robustness Issues
🌎 Biases
💬 Unfaithful Explanations
Many of our previous solutions to MCQA's format/datasets can better address or evaluate these issues 😁
📋 Writing MCQs using educators' rubrics to improve question quality
🧑🎓 Designing MCQs hard for models but easy for humans (adversarial), rather than creating needlessly impossible/obscure questions
📋 Writing MCQs using educators' rubrics to improve question quality
🧑🎓 Designing MCQs hard for models but easy for humans (adversarial), rather than creating needlessly impossible/obscure questions
We discuss:
🔓 Dataset Leakage
❓ Unanswerable Questions
⚡️ Shortcuts
📈 Saturation
More good news: educators again already have solutions! We also discuss recent work tackling these problems! 💪
We discuss:
🔓 Dataset Leakage
❓ Unanswerable Questions
⚡️ Shortcuts
📈 Saturation
More good news: educators again already have solutions! We also discuss recent work tackling these problems! 💪
We explore two possible improvements:
1️⃣ Constructed Response (short-form QA)
2️⃣ Explanation MCQA (justifying answers)
Both are grounded in education research, better align with LLM use cases, and test deeper knowledge levels versus MCQA ⭐️
We explore two possible improvements:
1️⃣ Constructed Response (short-form QA)
2️⃣ Explanation MCQA (justifying answers)
Both are grounded in education research, better align with LLM use cases, and test deeper knowledge levels versus MCQA ⭐️
🔒 Test subjectivity and generation
👥 Align with real LLM use cases
🧠 Assess knowledge (based on education research)
When's the last time you asked ChatGPT to answer an MCQ? 🤔
🔒 Test subjectivity and generation
👥 Align with real LLM use cases
🧠 Assess knowledge (based on education research)
When's the last time you asked ChatGPT to answer an MCQ? 🤔
1️⃣ Flaws in MCQA’s format
2️⃣ Issues in datasets
3️⃣ Weaknesses in how LLMs run MCQA
The good news? Best practices in education made for effective student testing can help fix these 🧑🏫
Yet, we rarely use these insights in LLM evaluation 🤦
1️⃣ Flaws in MCQA’s format
2️⃣ Issues in datasets
3️⃣ Weaknesses in how LLMs run MCQA
The good news? Best practices in education made for effective student testing can help fix these 🧑🏫
Yet, we rarely use these insights in LLM evaluation 🤦