nbalepur.github.io
Our first paper designs a preference training method to boost LLM personalization 🎨
While the second outlines our position on why MCQA evals are terrible and how to make them better 🙏
Grateful for amazing collaborators!
Our first paper designs a preference training method to boost LLM personalization 🎨
While the second outlines our position on why MCQA evals are terrible and how to make them better 🙏
Grateful for amazing collaborators!
And loved visiting London+Edinburgh this week, hope to be back soon! 🙏
And loved visiting London+Edinburgh this week, hope to be back soon! 🙏
🔩Robustness Issues
🌎 Biases
💬 Unfaithful Explanations
Many of our previous solutions to MCQA's format/datasets can better address or evaluate these issues 😁
🔩Robustness Issues
🌎 Biases
💬 Unfaithful Explanations
Many of our previous solutions to MCQA's format/datasets can better address or evaluate these issues 😁
📋 Writing MCQs using educators' rubrics to improve question quality
🧑🎓 Designing MCQs hard for models but easy for humans (adversarial), rather than creating needlessly impossible/obscure questions
📋 Writing MCQs using educators' rubrics to improve question quality
🧑🎓 Designing MCQs hard for models but easy for humans (adversarial), rather than creating needlessly impossible/obscure questions
We explore two possible improvements:
1️⃣ Constructed Response (short-form QA)
2️⃣ Explanation MCQA (justifying answers)
Both are grounded in education research, better align with LLM use cases, and test deeper knowledge levels versus MCQA ⭐️
We explore two possible improvements:
1️⃣ Constructed Response (short-form QA)
2️⃣ Explanation MCQA (justifying answers)
Both are grounded in education research, better align with LLM use cases, and test deeper knowledge levels versus MCQA ⭐️
🔒 Test subjectivity and generation
👥 Align with real LLM use cases
🧠 Assess knowledge (based on education research)
When's the last time you asked ChatGPT to answer an MCQ? 🤔
🔒 Test subjectivity and generation
👥 Align with real LLM use cases
🧠 Assess knowledge (based on education research)
When's the last time you asked ChatGPT to answer an MCQ? 🤔
1️⃣ Flaws in MCQA’s format
2️⃣ Issues in datasets
3️⃣ Weaknesses in how LLMs run MCQA
The good news? Best practices in education made for effective student testing can help fix these 🧑🏫
Yet, we rarely use these insights in LLM evaluation 🤦
1️⃣ Flaws in MCQA’s format
2️⃣ Issues in datasets
3️⃣ Weaknesses in how LLMs run MCQA
The good news? Best practices in education made for effective student testing can help fix these 🧑🏫
Yet, we rarely use these insights in LLM evaluation 🤦
Multiple choice evals for LLMs are simple and popular, but we know they are awful 😬
We complain they're full of errors, saturated, and test nothing meaningful, so why do we still use them? 🫠
Here's why MCQA evals are broken, and how to fix them 🧵
Multiple choice evals for LLMs are simple and popular, but we know they are awful 😬
We complain they're full of errors, saturated, and test nothing meaningful, so why do we still use them? 🫠
Here's why MCQA evals are broken, and how to fix them 🧵
📄✍️ MoDS: Multi-Doc Summarization for Debatable Queries (Adobe intern work, coming soon!)
🤔❓Reverse QA: LLMs struggle with the simple task of giving questions for answers
Grateful for all my collaborators 😁
📄✍️ MoDS: Multi-Doc Summarization for Debatable Queries (Adobe intern work, coming soon!)
🤔❓Reverse QA: LLMs struggle with the simple task of giving questions for answers
Grateful for all my collaborators 😁
Best of luck to anyone submitting tmrw :)
Best of luck to anyone submitting tmrw :)