umd.wd1.myworkdayjobs.com/UMCP/job/Uni...
Please join us!
umd.wd1.myworkdayjobs.com/UMCP/job/Uni...
Please join us!
youtu.be/87OBxEM8a9E
youtu.be/87OBxEM8a9E
HellaSwag is currently on of the most widely LLM benchmarks in the world. We introduce a new critical method to assess the validity of standard LLM evals and show it does not accurately measure common sense reasoning. arxiv.org/abs/2504.07825
HellaSwag is currently on of the most widely LLM benchmarks in the world. We introduce a new critical method to assess the validity of standard LLM evals and show it does not accurately measure common sense reasoning. arxiv.org/abs/2504.07825
Please check out the paper, we would love to hear your feedback! 📄👇
Please check out the paper, we would love to hear your feedback! 📄👇
📈 We consider how models’ confidence in their answers changes as test-time compute increases. Reasoning longer helps models answer more confidently!
📝: arxiv.org/abs/2502.13962
📈 We consider how models’ confidence in their answers changes as test-time compute increases. Reasoning longer helps models answer more confidently!
📝: arxiv.org/abs/2502.13962
youtube.com/live/jVjTbPH...
youtube.com/live/jVjTbPH...
Check your e-mail for "Reminder: ACL 2024 Elections - Please Vote".
Check your e-mail for "Reminder: ACL 2024 Elections - Please Vote".