anaryegen.github.io
Presenting Wicked: a simple automated method to make MCQA benchmarks more challenging. Wicked shook up 18 open-weight LLMs on 6 benchmarks, with up to 19.7% performance drop with direct prompting 🤯
Paper: shorturl.at/1CGq0
Code: shorturl.at/n2nCU
Presenting Wicked: a simple automated method to make MCQA benchmarks more challenging. Wicked shook up 18 open-weight LLMs on 6 benchmarks, with up to 19.7% performance drop with direct prompting 🤯
Paper: shorturl.at/1CGq0
Code: shorturl.at/n2nCU
@clefourrier.bsky.social - a platform that lets you easily compare models as judges side-by-side and vote for the best evaluation
Check out the live leaderboard and start voting now 🤗
@clefourrier.bsky.social - a platform that lets you easily compare models as judges side-by-side and vote for the best evaluation
Check out the live leaderboard and start voting now 🤗