EvalEval Coalition
banner
eval-eval.bsky.social
EvalEval Coalition
@eval-eval.bsky.social
We are a researcher community developing scientifically grounded research outputs and robust deployment infrastructure for broader impact evaluations.

https://evalevalai.com/
🚨 AI keeps scaling, but social impact evaluations aren’t–and the data proves it 🚨

Our new paper, 📎“Who Evaluates AI’s Social Impacts? Mapping Coverage and Gaps in First and Third Party Evaluations,” analyzes hundreds of evaluation reports and reveals major blind spots ‼️🧵 (1/7)
November 13, 2025 at 1:59 PM
🚨 EvalEval is back - now in San Diego!🚨

🧠 Join us for the 2025 Workshop on "Evaluating AI in Practice Bridging Statistical Rigor, Sociotechnical Insights, and Ethical Boundaries" (Co-hosted with UKAISI)

📅 Dec 8, 2025
📝 Abstract due: Nov 20, 2025

Details below! ⬇️
evalevalai.com/events/works...
evalevalai.com
November 6, 2025 at 9:19 PM
✨ Weekly AI Evaluation Paper Spotlight ✨

🤔Is it time to move beyond static tests and toward more dynamic, adaptive, and model-aware evaluation?

🖇️ "Fluid Language Model Benchmarking" by
@valentinhofmann.bsky.social et. al introduces a dynamic benchmarking method for evaluating language models
October 31, 2025 at 3:47 PM
🌟 Weekly AI Evaluation Spotlight 🌟

🤖 Did you know malicious actors can exploit trust in AI leaderboards to promote poisoned models in the community?

This week's paper 📜"Exploiting Leaderboards for Large-Scale Distribution of Malicious Models" by @iamgroot42.bsky.social explores this!
October 24, 2025 at 4:44 PM
✨Weekly AI Evaluation Paper Spotlight✨

🕵️ Is benchmark noise and label errors masking the true fragility of LLMs?

🖇️"Do Large Language Model Benchmarks Test Reliability?" - This paper by @joshvendrow.bsky.social provides insights!
October 17, 2025 at 4:15 PM
🚨New blog: The AI Evaluation Chart Crisis 📝

From misleading bar heights to missing error bars, recent model launches have sparked debate on AI evals. In our new blogpost, we dig into what’s broken, why it matters and how they should be presented 👇

evalevalai.com/documentatio...
The AI Evaluation Chart Crisis
Charts used to showcase performance demonstrate broader issues in the AI evaluation ecosystem: a lack of balance between competitive benchmarking and statistical rigor.
evalevalai.com
August 11, 2025 at 7:20 PM
🚨 AI Evals Crisis: Officially kicking off the Eval Science Workstream 🚨

We’re building a shared scientific foundation for evaluating AI systems, one that’s rigorous, open, and grounded in real-world & cross-disciplinary best practices👇 (1/2)

Read our new blog post: tinyurl.com/evalevalai
The Science of Evaluations: Workstream Kickoff Post
Announcing the launch of a research-driven initiative among a community of researchers to strengthen the science of AI evaluations.
tinyurl.com
July 16, 2025 at 5:17 PM
Join us for the Eval Eval Coalition Social at @facct.bsky.social tomorrow Tuesday June 24th from 4-4:30 pm during the coffee break! We would love to have you join us and we look forward to seeing you there!! #FAccT2025 #EvalEval
June 23, 2025 at 2:41 PM
Introducing the Eval Eval Coalition! ✨
We are a community of researchers dedicated to designing, developing, and deploying better evaluations (1/3)
June 22, 2025 at 7:34 PM