https://evalevalai.com/
Our new paper, 📎“Who Evaluates AI’s Social Impacts? Mapping Coverage and Gaps in First and Third Party Evaluations,” analyzes hundreds of evaluation reports and reveals major blind spots ‼️🧵 (1/7)
Our new paper, 📎“Who Evaluates AI’s Social Impacts? Mapping Coverage and Gaps in First and Third Party Evaluations,” analyzes hundreds of evaluation reports and reveals major blind spots ‼️🧵 (1/7)
🧠 Join us for the 2025 Workshop on "Evaluating AI in Practice Bridging Statistical Rigor, Sociotechnical Insights, and Ethical Boundaries" (Co-hosted with UKAISI)
📅 Dec 8, 2025
📝 Abstract due: Nov 20, 2025
Details below! ⬇️
evalevalai.com/events/works...
🧠 Join us for the 2025 Workshop on "Evaluating AI in Practice Bridging Statistical Rigor, Sociotechnical Insights, and Ethical Boundaries" (Co-hosted with UKAISI)
📅 Dec 8, 2025
📝 Abstract due: Nov 20, 2025
Details below! ⬇️
evalevalai.com/events/works...
🤔Is it time to move beyond static tests and toward more dynamic, adaptive, and model-aware evaluation?
🖇️ "Fluid Language Model Benchmarking" by
@valentinhofmann.bsky.social et. al introduces a dynamic benchmarking method for evaluating language models
🤔Is it time to move beyond static tests and toward more dynamic, adaptive, and model-aware evaluation?
🖇️ "Fluid Language Model Benchmarking" by
@valentinhofmann.bsky.social et. al introduces a dynamic benchmarking method for evaluating language models
🤖 Did you know malicious actors can exploit trust in AI leaderboards to promote poisoned models in the community?
This week's paper 📜"Exploiting Leaderboards for Large-Scale Distribution of Malicious Models" by @iamgroot42.bsky.social explores this!
🤖 Did you know malicious actors can exploit trust in AI leaderboards to promote poisoned models in the community?
This week's paper 📜"Exploiting Leaderboards for Large-Scale Distribution of Malicious Models" by @iamgroot42.bsky.social explores this!
🕵️ Is benchmark noise and label errors masking the true fragility of LLMs?
🖇️"Do Large Language Model Benchmarks Test Reliability?" - This paper by @joshvendrow.bsky.social provides insights!
🕵️ Is benchmark noise and label errors masking the true fragility of LLMs?
🖇️"Do Large Language Model Benchmarks Test Reliability?" - This paper by @joshvendrow.bsky.social provides insights!
From misleading bar heights to missing error bars, recent model launches have sparked debate on AI evals. In our new blogpost, we dig into what’s broken, why it matters and how they should be presented 👇
evalevalai.com/documentatio...
From misleading bar heights to missing error bars, recent model launches have sparked debate on AI evals. In our new blogpost, we dig into what’s broken, why it matters and how they should be presented 👇
evalevalai.com/documentatio...
We’re building a shared scientific foundation for evaluating AI systems, one that’s rigorous, open, and grounded in real-world & cross-disciplinary best practices👇 (1/2)
Read our new blog post: tinyurl.com/evalevalai
We’re building a shared scientific foundation for evaluating AI systems, one that’s rigorous, open, and grounded in real-world & cross-disciplinary best practices👇 (1/2)
Read our new blog post: tinyurl.com/evalevalai
We are a community of researchers dedicated to designing, developing, and deploying better evaluations (1/3)
We are a community of researchers dedicated to designing, developing, and deploying better evaluations (1/3)