EvalEval Coalition
banner
eval-eval.bsky.social
EvalEval Coalition
@eval-eval.bsky.social
We are a researcher community developing scientifically grounded research outputs and robust deployment infrastructure for broader impact evaluations.

https://evalevalai.com/
Continued..

📉 Reporting on social impact dimensions has steadily declined, both in frequency and detail, across major providers
🧑‍💻 Sensitive content gets the most attention, as it’s easier to define and measure

🛡️Solution? Standardized reporting & safety policies (6/7)
November 13, 2025 at 1:59 PM
Key Takeaways:

⛔️ First-party reporting is often sparse & superficial, with many reporting NO social impact evals
📉 On average, first-party scores are far lower than third-party evals (0.72 vs 2.62/3)
🎯 Third parties provide some complementary coverage (GPT-4 and LLaMA) (5/7)
November 13, 2025 at 1:59 PM
🚨 AI keeps scaling, but social impact evaluations aren’t–and the data proves it 🚨

Our new paper, 📎“Who Evaluates AI’s Social Impacts? Mapping Coverage and Gaps in First and Third Party Evaluations,” analyzes hundreds of evaluation reports and reveals major blind spots ‼️🧵 (1/7)
November 13, 2025 at 1:59 PM
📊Results & Findings

🧪 Experiments across 6 LLMs and 6 major benchmarks:

🏃Fluid Benchmarking outperforms all baselines across all four evaluation dimensions: efficiency, validity, variance, and saturation.
⚡️It achieves lower variance with up to 50× fewer items needed!!
October 31, 2025 at 3:47 PM
🔍How to address this? 🤔

🧩Fluid Benchmarking: This work proposes a framework inspired by psychometrics that uses Item Response Theory (IRT) and adaptive item selection to dynamically tailor benchmark evaluations to each model’s capability level.

Continued...👇
October 31, 2025 at 3:47 PM
✨ Weekly AI Evaluation Paper Spotlight ✨

🤔Is it time to move beyond static tests and toward more dynamic, adaptive, and model-aware evaluation?

🖇️ "Fluid Language Model Benchmarking" by
@valentinhofmann.bsky.social et. al introduces a dynamic benchmarking method for evaluating language models
October 31, 2025 at 3:47 PM
📊Key insights

🗳️Popular leaderboards (e.g., ChatArena, MTEB) can be exploited to distribute poisoned LLMs at scale
🔐Derivative models (finetuned, quantized, “abliterated”) are easy backdoor vectors. For instance, unsafe LLM variants often get downloaded as much as originals!

Continued...
October 24, 2025 at 4:44 PM
🔍 Method:

🧮Introduces TrojanClimb, a framework showing how attackers can:

⌨️ Simulate leaderboard attacks where malicious models achieve high test scores while embedding harmful pay loads (4 modalities)
🔒 Leverage stylistic watermarks/tags to game voting-based leaderboards
October 24, 2025 at 4:44 PM
🌟 Weekly AI Evaluation Spotlight 🌟

🤖 Did you know malicious actors can exploit trust in AI leaderboards to promote poisoned models in the community?

This week's paper 📜"Exploiting Leaderboards for Large-Scale Distribution of Malicious Models" by @iamgroot42.bsky.social explores this!
October 24, 2025 at 4:44 PM
📊Key insights:

‼️Noise in benchmarks is substantial! For some datasets, up to 90% of reported “model errors” actually stem from *bad data* instead of model failures.
🧠 After benchmark cleaning, even top LLMs fail on simple, unambiguous platinum benchmark tasks.

Continued...
October 17, 2025 at 4:15 PM
🔍 Method:

🧹 Revise & clean 15 popular LLM benchmarks across 6 domains to create *platinum* benchmarks.
🤖 Use multiple LLMs to flag inconsistent samples via disagreement.
⚠️ Bad” questions fall into 4 types: mislabeled, contradictory, ambiguous, or ill-posed.

Example 👇
October 17, 2025 at 4:15 PM
✨Weekly AI Evaluation Paper Spotlight✨

🕵️ Is benchmark noise and label errors masking the true fragility of LLMs?

🖇️"Do Large Language Model Benchmarks Test Reliability?" - This paper by @joshvendrow.bsky.social provides insights!
October 17, 2025 at 4:15 PM