Lightnews — Scholar-powered news

EvalEval Coalition

@eval-eval.bsky.social

We are a researcher community developing scientifically grounded research outputs and robust deployment infrastructure for broader impact evaluations.

https://evalevalai.com/

Posts Replies Media Videos

EvalEval Coalition

@eval-eval.bsky.social

Continued..

📉 Reporting on social impact dimensions has steadily declined, both in frequency and detail, across major providers
🧑‍💻 Sensitive content gets the most attention, as it’s easier to define and measure

🛡️Solution? Standardized reporting & safety policies (6/7)

November 13, 2025 at 1:59 PM

EvalEval Coalition

@eval-eval.bsky.social

Key Takeaways:

⛔️ First-party reporting is often sparse & superficial, with many reporting NO social impact evals
📉 On average, first-party scores are far lower than third-party evals (0.72 vs 2.62/3)
🎯 Third parties provide some complementary coverage (GPT-4 and LLaMA) (5/7)

November 13, 2025 at 1:59 PM

EvalEval Coalition

@eval-eval.bsky.social

🚨 AI keeps scaling, but social impact evaluations aren’t–and the data proves it 🚨

Our new paper, 📎“Who Evaluates AI’s Social Impacts? Mapping Coverage and Gaps in First and Third Party Evaluations,” analyzes hundreds of evaluation reports and reveals major blind spots ‼️🧵 (1/7)

November 13, 2025 at 1:59 PM

EvalEval Coalition

@eval-eval.bsky.social

📊Results & Findings

🧪 Experiments across 6 LLMs and 6 major benchmarks:

🏃Fluid Benchmarking outperforms all baselines across all four evaluation dimensions: efficiency, validity, variance, and saturation.
⚡️It achieves lower variance with up to 50× fewer items needed!!

October 31, 2025 at 3:47 PM

EvalEval Coalition

@eval-eval.bsky.social

🔍How to address this? 🤔

🧩Fluid Benchmarking: This work proposes a framework inspired by psychometrics that uses Item Response Theory (IRT) and adaptive item selection to dynamically tailor benchmark evaluations to each model’s capability level.

Continued...👇

October 31, 2025 at 3:47 PM

EvalEval Coalition

@eval-eval.bsky.social

✨ Weekly AI Evaluation Paper Spotlight ✨

🤔Is it time to move beyond static tests and toward more dynamic, adaptive, and model-aware evaluation?

🖇️ "Fluid Language Model Benchmarking" by
@valentinhofmann.bsky.social et. al introduces a dynamic benchmarking method for evaluating language models

October 31, 2025 at 3:47 PM

EvalEval Coalition

@eval-eval.bsky.social

📊Key insights

🗳️Popular leaderboards (e.g., ChatArena, MTEB) can be exploited to distribute poisoned LLMs at scale
🔐Derivative models (finetuned, quantized, “abliterated”) are easy backdoor vectors. For instance, unsafe LLM variants often get downloaded as much as originals!

Continued...

October 24, 2025 at 4:44 PM

EvalEval Coalition

@eval-eval.bsky.social

🔍 Method:

🧮Introduces TrojanClimb, a framework showing how attackers can:

⌨️ Simulate leaderboard attacks where malicious models achieve high test scores while embedding harmful pay loads (4 modalities)
🔒 Leverage stylistic watermarks/tags to game voting-based leaderboards

October 24, 2025 at 4:44 PM

EvalEval Coalition

@eval-eval.bsky.social

🌟 Weekly AI Evaluation Spotlight 🌟

🤖 Did you know malicious actors can exploit trust in AI leaderboards to promote poisoned models in the community?

This week's paper 📜"Exploiting Leaderboards for Large-Scale Distribution of Malicious Models" by @iamgroot42.bsky.social explores this!

October 24, 2025 at 4:44 PM

EvalEval Coalition

@eval-eval.bsky.social

📊Key insights:

‼️Noise in benchmarks is substantial! For some datasets, up to 90% of reported “model errors” actually stem from *bad data* instead of model failures.
🧠 After benchmark cleaning, even top LLMs fail on simple, unambiguous platinum benchmark tasks.

Continued...

October 17, 2025 at 4:15 PM

EvalEval Coalition

@eval-eval.bsky.social

🔍 Method:

🧹 Revise & clean 15 popular LLM benchmarks across 6 domains to create *platinum* benchmarks.
🤖 Use multiple LLMs to flag inconsistent samples via disagreement.
⚠️ Bad” questions fall into 4 types: mislabeled, contradictory, ambiguous, or ill-posed.

Example 👇

October 17, 2025 at 4:15 PM

EvalEval Coalition

@eval-eval.bsky.social

✨Weekly AI Evaluation Paper Spotlight✨

🕵️ Is benchmark noise and label errors masking the true fragility of LLMs?

🖇️"Do Large Language Model Benchmarks Test Reliability?" - This paper by @joshvendrow.bsky.social provides insights!

October 17, 2025 at 4:15 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news