https://evalevalai.com/
📝Blog: tinyurl.com/blogAI1
🤝At EvalEval, we are a coalition of researchers working towards better AI evals. Interested in joining us? Check out: evalevalai.com 7/7 🧵
📝Blog: tinyurl.com/blogAI1
🤝At EvalEval, we are a coalition of researchers working towards better AI evals. Interested in joining us? Check out: evalevalai.com 7/7 🧵
📉 Reporting on social impact dimensions has steadily declined, both in frequency and detail, across major providers
🧑💻 Sensitive content gets the most attention, as it’s easier to define and measure
🛡️Solution? Standardized reporting & safety policies (6/7)
📉 Reporting on social impact dimensions has steadily declined, both in frequency and detail, across major providers
🧑💻 Sensitive content gets the most attention, as it’s easier to define and measure
🛡️Solution? Standardized reporting & safety policies (6/7)
⛔️ First-party reporting is often sparse & superficial, with many reporting NO social impact evals
📉 On average, first-party scores are far lower than third-party evals (0.72 vs 2.62/3)
🎯 Third parties provide some complementary coverage (GPT-4 and LLaMA) (5/7)
⛔️ First-party reporting is often sparse & superficial, with many reporting NO social impact evals
📉 On average, first-party scores are far lower than third-party evals (0.72 vs 2.62/3)
🎯 Third parties provide some complementary coverage (GPT-4 and LLaMA) (5/7)
💬 TLDR: Incentives and constraints shape reporting (4/7)
💬 TLDR: Incentives and constraints shape reporting (4/7)
🔎 Analyzed 186 first-party release reports from model developers & 183 post-release evaluations (third-party)
📏 Scored 7 social impact dimensions: bias, harmful content, performance disparities, environmental costs, privacy, financial costs, & labor (3/7)
🔎 Analyzed 186 first-party release reports from model developers & 183 post-release evaluations (third-party)
📏 Scored 7 social impact dimensions: bias, harmful content, performance disparities, environmental costs, privacy, financial costs, & labor (3/7)
🎯Our goal: Explore the AI Eval landscape to answer who evaluates what and identify gaps in social impact evals!! (2/7)
🎯Our goal: Explore the AI Eval landscape to answer who evaluates what and identify gaps in social impact evals!! (2/7)
We have a rock-star lineup of AI researchers and an amazing program. Please RSVP at the earliest! Stay tuned!
We have a rock-star lineup of AI researchers and an amazing program. Please RSVP at the earliest! Stay tuned!
📸Interested in working on better AI evals? Join us: evalevalai.com
📸Interested in working on better AI evals? Join us: evalevalai.com
📑Read more: arxiv.org/abs/2509.11106
📑Read more: arxiv.org/abs/2509.11106
🧪 Experiments across 6 LLMs and 6 major benchmarks:
🏃Fluid Benchmarking outperforms all baselines across all four evaluation dimensions: efficiency, validity, variance, and saturation.
⚡️It achieves lower variance with up to 50× fewer items needed!!
🧪 Experiments across 6 LLMs and 6 major benchmarks:
🏃Fluid Benchmarking outperforms all baselines across all four evaluation dimensions: efficiency, validity, variance, and saturation.
⚡️It achieves lower variance with up to 50× fewer items needed!!
✍️Item Response Theory: Models LLM performance in a latent ability space based on item difficulty and discrimination across models
🧨Dynamic Item Selection: Adaptive benchmarking-weaker models get easier items, while stronger models face harder ones
✍️Item Response Theory: Models LLM performance in a latent ability space based on item difficulty and discrimination across models
🧨Dynamic Item Selection: Adaptive benchmarking-weaker models get easier items, while stronger models face harder ones
🧩Fluid Benchmarking: This work proposes a framework inspired by psychometrics that uses Item Response Theory (IRT) and adaptive item selection to dynamically tailor benchmark evaluations to each model’s capability level.
Continued...👇
🧩Fluid Benchmarking: This work proposes a framework inspired by psychometrics that uses Item Response Theory (IRT) and adaptive item selection to dynamically tailor benchmark evaluations to each model’s capability level.
Continued...👇
🧱As models advance, benchmarks tend to saturate quickly, reducing their longterm usefulness.
🪃Existing approaches typically tackle just one of these problems (e.g., efficiency or validity)
What now⁉️
🧱As models advance, benchmarks tend to saturate quickly, reducing their longterm usefulness.
🪃Existing approaches typically tackle just one of these problems (e.g., efficiency or validity)
What now⁉️
📉It’s often unclear which benchmark(s) to choose, while evaluating on all available ones is too expensive, inefficient, and not always aligned with the intended capabilities we want to measure.
More 👇👇
📉It’s often unclear which benchmark(s) to choose, while evaluating on all available ones is too expensive, inefficient, and not always aligned with the intended capabilities we want to measure.
More 👇👇
📷 Interested in working on better AI evals? Check out: evalevalai.com
📷 Interested in working on better AI evals? Check out: evalevalai.com
Read more: arxiv.org/pdf/2507.08983
Read more: arxiv.org/pdf/2507.08983
🗳️Popular leaderboards (e.g., ChatArena, MTEB) can be exploited to distribute poisoned LLMs at scale
🔐Derivative models (finetuned, quantized, “abliterated”) are easy backdoor vectors. For instance, unsafe LLM variants often get downloaded as much as originals!
Continued...
🗳️Popular leaderboards (e.g., ChatArena, MTEB) can be exploited to distribute poisoned LLMs at scale
🔐Derivative models (finetuned, quantized, “abliterated”) are easy backdoor vectors. For instance, unsafe LLM variants often get downloaded as much as originals!
Continued...
🧮Introduces TrojanClimb, a framework showing how attackers can:
⌨️ Simulate leaderboard attacks where malicious models achieve high test scores while embedding harmful pay loads (4 modalities)
🔒 Leverage stylistic watermarks/tags to game voting-based leaderboards
🧮Introduces TrojanClimb, a framework showing how attackers can:
⌨️ Simulate leaderboard attacks where malicious models achieve high test scores while embedding harmful pay loads (4 modalities)
🔒 Leverage stylistic watermarks/tags to game voting-based leaderboards
🤝 Interested in working on better AI evals? We are a coalition of researchers working towards better AI evals. Check out: evalevalai.com
🤝 Interested in working on better AI evals? We are a coalition of researchers working towards better AI evals. Check out: evalevalai.com
📢 Highlights the gap between apparent competence & dependable reliability - therefore systematic reliability testing is needed.
Read more at: arxiv.org/pdf/2502.03461
📢 Highlights the gap between apparent competence & dependable reliability - therefore systematic reliability testing is needed.
Read more at: arxiv.org/pdf/2502.03461