https://evalevalai.com/
📉 Reporting on social impact dimensions has steadily declined, both in frequency and detail, across major providers
🧑💻 Sensitive content gets the most attention, as it’s easier to define and measure
🛡️Solution? Standardized reporting & safety policies (6/7)
📉 Reporting on social impact dimensions has steadily declined, both in frequency and detail, across major providers
🧑💻 Sensitive content gets the most attention, as it’s easier to define and measure
🛡️Solution? Standardized reporting & safety policies (6/7)
⛔️ First-party reporting is often sparse & superficial, with many reporting NO social impact evals
📉 On average, first-party scores are far lower than third-party evals (0.72 vs 2.62/3)
🎯 Third parties provide some complementary coverage (GPT-4 and LLaMA) (5/7)
⛔️ First-party reporting is often sparse & superficial, with many reporting NO social impact evals
📉 On average, first-party scores are far lower than third-party evals (0.72 vs 2.62/3)
🎯 Third parties provide some complementary coverage (GPT-4 and LLaMA) (5/7)
Our new paper, 📎“Who Evaluates AI’s Social Impacts? Mapping Coverage and Gaps in First and Third Party Evaluations,” analyzes hundreds of evaluation reports and reveals major blind spots ‼️🧵 (1/7)
Our new paper, 📎“Who Evaluates AI’s Social Impacts? Mapping Coverage and Gaps in First and Third Party Evaluations,” analyzes hundreds of evaluation reports and reveals major blind spots ‼️🧵 (1/7)
🧪 Experiments across 6 LLMs and 6 major benchmarks:
🏃Fluid Benchmarking outperforms all baselines across all four evaluation dimensions: efficiency, validity, variance, and saturation.
⚡️It achieves lower variance with up to 50× fewer items needed!!
🧪 Experiments across 6 LLMs and 6 major benchmarks:
🏃Fluid Benchmarking outperforms all baselines across all four evaluation dimensions: efficiency, validity, variance, and saturation.
⚡️It achieves lower variance with up to 50× fewer items needed!!
🧩Fluid Benchmarking: This work proposes a framework inspired by psychometrics that uses Item Response Theory (IRT) and adaptive item selection to dynamically tailor benchmark evaluations to each model’s capability level.
Continued...👇
🧩Fluid Benchmarking: This work proposes a framework inspired by psychometrics that uses Item Response Theory (IRT) and adaptive item selection to dynamically tailor benchmark evaluations to each model’s capability level.
Continued...👇
🤔Is it time to move beyond static tests and toward more dynamic, adaptive, and model-aware evaluation?
🖇️ "Fluid Language Model Benchmarking" by
@valentinhofmann.bsky.social et. al introduces a dynamic benchmarking method for evaluating language models
🤔Is it time to move beyond static tests and toward more dynamic, adaptive, and model-aware evaluation?
🖇️ "Fluid Language Model Benchmarking" by
@valentinhofmann.bsky.social et. al introduces a dynamic benchmarking method for evaluating language models
🗳️Popular leaderboards (e.g., ChatArena, MTEB) can be exploited to distribute poisoned LLMs at scale
🔐Derivative models (finetuned, quantized, “abliterated”) are easy backdoor vectors. For instance, unsafe LLM variants often get downloaded as much as originals!
Continued...
🗳️Popular leaderboards (e.g., ChatArena, MTEB) can be exploited to distribute poisoned LLMs at scale
🔐Derivative models (finetuned, quantized, “abliterated”) are easy backdoor vectors. For instance, unsafe LLM variants often get downloaded as much as originals!
Continued...
🧮Introduces TrojanClimb, a framework showing how attackers can:
⌨️ Simulate leaderboard attacks where malicious models achieve high test scores while embedding harmful pay loads (4 modalities)
🔒 Leverage stylistic watermarks/tags to game voting-based leaderboards
🧮Introduces TrojanClimb, a framework showing how attackers can:
⌨️ Simulate leaderboard attacks where malicious models achieve high test scores while embedding harmful pay loads (4 modalities)
🔒 Leverage stylistic watermarks/tags to game voting-based leaderboards
🤖 Did you know malicious actors can exploit trust in AI leaderboards to promote poisoned models in the community?
This week's paper 📜"Exploiting Leaderboards for Large-Scale Distribution of Malicious Models" by @iamgroot42.bsky.social explores this!
🤖 Did you know malicious actors can exploit trust in AI leaderboards to promote poisoned models in the community?
This week's paper 📜"Exploiting Leaderboards for Large-Scale Distribution of Malicious Models" by @iamgroot42.bsky.social explores this!
‼️Noise in benchmarks is substantial! For some datasets, up to 90% of reported “model errors” actually stem from *bad data* instead of model failures.
🧠 After benchmark cleaning, even top LLMs fail on simple, unambiguous platinum benchmark tasks.
Continued...
‼️Noise in benchmarks is substantial! For some datasets, up to 90% of reported “model errors” actually stem from *bad data* instead of model failures.
🧠 After benchmark cleaning, even top LLMs fail on simple, unambiguous platinum benchmark tasks.
Continued...
🧹 Revise & clean 15 popular LLM benchmarks across 6 domains to create *platinum* benchmarks.
🤖 Use multiple LLMs to flag inconsistent samples via disagreement.
⚠️ Bad” questions fall into 4 types: mislabeled, contradictory, ambiguous, or ill-posed.
Example 👇
🧹 Revise & clean 15 popular LLM benchmarks across 6 domains to create *platinum* benchmarks.
🤖 Use multiple LLMs to flag inconsistent samples via disagreement.
⚠️ Bad” questions fall into 4 types: mislabeled, contradictory, ambiguous, or ill-posed.
Example 👇
🕵️ Is benchmark noise and label errors masking the true fragility of LLMs?
🖇️"Do Large Language Model Benchmarks Test Reliability?" - This paper by @joshvendrow.bsky.social provides insights!
🕵️ Is benchmark noise and label errors masking the true fragility of LLMs?
🖇️"Do Large Language Model Benchmarks Test Reliability?" - This paper by @joshvendrow.bsky.social provides insights!