Lightnews — Scholar-powered news

Lexin Zhou

@lexinzhou.bsky.social

Research Intern at Microsoft | Working on AI Evaluation, Social Computing and NLP | Incoming PhD candidate for Fall 2025
https://lexzhou.github.io

Posts Replies Media Videos

Lexin Zhou

@lexinzhou.bsky.social

14/ Takeaways on our novel methodology:

- General scales (stable to SOTA/frontiers in AI, no saturation!)
- AI benchmarks and systems become commensurate!
- Explanatory power (demand profiles, ability profiles)
- Predictive power at instance level (especially OOD!)
- Fully automated procedure

March 11, 2025 at 6:28 PM

Lexin Zhou

@lexinzhou.bsky.social

13/ Even better, we build a Random Forest (RF) classifier fed with the 18 demand levels to predict the performance of LLMs at instance-level. This yields high predictive power (high AUROC and nearly perfect calibration!) in-distribution and out-of-distribution, outperforming black-box predictors.

March 11, 2025 at 6:26 PM

Lexin Zhou

@lexinzhou.bsky.social

12/ On predictive power: We can map these interpretable ability profiles with the demand profiles of benchmarks or individual instances, to anticipate the performance of LLMs on them: the larger the supremacy of (model) abilities over (task) demands is, the more likely the model will succeed.

March 11, 2025 at 6:25 PM

Lexin Zhou

@lexinzhou.bsky.social

11/ Takeaways from ability profiles:

-Newer LLMs have higher abilities than older ones, but this is NOT monotonic for all abilities
-Knowledge scales are limited by model size and distillation processed
-Reasoning, learning and abstraction, and social capabilities, are boosted in ‘reasoning’ models

March 11, 2025 at 6:23 PM

Lexin Zhou

@lexinzhou.bsky.social

9/ The SCCs of certain dimensions are steep, which explains (and predicts) success very well for instances in the low and high ranges. In contrast, SCCs of other dimensions are flatter and show strong differences between LLMs, i.e., lower discrimination power to differentiate success and failures

March 11, 2025 at 6:21 PM

Lexin Zhou

@lexinzhou.bsky.social

8/ To evaluate abilities, we show the subject characteristic curve (SCC) for each dimension: the probability of success as a logistic function of demand levels. We use dominant slicing: for level k of the target dimension, all other dimensions<=k.

Here's an example SCC, but next post has all SCCs.

March 11, 2025 at 6:20 PM

Lexin Zhou

@lexinzhou.bsky.social

6/ Surprisingly, by inspecting demand levels: All these 20 benchmarks from recent top AI/NLP conferences lack construct validity: not measuring what they claim to measure (lacking specificity) or tend to only include intermediate difficulties for the target ability scale (lacking sensitivity)

March 11, 2025 at 6:19 PM

Lexin Zhou

@lexinzhou.bsky.social

5/ We annotate demand levels of 18 dimensions for 16K instances sampled from 63 tasks on 20 benchmarks. This forms the Annotated-Demand-Levels (ADeLe) battery, which elegantly places task instances of many different benchmarks in the same commensurate space!

March 11, 2025 at 6:18 PM

Lexin Zhou

@lexinzhou.bsky.social

4/ For example, in the natural sciences knowledge (KNn) rubric, we use education levels to represent the demand levels from 0 to 5+

A demand level of 0 means KNn is not required to solve the task, while 5+ means graduate level or beyond.

Similar/related principles are applied to other rubrics.

March 11, 2025 at 6:17 PM

Lexin Zhou

@lexinzhou.bsky.social

3/To address these issues, we craft 18 novel rubrics to annotate demand levels (0 to 5+) for 18 general scales from a taxonomy of cognitive abilities, focusing on LLMs:

Primordial: 11 cognitive capabilities
Knowledge: 5 branches of knowledge
Extraneous: 2 other elements making task difficult

March 11, 2025 at 6:16 PM

Lexin Zhou

@lexinzhou.bsky.social

Thrilled to unlock AI Evaluation with explanatory and predictive power through general ability scales!

With a new methodology to
-Explain what common benchmarks really measure
-Extract explainable ability profiles of AI systems
-Predict performance for new task instances, in & out-of-distribution
🧵

March 11, 2025 at 6:12 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news