Lexin Zhou
banner
lexinzhou.bsky.social
Lexin Zhou
@lexinzhou.bsky.social
Research Intern at Microsoft | Working on AI Evaluation, Social Computing and NLP | Incoming PhD candidate for Fall 2025
https://lexzhou.github.io
14/ Takeaways on our novel methodology:

- General scales (stable to SOTA/frontiers in AI, no saturation!)​
- AI benchmarks and systems become commensurate!
- Explanatory power (demand profiles, ability profiles) ​
- Predictive power at instance level (especially OOD!)
- Fully automated procedure​
March 11, 2025 at 6:28 PM
13/ Even better, we build a Random Forest (RF) classifier fed with the 18 demand levels to predict the performance of LLMs at instance-level. This yields high predictive power (high AUROC and nearly perfect calibration!) in-distribution and out-of-distribution, outperforming black-box predictors.
March 11, 2025 at 6:26 PM
12/ On predictive power: We can map these interpretable ability profiles with the demand profiles of benchmarks or individual instances, to anticipate the performance of LLMs on them: the larger the supremacy of (model) abilities over (task) demands is, the more likely the model will succeed.
March 11, 2025 at 6:25 PM
11/ Takeaways from ability profiles:

-Newer LLMs have higher abilities than older ones, but this is NOT monotonic for all abilities
-Knowledge scales are limited by model size and distillation processed
-Reasoning, learning and abstraction, and social capabilities, are boosted in ‘reasoning’ models
March 11, 2025 at 6:23 PM
9/ The SCCs of certain dimensions are steep, which explains (and predicts) success very well for instances in the low and high ranges. In contrast, ​SCCs of other dimensions are flatter and show strong differences between LLMs, i.e., lower discrimination power to differentiate success and failures
March 11, 2025 at 6:21 PM
8/ To evaluate abilities, we show the subject characteristic curve (SCC) for each dimension: the probability of success as a logistic function of demand levels. We use dominant slicing: for level k of the target dimension, all other dimensions<=k.

Here's an example SCC, but next post has all SCCs.
March 11, 2025 at 6:20 PM
6/ Surprisingly, by inspecting demand levels: All these 20 benchmarks from recent top AI/NLP conferences lack construct validity: not measuring what they claim to measure (lacking specificity) or tend to only include intermediate difficulties for the target ability scale (lacking sensitivity)
March 11, 2025 at 6:19 PM
5/ We annotate demand levels of 18 dimensions for 16K instances sampled from 63 tasks on 20 benchmarks. This forms the Annotated-Demand-Levels (ADeLe) battery, which elegantly places task instances of many different benchmarks in the same commensurate space!
March 11, 2025 at 6:18 PM
4/ For example, in the natural sciences knowledge (KNn) rubric, we use education levels to represent the demand levels from 0 to 5+

A demand level of 0 means KNn is not required to solve the task, while 5+ means graduate level or beyond.

Similar/related principles are applied to other rubrics.
March 11, 2025 at 6:17 PM
3/To address these issues, we craft 18 novel rubrics to annotate demand levels (0 to 5+) for 18 general scales from a taxonomy of cognitive abilities, focusing on LLMs:

Primordial: 11 cognitive capabilities​
Knowledge: 5 branches of knowledge​
Extraneous: 2 other elements making task difficult​
March 11, 2025 at 6:16 PM
Thrilled to unlock AI Evaluation with explanatory and predictive power through general ability scales!

With a new methodology to
-Explain what common benchmarks really measure
-Extract explainable ability profiles of AI systems
-Predict performance for new task instances, in & out-of-distribution
🧵
March 11, 2025 at 6:12 PM