https://lexzhou.github.io
- General scales (stable to SOTA/frontiers in AI, no saturation!)
- AI benchmarks and systems become commensurate!
- Explanatory power (demand profiles, ability profiles)
- Predictive power at instance level (especially OOD!)
- Fully automated procedure
- General scales (stable to SOTA/frontiers in AI, no saturation!)
- AI benchmarks and systems become commensurate!
- Explanatory power (demand profiles, ability profiles)
- Predictive power at instance level (especially OOD!)
- Fully automated procedure
-Newer LLMs have higher abilities than older ones, but this is NOT monotonic for all abilities
-Knowledge scales are limited by model size and distillation processed
-Reasoning, learning and abstraction, and social capabilities, are boosted in ‘reasoning’ models
-Newer LLMs have higher abilities than older ones, but this is NOT monotonic for all abilities
-Knowledge scales are limited by model size and distillation processed
-Reasoning, learning and abstraction, and social capabilities, are boosted in ‘reasoning’ models
Here's an example SCC, but next post has all SCCs.
Here's an example SCC, but next post has all SCCs.
A demand level of 0 means KNn is not required to solve the task, while 5+ means graduate level or beyond.
Similar/related principles are applied to other rubrics.
A demand level of 0 means KNn is not required to solve the task, while 5+ means graduate level or beyond.
Similar/related principles are applied to other rubrics.
Primordial: 11 cognitive capabilities
Knowledge: 5 branches of knowledge
Extraneous: 2 other elements making task difficult
Primordial: 11 cognitive capabilities
Knowledge: 5 branches of knowledge
Extraneous: 2 other elements making task difficult
With a new methodology to
-Explain what common benchmarks really measure
-Extract explainable ability profiles of AI systems
-Predict performance for new task instances, in & out-of-distribution
🧵
With a new methodology to
-Explain what common benchmarks really measure
-Extract explainable ability profiles of AI systems
-Predict performance for new task instances, in & out-of-distribution
🧵