https://lexzhou.github.io
Contributions and feedback are welcome!
Contributions and feedback are welcome!
Lorenzo Pacchiardi, Fernando Martínez-Plumed, Katherine M. Collins, Yael Moros-Daval, Seraphina Zhang, Qinlin Zhao, Yitian Huang, Luning Sun, Jonathan E. Prunty, Zongqian Li, Pablo Sánchez-García, ...
Lorenzo Pacchiardi, Fernando Martínez-Plumed, Katherine M. Collins, Yael Moros-Daval, Seraphina Zhang, Qinlin Zhao, Yitian Huang, Luning Sun, Jonathan E. Prunty, Zongqian Li, Pablo Sánchez-García, ...
Newsletters: If you are drawn to everything relevant to AI Evaluation and want to stay informed, please subscribe to our monthly AI Evaluation Digest newsletter! (aievaluation.substack.com)
Newsletters: If you are drawn to everything relevant to AI Evaluation and want to stay informed, please subscribe to our monthly AI Evaluation Digest newsletter! (aievaluation.substack.com)
- Analyse multimodal systems and embodied AI
- Turn the demand level 5+ into 5-10
- Enhance the coverage of instances at demand level 5+
- We encourage collaborative efforts on extending our methodology. Contact: jh2135@cam.ac.uk
- Analyse multimodal systems and embodied AI
- Turn the demand level 5+ into 5-10
- Enhance the coverage of instances at demand level 5+
- We encourage collaborative efforts on extending our methodology. Contact: jh2135@cam.ac.uk
- General scales (stable to SOTA/frontiers in AI, no saturation!)
- AI benchmarks and systems become commensurate!
- Explanatory power (demand profiles, ability profiles)
- Predictive power at instance level (especially OOD!)
- Fully automated procedure
- General scales (stable to SOTA/frontiers in AI, no saturation!)
- AI benchmarks and systems become commensurate!
- Explanatory power (demand profiles, ability profiles)
- Predictive power at instance level (especially OOD!)
- Fully automated procedure
-Newer LLMs have higher abilities than older ones, but this is NOT monotonic for all abilities
-Knowledge scales are limited by model size and distillation processed
-Reasoning, learning and abstraction, and social capabilities, are boosted in ‘reasoning’ models
-Newer LLMs have higher abilities than older ones, but this is NOT monotonic for all abilities
-Knowledge scales are limited by model size and distillation processed
-Reasoning, learning and abstraction, and social capabilities, are boosted in ‘reasoning’ models
Here's an example SCC, but next post has all SCCs.
Here's an example SCC, but next post has all SCCs.
A demand level of 0 means KNn is not required to solve the task, while 5+ means graduate level or beyond.
Similar/related principles are applied to other rubrics.
A demand level of 0 means KNn is not required to solve the task, while 5+ means graduate level or beyond.
Similar/related principles are applied to other rubrics.
Primordial: 11 cognitive capabilities
Knowledge: 5 branches of knowledge
Extraneous: 2 other elements making task difficult
Primordial: 11 cognitive capabilities
Knowledge: 5 branches of knowledge
Extraneous: 2 other elements making task difficult
- Can’t robustly explain and predict where an AI can be deployed reliably and safely
- Can’t precisely explain what benchmarks really measure
- Incomparable aggregate scores between benchmarks
- Benchmark saturation
- Changing scales
- …
- Can’t robustly explain and predict where an AI can be deployed reliably and safely
- Can’t precisely explain what benchmarks really measure
- Incomparable aggregate scores between benchmarks
- Benchmark saturation
- Changing scales
- …