Chris Painter
@chris.bsky.social
evals accelerationist, Head of Policy at METR, working hard on responsible scaling policies
Check out my artisanal hand-crafted "AI Bluesky" starter pack here: https://bsky.app/starter-pack/chris.bsky.social/3lbefurb2xh2u
Check out my artisanal hand-crafted "AI Bluesky" starter pack here: https://bsky.app/starter-pack/chris.bsky.social/3lbefurb2xh2u
The full website lets you toggle and see the task-horizon at 80% success rate as well. The resolution we can observe confidently is very low at pass rates like 95%
Full site: metr.org/blog/2025-03...
Original paper explaining: arxiv.org/abs/2503.14499
Full site: metr.org/blog/2025-03...
Original paper explaining: arxiv.org/abs/2503.14499
Measuring AI Ability to Complete Long Tasks
We propose measuring AI performance in terms of the *length* of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing over the past 6 years, with a doub...
metr.org
July 31, 2025 at 3:12 AM
The full website lets you toggle and see the task-horizon at 80% success rate as well. The resolution we can observe confidently is very low at pass rates like 95%
Full site: metr.org/blog/2025-03...
Original paper explaining: arxiv.org/abs/2503.14499
Full site: metr.org/blog/2025-03...
Original paper explaining: arxiv.org/abs/2503.14499
We first characterize the difficulty of the tasks in our suite by seeing how long they take experienced human developers/engineers/researchers. We then sort the tasks into buckets based on how long they take humans. Grok 4 gets 50% success on the ~1hr50min part of the task difficulty distribution
July 31, 2025 at 3:03 AM
We first characterize the difficulty of the tasks in our suite by seeing how long they take experienced human developers/engineers/researchers. We then sort the tasks into buckets based on how long they take humans. Grok 4 gets 50% success on the ~1hr50min part of the task difficulty distribution
Oh I also should clarify that we have many more than 2 projects going in parallel at any given time hahahaha, these two were just similar
July 11, 2025 at 6:11 PM
Oh I also should clarify that we have many more than 2 projects going in parallel at any given time hahahaha, these two were just similar
Oh I also should clarify that we have many more than 2 projects going in parallel at any given time time, for what it’s worth
July 11, 2025 at 6:10 PM
Oh I also should clarify that we have many more than 2 projects going in parallel at any given time time, for what it’s worth
To be clear: The other project was very nascent, and would’ve been far less quantitative/experimental, more like an index of developer anecdotes. To my knowledge the RCT was not formally pre-registered, but I would want to check with the people on our team who worked on it
July 11, 2025 at 4:45 PM
To be clear: The other project was very nascent, and would’ve been far less quantitative/experimental, more like an index of developer anecdotes. To my knowledge the RCT was not formally pre-registered, but I would want to check with the people on our team who worked on it
For me, the biggest upshot of this work, at the moment, is that the most obvious and straightforward ways of assessing AI R&D acceleration from access to AI, like "just survey people" or "monitor the vibes in your AI lab" probably won't work, or will badly misfire.
July 11, 2025 at 12:22 AM
For me, the biggest upshot of this work, at the moment, is that the most obvious and straightforward ways of assessing AI R&D acceleration from access to AI, like "just survey people" or "monitor the vibes in your AI lab" probably won't work, or will badly misfire.
In particular, the amount of influence and power that depends on the outcomes of these debates, without any of these people really being in the trenches of politics or business, feels very monastic
April 9, 2025 at 4:35 PM
In particular, the amount of influence and power that depends on the outcomes of these debates, without any of these people really being in the trenches of politics or business, feels very monastic
You have these monks and scholars hidden away in a sort of monastery, and the law of the land hangs on their calm debates about the correct way to interpret our secular scripture
April 9, 2025 at 4:35 PM
You have these monks and scholars hidden away in a sort of monastery, and the law of the land hangs on their calm debates about the correct way to interpret our secular scripture
Look at this extremely expansive definition of Russia’s territory on my hand-drawn 7th grade map
December 29, 2024 at 4:19 AM
Look at this extremely expansive definition of Russia’s territory on my hand-drawn 7th grade map
Also: my high school graduation speech was superintelligence-pilled:
December 29, 2024 at 4:18 AM
Also: my high school graduation speech was superintelligence-pilled:
I’m not sure that’s going to be a very meaningful distinction for the most advanced models, and I guess I’m specifically interested in what’s possible with both the best models-as-agents and models-as-tools
December 22, 2024 at 7:13 AM
I’m not sure that’s going to be a very meaningful distinction for the most advanced models, and I guess I’m specifically interested in what’s possible with both the best models-as-agents and models-as-tools