Lightnews — Scholar-powered news

Chris Painter

@chris.bsky.social

3.1K followers 500 following 940 posts

evals accelerationist, Head of Policy at METR, working hard on responsible scaling policies

Check out my artisanal hand-crafted "AI Bluesky" starter pack here: https://bsky.app/starter-pack/chris.bsky.social/3lbefurb2xh2u

Posts Replies Media Videos

Chris Painter

@chris.bsky.social

The full website lets you toggle and see the task-horizon at 80% success rate as well. The resolution we can observe confidently is very low at pass rates like 95%

Full site: metr.org/blog/2025-03...

Original paper explaining: arxiv.org/abs/2503.14499

Measuring AI Ability to Complete Long Tasks

We propose measuring AI performance in terms of the *length* of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing over the past 6 years, with a doub...

metr.org

July 31, 2025 at 3:12 AM

Chris Painter

@chris.bsky.social

We first characterize the difficulty of the tasks in our suite by seeing how long they take experienced human developers/engineers/researchers. We then sort the tasks into buckets based on how long they take humans. Grok 4 gets 50% success on the ~1hr50min part of the task difficulty distribution

July 31, 2025 at 3:03 AM

Chris Painter

@chris.bsky.social

Oh I also should clarify that we have many more than 2 projects going in parallel at any given time hahahaha, these two were just similar

July 11, 2025 at 6:11 PM

Chris Painter

@chris.bsky.social

Oh I also should clarify that we have many more than 2 projects going in parallel at any given time time, for what it’s worth

July 11, 2025 at 6:10 PM

Chris Painter

@chris.bsky.social

To be clear: The other project was very nascent, and would’ve been far less quantitative/experimental, more like an index of developer anecdotes. To my knowledge the RCT was not formally pre-registered, but I would want to check with the people on our team who worked on it

July 11, 2025 at 4:45 PM

Chris Painter

@chris.bsky.social

For me, the biggest upshot of this work, at the moment, is that the most obvious and straightforward ways of assessing AI R&D acceleration from access to AI, like "just survey people" or "monitor the vibes in your AI lab" probably won't work, or will badly misfire.

July 11, 2025 at 12:22 AM

Chris Painter

@chris.bsky.social

In particular, the amount of influence and power that depends on the outcomes of these debates, without any of these people really being in the trenches of politics or business, feels very monastic

April 9, 2025 at 4:35 PM

Chris Painter

@chris.bsky.social

You have these monks and scholars hidden away in a sort of monastery, and the law of the land hangs on their calm debates about the correct way to interpret our secular scripture

April 9, 2025 at 4:35 PM

Chris Painter

@chris.bsky.social

Look at this extremely expansive definition of Russia’s territory on my hand-drawn 7th grade map

December 29, 2024 at 4:19 AM

Chris Painter

@chris.bsky.social

Also: my high school graduation speech was superintelligence-pilled:

December 29, 2024 at 4:18 AM

Chris Painter

@chris.bsky.social

I’m not sure that’s going to be a very meaningful distinction for the most advanced models, and I guess I’m specifically interested in what’s possible with both the best models-as-agents and models-as-tools

December 22, 2024 at 7:13 AM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news