Lightnews — Scholar-powered news

Ryan Greenblatt

@ryangreenblatt.bsky.social

Chief scientist at Redwood Research (https://www.redwoodresearch.org/), focused on technical AI safety research to reduce risks from rogue AIs

Posts Replies Media Videos

Ryan Greenblatt

@ryangreenblatt.bsky.social

Anthropic hasn't made clear intermediate predictions, so I make up a proposed timeline with powerful AI in March 2027 that Anthropic might endorse. Then we can see which predictions are closer to correct.

November 3, 2025 at 5:25 PM

Ryan Greenblatt

@ryangreenblatt.bsky.social

Earlier predictions (before powerful AI) help (partially) adjudicate who was right and allow for updating before it's too late.

Sometimes this isn't possible (predictions roughly agree until too late), but my predictions aren't consistent with powerful AI by early 2027!

November 3, 2025 at 5:25 PM

Ryan Greenblatt

@ryangreenblatt.bsky.social

I include more details in the footnotes here: www.lesswrong.com/posts/FG54eu...

October 10, 2025 at 4:31 PM

Ryan Greenblatt

@ryangreenblatt.bsky.social

If companies are strongly opposed because this would leak too much IP, they could instead disclose this information to some number of non-conflicted third parties (e.g. METR) who could then publicly share their conclusions. 7/

October 10, 2025 at 4:31 PM

Ryan Greenblatt

@ryangreenblatt.bsky.social

Given that the situation is complicated, ideally AI companies would publicly disclose enough information that other actors can come to their own conclusion about how worried they should be about CoT obfuscation. 6/

October 10, 2025 at 4:31 PM

Ryan Greenblatt

@ryangreenblatt.bsky.social

The situation is also just generally complicated because things like "what SFT/prompting initialization do you use for your CoT" could have big effects on obfuscation and aren't exactly well described as applying optimization pressure to CoT. 5/

October 10, 2025 at 4:31 PM

Ryan Greenblatt

@ryangreenblatt.bsky.social

It's worth noting that some versions of training against CoT might be totally fine (as in, they don't incentivize obfuscation of misaligned reasoning) and that some particular approach for training against CoT could be worth it even if it does cause problems. 4/

October 10, 2025 at 4:31 PM

Ryan Greenblatt

@ryangreenblatt.bsky.social

I think we should try to create an incentive gradient that pushes AI companies to disclose information even if that information makes them look bad, so we should make companies feel some heat for not disclosing important information like whether they are training against CoT. 3/

October 10, 2025 at 4:31 PM

Ryan Greenblatt

@ryangreenblatt.bsky.social

It's striking that Anthropic says nothing given their system card is very thorough and includes a section on "Reasoning faithfulness" (kudos to them for providing so much other info!). Naively, this is evidence they are training against CoT and didn't want to disclose this. 2/

October 10, 2025 at 4:31 PM

Ryan Greenblatt

@ryangreenblatt.bsky.social

Side question: what about the "shut it all down" plan proposed by (e.g.) MIRI?

I think this probably requires substantially more political will than Plan A and seems worse than a well-implemented version of Plan A. I say more here: www.lesswrong.com/posts/E8n93n...

October 8, 2025 at 6:46 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news