Chris Olah
banner
colah.bsky.social
Chris Olah
@colah.bsky.social
Reverse engineering neural networks at Anthropic. Previously Distill, OpenAI, Google Brain.Personal account.
Our new note demonstrates that interference weights in toy models can demonstrate strikingly similar phenomenology to that of Towards Monosemanticity...
July 29, 2025 at 11:33 PM
The keen reader may recall all these plots referencing "interference weights??" in Towards Monosemanticity (transformer-circuits.pub/2023/monosem...).
July 29, 2025 at 11:33 PM
There are a lot more cool examples in the "biology" paper! transformer-circuits.pub/2025/attribu...
March 27, 2025 at 6:18 PM
In this example, Haiku is very sneaky. It claims to be solving this question, but is actually working backwards so that it comes to the same answer as the human.
March 27, 2025 at 6:18 PM
How does Haiku generate poetry? It turns out that, after the first line, it generates candidates for the rhyming end of the next line and "writes towards them"!
March 27, 2025 at 6:18 PM
How does Claude 3.5 Haiku complete the sentence "Fact: the capital of the state containing Dallas is"?

We can mechanistically see that Haiku is assembling two different facts ("Dallas is in Texas" and "the capital of Texas is Austin") rather than directly knowing the answer.
March 27, 2025 at 6:18 PM