Can
canrager.bsky.social
Can
@canrager.bsky.social
Humans and LLMs think fast and slow. Do SAEs recover slow concepts in LLMs? Not really.

Our Temporal Feature Analyzer discovers contextual features in LLMs, that detect event boundaries, parse complex grammar, and represent ICL patterns.
November 13, 2025 at 10:32 PM
Reposted by Can
What's the right unit of analysis for understanding LLM internals? We explore in our mech interp survey (a major update from our 2024 ms).

We’ve added more recent work and more immediately actionable directions for future work. Now published in Computational Linguistics!
October 1, 2025 at 2:03 PM
Reposted by Can
🚨 Registration is live! 🚨

The New England Mechanistic Interpretability (NEMI) Workshop is happening Aug 22nd 2025 at Northeastern University!

A chance for the mech interp community to nerd out on how models really work 🧠🤖

🌐 Info: nemiconf.github.io/summer25/
📝 Register: forms.gle/v4kJCweE3UUH...
June 30, 2025 at 10:55 PM
Can we uncover the list of topics a language model is censored on?

Refused topics vary strongly among models. Claude-3.5 vs DeepSeek-R1 refusal patterns:
June 13, 2025 at 3:59 PM
Announcing ARBOR, an open research community for collectively understanding how reasoning models like openai-o3 and deepseek-r1 work. We invite all researchers and enthusiasts to this initiative by @wattenberg.bsky.social's and @davidbau.bsky.social's lab.

arborproject.github.io
February 20, 2025 at 7:55 PM
Addressing key concerns about AI competition.

darioamodei.com/on-deepseek-...
Dario Amodei — On DeepSeek and Export Controls
On DeepSeek and Export Controls
darioamodei.com
January 29, 2025 at 6:54 PM
The #38c3 Chaos Computer Conference was a blast! 🚀 Find the accompanying code for my intro workshop on activation steering in the thread.
January 10, 2025 at 5:10 PM
Sparse Autoencoders (SAEs) are popular, with 10+ new approaches proposed in the last year. How do we know if we are making progress? The field has relied on imperfect proxy metrics.

We are releasing SAE Bench, a suite of 8 SAE evaluations!

Project co-led with Adam Karvonen.
December 11, 2024 at 6:07 AM
Reposted by Can
More big news! Applications are open for the NDIF Summer Engineering Fellowship—an opportunity to work on cutting-edge AI research infrastructure this summer in Boston! 🚀
December 10, 2024 at 9:59 PM
Safe travels to #NeurIPS2025 in Vancouver BC! Join our poster sessions on *Measuring Progress in Dictionary Learning with Board Game Models* and *Evaluating Sparse Autoencoders on Concept Erasure Tasks*. Reach out brainstorm future interpretability benchmarks.
December 9, 2024 at 10:15 PM