Lightnews — Scholar-powered news

Aaron Mueller

@amuuueller.bsky.social

Aruna Sankaranarayanan, @arnabsensharma.bsky.social absensharma.bsky.social @ericwtodd.bsky.social ky.social @davidbau.bsky.social u.bsky.social @boknilev.bsky.social (2/2)

October 1, 2025 at 2:03 PM

Aaron Mueller

@amuuueller.bsky.social

Thanks again to the co-authors! Such a wide survey required a lot of perspectives. @jannikbrinkmann.bsky.social Millicent Li, Samuel Marks, @koyena.bsky.social @nikhil07prakash.bsky.social @canrager.bsky.social (1/2)

October 1, 2025 at 2:03 PM

Aaron Mueller

@amuuueller.bsky.social

See the paper for more details! arxiv.org/abs/2408.01416

The Quest for the Right Mediator: Surveying Mechanistic Interpretability Through the Lens of Causal Mediation Analysis

Interpretability provides a toolset for understanding how and why neural networks behave in certain ways. However, there is little unity in the field: most studies employ ad-hoc evaluations and do not...

arxiv.org

October 1, 2025 at 2:03 PM

Aaron Mueller

@amuuueller.bsky.social

We also made the causal graph formalism more precise. Interpretability and causality are intimately linked; the latter makes the former more trustworthy and rigorous. This formal link should be strengthened in future work.

October 1, 2025 at 2:03 PM

Aaron Mueller

@amuuueller.bsky.social

One of the bigger changes was establishing criteria for success in interpretability. What units of analysis should you use if you know what you’re looking for? If you *don’t* know what you’re looking for?

October 1, 2025 at 2:03 PM

Aaron Mueller

@amuuueller.bsky.social

We still have a lot to learn in editing NN representations.

To edit or steer, we cannot simply choose semantically relevant representations; we must choose the ones that will have the intended impact. As @peterbhase.bsky.social found, these are often distinct.

May 27, 2025 at 5:07 PM

Aaron Mueller

@amuuueller.bsky.social

By limiting steering to output features, we recover >90% of the performance of the best supervised representation-based steering methods—and at some locations, we outperform them!

May 27, 2025 at 5:07 PM

Aaron Mueller

@amuuueller.bsky.social

We define the notion of an “output feature”, whose role is to increase p(some token(s)). Steering these gives better results than steering “input features”, whose role is to attend to concepts in the input. We propose fast methods to sort features into these categories.

May 27, 2025 at 5:07 PM

Reposted by Aaron Mueller

Ethan Gotlieb Wilcox

@wegotlieb.bsky.social

Couldn’t be happier to have co-authored this will a stellar team, including: Michael Hu, @amuuueller.bsky.social, @alexwarstadt.bsky.social, @lchoshen.bsky.social, Chengxu Zhuang, @adinawilliams.bsky.social, Ryan Cotterell, @tallinzen.bsky.social

May 12, 2025 at 3:48 PM

Aaron Mueller

@amuuueller.bsky.social

... Jing Huang, Rohan Gupta, Yaniv Nikankin, @hadasorgad.bsky.social, Nikhil Prakash, @anja.re, Aruna Sankaranarayanan, Shun Shao, @alestolfo.bsky.social, @mtutek.bsky.social, @amirzur, @davidbau.bsky.social, and @boknilev.bsky.social!

April 23, 2025 at 6:15 PM

Aaron Mueller

@amuuueller.bsky.social

This was a huge collaboration with many great folks! If you get a chance, be sure to talk to Atticus Geiger, @sarah-nlp.bsky.social, @danaarad.bsky.social, Iván Arcuschin, @adambelfki.bsky.social, @yiksiu.bsky.social, Jaden Fiotto-Kaufmann, @talhaklay.bsky.social, @michaelwhanna.bsky.social, ...

April 23, 2025 at 6:15 PM

Aaron Mueller

@amuuueller.bsky.social

We’re eager to establish MIB as a meaningful and lasting standard for comparing the quality of MI methods. If you’ll be at #ICLR2025 or #NAACL2025, please reach out to chat!

📜 arxiv.org/abs/2504.13151

MIB: A Mechanistic Interpretability Benchmark

How can we know whether new mechanistic interpretability methods achieve real improvements? In pursuit of meaningful and lasting evaluation standards, we propose MIB, a benchmark with two tracks spann...

arxiv.org

April 23, 2025 at 6:15 PM

Aaron Mueller

@amuuueller.bsky.social

We release many public resources, including:

🌐 Website: mib-bench.github.io
📄 Data: huggingface.co/collections/...
💻 Code: github.com/aaronmueller...
📊 Leaderboard: Coming very soon!

MIB – Project Page

mib-bench.github.io

April 23, 2025 at 6:15 PM

Aaron Mueller

@amuuueller.bsky.social

These results highlight that there has been real progress in the field! We also recovered known findings, like that integrated gradients improves attribution quality. This is a sanity check verifying that our benchmark is capturing something real.

April 23, 2025 at 6:15 PM

Aaron Mueller

@amuuueller.bsky.social

We find that supervised methods like DAS significantly outperform methods like sparse autoencoders or principal component analysis. Mask-learning methods also perform well, but not as well as DAS.

Table of results for the causal variable localization track.

April 23, 2025 at 6:15 PM

Aaron Mueller

@amuuueller.bsky.social

This is evaluated using the interchange intervention accuracy (IIA): we featurize the activations, intervene on the specific causal variable, and see whether the intervention has the expected effect on model behavior.

Visual intuition underlying the interchange intervention accuracy (IIA), the main faithfulness metric for this track.

April 23, 2025 at 6:15 PM

Aaron Mueller

@amuuueller.bsky.social

The causal variable localization track measures the quality of featurization methods (like DAS, SAEs, etc.). How well can we decompose activations into more meaningful units, and intervene selectively on just the target variable?

Overview of the causal variable localization track. Users provide a trained featurizer and location at which the causal variable is hypothesized to exist. The faithfulness of the intervention is measured; this is the final score.

April 23, 2025 at 6:15 PM

Aaron Mueller

@amuuueller.bsky.social

We find that edge-level methods generally outperform node-level methods, that attribution patching with integrated gradients generally outperforms other methods (including more exact methods!), and that mask-learning methods perform well.

Table summarizing the results from the circuit localization track.

April 23, 2025 at 6:15 PM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news