Lightnews — Scholar-powered news

Aaron Mueller

@amuuueller.bsky.social

2.3K followers 330 following 36 posts

Postdoc at Northeastern and incoming Asst. Prof. at Boston U. Working on NLP, interpretability, causality. Previously: JHU, Meta, AWS

Posts Replies Media Videos

Aaron Mueller

@amuuueller.bsky.social

We also made the causal graph formalism more precise. Interpretability and causality are intimately linked; the latter makes the former more trustworthy and rigorous. This formal link should be strengthened in future work.

October 1, 2025 at 2:03 PM

Aaron Mueller

@amuuueller.bsky.social

What's the right unit of analysis for understanding LLM internals? We explore in our mech interp survey (a major update from our 2024 ms).

We’ve added more recent work and more immediately actionable directions for future work. Now published in Computational Linguistics!

October 1, 2025 at 2:03 PM

Aaron Mueller

@amuuueller.bsky.social

If you're at #ICML2025, chat with me, @sarah-nlp.bsky.social, Atticus, and others at our poster 11am - 1:30pm at East #1205! We're establishing a 𝗠echanistic 𝗜nterpretability 𝗕enchmark.

We're planning to keep this a living benchmark; come by and share your ideas/hot takes!

July 17, 2025 at 5:45 PM

Aaron Mueller

@amuuueller.bsky.social

We find that supervised methods like DAS significantly outperform methods like sparse autoencoders or principal component analysis. Mask-learning methods also perform well, but not as well as DAS.

Table of results for the causal variable localization track.

April 23, 2025 at 6:15 PM

Aaron Mueller

@amuuueller.bsky.social

This is evaluated using the interchange intervention accuracy (IIA): we featurize the activations, intervene on the specific causal variable, and see whether the intervention has the expected effect on model behavior.

Visual intuition underlying the interchange intervention accuracy (IIA), the main faithfulness metric for this track.

April 23, 2025 at 6:15 PM

Aaron Mueller

@amuuueller.bsky.social

The causal variable localization track measures the quality of featurization methods (like DAS, SAEs, etc.). How well can we decompose activations into more meaningful units, and intervene selectively on just the target variable?

Overview of the causal variable localization track. Users provide a trained featurizer and location at which the causal variable is hypothesized to exist. The faithfulness of the intervention is measured; this is the final score.

April 23, 2025 at 6:15 PM

Aaron Mueller

@amuuueller.bsky.social

We find that edge-level methods generally outperform node-level methods, that attribution patching with integrated gradients generally outperforms other methods (including more exact methods!), and that mask-learning methods perform well.

Table summarizing the results from the circuit localization track.

April 23, 2025 at 6:15 PM

Aaron Mueller

@amuuueller.bsky.social

Thus, we split 𝘧 into two metrics: the integrated 𝗰𝗶𝗿𝗰𝘂𝗶𝘁 𝗽𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲 𝗿𝗮𝘁𝗶𝗼 (CPR), and the integrated 𝗰𝗶𝗿𝗰𝘂𝗶𝘁–𝗺𝗼𝗱𝗲𝗹 𝗱𝗶𝘀𝘁𝗮𝗻𝗰𝗲 (CMD). Both involve integrating 𝘧 across many circuit sizes. This implicitly captures 𝗳𝗮𝗶𝘁𝗵𝗳𝘂𝗹𝗻𝗲𝘀𝘀 and 𝗺𝗶𝗻𝗶𝗺𝗮𝗹𝗶𝘁𝘆 at the same time!

Illustration of CPR (area under the faithfulness curve) and CMD (area between the faithfulness curve and 1).

April 23, 2025 at 6:15 PM

Aaron Mueller

@amuuueller.bsky.social

The circuit localization track compares causal graph localization methods. Faithfulness (𝘧) is a common way to evaluate a single circuit, but it’s used for two distinct Qs: (1) Does the circuit perform well? (2) Does the circuit match the model’s behavior?

Overview of the circuit localization track. The user provides circuits of various sizes. The faithfulness of each is computed, and then the area under the faithfulness vs. circuit size curve is computed.

April 23, 2025 at 6:15 PM

Aaron Mueller

@amuuueller.bsky.social

Our data includes tasks of varying difficulties, including some that have never been mechanistically analyzed. We also include models of varying capabilities. We release our data, including counterfactual input pairs.

Table summarizing the task datasets in MIB and their sizes. This includes IOI, MCQA, Arithmetic (addition and subtraction), and the easy and challenge splits of ARC.

April 23, 2025 at 6:15 PM

Aaron Mueller

@amuuueller.bsky.social

What should a mech interp benchmark evaluate? We think there are two fundamental paradigms: 𝗹𝗼𝗰𝗮𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻 and 𝗳𝗲𝗮𝘁𝘂𝗿𝗶𝘇𝗮𝘁𝗶𝗼𝗻. We propose one track each: 𝗰𝗶𝗿𝗰𝘂𝗶𝘁 𝗹𝗼𝗰𝗮𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻 and 𝗰𝗮𝘂𝘀𝗮𝗹 𝘃𝗮𝗿𝗶𝗮𝗯𝗹𝗲 𝗹𝗼𝗰𝗮𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻.

Overview of the two tracks in MIB: the circuit localization track, and the casual variable localization track.

April 23, 2025 at 6:15 PM

Aaron Mueller

@amuuueller.bsky.social

Lots of progress in mech interp (MI) lately! But how can we measure when new mech interp methods yield real improvements over prior work?

We propose 😎 𝗠𝗜𝗕: a 𝗠echanistic 𝗜nterpretability 𝗕enchmark!

April 23, 2025 at 6:15 PM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news