Mor Geva
megamor2.bsky.social
Mor Geva
@megamor2.bsky.social
🎉 Our Actionable Interpretability workshop has been accepted to #ICML2025! 🎉
> Follow @actinterp.bsky.social
> Website actionable-interpretability.github.io

@talhaklay.bsky.social @anja.re @mariusmosbach.bsky.social @sarah-nlp.bsky.social @iftenney.bsky.social

Paper submission deadline: May 9th!
March 31, 2025 at 4:59 PM
In a final experiment, we show that output-centric methods can be used to "revive" features previously thought to be "dead" 🧟‍♂️ reviving hundreds of SAE features in Gemma 2! 6/
January 28, 2025 at 7:38 PM
In a final experiment, we show that output-centric methods can be used to "revive" features previously thought to be "dead" 🧟‍♂️ reviving hundreds of SAE features in Gemma 2! 6/
January 28, 2025 at 7:38 PM
Unsurprisingly, while activating inputs better describe what activates a feature, output-centric methods do much better at predicting how steering the feature will affect the model’s output!

But combining the two works best! 🚀 5/
January 28, 2025 at 7:37 PM
To fix this, we first propose using both input- and output-based evaluations for feature descriptions.
Our output-based eval measures how well a description of a feature captures its effect on the model's generation. 3/
January 28, 2025 at 7:36 PM
How can we interpret LLM features at scale? 🤔

Current pipelines use activating inputs, which is costly and ignores how features causally affect model outputs!
We propose efficient output-centric methods that better predict the steering effect of a feature.

New preprint led by @yoav.ml 🧵1/
January 28, 2025 at 7:34 PM
Most operation descriptions are plausible based on human judgment.
We also observe interesting operations implemented by heads, like the extension of time periods (day → month → year) and association of known figures with years relevant to their historical significance (9/10)
December 18, 2024 at 6:01 PM
(3) Smaller models tend to encode higher numbers of relations in a single head

(4) In Llama-3.1 models, which use grouped-query attention, grouped heads often implement the same or similar relations (7/10)
December 18, 2024 at 5:59 PM
(1) Different models encode certain relations across attention heads to similar degrees

(2) Different heads implement the same relation to varying degrees, which has implications for localization and editing of LLMs (6/10)
December 18, 2024 at 5:58 PM
Experiments on 20 operations and 6 LLMs show that MAPS estimations strongly correlate with the head’s outputs during inference

Ablating heads implementing an operation damages the model’s ability to perform tasks requiring the operation compared to removing other heads (4/10)
December 18, 2024 at 5:57 PM
MAPS infers the head’s functionality by examining different groups of mappings:

(A) Predefined relations: groups expressing certain relations (e.g. city of a country)

(B) Salient operations: groups for which the head induces the most prominent effect (3/10)
December 18, 2024 at 5:57 PM
What's in an attention head? 🤯

We present an efficient framework – MAPS – for inferring the functionality of attention heads in LLMs ✨directly from their parameters✨

A new preprint with Amit Elhelo 🧵 (1/10)
December 18, 2024 at 5:55 PM
Post a photo of yourself from a different era
November 28, 2024 at 5:38 PM