Lightnews — Scholar-powered news

Mor Geva

@megamor2.bsky.social

🎉 Our Actionable Interpretability workshop has been accepted to #ICML2025! 🎉
> Follow @actinterp.bsky.social
> Website actionable-interpretability.github.io

@talhaklay.bsky.social @anja.re @mariusmosbach.bsky.social @sarah-nlp.bsky.social @iftenney.bsky.social

Paper submission deadline: May 9th!

March 31, 2025 at 4:59 PM

Mor Geva

@megamor2.bsky.social

In a final experiment, we show that output-centric methods can be used to "revive" features previously thought to be "dead" 🧟‍♂️ reviving hundreds of SAE features in Gemma 2! 6/

January 28, 2025 at 7:38 PM

Mor Geva

@megamor2.bsky.social

In a final experiment, we show that output-centric methods can be used to "revive" features previously thought to be "dead" 🧟‍♂️ reviving hundreds of SAE features in Gemma 2! 6/

January 28, 2025 at 7:38 PM

Mor Geva

@megamor2.bsky.social

Unsurprisingly, while activating inputs better describe what activates a feature, output-centric methods do much better at predicting how steering the feature will affect the model’s output!

But combining the two works best! 🚀 5/

January 28, 2025 at 7:37 PM

Mor Geva

@megamor2.bsky.social

To fix this, we first propose using both input- and output-based evaluations for feature descriptions.
Our output-based eval measures how well a description of a feature captures its effect on the model's generation. 3/

January 28, 2025 at 7:36 PM

Mor Geva

@megamor2.bsky.social

How can we interpret LLM features at scale? 🤔

Current pipelines use activating inputs, which is costly and ignores how features causally affect model outputs!
We propose efficient output-centric methods that better predict the steering effect of a feature.

New preprint led by @yoav.ml 🧵1/

January 28, 2025 at 7:34 PM

Mor Geva

@megamor2.bsky.social

Most operation descriptions are plausible based on human judgment.
We also observe interesting operations implemented by heads, like the extension of time periods (day → month → year) and association of known figures with years relevant to their historical significance (9/10)

December 18, 2024 at 6:01 PM

Mor Geva

@megamor2.bsky.social

(3) Smaller models tend to encode higher numbers of relations in a single head

(4) In Llama-3.1 models, which use grouped-query attention, grouped heads often implement the same or similar relations (7/10)

December 18, 2024 at 5:59 PM

Mor Geva

@megamor2.bsky.social

(1) Different models encode certain relations across attention heads to similar degrees

(2) Different heads implement the same relation to varying degrees, which has implications for localization and editing of LLMs (6/10)

December 18, 2024 at 5:58 PM

Mor Geva

@megamor2.bsky.social

Experiments on 20 operations and 6 LLMs show that MAPS estimations strongly correlate with the head’s outputs during inference

Ablating heads implementing an operation damages the model’s ability to perform tasks requiring the operation compared to removing other heads (4/10)

December 18, 2024 at 5:57 PM

Mor Geva

@megamor2.bsky.social

MAPS infers the head’s functionality by examining different groups of mappings:

(A) Predefined relations: groups expressing certain relations (e.g. city of a country)

(B) Salient operations: groups for which the head induces the most prominent effect (3/10)

December 18, 2024 at 5:57 PM

Mor Geva

@megamor2.bsky.social

What's in an attention head? 🤯

We present an efficient framework – MAPS – for inferring the functionality of attention heads in LLMs ✨directly from their parameters✨

A new preprint with Amit Elhelo 🧵 (1/10)

December 18, 2024 at 5:55 PM

Mor Geva

@megamor2.bsky.social

Post a photo of yourself from a different era

November 28, 2024 at 5:38 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news