FADE quantifies the causes of mismatch of feature-to-description alignment and highlights challenges of current methods, such as various failure modes, how SAE features are more difficult to describe compared to MLP, and interpretability of feature descriptions across layers.
July 16, 2025 at 1:26 PM
FADE quantifies the causes of mismatch of feature-to-description alignment and highlights challenges of current methods, such as various failure modes, how SAE features are more difficult to describe compared to MLP, and interpretability of feature descriptions across layers.
Building on the gained insights, we present a probe to track for knowledge provenance during inference and show where it is localized within the input prompt. Our attempt shows promising results, with >94% ROC AUC and >84% localization accuracy.
4/4
May 26, 2025 at 4:01 PM
Building on the gained insights, we present a probe to track for knowledge provenance during inference and show where it is localized within the input prompt. Our attempt shows promising results, with >94% ROC AUC and >84% localization accuracy.
We analyze how in-context heads can specialize to understand instructions (task heads) and retrieve relevant information (retrieval heads). Together with parametric heads, we investigate their causal roles by extracting function vectors or modifying their weights.
3/
May 26, 2025 at 4:01 PM
We analyze how in-context heads can specialize to understand instructions (task heads) and retrieve relevant information (retrieval heads). Together with parametric heads, we investigate their causal roles by extracting function vectors or modifying their weights.
Using interpretability tools, we discover that heads important for RAG can be categorized into two: parametric heads that encode relational knowledge and in-context heads that are responsible for processing information in the prompt.
2/
May 26, 2025 at 4:01 PM
Using interpretability tools, we discover that heads important for RAG can be categorized into two: parametric heads that encode relational knowledge and in-context heads that are responsible for processing information in the prompt.