Encountered this funny SAE feature again (MLP 22.6656), found in our research on interpreting features in LLMs: aclanthology.org/2025.acl-lo...
Encountered this funny SAE feature again (MLP 22.6656), found in our research on interpreting features in LLMs: aclanthology.org/2025.acl-lo...
Check out our paper and interactive blog post for more on this!
📄 arxiv.org/abs/2510.06182
🌐 yoav.ml/blog/2025/m...
Check out our paper and interactive blog post for more on this!
📄 arxiv.org/abs/2510.06182
🌐 yoav.ml/blog/2025/m...
Before that, the orange spikes for first and last positions suggest perhaps a primordial mechanism that can only track which entities come first/last in context.
Before that, the orange spikes for first and last positions suggest perhaps a primordial mechanism that can only track which entities come first/last in context.
Check out our *interactive blog post* to see how these mechanisms shape LM outputs 👇
🌐 yoav.ml/blog/2025/m...
📄 arxiv.org/abs/2510.06182
🤗 huggingface.co/papers/2510...
💻 github.com/yoavgur/mix...
Check out our *interactive blog post* to see how these mechanisms shape LM outputs 👇
🌐 yoav.ml/blog/2025/m...
📄 arxiv.org/abs/2510.06182
🤗 huggingface.co/papers/2510...
💻 github.com/yoavgur/mix...
Understanding these mechanisms helps explain both the strengths and limits of LLMs, and how they reason in context. 8/
Understanding these mechanisms helps explain both the strengths and limits of LLMs, and how they reason in context. 8/
We model the positional term as a Gaussian with shifting std, and the other two as one-hot distributions with position-based weights. 6/
We model the positional term as a Gaussian with shifting std, and the other two as one-hot distributions with position-based weights. 6/
Across all models, we find a remarkably consistent reliance on these three specific mechanisms and how they interact. 5/
Across all models, we find a remarkably consistent reliance on these three specific mechanisms and how they interact. 5/
This happens through a self-referential pointer originating from the "Holly" token and pointing back to it. This pointer gets copied to the "Michael" token, binding the two entities together. 4/
This happens through a self-referential pointer originating from the "Holly" token and pointing back to it. This pointer gets copied to the "Michael" token, binding the two entities together. 4/
The first is *lexical*, where the LM retrieves the subject next to "Michael". It does this by copying the lexical contents of "Holly" to "Michael", binding them together. 3/
The first is *lexical*, where the LM retrieves the subject next to "Michael". It does this by copying the lexical contents of "Holly" to "Michael", binding them together. 3/
We show this isn’t sufficient—the positional signal is strong at the edges of context but weak and diffuse in the middle. 2/
We show this isn’t sufficient—the positional signal is strong at the edges of context but weak and diffuse in the middle. 2/
When reading “Holly loves Michael, Jim loves Pam”, the model must bind Holly↔Michael to answer “Who loves Michael?”
We show that this binding relies on three mechanisms. 1/
When reading “Holly loves Michael, Jim loves Pam”, the model must bind Holly↔Michael to answer “Who loves Michael?”
We show that this binding relies on three mechanisms. 1/
Joint work with Clara Suslik, Yihuai Hong, and @fbarez.bsky.social, advised by @megamor2.bsky.social
🔗 Paper: arxiv.org/abs/2505.22586
🔗 Code: github.com/yoavgur/PISCES
Joint work with Clara Suslik, Yihuai Hong, and @fbarez.bsky.social, advised by @megamor2.bsky.social
🔗 Paper: arxiv.org/abs/2505.22586
🔗 Code: github.com/yoavgur/PISCES
🪝𝐏𝐈𝐒𝐂𝐄𝐒 resists relearning far better than prior methods, while others often fully recover the concept! 5/
🪝𝐏𝐈𝐒𝐂𝐄𝐒 resists relearning far better than prior methods, while others often fully recover the concept! 5/
You can erase “Harry Potter” and still do fine on Lord of the Rings and Star Wars! 4/
You can erase “Harry Potter” and still do fine on Lord of the Rings and Star Wars! 4/
✅ Achieves much higher specificity and robustness
✅ Maintains low retained accuracy (as low or lower than other methods!)
✅ Preserves coherence and general capabilities 3/
✅ Achieves much higher specificity and robustness
✅ Maintains low retained accuracy (as low or lower than other methods!)
✅ Preserves coherence and general capabilities 3/
1️⃣ Disentangling model parameters into interpretable features (implemented using SAEs)
2️⃣ Identifying those that encode a target concept
3️⃣ Precisely ablating them and reconstructing the weights
No need for fine-tuning, retain sets, or enumerating facts. 2/
1️⃣ Disentangling model parameters into interpretable features (implemented using SAEs)
2️⃣ Identifying those that encode a target concept
3️⃣ Precisely ablating them and reconstructing the weights
No need for fine-tuning, retain sets, or enumerating facts. 2/
Meet our method🪝𝐏𝐈𝐒𝐂𝐄𝐒!
Meet our method🪝𝐏𝐈𝐒𝐂𝐄𝐒!