Yoav Gur Arieh
yoav.ml
Yoav Gur Arieh
@yoav.ml
You got Covid, Zoom, Fauci, Moderna, Trump/Biden, Tiktok, lockdowns, quarantines, fentanyl, the metaverse, and NFTs! Really gives you PTSD...

Encountered this funny SAE feature again (MLP 22.6656), found in our research on interpreting features in LLMs: aclanthology.org/2025.acl-lo...
Enhancing Automated Interpretability with Output-Centric Feature Descriptions
Yoav Gur-Arieh, Roy Mayan, Chen Agassy, Atticus Geiger, Mor Geva. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.
aclanthology.org
October 29, 2025 at 9:15 PM
Lots more to explore! eg what the primordial mechanism is, and what changes between these mechanisms' emergence (500B) and when the model gets perfect acc (3000B).

Check out our paper and interactive blog post for more on this!
📄 arxiv.org/abs/2510.06182
🌐 yoav.ml/blog/2025/m...
Mixing Mechanisms: How Language Models Retrieve Bound Entities In-Context
A key component of in-context reasoning is the ability of language models (LMs) to bind entities for later retrieval. For example, an LM might represent "Ann loves pie" by binding "Ann" to "pie",...
arxiv.org
October 21, 2025 at 7:40 PM
These mechanisms start emerging after ~200B tokens of training, right when accuracy on binding tasks starts to rise.

Before that, the orange spikes for first and last positions suggest perhaps a primordial mechanism that can only track which entities come first/last in context.
October 21, 2025 at 7:40 PM
This was a joint work with the amazing @megamor2.bsky.social and Atticus Geiger.

Check out our *interactive blog post* to see how these mechanisms shape LM outputs 👇

🌐 yoav.ml/blog/2025/m...
📄 arxiv.org/abs/2510.06182
🤗 huggingface.co/papers/2510...
💻 github.com/yoavgur/mix...
GitHub - yoavgur/mixing-mechs: Official code for "Mixing Mechanisms: How Language Models Retrieve Bound Entities In-Context"
Official code for "Mixing Mechanisms: How Language Models Retrieve Bound Entities In-Context" - yoavgur/mixing-mechs
github.com
October 8, 2025 at 2:56 PM
Overall, we show that LMs retrieve entities not through a single positional mechanism, but a mixture of three: positional, lexical, and reflexive.

Understanding these mechanisms helps explain both the strengths and limits of LLMs, and how they reason in context. 8/
October 8, 2025 at 2:56 PM
Finally, we evaluate our model over more natural and increasingly long tasks, showing that the ‘lost-in-the-middle’ effect might be explained mechanistically by a weakening lexical signal alongside an increasingly noisy positional one. 7/
October 8, 2025 at 2:56 PM
We leverage these insights to build a causal model combining all three mechanisms, predicting next-token distributions with 95% agreement.

We model the positional term as a Gaussian with shifting std, and the other two as one-hot distributions with position-based weights. 6/
October 8, 2025 at 2:56 PM
We show this through extensive use of interchange interventions, evaluating over 10 binding tasks and 9 models (Gemma/Qwen/Llama 2B-72B params).

Across all models, we find a remarkably consistent reliance on these three specific mechanisms and how they interact. 5/
October 8, 2025 at 2:56 PM
Then we have the *reflexive* mechanism, which retrieves exactly the token "Holly".

This happens through a self-referential pointer originating from the "Holly" token and pointing back to it. This pointer gets copied to the "Michael" token, binding the two entities together. 4/
October 8, 2025 at 2:56 PM
To compensate for this, LMs use two additional mechanisms.

The first is *lexical*, where the LM retrieves the subject next to "Michael". It does this by copying the lexical contents of "Holly" to "Michael", binding them together. 3/
October 8, 2025 at 2:56 PM
Prior work identified only a positional mechanism, where the model tracks entities by position: here retrieving the subject from the first clause "Holly".

We show this isn’t sufficient—the positional signal is strong at the edges of context but weak and diffuse in the middle. 2/
October 8, 2025 at 2:56 PM
A key part of in-context reasoning is the ability to bind entities for tracking and retrieval.

When reading “Holly loves Michael, Jim loves Pam”, the model must bind Holly↔Michael to answer “Who loves Michael?”

We show that this binding relies on three mechanisms. 1/
October 8, 2025 at 2:56 PM
This is a step toward targeted, interpretable, and robust knowledge removal — at the parameter level.

Joint work with Clara Suslik, Yihuai Hong, and @fbarez.bsky.social, advised by @megamor2.bsky.social
🔗 Paper: arxiv.org/abs/2505.22586
🔗 Code: github.com/yoavgur/PISCES
May 29, 2025 at 4:22 PM
We also check robustness to relearning: Can the model relearn the erased concept from related but non-overlapping data to the eval questions?

🪝𝐏𝐈𝐒𝐂𝐄𝐒 resists relearning far better than prior methods, while others often fully recover the concept! 5/
May 29, 2025 at 4:22 PM
Our specificity evaluation includes similar-domain accuracy, a stricter test than others use, where🪝𝐏𝐈𝐒𝐂𝐄𝐒 outperforms all other methods.

You can erase “Harry Potter” and still do fine on Lord of the Rings and Star Wars! 4/
May 29, 2025 at 4:22 PM
We show that 🪝𝐏𝐈𝐒𝐂𝐄𝐒:
✅ Achieves much higher specificity and robustness
✅ Maintains low retained accuracy (as low or lower than other methods!)
✅ Preserves coherence and general capabilities 3/
May 29, 2025 at 4:22 PM
🪝𝐏𝐈𝐒𝐂𝐄𝐒 works by:
1️⃣ Disentangling model parameters into interpretable features (implemented using SAEs)
2️⃣ Identifying those that encode a target concept
3️⃣ Precisely ablating them and reconstructing the weights

No need for fine-tuning, retain sets, or enumerating facts. 2/
May 29, 2025 at 4:22 PM
Large language models excel at storing knowledge, but not all of it is safe or useful - e.g. chatbots for kids shouldn’t discuss guns or gambling. How can we selectively remove inappropriate conceptual knowledge while preserving utility?

Meet our method🪝𝐏𝐈𝐒𝐂𝐄𝐒!
May 29, 2025 at 4:22 PM