Yoav Gur Arieh
yoav.ml
Yoav Gur Arieh
@yoav.ml
I think I found the latent direction in Gemma (an SAE feature) that represents the pandemic era...

Interpreted by projecting the vector to vocabulary space, yielding a list of tokens associated with it
October 29, 2025 at 9:15 PM
These mechanisms start emerging after ~200B tokens of training, right when accuracy on binding tasks starts to rise.

Before that, the orange spikes for first and last positions suggest perhaps a primordial mechanism that can only track which entities come first/last in context.
October 21, 2025 at 7:40 PM
Two weeks ago I posted about our recent paper, which shows that to bind entities, LMs use three mechanisms: positional, lexical and reflexive.

We were curious how these mechanisms develop throughout training, so we evaluated their existence across OLMo checkpoints 👇
October 21, 2025 at 7:40 PM
Finally, we evaluate our model over more natural and increasingly long tasks, showing that the ‘lost-in-the-middle’ effect might be explained mechanistically by a weakening lexical signal alongside an increasingly noisy positional one. 7/
October 8, 2025 at 2:56 PM
We leverage these insights to build a causal model combining all three mechanisms, predicting next-token distributions with 95% agreement.

We model the positional term as a Gaussian with shifting std, and the other two as one-hot distributions with position-based weights. 6/
October 8, 2025 at 2:56 PM
We show this through extensive use of interchange interventions, evaluating over 10 binding tasks and 9 models (Gemma/Qwen/Llama 2B-72B params).

Across all models, we find a remarkably consistent reliance on these three specific mechanisms and how they interact. 5/
October 8, 2025 at 2:56 PM
To compensate for this, LMs use two additional mechanisms.

The first is *lexical*, where the LM retrieves the subject next to "Michael". It does this by copying the lexical contents of "Holly" to "Michael", binding them together. 3/
October 8, 2025 at 2:56 PM
Prior work identified only a positional mechanism, where the model tracks entities by position: here retrieving the subject from the first clause "Holly".

We show this isn’t sufficient—the positional signal is strong at the edges of context but weak and diffuse in the middle. 2/
October 8, 2025 at 2:56 PM
A key part of in-context reasoning is the ability to bind entities for tracking and retrieval.

When reading “Holly loves Michael, Jim loves Pam”, the model must bind Holly↔Michael to answer “Who loves Michael?”

We show that this binding relies on three mechanisms. 1/
October 8, 2025 at 2:56 PM
🧠 To reason over text and track entities, we find that language models use three types of 'pointers'!

They were thought to rely only on a positional one—but when many entities appear, that system breaks down.

Our new paper shows what these pointers are and how they interact 👇
October 8, 2025 at 2:56 PM
This is a step toward targeted, interpretable, and robust knowledge removal — at the parameter level.

Joint work with Clara Suslik, Yihuai Hong, and @fbarez.bsky.social, advised by @megamor2.bsky.social
🔗 Paper: arxiv.org/abs/2505.22586
🔗 Code: github.com/yoavgur/PISCES
May 29, 2025 at 4:22 PM
We show that 🪝𝐏𝐈𝐒𝐂𝐄𝐒:
✅ Achieves much higher specificity and robustness
✅ Maintains low retained accuracy (as low or lower than other methods!)
✅ Preserves coherence and general capabilities 3/
May 29, 2025 at 4:22 PM
🪝𝐏𝐈𝐒𝐂𝐄𝐒 works by:
1️⃣ Disentangling model parameters into interpretable features (implemented using SAEs)
2️⃣ Identifying those that encode a target concept
3️⃣ Precisely ablating them and reconstructing the weights

No need for fine-tuning, retain sets, or enumerating facts. 2/
May 29, 2025 at 4:22 PM
New Paper Alert! Can we precisely erase conceptual knowledge from LLM parameters?
Most methods are shallow, coarse, or overreach, adversely affecting related or general knowledge.

We introduce🪝𝐏𝐈𝐒𝐂𝐄𝐒 — a general framework for Precise In-parameter Concept EraSure. 🧵 1/
May 29, 2025 at 4:22 PM