Lightnews — Scholar-powered news

Yoav Gur Arieh

@yoav.ml

I think I found the latent direction in Gemma (an SAE feature) that represents the pandemic era...

Interpreted by projecting the vector to vocabulary space, yielding a list of tokens associated with it

October 29, 2025 at 9:15 PM

Yoav Gur Arieh

@yoav.ml

These mechanisms start emerging after ~200B tokens of training, right when accuracy on binding tasks starts to rise.

Before that, the orange spikes for first and last positions suggest perhaps a primordial mechanism that can only track which entities come first/last in context.

October 21, 2025 at 7:40 PM

Yoav Gur Arieh

@yoav.ml

Two weeks ago I posted about our recent paper, which shows that to bind entities, LMs use three mechanisms: positional, lexical and reflexive.

We were curious how these mechanisms develop throughout training, so we evaluated their existence across OLMo checkpoints 👇

October 21, 2025 at 7:40 PM

Yoav Gur Arieh

@yoav.ml

Finally, we evaluate our model over more natural and increasingly long tasks, showing that the ‘lost-in-the-middle’ effect might be explained mechanistically by a weakening lexical signal alongside an increasingly noisy positional one. 7/

October 8, 2025 at 2:56 PM

Yoav Gur Arieh

@yoav.ml

We leverage these insights to build a causal model combining all three mechanisms, predicting next-token distributions with 95% agreement.

We model the positional term as a Gaussian with shifting std, and the other two as one-hot distributions with position-based weights. 6/

October 8, 2025 at 2:56 PM

Yoav Gur Arieh

@yoav.ml

We show this through extensive use of interchange interventions, evaluating over 10 binding tasks and 9 models (Gemma/Qwen/Llama 2B-72B params).

Across all models, we find a remarkably consistent reliance on these three specific mechanisms and how they interact. 5/

October 8, 2025 at 2:56 PM

Yoav Gur Arieh

@yoav.ml

To compensate for this, LMs use two additional mechanisms.

The first is *lexical*, where the LM retrieves the subject next to "Michael". It does this by copying the lexical contents of "Holly" to "Michael", binding them together. 3/

October 8, 2025 at 2:56 PM

Yoav Gur Arieh

@yoav.ml

Prior work identified only a positional mechanism, where the model tracks entities by position: here retrieving the subject from the first clause "Holly".

We show this isn’t sufficient—the positional signal is strong at the edges of context but weak and diffuse in the middle. 2/

October 8, 2025 at 2:56 PM

Yoav Gur Arieh

@yoav.ml

A key part of in-context reasoning is the ability to bind entities for tracking and retrieval.

When reading “Holly loves Michael, Jim loves Pam”, the model must bind Holly↔Michael to answer “Who loves Michael?”

We show that this binding relies on three mechanisms. 1/

October 8, 2025 at 2:56 PM

Yoav Gur Arieh

@yoav.ml

🧠 To reason over text and track entities, we find that language models use three types of 'pointers'!

They were thought to rely only on a positional one—but when many entities appear, that system breaks down.

Our new paper shows what these pointers are and how they interact 👇

October 8, 2025 at 2:56 PM

Yoav Gur Arieh

@yoav.ml

This is a step toward targeted, interpretable, and robust knowledge removal — at the parameter level.

Joint work with Clara Suslik, Yihuai Hong, and @fbarez.bsky.social, advised by @megamor2.bsky.social
🔗 Paper: arxiv.org/abs/2505.22586
🔗 Code: github.com/yoavgur/PISCES

May 29, 2025 at 4:22 PM

Yoav Gur Arieh

@yoav.ml

We show that 🪝𝐏𝐈𝐒𝐂𝐄𝐒:
✅ Achieves much higher specificity and robustness
✅ Maintains low retained accuracy (as low or lower than other methods!)
✅ Preserves coherence and general capabilities 3/

May 29, 2025 at 4:22 PM

Yoav Gur Arieh

@yoav.ml

🪝𝐏𝐈𝐒𝐂𝐄𝐒 works by:
1️⃣ Disentangling model parameters into interpretable features (implemented using SAEs)
2️⃣ Identifying those that encode a target concept
3️⃣ Precisely ablating them and reconstructing the weights

No need for fine-tuning, retain sets, or enumerating facts. 2/

May 29, 2025 at 4:22 PM

Yoav Gur Arieh

@yoav.ml

New Paper Alert! Can we precisely erase conceptual knowledge from LLM parameters?
Most methods are shallow, coarse, or overreach, adversely affecting related or general knowledge.

We introduce🪝𝐏𝐈𝐒𝐂𝐄𝐒 — a general framework for Precise In-parameter Concept EraSure. 🧵 1/

May 29, 2025 at 4:22 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news