Interpreted by projecting the vector to vocabulary space, yielding a list of tokens associated with it
Interpreted by projecting the vector to vocabulary space, yielding a list of tokens associated with it
Before that, the orange spikes for first and last positions suggest perhaps a primordial mechanism that can only track which entities come first/last in context.
Before that, the orange spikes for first and last positions suggest perhaps a primordial mechanism that can only track which entities come first/last in context.
We were curious how these mechanisms develop throughout training, so we evaluated their existence across OLMo checkpoints 👇
We were curious how these mechanisms develop throughout training, so we evaluated their existence across OLMo checkpoints 👇
We model the positional term as a Gaussian with shifting std, and the other two as one-hot distributions with position-based weights. 6/
We model the positional term as a Gaussian with shifting std, and the other two as one-hot distributions with position-based weights. 6/
Across all models, we find a remarkably consistent reliance on these three specific mechanisms and how they interact. 5/
Across all models, we find a remarkably consistent reliance on these three specific mechanisms and how they interact. 5/
The first is *lexical*, where the LM retrieves the subject next to "Michael". It does this by copying the lexical contents of "Holly" to "Michael", binding them together. 3/
The first is *lexical*, where the LM retrieves the subject next to "Michael". It does this by copying the lexical contents of "Holly" to "Michael", binding them together. 3/
We show this isn’t sufficient—the positional signal is strong at the edges of context but weak and diffuse in the middle. 2/
We show this isn’t sufficient—the positional signal is strong at the edges of context but weak and diffuse in the middle. 2/
When reading “Holly loves Michael, Jim loves Pam”, the model must bind Holly↔Michael to answer “Who loves Michael?”
We show that this binding relies on three mechanisms. 1/
When reading “Holly loves Michael, Jim loves Pam”, the model must bind Holly↔Michael to answer “Who loves Michael?”
We show that this binding relies on three mechanisms. 1/
They were thought to rely only on a positional one—but when many entities appear, that system breaks down.
Our new paper shows what these pointers are and how they interact 👇
They were thought to rely only on a positional one—but when many entities appear, that system breaks down.
Our new paper shows what these pointers are and how they interact 👇
Joint work with Clara Suslik, Yihuai Hong, and @fbarez.bsky.social, advised by @megamor2.bsky.social
🔗 Paper: arxiv.org/abs/2505.22586
🔗 Code: github.com/yoavgur/PISCES
Joint work with Clara Suslik, Yihuai Hong, and @fbarez.bsky.social, advised by @megamor2.bsky.social
🔗 Paper: arxiv.org/abs/2505.22586
🔗 Code: github.com/yoavgur/PISCES
✅ Achieves much higher specificity and robustness
✅ Maintains low retained accuracy (as low or lower than other methods!)
✅ Preserves coherence and general capabilities 3/
✅ Achieves much higher specificity and robustness
✅ Maintains low retained accuracy (as low or lower than other methods!)
✅ Preserves coherence and general capabilities 3/
1️⃣ Disentangling model parameters into interpretable features (implemented using SAEs)
2️⃣ Identifying those that encode a target concept
3️⃣ Precisely ablating them and reconstructing the weights
No need for fine-tuning, retain sets, or enumerating facts. 2/
1️⃣ Disentangling model parameters into interpretable features (implemented using SAEs)
2️⃣ Identifying those that encode a target concept
3️⃣ Precisely ablating them and reconstructing the weights
No need for fine-tuning, retain sets, or enumerating facts. 2/
Most methods are shallow, coarse, or overreach, adversely affecting related or general knowledge.
We introduce🪝𝐏𝐈𝐒𝐂𝐄𝐒 — a general framework for Precise In-parameter Concept EraSure. 🧵 1/
Most methods are shallow, coarse, or overreach, adversely affecting related or general knowledge.
We introduce🪝𝐏𝐈𝐒𝐂𝐄𝐒 — a general framework for Precise In-parameter Concept EraSure. 🧵 1/