Prev research intern @ EPFL w/ wendlerc.bsky.social and Robert West
MATS Winter 7.0 Scholar w/ neelnanda.bsky.social
https://butanium.github.io
Tldr: we show that your Narrow Finetuning is showing and might not be a realistic setup to study!
Tldr: we show that your Narrow Finetuning is showing and might not be a realistic setup to study!
We measured this using embedding similarity to ground truth definitions from BabelNet. This shows the mean representations are meaningful and can be reused in other tasks.
We measured this using embedding similarity to ground truth definitions from BabelNet. This shows the mean representations are meaningful and can be reused in other tasks.
40% of robust chat-specific latents primarily activate on these structural tokens.
The "special sauce" of chat models may be in how they use these tokens!
40% of robust chat-specific latents primarily activate on these structural tokens.
The "special sauce" of chat models may be in how they use these tokens!
💬 False information detection
❓ Knowledge boundaries recognition
🤔 Personal experience questions
⚠️ Refusal mechanisms
📝 Summarization requests
🃏 Joke detection
...and many more!
💬 False information detection
❓ Knowledge boundaries recognition
🤔 Personal experience questions
⚠️ Refusal mechanisms
📝 Summarization requests
🃏 Joke detection
...and many more!
Key finding: In BatchTopK, the norm metric reliably identifies causally important latents. With L1 crosscoders, you need our Latent Scaling technique.
Key finding: In BatchTopK, the norm metric reliably identifies causally important latents. With L1 crosscoders, you need our Latent Scaling technique.
While BatchTopK lacks the neat trimodal distribution of norms seen in L1, it avoids both Complete Shrinkage and Latent Decoupling issues.
Result: Many more genuinely chat-specific latents!
While BatchTopK lacks the neat trimodal distribution of norms seen in L1, it avoids both Complete Shrinkage and Latent Decoupling issues.
Result: Many more genuinely chat-specific latents!
- Many "chat-only" latents (blue) show high Shrinkage values
- Clear overlap between chat-only and shared latents (orange)
Most "chat-only" latents aren't actually chat-specific!
- Many "chat-only" latents (blue) show high Shrinkage values
- Clear overlap between chat-only and shared latents (orange)
Most "chat-only" latents aren't actually chat-specific!
- Base model reconstruction error (catches Complete Shrinkage)
- Base reconstructed activation (catches Latent Decoupling)
This reveals which latents are genuinely chat-specific vs. false positives
- Base model reconstruction error (catches Complete Shrinkage)
- Base reconstructed activation (catches Latent Decoupling)
This reveals which latents are genuinely chat-specific vs. false positives
1) Complete Shrinkage: L1 regularization might force base latents to zero even when useful
2) Latent Decoupling: "Chat-only" concepts might actually exist in the base model but be encoded in the crosscoder differently
1) Complete Shrinkage: L1 regularization might force base latents to zero even when useful
2) Latent Decoupling: "Chat-only" concepts might actually exist in the base model but be encoded in the crosscoder differently
We use @anthropic.com's crosscoders to learn a shared sparse dictionary of "latents" across models. Each latent is represented by different vectors in both models.
When comparing these vectors' norms, some latents appear to exist in just 1 model!
We use @anthropic.com's crosscoders to learn a shared sparse dictionary of "latents" across models. Each latent is represented by different vectors in both models.
When comparing these vectors' norms, some latents appear to exist in just 1 model!
What do chat LLMs learn in finetuning?
Anthropic introduced a tool for this: crosscoders, an SAE variant. We find key limitations of crosscoders & fix them with BatchTopK crosscoders
This finds interpretable and causal chat-only features!🧵
What do chat LLMs learn in finetuning?
Anthropic introduced a tool for this: crosscoders, an SAE variant. We find key limitations of crosscoders & fix them with BatchTopK crosscoders
This finds interpretable and causal chat-only features!🧵