Clément Dumas
butanium.bsky.social
Clément Dumas
@butanium.bsky.social
Master student at ENS Paris-Saclay / aspiring AI safety researcher / improviser
Prev research intern @ EPFL w/ wendlerc.bsky.social and Robert West
MATS Winter 7.0 Scholar w/ neelnanda.bsky.social

https://butanium.github.io
A very important paper led by Julian!
Tldr: we show that your Narrow Finetuning is showing and might not be a realistic setup to study!
October 20, 2025 at 3:20 PM
Results: The generated definitions (dark blue) are just as good as what you'd get from prompting (brown)!
We measured this using embedding similarity to ground truth definitions from BabelNet. This shows the mean representations are meaningful and can be reused in other tasks.
June 29, 2025 at 11:07 PM
We did this by patching the mean representation into a target prompt to force the model to translate it (left). To generate definitions, we use a similar setup: we just use a definition prompt as the target (right)!
June 29, 2025 at 11:07 PM
Like Andy Arditi (andyrdt.com) & Cooper Leong (cooperleong00.github.io), we find template tokens (like ) matter enormously!
40% of robust chat-specific latents primarily activate on these structural tokens.
The "special sauce" of chat models may be in how they use these tokens!
April 7, 2025 at 4:21 PM
Those latents can be used to steer the model’s behavior, e.g. by inducing different type of refusal!
April 7, 2025 at 4:21 PM
The BatchTopK chat-only latents are highly interpretable and represent fascinating concepts:
💬 False information detection
❓ Knowledge boundaries recognition
🤔 Personal experience questions
⚠️ Refusal mechanisms
📝 Summarization requests
🃏 Joke detection
...and many more!
April 7, 2025 at 4:21 PM
We tested how well different latent sets can transform base model activations into chat model ones and recover the chat behavior.
Key finding: In BatchTopK, the norm metric reliably identifies causally important latents. With L1 crosscoders, you need our Latent Scaling technique.
April 7, 2025 at 4:21 PM
Our findings led us to train crosscoders with @bartbussmann.bsky.social’s BatchTopK loss instead of L1.
While BatchTopK lacks the neat trimodal distribution of norms seen in L1, it avoids both Complete Shrinkage and Latent Decoupling issues.
Result: Many more genuinely chat-specific latents!
April 7, 2025 at 4:21 PM
Our analysis confirms these aren't just theoretical concerns! Looking at L1 crosscoder, we found:
- Many "chat-only" latents (blue) show high Shrinkage values
- Clear overlap between chat-only and shared latents (orange)

Most "chat-only" latents aren't actually chat-specific!
April 7, 2025 at 4:21 PM
To detect this, we developed "Latent Scaling" to measure latent's contribution to:
- Base model reconstruction error (catches Complete Shrinkage)
- Base reconstructed activation (catches Latent Decoupling)
This reveals which latents are genuinely chat-specific vs. false positives
April 7, 2025 at 4:21 PM
We identified two theoretical issues with L1 crosscoders:
1) Complete Shrinkage: L1 regularization might force base latents to zero even when useful
2) Latent Decoupling: "Chat-only" concepts might actually exist in the base model but be encoded in the crosscoder differently
April 7, 2025 at 4:21 PM
How do we compare base/chat models?
We use @anthropic.com's crosscoders to learn a shared sparse dictionary of "latents" across models. Each latent is represented by different vectors in both models.
When comparing these vectors' norms, some latents appear to exist in just 1 model!
April 7, 2025 at 4:21 PM
New paper w/@jkminder.bsky.social & @neelnanda.bsky.social
What do chat LLMs learn in finetuning?

Anthropic introduced a tool for this: crosscoders, an SAE variant. We find key limitations of crosscoders & fix them with BatchTopK crosscoders

This finds interpretable and causal chat-only features!🧵
April 7, 2025 at 4:21 PM