Lightnews — Scholar-powered news

Clément Dumas

@butanium.bsky.social

540 followers 210 following 45 posts

Master student at ENS Paris-Saclay / aspiring AI safety researcher / improviser
Prev research intern @ EPFL w/ wendlerc.bsky.social and Robert West
MATS Winter 7.0 Scholar w/ neelnanda.bsky.social

https://butanium.github.io

Posts Replies Media Videos

Clément Dumas

@butanium.bsky.social

A very important paper led by Julian!
Tldr: we show that your Narrow Finetuning is showing and might not be a realistic setup to study!

"Your Narrow Finetuning is showing", image of robots (representing LLMs) with signs disclosing their finetuning objectives

October 20, 2025 at 3:20 PM

Clément Dumas

@butanium.bsky.social

Results: The generated definitions (dark blue) are just as good as what you'd get from prompting (brown)!
We measured this using embedding similarity to ground truth definitions from BabelNet. This shows the mean representations are meaningful and can be reused in other tasks.

June 29, 2025 at 11:07 PM

Clément Dumas

@butanium.bsky.social

We did this by patching the mean representation into a target prompt to force the model to translate it (left). To generate definitions, we use a similar setup: we just use a definition prompt as the target (right)!

June 29, 2025 at 11:07 PM

Clément Dumas

@butanium.bsky.social

Like Andy Arditi (andyrdt.com) & Cooper Leong (cooperleong00.github.io), we find template tokens (like ) matter enormously!
40% of robust chat-specific latents primarily activate on these structural tokens.
The "special sauce" of chat models may be in how they use these tokens!

April 7, 2025 at 4:21 PM

Clément Dumas

@butanium.bsky.social

Those latents can be used to steer the model’s behavior, e.g. by inducing different type of refusal!

April 7, 2025 at 4:21 PM

Clément Dumas

@butanium.bsky.social

The BatchTopK chat-only latents are highly interpretable and represent fascinating concepts:
💬 False information detection
❓ Knowledge boundaries recognition
🤔 Personal experience questions
⚠️ Refusal mechanisms
📝 Summarization requests
🃏 Joke detection
...and many more!

April 7, 2025 at 4:21 PM

Clément Dumas

@butanium.bsky.social

We tested how well different latent sets can transform base model activations into chat model ones and recover the chat behavior.
Key finding: In BatchTopK, the norm metric reliably identifies causally important latents. With L1 crosscoders, you need our Latent Scaling technique.

April 7, 2025 at 4:21 PM

Clément Dumas

@butanium.bsky.social

Our findings led us to train crosscoders with @bartbussmann.bsky.social’s BatchTopK loss instead of L1.
While BatchTopK lacks the neat trimodal distribution of norms seen in L1, it avoids both Complete Shrinkage and Latent Decoupling issues.
Result: Many more genuinely chat-specific latents!

April 7, 2025 at 4:21 PM

Clément Dumas

@butanium.bsky.social

Our analysis confirms these aren't just theoretical concerns! Looking at L1 crosscoder, we found:
- Many "chat-only" latents (blue) show high Shrinkage values
- Clear overlap between chat-only and shared latents (orange)

Most "chat-only" latents aren't actually chat-specific!

April 7, 2025 at 4:21 PM

Clément Dumas

@butanium.bsky.social

To detect this, we developed "Latent Scaling" to measure latent's contribution to:
- Base model reconstruction error (catches Complete Shrinkage)
- Base reconstructed activation (catches Latent Decoupling)
This reveals which latents are genuinely chat-specific vs. false positives

April 7, 2025 at 4:21 PM

Clément Dumas

@butanium.bsky.social

We identified two theoretical issues with L1 crosscoders:
1) Complete Shrinkage: L1 regularization might force base latents to zero even when useful
2) Latent Decoupling: "Chat-only" concepts might actually exist in the base model but be encoded in the crosscoder differently

April 7, 2025 at 4:21 PM

Clément Dumas

@butanium.bsky.social

How do we compare base/chat models?
We use @anthropic.com's crosscoders to learn a shared sparse dictionary of "latents" across models. Each latent is represented by different vectors in both models.
When comparing these vectors' norms, some latents appear to exist in just 1 model!

April 7, 2025 at 4:21 PM

Clément Dumas

@butanium.bsky.social

New paper w/@jkminder.bsky.social & @neelnanda.bsky.social
What do chat LLMs learn in finetuning?

Anthropic introduced a tool for this: crosscoders, an SAE variant. We find key limitations of crosscoders & fix them with BatchTopK crosscoders

This finds interpretable and causal chat-only features!🧵

April 7, 2025 at 4:21 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news