Clément Dumas
butanium.bsky.social
Clément Dumas
@butanium.bsky.social
Master student at ENS Paris-Saclay / aspiring AI safety researcher / improviser
Prev research intern @ EPFL w/ wendlerc.bsky.social and Robert West
MATS Winter 7.0 Scholar w/ neelnanda.bsky.social

https://butanium.github.io
For more info check the blogpost / Julian's thread
September 5, 2025 at 7:23 PM
Why this matters: These model organisms (used in safety research) may not be realistic testbeds - the ft leaves such strong traces that models are 'always thinking' about their recent ft, even on unrelated prompts.
But: mixing in pretraining data can reduce this bias!
September 5, 2025 at 7:23 PM
The activation diffs on the first few tokens encode a clear bias toward the ft domain. We can:
- Use Patchscope to surface relevant tokens (e.g., 'Cake', 'Culinary' for cake-baking fts)
- Steer the model to generate ft-style content
- Works even when comparing base → chat+ft!
September 5, 2025 at 7:23 PM
Reposted by Clément Dumas
GPT is being asked to be both one mind and to also segment its understanding into many different minds, this incentivizes the model to learn to correct for its own perspective when mimicking the generator of individual texts so it doesn't know too much, to know self vs. other in minute detail.
August 29, 2025 at 1:59 AM
Do you plan to open it more broadly to people just interested in watching the dynamics that emerge there?
August 8, 2025 at 12:36 PM
What would you expect to happen if you prompt the model with "which animal do you hate the most?". It feels like your blog post would predict that the model says owl, right?
August 6, 2025 at 11:23 PM
Thanks to my co-authors @wendlerc.bsky.social, Bob West @veniamin.bsky.social and Giovanni Monea
June 29, 2025 at 11:07 PM
or more details, check out our paper on arXiv: arxiv.org/abs/2411.08745
(we renamed it to "Separating Tongue from Thought: Activation Patching Reveals Language-Agnostic Concept Representations in Transformers").
Separating Tongue from Thought: Activation Patching Reveals Language-Agnostic Concept Representations in Transformers
A central question in multilingual language modeling is whether large language models (LLMs) develop a universal concept representation, disentangled from specific languages. In this paper, we address...
arxiv.org
June 29, 2025 at 11:07 PM
Results: The generated definitions (dark blue) are just as good as what you'd get from prompting (brown)!
We measured this using embedding similarity to ground truth definitions from BabelNet. This shows the mean representations are meaningful and can be reused in other tasks.
June 29, 2025 at 11:07 PM
We did this by patching the mean representation into a target prompt to force the model to translate it (left). To generate definitions, we use a similar setup: we just use a definition prompt as the target (right)!
June 29, 2025 at 11:07 PM
Quick recap of our original finding: LLMs seem to use language-agnostic concept representations.
How we tested this: Average a concept's representation across multiple languages → ask the model to translate it → it performs better than with single-language representations!
June 29, 2025 at 11:07 PM
*discord, right?
June 26, 2025 at 6:41 PM
Asking an LLM with the right prompt is a good start imo (see e.g. www.lesswrong.com/posts/Gi8NP9...)
AI for Epistemics Hackathon — LessWrong
AI for Epistemics is about helping to leverage AI for better truthseeking mechanisms — at the level of individual users, the whole of society, or in…
www.lesswrong.com
June 26, 2025 at 9:36 AM
Full paper: arxiv.org/abs/2504.02922
This work was conducted during the MATS program with equal contribution with @jkminder.bsky.social, supervised by Bilal Chughtai (bilalchughtai.co.uk) and @neelnanda.bsky.social with help from @cadentj.bsky.social.
We'll be presenting at the ICLR SLLM workshop!
Robustly identifying concepts introduced during chat fine-tuning using crosscoders
Model diffing is the study of how fine-tuning changes a model's representations and internal algorithms. Many behaviours of interest are introduced during fine-tuning, and model diffing offers a promi...
arxiv.org
April 7, 2025 at 4:21 PM