Lightnews — Scholar-powered news

Clément Dumas

@butanium.bsky.social

540 followers 210 following 45 posts

Master student at ENS Paris-Saclay / aspiring AI safety researcher / improviser
Prev research intern @ EPFL w/ wendlerc.bsky.social and Robert West
MATS Winter 7.0 Scholar w/ neelnanda.bsky.social

https://butanium.github.io

Posts Replies Media Videos

Clément Dumas

@butanium.bsky.social

For more info check the blogpost / Julian's thread

September 5, 2025 at 7:23 PM

Clément Dumas

@butanium.bsky.social

Why this matters: These model organisms (used in safety research) may not be realistic testbeds - the ft leaves such strong traces that models are 'always thinking' about their recent ft, even on unrelated prompts.
But: mixing in pretraining data can reduce this bias!

September 5, 2025 at 7:23 PM

Clément Dumas

@butanium.bsky.social

The activation diffs on the first few tokens encode a clear bias toward the ft domain. We can:
- Use Patchscope to surface relevant tokens (e.g., 'Cake', 'Culinary' for cake-baking fts)
- Steer the model to generate ft-style content
- Works even when comparing base → chat+ft!

September 5, 2025 at 7:23 PM

Reposted by Clément Dumas

John David Pressman

@jdp.extropian.net

GPT is being asked to be both one mind and to also segment its understanding into many different minds, this incentivizes the model to learn to correct for its own perspective when mimicking the generator of individual texts so it doesn't know too much, to know self vs. other in minute detail.

August 29, 2025 at 1:59 AM

Clément Dumas

@butanium.bsky.social

Do you plan to open it more broadly to people just interested in watching the dynamics that emerge there?

August 8, 2025 at 12:36 PM

Clément Dumas

@butanium.bsky.social

What would you expect to happen if you prompt the model with "which animal do you hate the most?". It feels like your blog post would predict that the model says owl, right?

August 6, 2025 at 11:23 PM

Clément Dumas

@butanium.bsky.social

Thanks to my co-authors @wendlerc.bsky.social, Bob West @veniamin.bsky.social and Giovanni Monea

June 29, 2025 at 11:07 PM

Clément Dumas

@butanium.bsky.social

or more details, check out our paper on arXiv: arxiv.org/abs/2411.08745
(we renamed it to "Separating Tongue from Thought: Activation Patching Reveals Language-Agnostic Concept Representations in Transformers").

Separating Tongue from Thought: Activation Patching Reveals Language-Agnostic Concept Representations in Transformers

A central question in multilingual language modeling is whether large language models (LLMs) develop a universal concept representation, disentangled from specific languages. In this paper, we address...

arxiv.org

June 29, 2025 at 11:07 PM

Clément Dumas

@butanium.bsky.social

Results: The generated definitions (dark blue) are just as good as what you'd get from prompting (brown)!
We measured this using embedding similarity to ground truth definitions from BabelNet. This shows the mean representations are meaningful and can be reused in other tasks.

June 29, 2025 at 11:07 PM

Clément Dumas

@butanium.bsky.social

We did this by patching the mean representation into a target prompt to force the model to translate it (left). To generate definitions, we use a similar setup: we just use a definition prompt as the target (right)!

June 29, 2025 at 11:07 PM

Clément Dumas

@butanium.bsky.social

Quick recap of our original finding: LLMs seem to use language-agnostic concept representations.
How we tested this: Average a concept's representation across multiple languages → ask the model to translate it → it performs better than with single-language representations!

June 29, 2025 at 11:07 PM

Clément Dumas

@butanium.bsky.social

*discord, right?

June 26, 2025 at 6:41 PM

Clément Dumas

@butanium.bsky.social

Asking an LLM with the right prompt is a good start imo (see e.g. www.lesswrong.com/posts/Gi8NP9...)

AI for Epistemics Hackathon — LessWrong

AI for Epistemics is about helping to leverage AI for better truthseeking mechanisms — at the level of individual users, the whole of society, or in…

www.lesswrong.com

June 26, 2025 at 9:36 AM

Clément Dumas

@butanium.bsky.social

Full paper: arxiv.org/abs/2504.02922
This work was conducted during the MATS program with equal contribution with @jkminder.bsky.social, supervised by Bilal Chughtai (bilalchughtai.co.uk) and @neelnanda.bsky.social with help from @cadentj.bsky.social.
We'll be presenting at the ICLR SLLM workshop!

Robustly identifying concepts introduced during chat fine-tuning using crosscoders

Model diffing is the study of how fine-tuning changes a model's representations and internal algorithms. Many behaviours of interest are introduced during fine-tuning, and model diffing offers a promi...

arxiv.org

April 7, 2025 at 4:21 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news