Prev research intern @ EPFL w/ wendlerc.bsky.social and Robert West
MATS Winter 7.0 Scholar w/ neelnanda.bsky.social
https://butanium.github.io
But: mixing in pretraining data can reduce this bias!
But: mixing in pretraining data can reduce this bias!
- Use Patchscope to surface relevant tokens (e.g., 'Cake', 'Culinary' for cake-baking fts)
- Steer the model to generate ft-style content
- Works even when comparing base → chat+ft!
- Use Patchscope to surface relevant tokens (e.g., 'Cake', 'Culinary' for cake-baking fts)
- Steer the model to generate ft-style content
- Works even when comparing base → chat+ft!
(we renamed it to "Separating Tongue from Thought: Activation Patching Reveals Language-Agnostic Concept Representations in Transformers").
(we renamed it to "Separating Tongue from Thought: Activation Patching Reveals Language-Agnostic Concept Representations in Transformers").
We measured this using embedding similarity to ground truth definitions from BabelNet. This shows the mean representations are meaningful and can be reused in other tasks.
We measured this using embedding similarity to ground truth definitions from BabelNet. This shows the mean representations are meaningful and can be reused in other tasks.
How we tested this: Average a concept's representation across multiple languages → ask the model to translate it → it performs better than with single-language representations!
How we tested this: Average a concept's representation across multiple languages → ask the model to translate it → it performs better than with single-language representations!
This work was conducted during the MATS program with equal contribution with @jkminder.bsky.social, supervised by Bilal Chughtai (bilalchughtai.co.uk) and @neelnanda.bsky.social with help from @cadentj.bsky.social.
We'll be presenting at the ICLR SLLM workshop!
This work was conducted during the MATS program with equal contribution with @jkminder.bsky.social, supervised by Bilal Chughtai (bilalchughtai.co.uk) and @neelnanda.bsky.social with help from @cadentj.bsky.social.
We'll be presenting at the ICLR SLLM workshop!