Julian Minder
jkminder.bsky.social
Julian Minder
@jkminder.bsky.social
PhD at EPFL with Robert West, Master at ETHZ

Mainly interested in Language Model Interpretability and Model Diffing.

MATS 7.0 Winter 2025 Scholar w/ Neel Nanda

jkminder.ch
Pinned
New paper: Finetuning on narrow domains leaves traces behind. By looking at the difference in activations before and after finetuning, we can interpret what it was finetuned for. And so can our interpretability agent! 🧵
New paper: Finetuning on narrow domains leaves traces behind. By looking at the difference in activations before and after finetuning, we can interpret what it was finetuned for. And so can our interpretability agent! 🧵
October 20, 2025 at 3:11 PM
Can we interpret what happens in finetuning? Yes, if for a narrow domain! Narrow fine tuning leaves traces behind. By comparing activations before and after fine-tuning we can interpret these, even with an agent! We interpret subliminal learning, emergent misalignment, and more
September 5, 2025 at 12:21 PM
Very cool initiative!
The next generation of open LLMs should be inclusive, compliant, and multilingual by design. That’s why we @icepfl.bsky.social @ethz.ch @cscsch.bsky.social ) built Apertus.
EPFL, ETH Zurich & CSCS just released Apertus, Switzerland’s first fully open-source large language model.
Trained on 15T tokens in 1,000+ languages, it’s built for transparency, responsibility & the public good.

Read more: actu.epfl.ch/news/apertus...
September 3, 2025 at 9:44 AM
Causal Abstraction, the theory behind DAS, tests if a network realizes a given algorithm. We show (w/ @denissutter.bsky.social, T. Hofmann, @tpimentel.bsky.social ) that the theory collapses without the linear representation hypothesis—a problem we call the non-linear representation dilemma.
July 17, 2025 at 10:57 AM
Reposted by Julian Minder
In this new paper, w/ @denissutter.bsky.social , @jkminder.bsky.social, and T.Hofmann, we study *causal abstraction*, a formal specification of when a deep neural network (DNN) implements an algorithm. This is the framework behind, e.g., distributed alignment search.

Paper: arxiv.org/abs/2507.08802
The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
The concept of causal abstraction got recently popularised to demystify the opaque decision-making processes of machine learning models; in short, a neural network can be abstracted as a higher-level ...
arxiv.org
July 14, 2025 at 12:15 PM
Reposted by Julian Minder
Mechanistic interpretability often relies on *interventions* to study how DNNs work. Are these interventions enough to guarantee the features we find are not spurious? No!⚠️ In our new paper, we show many mech int methods implicitly rely on the linear representation hypothesis🧵
July 14, 2025 at 12:15 PM
With @butanium.bsky.social and @neelnanda.bsky.social we've just published a post on model diffing that extends our previous paper.
Rather than trying to reverse-engineer the full fine-tuned model, model diffing focuses on understanding what makes it different from its base model internally.
June 30, 2025 at 9:02 PM
In our most recent work, we looked at how to best leverage crosscoders to identify representational differences between base and chat models. We find many cool things, e.g., a knowledge boundary, a detailed info and a humor/ joke detection latent.
April 7, 2025 at 5:56 PM
Reposted by Julian Minder
background: the technique here is "model-diffing" introduced by @anthropic.com just 8 weeks ago and quickly replicated by others. this includes an open source @hf.co model release by @butanium.bsky.social and @jkminder.bsky.social which I'm using. transformer-circuits.pub/2024/crossco...
Sparse Crosscoders for Cross-Layer Features and Model Diffing
transformer-circuits.pub
December 22, 2024 at 6:46 AM
Reposted by Julian Minder
New @acm-cscw.bsky.social paper, new content moderation paradigm.

Post Guidance lets moderators prevent rule-breaking by triggering interventions as users write posts!

We implemented PG on Reddit and tested it in a massive field experiment (n=97k). It became a feature!

arxiv.org/abs/2411.16814
November 27, 2024 at 2:20 PM
Can we understand and control how language models balance context and prior knowledge? Our latest paper shows it’s all about a 1D knob! 🎛️
arxiv.org/abs/2411.07404

Co-led with
@kevdududu.bsky.social - @niklasstoehr.bsky.social , Giovanni Monea, @wendlerc.bsky.social, Robert West & Ryan Cotterell.
November 22, 2024 at 3:49 PM
Reposted by Julian Minder
In case you also wondered how to derive the maximal update parametrisation (muP) learning rate for ADAM. I did a short write up: tinyurl.com/mup-for-adam. Thanks Ilia Badanin and Eugene Golikov for your help on this.
Notion – The all-in-one workspace for your notes, tasks, wikis, and databases.
A new tool that blends your everyday work apps into one. It's the all-in-one workspace for you and your team
tinyurl.com
November 20, 2024 at 12:02 PM
Reposted by Julian Minder
If you’re interested in mechanistic interpretability, I just found this starter pack and wanted to boost it (thanks for creating it @butanium.bsky.social !). Excited to have a mech interp community on bluesky 🎉

go.bsky.app/LisK3CP
November 19, 2024 at 12:28 AM
Reposted by Julian Minder
Hey, @bsky.app @support.bsky.team, is there a way for you to shorten the displayed usernames when trailed by “bsky.social”? If someone has some other domain name, then fine, show that, but if we're using the default domain, can we get rid of these lengthy string of characters?
November 18, 2024 at 8:29 PM
Reposted by Julian Minder
Trying to bring ML/NLP/etal people from ETH Zürich together. Ping me to add you. 🙂
bsky.app/starter-pack...
November 18, 2024 at 10:51 AM