Lightnews — Scholar-powered news

Julian Minder

@jkminder.bsky.social

1.1K followers 380 following 39 posts

PhD at EPFL with Robert West, Master at ETHZ

Mainly interested in Language Model Interpretability and Model Diffing.

MATS 7.0 Winter 2025 Scholar w/ Neel Nanda

jkminder.ch

Posts Replies Media Videos

Pinned

Julian Minder @jkminder.bsky.social · Oct 20

New paper: Finetuning on narrow domains leaves traces behind. By looking at the difference in activations before and after finetuning, we can interpret what it was finetuned for. And so can our interpretability agent! 🧵

Julian Minder

@jkminder.bsky.social

October 20, 2025 at 3:11 PM

Julian Minder

@jkminder.bsky.social

Can we interpret what happens in finetuning? Yes, if for a narrow domain! Narrow fine tuning leaves traces behind. By comparing activations before and after fine-tuning we can interpret these, even with an agent! We interpret subliminal learning, emergent misalignment, and more

September 5, 2025 at 12:21 PM

Julian Minder

@jkminder.bsky.social

Very cool initiative!

Antoine Bosselut @abosselut.bsky.social · Sep 3

The next generation of open LLMs should be inclusive, compliant, and multilingual by design. That’s why we @icepfl.bsky.social @ethz.ch @cscsch.bsky.social ) built Apertus.

EPFL School of Computer and Communication Sciences @icepfl.bsky.social · Sep 2

EPFL, ETH Zurich & CSCS just released Apertus, Switzerland’s first fully open-source large language model.
Trained on 15T tokens in 1,000+ languages, it’s built for transparency, responsibility & the public good.

Read more: actu.epfl.ch/news/apertus...

September 3, 2025 at 9:44 AM

Julian Minder

@jkminder.bsky.social

Causal Abstraction, the theory behind DAS, tests if a network realizes a given algorithm. We show (w/ @denissutter.bsky.social, T. Hofmann, @tpimentel.bsky.social ) that the theory collapses without the linear representation hypothesis—a problem we call the non-linear representation dilemma.

July 17, 2025 at 10:57 AM

Reposted by Julian Minder

Tiago Pimentel

@tpimentel.bsky.social

In this new paper, w/ @denissutter.bsky.social , @jkminder.bsky.social, and T.Hofmann, we study *causal abstraction*, a formal specification of when a deep neural network (DNN) implements an algorithm. This is the framework behind, e.g., distributed alignment search.

Paper: arxiv.org/abs/2507.08802

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?

The concept of causal abstraction got recently popularised to demystify the opaque decision-making processes of machine learning models; in short, a neural network can be abstracted as a higher-level ...

arxiv.org

July 14, 2025 at 12:15 PM

Reposted by Julian Minder

Tiago Pimentel

@tpimentel.bsky.social

Mechanistic interpretability often relies on *interventions* to study how DNNs work. Are these interventions enough to guarantee the features we find are not spurious? No!⚠️ In our new paper, we show many mech int methods implicitly rely on the linear representation hypothesis🧵

Paper title "The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?" with the paper's graphical abstract showing how more powerful alignment maps between a DNN and an algorithm allow more complex features to be found and more "accurate" abstractions.

July 14, 2025 at 12:15 PM

Julian Minder

@jkminder.bsky.social

With @butanium.bsky.social and @neelnanda.bsky.social we've just published a post on model diffing that extends our previous paper.
Rather than trying to reverse-engineer the full fine-tuned model, model diffing focuses on understanding what makes it different from its base model internally.

June 30, 2025 at 9:02 PM

Julian Minder

@jkminder.bsky.social

In our most recent work, we looked at how to best leverage crosscoders to identify representational differences between base and chat models. We find many cool things, e.g., a knowledge boundary, a detailed info and a humor/ joke detection latent.

April 7, 2025 at 5:56 PM

Reposted by Julian Minder

dribnet

@drib.net

background: the technique here is "model-diffing" introduced by @anthropic.com just 8 weeks ago and quickly replicated by others. this includes an open source @hf.co model release by @butanium.bsky.social and @jkminder.bsky.social which I'm using. transformer-circuits.pub/2024/crossco...

Sparse Crosscoders for Cross-Layer Features and Model Diffing

transformer-circuits.pub

December 22, 2024 at 6:46 AM

Reposted by Julian Minder

Manoel Horta Ribeiro

@manoelhortaribeiro.bsky.social

New @acm-cscw.bsky.social paper, new content moderation paradigm.

Post Guidance lets moderators prevent rule-breaking by triggering interventions as users write posts!

We implemented PG on Reddit and tested it in a massive field experiment (n=97k). It became a feature!

arxiv.org/abs/2411.16814

November 27, 2024 at 2:20 PM

Julian Minder

@jkminder.bsky.social

Can we understand and control how language models balance context and prior knowledge? Our latest paper shows it’s all about a 1D knob! 🎛️
arxiv.org/abs/2411.07404

Co-led with
@kevdududu.bsky.social - @niklasstoehr.bsky.social , Giovanni Monea, @wendlerc.bsky.social, Robert West & Ryan Cotterell.

November 22, 2024 at 3:49 PM

Reposted by Julian Minder

Chris Wendler

@wendlerc.bsky.social

In case you also wondered how to derive the maximal update parametrisation (muP) learning rate for ADAM. I did a short write up: tinyurl.com/mup-for-adam. Thanks Ilia Badanin and Eugene Golikov for your help on this.

Notion – The all-in-one workspace for your notes, tasks, wikis, and databases.

A new tool that blends your everyday work apps into one. It's the all-in-one workspace for you and your team

tinyurl.com

November 20, 2024 at 12:02 PM

Reposted by Julian Minder

Sweta Karlekar

@swetakar.bsky.social

If you’re interested in mechanistic interpretability, I just found this starter pack and wanted to boost it (thanks for creating it @butanium.bsky.social !). Excited to have a mech interp community on bluesky 🎉

go.bsky.app/LisK3CP

November 19, 2024 at 12:28 AM

Reposted by Julian Minder

Alfredo Canziani

@alfcnz.bsky.social

Hey, @bsky.app @support.bsky.team, is there a way for you to shorten the displayed usernames when trailed by “bsky.social”? If someone has some other domain name, then fine, show that, but if we're using the default domain, can we get rid of these lengthy string of characters?

November 18, 2024 at 8:29 PM

Reposted by Julian Minder

Vilém Zouhar #EMNLP

@zouharvi.bsky.social

Trying to bring ML/NLP/etal people from ETH Zürich together. Ping me to add you. 🙂
bsky.app/starter-pack...

November 18, 2024 at 10:51 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news