Lightnews — Scholar-powered news

Reposted by Aaron Mueller

@interplay-workshop.bsky.social

✨ The schedule for our INTERPLAY workshop at COLM is live! ✨
🗓️ October 10th, Room 518C
🔹 Invited talks from @sarah-nlp.bsky.social John Hewitt @amuuueller.bsky.social @kmahowald.bsky.social
🔹 Paper presentations and posters
🔹 Closing roundtable discussion.

Join us in Montréal! @colmweb.org

Schedule for the INTERPLAY workshop at COLM on October 10th, Room 518C.

09:00 am: Opening
09:10 am: Invited Talks by Sarah Wiegreffe and John Hewitt
10:20 am: Paper Presentations

Lunch Break

01:00 pm: Invited Talks by Aaron Mueller and Kyle Mowhald
02:10 pm: Poster Session
03:20 pm: Roundtable Discussion
04:50 pm: Closing

October 9, 2025 at 5:30 PM

Aaron Mueller

@amuuueller.bsky.social

What's the right unit of analysis for understanding LLM internals? We explore in our mech interp survey (a major update from our 2024 ms).

We’ve added more recent work and more immediately actionable directions for future work. Now published in Computational Linguistics!

October 1, 2025 at 2:03 PM

Reposted by Aaron Mueller

Andrew Lampinen

@lampinen.bsky.social

In neuroscience, we often try to understand systems by analyzing their representations — using tools like regression or RSA. But are these analyses biased towards discovering a subset of what a system represents? If you're interested in this question, check out our new commentary! Thread:

What do representations tell us about a system? Image of a mouse with a scope showing a vector of activity patterns, and a neural network with a vector of unit activity patterns
Common analyses of neural representations: Encoding models (relating activity to task features) drawing of an arrow from a trace saying [on_____on____] to a neuron and spike train. Comparing models via neural predictivity: comparing two neural networks by their R^2 to mouse brain activity. RSA: assessing brain-brain or model-brain correspondence using representational dissimilarity matrices

August 5, 2025 at 2:36 PM

Aaron Mueller

@amuuueller.bsky.social

If you're at #ICML2025, chat with me, @sarah-nlp.bsky.social, Atticus, and others at our poster 11am - 1:30pm at East #1205! We're establishing a 𝗠echanistic 𝗜nterpretability 𝗕enchmark.

We're planning to keep this a living benchmark; come by and share your ideas/hot takes!

July 17, 2025 at 5:45 PM

Reposted by Aaron Mueller

David Bau

@davidbau.bsky.social

The new "Lookback" paper from @nikhil07prakash.bsky.social‬ contains a surprising insight...

70b/405b LLMs use double pointers, akin to C programmers' double (**) pointers. They show up when the LLM is "knowing what Sally knows Ann knows", i.e., Theory of Mind.

bsky.app/profile/nik...

@nikhil07prakash.bsky.social

How do language models track mental states of each character in a story, often referred to as Theory of Mind? We reverse-engineered how LLaMA-3-70B-Instruct handles a belief-tracking task and found something surprising: it uses mechanisms strikingly similar to pointer variables in C programming!

bsky.app

June 25, 2025 at 3:00 PM

Aaron Mueller

@amuuueller.bsky.social

SAEs have been found to massively underperform supervised methods for steering neural networks.

In new work led by @danaarad.bsky.social, we find that this problem largely disappears if you select the right features!

Dana Arad @danaarad.bsky.social · May 27

Tried steering with SAEs and found that not all features behave as expected?

Check out our new preprint - "SAEs Are Good for Steering - If You Select the Right Features" 🧵

May 27, 2025 at 5:07 PM

Reposted by Aaron Mueller

Dana Arad

@danaarad.bsky.social

Tried steering with SAEs and found that not all features behave as expected?

Check out our new preprint - "SAEs Are Good for Steering - If You Select the Right Features" 🧵

May 27, 2025 at 4:06 PM

Reposted by Aaron Mueller

Ethan Gotlieb Wilcox

@wegotlieb.bsky.social

Couldn’t be happier to have co-authored this will a stellar team, including: Michael Hu, @amuuueller.bsky.social, @alexwarstadt.bsky.social, @lchoshen.bsky.social, Chengxu Zhuang, @adinawilliams.bsky.social, Ryan Cotterell, @tallinzen.bsky.social

May 12, 2025 at 3:48 PM

Aaron Mueller

@amuuueller.bsky.social

Lots of progress in mech interp (MI) lately! But how can we measure when new mech interp methods yield real improvements over prior work?

We propose 😎 𝗠𝗜𝗕: a 𝗠echanistic 𝗜nterpretability 𝗕enchmark!

April 23, 2025 at 6:15 PM

Aaron Mueller

@amuuueller.bsky.social

Lots of work coming soon to @iclr-conf.bsky.social and @naaclmeeting.bsky.social in April/May! Come chat with us about new methods for interpreting and editing LLMs, multilingual concept representations, sentence processing mechanisms, and arithmetic reasoning. 🧵

March 11, 2025 at 2:30 PM

Aaron Mueller

@amuuueller.bsky.social

It’s common to assume one task = one static circuit. But this ignores that the important computations depend on position!

We propose a way to find 𝗽𝗼𝘀𝗶𝘁𝗶𝗼𝗻-𝗮𝘄𝗮𝗿𝗲 circuits. (Highlight: using LLMs to help us create multi-token causal abstractions!)

Tal Haklay @talhaklay.bsky.social · Mar 6

1/13 LLM circuits tell us where the computation happens inside the model—but the computation varies by token position, a key detail often ignored!
We propose a method to automatically find position-aware circuits, improving faithfulness while keeping circuits compact. 🧵👇

March 6, 2025 at 10:28 PM

Aaron Mueller

@amuuueller.bsky.social

What can mechanistic interpretability do for computational psycholinguists? @michaelwhanna.bsky.social and I took a stab at this question! We investigate garden path sentence processing in LMs at the feature (circuit) level.

Michael Hanna @michaelwhanna.bsky.social · Dec 19

Sentences are partially understood before they're fully read. How do LMs incrementally interpret their inputs?

In a new paper, @amuuueller.bsky.social and I use mech interp tools to study how LMs process structurally ambiguous sentences. We show LMs rely on both syntactic & spurious features! 1/10

December 19, 2024 at 1:56 PM

Reposted by Aaron Mueller

GroNLP

@gronlp.bsky.social

Today we had the pleasure to host @michaelwhanna.bsky.social from @amsterdamnlp.bsky.social for a seminar on his work w/ @amuuueller.bsky.social to interpret incremental sentence processing in language models. Thank you for joining us Michael!

November 22, 2024 at 3:52 PM

Reposted by Aaron Mueller

Leshem (Legend) Choshen @EMNLP

@lchoshen.bsky.social

The babies are now in their beds, but what a year it was!
Highlighting some findings of BabyLM

Architectures & Training objective matter a lot (and got the highest scores)
alphaxiv.org/pdf/2410.24159
🤖

November 19, 2024 at 3:20 PM

Reposted by Aaron Mueller

Juan Diego Rodriguez

@juand-r.bsky.social

How do language models organize concepts and their properties? Do they use taxonomies to infer new properties, or infer based on concept similarities? Apparently, both!

🌟 New paper with my fantastic collaborators @amuuueller.bsky.social and @kanishka.bsky.social

Title: "Characterizing the Role of Similarity in the Property Inferences of Language Models"
Authors: Juan Diego Rodriguez, Aaron Mueller, Kanishka Misra

Left figure: "Given that dogs are daxable, is it true that corgis are daxable?" A language model could answer this either using taxonomic relations, illustrated by a taxonomy dog-corgi, dog-mutt, canine-wolf, etc., or by similarity relations (dogs are more similar to corgis than cats, wolves or shar peis).

Right figure: illustration of the causal model (and an example intervention) for distributed alignment search (DAS), which we used to find a subspace in the network responsible for property inheritance behavior. The bottom nodes are "property", "premise concept (A)" and "conclusion concept (B)" , the middle nodes are "A has property P", "B is a kind of A", and the top node is "B has property P".

November 8, 2024 at 9:40 PM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news