David Bau
@davidbau.bsky.social
Interpretable Deep Networks. http://baulab.info/ @davidbau
When you read the paper, be sure to check out the appendix where @arnab_api discusses how pointer and value data are entangled in filters.
And possible applications of the filter mechanism, like as a zero-shot "lie detector" that can flag incorrect statements in ordinary text.
And possible applications of the filter mechanism, like as a zero-shot "lie detector" that can flag incorrect statements in ordinary text.
November 6, 2025 at 2:00 PM
When you read the paper, be sure to check out the appendix where @arnab_api discusses how pointer and value data are entangled in filters.
And possible applications of the filter mechanism, like as a zero-shot "lie detector" that can flag incorrect statements in ordinary text.
And possible applications of the filter mechanism, like as a zero-shot "lie detector" that can flag incorrect statements in ordinary text.
Curiously, when the question precedes the list of candidates, there is an abstract predicate for "this is the answer I am looking for," that that tags items in a list as soon as they are seen.
bsky.app/profile/arn...
bsky.app/profile/arn...
Arnab Sen Sharma (@arnabsensharma.bsky.social)
When the question is presented *after* the options, filter heads can achieve high causality scores across language and format changes! This suggests that the encoded predicate is robust against such perturbations.
bsky.app
November 6, 2025 at 2:00 PM
Curiously, when the question precedes the list of candidates, there is an abstract predicate for "this is the answer I am looking for," that that tags items in a list as soon as they are seen.
bsky.app/profile/arn...
bsky.app/profile/arn...
The neural representations for LLM filter heads are language independent!
If we pick up the representation for a question in French, it will accurately match items expressed in the Thai language.
If we pick up the representation for a question in French, it will accurately match items expressed in the Thai language.
November 6, 2025 at 2:00 PM
The neural representations for LLM filter heads are language independent!
If we pick up the representation for a question in French, it will accurately match items expressed in the Thai language.
If we pick up the representation for a question in French, it will accurately match items expressed in the Thai language.
Arnab calls predicate attention heads "filter heads" because the same heads filter many properties across objects, people, and landmarks.
The generic structure resembles functional programming's "filter" function, with a common mechanism handling a wide range of predicates.
bsky.app/profile/arn...
The generic structure resembles functional programming's "filter" function, with a common mechanism handling a wide range of predicates.
bsky.app/profile/arn...
Arnab Sen Sharma (@arnabsensharma.bsky.social)
🔍 In Llama-70B and Gemma-27B, we found special attention heads that consistently focus their attention on the filtered items. This behavior seems consistent across a range of different formats and semantic types.
bsky.app
November 6, 2025 at 2:00 PM
Arnab calls predicate attention heads "filter heads" because the same heads filter many properties across objects, people, and landmarks.
The generic structure resembles functional programming's "filter" function, with a common mechanism handling a wide range of predicates.
bsky.app/profile/arn...
The generic structure resembles functional programming's "filter" function, with a common mechanism handling a wide range of predicates.
bsky.app/profile/arn...
How embarrassing for me and confusing to the LLM!
OK, here it is fixed. Nice thing about workbench is that it just takes a second to edit the prompt, and you can see how the LLM responds, now deciding very early it should be ':'
OK, here it is fixed. Nice thing about workbench is that it just takes a second to edit the prompt, and you can see how the LLM responds, now deciding very early it should be ':'
October 11, 2025 at 2:21 PM
How embarrassing for me and confusing to the LLM!
OK, here it is fixed. Nice thing about workbench is that it just takes a second to edit the prompt, and you can see how the LLM responds, now deciding very early it should be ':'
OK, here it is fixed. Nice thing about workbench is that it just takes a second to edit the prompt, and you can see how the LLM responds, now deciding very early it should be ':'
... @wendlerc.bsky.social and @sfeucht.bsky.social ....
October 11, 2025 at 12:25 PM
... @wendlerc.bsky.social and @sfeucht.bsky.social ....
Help me thank the NDIF team for rolling out workbench.ndif.us/ by using it to make your own discoveries inside LLM internals. We should all be looking inside our LLMs.
Share the tool! Share what you find!
And send the team feedback -
bsky.app/profile/ndi...
Share the tool! Share what you find!
And send the team feedback -
bsky.app/profile/ndi...
NDIF Team (@ndif-team.bsky.social)
This is a public beta, so we expect bugs and actively want your feedback: https://forms.gle/WsxmZikeLNw34LBV9
bsky.app
October 11, 2025 at 12:02 PM
Help me thank the NDIF team for rolling out workbench.ndif.us/ by using it to make your own discoveries inside LLM internals. We should all be looking inside our LLMs.
Share the tool! Share what you find!
And send the team feedback -
bsky.app/profile/ndi...
Share the tool! Share what you find!
And send the team feedback -
bsky.app/profile/ndi...
That process was noticed by @wendlerch in arxiv.org/abs/2402.10588 and studied by @sheridan_feucht in dualroute.baulab.info
Try it out yourself on workbench.ndif.us/.
Does it work with other words? Can you find interesting exceptions? How about prompts beyond translation?
Try it out yourself on workbench.ndif.us/.
Does it work with other words? Can you find interesting exceptions? How about prompts beyond translation?
October 11, 2025 at 12:02 PM
That process was noticed by @wendlerch in arxiv.org/abs/2402.10588 and studied by @sheridan_feucht in dualroute.baulab.info
Try it out yourself on workbench.ndif.us/.
Does it work with other words? Can you find interesting exceptions? How about prompts beyond translation?
Try it out yourself on workbench.ndif.us/.
Does it work with other words? Can you find interesting exceptions? How about prompts beyond translation?
The lens reveals: the model does NOT go directly from amore to "amor" or "amour" by just dropping or adding letters!
Instead it first "thinks" about the (English) word "love".
In other words: LLMs translate using *concepts*, not tokens.
Instead it first "thinks" about the (English) word "love".
In other words: LLMs translate using *concepts*, not tokens.
October 11, 2025 at 12:02 PM
The lens reveals: the model does NOT go directly from amore to "amor" or "amour" by just dropping or adding letters!
Instead it first "thinks" about the (English) word "love".
In other words: LLMs translate using *concepts*, not tokens.
Instead it first "thinks" about the (English) word "love".
In other words: LLMs translate using *concepts*, not tokens.
Enter a translation prompt: "Italiano: amore, Español: amor, François:".
The workbench doesn't just show you the model's output. It shows the grid of internal states that lead to the output. Researchers call this visualization the "logit lens".
The workbench doesn't just show you the model's output. It shows the grid of internal states that lead to the output. Researchers call this visualization the "logit lens".
October 11, 2025 at 12:02 PM
Enter a translation prompt: "Italiano: amore, Español: amor, François:".
The workbench doesn't just show you the model's output. It shows the grid of internal states that lead to the output. Researchers call this visualization the "logit lens".
The workbench doesn't just show you the model's output. It shows the grid of internal states that lead to the output. Researchers call this visualization the "logit lens".
But why theorize? We can actually look at what it does.
Visit the NDIF workbench here: workbench.ndif.us/, and pull up any LLM that can translate, like GPT-J-6b. If you register an account you can access larger models.
bsky.app/profile/ndi...
Visit the NDIF workbench here: workbench.ndif.us/, and pull up any LLM that can translate, like GPT-J-6b. If you register an account you can access larger models.
bsky.app/profile/ndi...
NDIF Team (@ndif-team.bsky.social)
Ever wished you could explore what's happening inside a 405B parameter model without writing any code? Workbench, our AI interpretability interface, is now live for public beta at workbench.ndif.us!
bsky.app
October 11, 2025 at 12:02 PM
But why theorize? We can actually look at what it does.
Visit the NDIF workbench here: workbench.ndif.us/, and pull up any LLM that can translate, like GPT-J-6b. If you register an account you can access larger models.
bsky.app/profile/ndi...
Visit the NDIF workbench here: workbench.ndif.us/, and pull up any LLM that can translate, like GPT-J-6b. If you register an account you can access larger models.
bsky.app/profile/ndi...
And kudos to @ndif-team.bsky.social for keeping up with weekly youtube video posts on AI interpretability!
www.youtube.com/@NDIFTeam
www.youtube.com/@NDIFTeam
NDIF Team
We're a research computing project cracking open the mysteries inside large-scale AI systems.
The NSF National Deep Inference Fabric consists of a unique combination of hardware and software that pr...
www.youtube.com
October 3, 2025 at 6:53 PM
And kudos to @ndif-team.bsky.social for keeping up with weekly youtube video posts on AI interpretability!
www.youtube.com/@NDIFTeam
www.youtube.com/@NDIFTeam
At UT we just got to hear about this in a zoom talk from @sfeucht.bsky.social. I echo the endorsement:
cool ideas about representations in llms with linguistic relevance!
cool ideas about representations in llms with linguistic relevance!
Who is going to be at #COLM2025?
I want to draw your attention to a COLM paper by my student @sfeucht.bsky.social that has totally changed the way I think and teach about LLM representations. The work is worth knowing.
And you can meet Sheridan at COLM, Oct 7!
bsky.app/profile/sfe...
I want to draw your attention to a COLM paper by my student @sfeucht.bsky.social that has totally changed the way I think and teach about LLM representations. The work is worth knowing.
And you can meet Sheridan at COLM, Oct 7!
bsky.app/profile/sfe...
September 28, 2025 at 12:46 AM
Read more at arxiv.org/abs/2504.03022 <- at COLM
footprints.baulab.info <- token context erasure
arithmetic.baulab.info <- concept parallelograms
dualroute.baulab.info <- the second induction route,
w a neat colab notebook.
@ericwtodd.bsky.social @byron.bsky.social @diatkinson.bsky.social
footprints.baulab.info <- token context erasure
arithmetic.baulab.info <- concept parallelograms
dualroute.baulab.info <- the second induction route,
w a neat colab notebook.
@ericwtodd.bsky.social @byron.bsky.social @diatkinson.bsky.social
The Dual-Route Model of Induction
Do LLMs copy meaningful text by rote or by understanding meaning? Webpage for The Dual-Route Model of Induction (Feucht et al., 2025).
dualroute.baulab.info
September 27, 2025 at 8:54 PM
Read more at arxiv.org/abs/2504.03022 <- at COLM
footprints.baulab.info <- token context erasure
arithmetic.baulab.info <- concept parallelograms
dualroute.baulab.info <- the second induction route,
w a neat colab notebook.
@ericwtodd.bsky.social @byron.bsky.social @diatkinson.bsky.social
footprints.baulab.info <- token context erasure
arithmetic.baulab.info <- concept parallelograms
dualroute.baulab.info <- the second induction route,
w a neat colab notebook.
@ericwtodd.bsky.social @byron.bsky.social @diatkinson.bsky.social
The takeaway for me: LLMs separate their token processing from their conceptual processing. Akin to humans' dual route processing of speech.
We need to be aware when an LM is thinking about tokens or concepts.
They do both, and it makes a difference which way it's thinking.
We need to be aware when an LM is thinking about tokens or concepts.
They do both, and it makes a difference which way it's thinking.
September 27, 2025 at 8:54 PM
The takeaway for me: LLMs separate their token processing from their conceptual processing. Akin to humans' dual route processing of speech.
We need to be aware when an LM is thinking about tokens or concepts.
They do both, and it makes a difference which way it's thinking.
We need to be aware when an LM is thinking about tokens or concepts.
They do both, and it makes a difference which way it's thinking.
If token-processing and concept-processing are largely separate, does killing one kill the other? Chris Olah's team in Olsson 2022 hypothesized that ICL emerges from token induction.
@keremsahin22.bsky.social + Sheridan are finding cool ways to look into Olah's induction hypothesis too!
@keremsahin22.bsky.social + Sheridan are finding cool ways to look into Olah's induction hypothesis too!
September 27, 2025 at 8:54 PM
If token-processing and concept-processing are largely separate, does killing one kill the other? Chris Olah's team in Olsson 2022 hypothesized that ICL emerges from token induction.
@keremsahin22.bsky.social + Sheridan are finding cool ways to look into Olah's induction hypothesis too!
@keremsahin22.bsky.social + Sheridan are finding cool ways to look into Olah's induction hypothesis too!
The representation space within the concept induction heads also has a more "meaningful" geometry than the transformer as a whole.
Sheridan discovered (Neurips mechint 2025) that semantic vector arithmetic works better in this space. (Token semantics work in tokenspace.)
arithmetic.baulab.info/
Sheridan discovered (Neurips mechint 2025) that semantic vector arithmetic works better in this space. (Token semantics work in tokenspace.)
arithmetic.baulab.info/
September 27, 2025 at 8:54 PM
The representation space within the concept induction heads also has a more "meaningful" geometry than the transformer as a whole.
Sheridan discovered (Neurips mechint 2025) that semantic vector arithmetic works better in this space. (Token semantics work in tokenspace.)
arithmetic.baulab.info/
Sheridan discovered (Neurips mechint 2025) that semantic vector arithmetic works better in this space. (Token semantics work in tokenspace.)
arithmetic.baulab.info/
If you disable token induction heads and ask the model to copy text with only the concept induction heads, it will NOT copy exactly. It will paraphrase the text.
That happens even for computer code. They copy the BEHAVIOR of the code, but write it in a totally different way!
That happens even for computer code. They copy the BEHAVIOR of the code, but write it in a totally different way!
September 27, 2025 at 8:54 PM
If you disable token induction heads and ask the model to copy text with only the concept induction heads, it will NOT copy exactly. It will paraphrase the text.
That happens even for computer code. They copy the BEHAVIOR of the code, but write it in a totally different way!
That happens even for computer code. They copy the BEHAVIOR of the code, but write it in a totally different way!