@nikhil07prakash.bsky.social
Joint work with @natalieshapira.bsky.social, @arnabsensharma.bsky.social, @criedl.bsky.social, @boknilev.bsky.social, @tamarott.bsky.social, @davidbau.bsky.social, and Atticus Geiger.
June 24, 2025 at 5:15 PM
Joint work with @natalieshapira.bsky.social, @arnabsensharma.bsky.social, @criedl.bsky.social, @boknilev.bsky.social, @tamarott.bsky.social, @davidbau.bsky.social, and Atticus Geiger.
Find more details, including the causal intervention experiments and subspace analysis, in the paper. Links to code and data is available on our website.
🌐: belief.baulab.info
📜: arxiv.org/abs/2505.14685
🌐: belief.baulab.info
📜: arxiv.org/abs/2505.14685
Language Models use Lookbacks to Track Beliefs
How do language models (LMs) represent characters' beliefs, especially when those beliefs may differ from reality? This question lies at the heart of understanding the Theory of Mind (ToM)...
arxiv.org
June 24, 2025 at 5:15 PM
Find more details, including the causal intervention experiments and subspace analysis, in the paper. Links to code and data is available on our website.
🌐: belief.baulab.info
📜: arxiv.org/abs/2505.14685
🌐: belief.baulab.info
📜: arxiv.org/abs/2505.14685
We expect Lookback mechanism to extend beyond belief tracking - the concept of marking vital tokens seems universal across tasks requiring in-context manipulation. This mechanism could be fundamental to how LMs handle complex logical reasoning with conditionals.
June 24, 2025 at 5:15 PM
We expect Lookback mechanism to extend beyond belief tracking - the concept of marking vital tokens seems universal across tasks requiring in-context manipulation. This mechanism could be fundamental to how LMs handle complex logical reasoning with conditionals.
We found that the LM generates a Visibility ID at the visibility sentence which serves as source info. Its address copy stays in-place, while a pointer copy flows to later lookback tokens. There, a QK-circuit dereferences the pointer to fetch info about the observed character as payload.
June 24, 2025 at 5:15 PM
We found that the LM generates a Visibility ID at the visibility sentence which serves as source info. Its address copy stays in-place, while a pointer copy flows to later lookback tokens. There, a QK-circuit dereferences the pointer to fetch info about the observed character as payload.
Next we studied how providing explicit visibility condition affects characters' beliefs.
June 24, 2025 at 5:14 PM
Next we studied how providing explicit visibility condition affects characters' beliefs.
We test our high-level causal model with targeted interventions:
- Patching the answer lookback pointer flips the final output from coffee to beer (pink line)
- Patching the answer lookback payload shifts it from coffee to tea (grey line)
Strong evidence that the Answer Lookback mechanism is real!
- Patching the answer lookback pointer flips the final output from coffee to beer (pink line)
- Patching the answer lookback payload shifts it from coffee to tea (grey line)
Strong evidence that the Answer Lookback mechanism is real!
June 24, 2025 at 5:14 PM
We test our high-level causal model with targeted interventions:
- Patching the answer lookback pointer flips the final output from coffee to beer (pink line)
- Patching the answer lookback payload shifts it from coffee to tea (grey line)
Strong evidence that the Answer Lookback mechanism is real!
- Patching the answer lookback pointer flips the final output from coffee to beer (pink line)
- Patching the answer lookback payload shifts it from coffee to tea (grey line)
Strong evidence that the Answer Lookback mechanism is real!
Step 3: The LM now uses the state OI at the last token as a pointer and its in-place copy as an address to look back to the correct token, this time fetching its token value (e.g., "beer") as the payload, which gets predicted as the final output.
June 24, 2025 at 5:14 PM
Step 3: The LM now uses the state OI at the last token as a pointer and its in-place copy as an address to look back to the correct token, this time fetching its token value (e.g., "beer") as the payload, which gets predicted as the final output.
Step 2: The LM binds the character-object-state triplet by copying their OIs (source) to the state token. These OIs also flow to the last token via corresponding tokens in the query (pointer). Next, LM uses both copies to attend to the correct state from last token and fetch its state OI (payload).
June 24, 2025 at 5:14 PM
Step 2: The LM binds the character-object-state triplet by copying their OIs (source) to the state token. These OIs also flow to the last token via corresponding tokens in the query (pointer). Next, LM uses both copies to attend to the correct state from last token and fetch its state OI (payload).
Step 1: LM maps each vital token (character, object, state) to an abstract Ordering ID (OI), a reference that marks it as the first or second of its type, regardless of the actual token.
June 24, 2025 at 5:14 PM
Step 1: LM maps each vital token (character, object, state) to an abstract Ordering ID (OI), a reference that marks it as the first or second of its type, regardless of the actual token.
Using the Causal Abstraction framework, we formulated a precise hypothesis that explains the end-to-end mechanism the model uses to perform this task.
June 24, 2025 at 5:14 PM
Using the Causal Abstraction framework, we formulated a precise hypothesis that explains the end-to-end mechanism the model uses to perform this task.
Tracing key tokens shows that the correct state (e.g., beer) flows directly to the final token at later layers. Meanwhile, info about the query character and object is retrieved from earlier mentions and passed to the final token before being replaced by the correct state token.
June 24, 2025 at 5:14 PM
Tracing key tokens shows that the correct state (e.g., beer) flows directly to the final token at later layers. Meanwhile, info about the query character and object is retrieved from earlier mentions and passed to the final token before being replaced by the correct state token.
Here is how it works: the model duplicates key info across two tokens, letting later attention heads look back at earlier ones to retrieve it, rather than passing it forward directly. Like leaving a breadcrumb trail in context.
June 24, 2025 at 5:14 PM
Here is how it works: the model duplicates key info across two tokens, letting later attention heads look back at earlier ones to retrieve it, rather than passing it forward directly. Like leaving a breadcrumb trail in context.
We discovered "Lookback" - a mechanism where language models mark important info while reading, then attend back to those marked tokens when needed later. Since LMs don't know what info will be relevant upfront, this lets them efficiently retrieve key details on demand.
June 24, 2025 at 5:13 PM
We discovered "Lookback" - a mechanism where language models mark important info while reading, then attend back to those marked tokens when needed later. Since LMs don't know what info will be relevant upfront, this lets them efficiently retrieve key details on demand.
We constructed CausalToM, a dataset for causal analysis, featuring simple stories where two characters each separately change the state of two objects, potentially unaware of each other's actions. We ask Llama-3-70B-Instruct to reason about a character’s beliefs regarding the state of an object. Eg.
June 24, 2025 at 5:13 PM
We constructed CausalToM, a dataset for causal analysis, featuring simple stories where two characters each separately change the state of two objects, potentially unaware of each other's actions. We ask Llama-3-70B-Instruct to reason about a character’s beliefs regarding the state of an object. Eg.
Since Theory of Mind (ToM) is fundamental to social intelligence numerous works have benchmarked this capability of LMs. However, the internal mechanics responsible for solving (or failing to solve) such tasks remains unexplored...
June 24, 2025 at 5:13 PM
Since Theory of Mind (ToM) is fundamental to social intelligence numerous works have benchmarked this capability of LMs. However, the internal mechanics responsible for solving (or failing to solve) such tasks remains unexplored...