https://yoterel.github.io
This framework offers a new way to probe and reason about attention !
📄 Paper: “Attention (as Discrete-Time Markov) Chains”
🔗 yoterel.github.io/attention_ch...
👥 Yotam Erel*, @oduenkel.bsky.social*, Rishabh Dabral, Vlad Golyanik, Christian Theobalt, Amit Bermano
*denotes equal contribution
This framework offers a new way to probe and reason about attention !
📄 Paper: “Attention (as Discrete-Time Markov) Chains”
🔗 yoterel.github.io/attention_ch...
👥 Yotam Erel*, @oduenkel.bsky.social*, Rishabh Dabral, Vlad Golyanik, Christian Theobalt, Amit Bermano
*denotes equal contribution
This reinterpretation yields results:
✅ State-of-the-art zero-shot segmentation
✅ Cleaner, sharper attention visualizations
✅ Better unconditional image generation
All without extra training—just a different perspective.
This reinterpretation yields results:
✅ State-of-the-art zero-shot segmentation
✅ Cleaner, sharper attention visualizations
✅ Better unconditional image generation
All without extra training—just a different perspective.
We define TokenRank:
The steady-state distribution of the attention Markov chain.
Like PageRank—but for tokens.
It measures global token importance—not just who’s attended, but who gets attended through others.
We define TokenRank:
The steady-state distribution of the attention Markov chain.
Like PageRank—but for tokens.
It measures global token importance—not just who’s attended, but who gets attended through others.
💡 Here’s the golden insight:
In practice, attention tends to linger among semantically similar tokens.
These are metastable states—regions where attention circulates before escaping.
Modeling this lets us filter noise and highlight meaningful structure.
💡 Here’s the golden insight:
In practice, attention tends to linger among semantically similar tokens.
These are metastable states—regions where attention circulates before escaping.
Modeling this lets us filter noise and highlight meaningful structure.
But wait 🚨! the transformer was never trained to account for indirect attention, it just applies the attention map once every pass, what gives?
But wait 🚨! the transformer was never trained to account for indirect attention, it just applies the attention map once every pass, what gives?
We interpret each attention matrix as a discrete-time Markov chain,
where:
Tokens = states
Attention weights = transition probabilities
This reframes attention as a dynamic process, not just a static lookup.
We interpret each attention matrix as a discrete-time Markov chain,
where:
Tokens = states
Attention weights = transition probabilities
This reframes attention as a dynamic process, not just a static lookup.
Most attention maps analysis is local: we reduce dimensions for visualizations using row- or column-selects, column sums, head averages, etc.
These only capture direct token-to-token interactions.
But what if we also considered indirect effects?
Most attention maps analysis is local: we reduce dimensions for visualizations using row- or column-selects, column sums, head averages, etc.
These only capture direct token-to-token interactions.
But what if we also considered indirect effects?