Lightnews — Scholar-powered news

Yotam Erel

@yotamerel.bsky.social

2 followers 4 following 10 posts

CS PhD candidate @ Tel Aviv University
https://yoterel.github.io

Posts Replies Media Videos

Yotam Erel

@yotamerel.bsky.social

{8/8}
This framework offers a new way to probe and reason about attention !

📄 Paper: “Attention (as Discrete-Time Markov) Chains”
🔗 yoterel.github.io/attention_ch...
👥 Yotam Erel*, @oduenkel.bsky.social*, Rishabh Dabral, Vlad Golyanik, Christian Theobalt, Amit Bermano
*denotes equal contribution

Attention (as Discrete-Time Markov) Chains

yoterel.github.io

July 24, 2025 at 12:50 PM

Yotam Erel

@yotamerel.bsky.social

{7/8}
This reinterpretation yields results:
✅ State-of-the-art zero-shot segmentation
✅ Cleaner, sharper attention visualizations
✅ Better unconditional image generation

All without extra training—just a different perspective.

July 24, 2025 at 12:50 PM

Yotam Erel

@yotamerel.bsky.social

{6/8}
We define TokenRank:
The steady-state distribution of the attention Markov chain.
Like PageRank—but for tokens.

It measures global token importance—not just who’s attended, but who gets attended through others.

July 24, 2025 at 12:50 PM

Yotam Erel

@yotamerel.bsky.social

{5/8}
💡 Here’s the golden insight:
In practice, attention tends to linger among semantically similar tokens.

These are metastable states—regions where attention circulates before escaping.
Modeling this lets us filter noise and highlight meaningful structure.

July 24, 2025 at 12:50 PM

Yotam Erel

@yotamerel.bsky.social

{4/8}
But wait 🚨! the transformer was never trained to account for indirect attention, it just applies the attention map once every pass, what gives?

July 24, 2025 at 12:50 PM

Yotam Erel

@yotamerel.bsky.social

{3/8}
We interpret each attention matrix as a discrete-time Markov chain,
where:

Tokens = states

Attention weights = transition probabilities

This reframes attention as a dynamic process, not just a static lookup.

July 24, 2025 at 12:50 PM

Yotam Erel

@yotamerel.bsky.social

{2/8}
Most attention maps analysis is local: we reduce dimensions for visualizations using row- or column-selects, column sums, head averages, etc.
These only capture direct token-to-token interactions.

But what if we also considered indirect effects?

July 24, 2025 at 12:50 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news