Anej Svete
anejsvete.bsky.social
Anej Svete
@anejsvete.bsky.social
PhD student in NLP at ETH Zurich.

anejsvete.github.io
6/ The work refines the landscape of transformer expressivity and demonstrates that seemingly minor implementation details can have major theoretical consequences for what neural architectures can represent.
May 17, 2025 at 2:32 PM
5/ This might help explain why positional encodings that skew attention toward recent (rightmost) tokens—like ALiBi—work so well in practice. They're compensating for an inherent limitation in conventional attention mechanisms.
May 17, 2025 at 2:32 PM
4/ Here's why this matters: leftmost-tiebreaking transformers are actually equivalent to soft-attention transformers in terms of expressivity! This suggests they might better approximate real-world transformers than right-attention models.
May 17, 2025 at 2:32 PM
3/ Specifically, we show that leftmost tiebreaking models correspond to a strictly weaker fragment of Linear Temporal Logic (LTL). While rightmost tiebreaking enables the full power of LTL, leftmost models are limited to the "past" fragment.
May 17, 2025 at 2:31 PM
2/ We analyzed future-masked unique hard attention transformers and found that those with leftmost tiebreaking are strictly less expressive than those with rightmost tiebreaking. The "Tale of Two Sides" nicely describes about how these two models differ.
May 17, 2025 at 2:31 PM
1/ When multiple positions achieve the maximum attention score in a transformer, we need a tiebreaking mechanism. Should we pick the leftmost or rightmost position? Turns out, this trivial implementation detail dramatically affects what transformers can express!
May 17, 2025 at 2:30 PM