Manvi Agarwal
manviagarwal.bsky.social
Manvi Agarwal
@manviagarwal.bsky.social
3️⃣ We explain why structural information in PE has proven to be empirically successful. 

We go back to 1️⃣ and use the content-context connection to show that the higher is the mutual information between data and its positional representation, the better is task performance.
April 8, 2025 at 4:52 AM
2️⃣ We introduce a new positional encoding method - RoPEPool - that can model causality.

How does RoPEPool compare to RoPE and F-StrIPE? Our analysis with a toy example says: RoPEPool isn’t just different, it’s also richer in terms of expressivity.
April 8, 2025 at 4:52 AM
1️⃣ We show how different families of positional encoding - rotation-based (RoPE) and random fourier features-based (F-StrIPE) - can be compared using kernel methods. 

It’s not just vibes - we characterize precisely how queries and keys are affected by positional information.
April 8, 2025 at 4:52 AM
With these two interventions, we obtain better performance at lower cost! 🚀
Curious? Check out the companion webpage: bit.ly/faststructurepe
April 7, 2025 at 11:48 AM
In our paper, we show that stochastic positional encoding is, in fact, a noisy version of a well-known kernel approximation technique: Random Fourier Features. We also show how prior knowledge (e.g. related to musical structure) can be used in such linear-complexity Transformers.
April 7, 2025 at 11:48 AM
However, there was a piece missing: how do you handle relative positional encoding in these linear-complexity transformers? 🤔

Enter Stochastic Positional Encoding! It brings relative positional information back into the picture without going to quadratic cost.
April 7, 2025 at 11:48 AM
Luckily, there's a solution: you can think of attention as a kernel function and use kernel approximation techniques to reduce the cost from quadratic to linear. ⚡
This was the idea used by Performers, for example.
April 7, 2025 at 11:48 AM
Transformers are powerful, but there's a problem: their cost grows quadratically with sequence length. 📈
This makes it really hard to apply them to lengthy sequences, like music, where long-term connections carry important information.
April 7, 2025 at 11:48 AM
December 1, 2024 at 1:22 PM