Lightnews — Scholar-powered news

Manvi Agarwal

@manviagarwal.bsky.social

3️⃣ We explain why structural information in PE has proven to be empirically successful.

We go back to 1️⃣ and use the content-context connection to show that the higher is the mutual information between data and its positional representation, the better is task performance.

April 8, 2025 at 4:52 AM

Manvi Agarwal

@manviagarwal.bsky.social

2️⃣ We introduce a new positional encoding method - RoPEPool - that can model causality.

How does RoPEPool compare to RoPE and F-StrIPE? Our analysis with a toy example says: RoPEPool isn’t just different, it’s also richer in terms of expressivity.

April 8, 2025 at 4:52 AM

Manvi Agarwal

@manviagarwal.bsky.social

1️⃣ We show how different families of positional encoding - rotation-based (RoPE) and random fourier features-based (F-StrIPE) - can be compared using kernel methods.

It’s not just vibes - we characterize precisely how queries and keys are affected by positional information.

April 8, 2025 at 4:52 AM

Manvi Agarwal

@manviagarwal.bsky.social

With these two interventions, we obtain better performance at lower cost! 🚀
Curious? Check out the companion webpage: bit.ly/faststructurepe

April 7, 2025 at 11:48 AM

Manvi Agarwal

@manviagarwal.bsky.social

In our paper, we show that stochastic positional encoding is, in fact, a noisy version of a well-known kernel approximation technique: Random Fourier Features. We also show how prior knowledge (e.g. related to musical structure) can be used in such linear-complexity Transformers.

April 7, 2025 at 11:48 AM

Manvi Agarwal

@manviagarwal.bsky.social

However, there was a piece missing: how do you handle relative positional encoding in these linear-complexity transformers? 🤔

Enter Stochastic Positional Encoding! It brings relative positional information back into the picture without going to quadratic cost.

April 7, 2025 at 11:48 AM

Manvi Agarwal

@manviagarwal.bsky.social

Luckily, there's a solution: you can think of attention as a kernel function and use kernel approximation techniques to reduce the cost from quadratic to linear. ⚡
This was the idea used by Performers, for example.

April 7, 2025 at 11:48 AM

Manvi Agarwal

@manviagarwal.bsky.social

Transformers are powerful, but there's a problem: their cost grows quadratically with sequence length. 📈
This makes it really hard to apply them to lengthy sequences, like music, where long-term connections carry important information.

April 7, 2025 at 11:48 AM

Manvi Agarwal

@manviagarwal.bsky.social

December 1, 2024 at 1:22 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news