Lightnews — Scholar-powered news

julien-siems.bsky.social

@julien-siems.bsky.social

DeltaProduct is now available in the flash-linear-attention library
github.com/fla-org/flas...

flash-linear-attention/fla/layers/gated_deltaproduct.py at main · fla-org/flash-linear-attention

🚀 Efficient implementations of state-of-the-art linear attention models in Torch and Triton - fla-org/flash-linear-attention

github.com

April 8, 2025 at 6:09 AM

julien-siems.bsky.social

@julien-siems.bsky.social

9/9 We also discussed state tracking in Linear RNNs at the ASAP Seminar—watch our full talk: www.youtube.com/watch?v=R_0v...
Also take a look at these excellent blog posts:
leloykun.github.io/ponder/block... (by @leloy.bsky.social )
jyopari.github.io/posts/househ... (by Jyothish Pari)

State Tracking in Scalable Linear RNNs - Riccardo Grazzi & Julien Siems | ASAP Seminar #04

YouTube video by ASAP Seminar Series

www.youtube.com

March 28, 2025 at 2:39 PM

julien-siems.bsky.social

@julien-siems.bsky.social

8/9 This was a great project with @timurcarstensen.bsky.social , @arberz.bsky.social , Frank Hutter, Massimiliano Pontil, and @riccardograzzi.bsky.social
Check out our Oral at the FM-Wild Workshop at @ICLR:
openreview.net/forum?id=nvb...

DeltaProduct: Improving State-Tracking in Linear RNNs via...

Linear Recurrent Neural Networks (linear RNNs) have emerged as competitive alternatives to Transformers for sequence modeling, offering efficient training and linear-time inference. However...

openreview.net

March 28, 2025 at 2:39 PM

julien-siems.bsky.social

@julien-siems.bsky.social

7/9 In language modeling tasks, DeltaProduct surpasses DeltaNet across lm-eval-harness benchmarks, with notable gains in length extrapolation performance as we increase nₕ.

March 28, 2025 at 2:39 PM

julien-siems.bsky.social

@julien-siems.bsky.social

6/9 Also on modular arithmetic with brackets, a context-free grammar, performance improves as nₕ increases.

March 28, 2025 at 2:39 PM

julien-siems.bsky.social

@julien-siems.bsky.social

5/9 To improve state-tracking, increasing the number of Householders nₕ is more effective than increasing the number of layers l: l=1,nₕ=2 (top row) yields much better performance than l=2 nₕ=1 (bottom row) on S₃, S₄, A₅. nₕ=4 gets good performance on S₅. nₕ=1↔DeltaNet

March 28, 2025 at 2:39 PM

julien-siems.bsky.social

@julien-siems.bsky.social

4/9 Building on this insight, DeltaProduct performs nₕ gradient steps per token (with different per-step keys and values), yielding a state-transition matrix A(xᵢ) as a product of nₕ generalized Householder transforms—interpolating between a rank-1 update and a dense matrix.

March 28, 2025 at 2:39 PM

julien-siems.bsky.social

@julien-siems.bsky.social

3/9 Following @sontaiscute.bsky.social et al. (2024), DeltaNet can be seen as performing one gradient descent step per token on an associative recall loss, resulting in a rank-1 state-transition matrix.

March 28, 2025 at 2:39 PM

julien-siems.bsky.social

@julien-siems.bsky.social

2/9 Linear RNNs’ expressivity depends on the state-transition matrix structure. Diagonal linear RNNs (Mamba, GLA, mLSTM) only allow token mixing. DeltaNet and RWKV-7 use a rank-1 update enabling token+channel mixing. DeltaProduct enables adjustable higher-rank updates—but how?

March 28, 2025 at 2:39 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news