julien-siems.bsky.social
@julien-siems.bsky.social
DeltaProduct is now available in the flash-linear-attention library
github.com/fla-org/flas...
flash-linear-attention/fla/layers/gated_deltaproduct.py at main · fla-org/flash-linear-attention
🚀 Efficient implementations of state-of-the-art linear attention models in Torch and Triton - fla-org/flash-linear-attention
github.com
April 8, 2025 at 6:09 AM
9/9 We also discussed state tracking in Linear RNNs at the ASAP Seminar—watch our full talk: www.youtube.com/watch?v=R_0v...
Also take a look at these excellent blog posts:
leloykun.github.io/ponder/block... (by @leloy.bsky.social )
jyopari.github.io/posts/househ... (by Jyothish Pari)
State Tracking in Scalable Linear RNNs - Riccardo Grazzi & Julien Siems | ASAP Seminar #04
YouTube video by ASAP Seminar Series
www.youtube.com
March 28, 2025 at 2:39 PM
8/9 This was a great project with @timurcarstensen.bsky.social , @arberz.bsky.social , Frank Hutter, Massimiliano Pontil, and @riccardograzzi.bsky.social
Check out our Oral at the FM-Wild Workshop at @ICLR:
openreview.net/forum?id=nvb...
DeltaProduct: Improving State-Tracking in Linear RNNs via...
Linear Recurrent Neural Networks (linear RNNs) have emerged as competitive alternatives to Transformers for sequence modeling, offering efficient training and linear-time inference. However...
openreview.net
March 28, 2025 at 2:39 PM
7/9 In language modeling tasks, DeltaProduct surpasses DeltaNet across lm-eval-harness benchmarks, with notable gains in length extrapolation performance as we increase nₕ.
March 28, 2025 at 2:39 PM
6/9 Also on modular arithmetic with brackets, a context-free grammar, performance improves as nₕ increases.
March 28, 2025 at 2:39 PM
5/9 To improve state-tracking, increasing the number of Householders nₕ is more effective than increasing the number of layers l: l=1,nₕ=2 (top row) yields much better performance than l=2 nₕ=1 (bottom row) on S₃, S₄, A₅. nₕ=4 gets good performance on S₅. nₕ=1↔DeltaNet
March 28, 2025 at 2:39 PM
4/9 Building on this insight, DeltaProduct performs nₕ gradient steps per token (with different per-step keys and values), yielding a state-transition matrix A(xᵢ) as a product of nₕ generalized Householder transforms—interpolating between a rank-1 update and a dense matrix.
March 28, 2025 at 2:39 PM
3/9 Following @sontaiscute.bsky.social et al. (2024), DeltaNet can be seen as performing one gradient descent step per token on an associative recall loss, resulting in a rank-1 state-transition matrix.
March 28, 2025 at 2:39 PM
2/9 Linear RNNs’ expressivity depends on the state-transition matrix structure. Diagonal linear RNNs (Mamba, GLA, mLSTM) only allow token mixing. DeltaNet and RWKV-7 use a rank-1 update enabling token+channel mixing. DeltaProduct enables adjustable higher-rank updates—but how?
March 28, 2025 at 2:39 PM