erogol.com
erogol.substack.com
github.com/erogol
Paper: arxiv.org/abs/2505.22954
Code: github.com/jennyzzt/dgm
Paper: arxiv.org/abs/2505.22954
Code: github.com/jennyzzt/dgm
• Canon Layers definitely improved performance when placed before Attention/MLP blocks
• Softpick had worse validation loss but completely removed attention sinks
• Parallel blocks matched baseline performance but trained 15% faster
• Canon Layers definitely improved performance when placed before Attention/MLP blocks
• Softpick had worse validation loss but completely removed attention sinks
• Parallel blocks matched baseline performance but trained 15% faster
So you get: z = x + MLP(x) + Attention(x)
PaLM models use this approach, which improves memory usage and speed without hurting performance.
So you get: z = x + MLP(x) + Attention(x)
PaLM models use this approach, which improves memory usage and speed without hurting performance.
They also help models without positional encoding work just as well as RoPE models.
❗Worth noting that RWKV used a similar idea years ago.
They also help models without positional encoding work just as well as RoPE models.
❗Worth noting that RWKV used a similar idea years ago.
It allows zero values in the numerator and lets negative values contribute to the denominator.
This prevents attention sinks while keeping math properties similar to regular softmax.
It allows zero values in the numerator and lets negative values contribute to the denominator.
This prevents attention sinks while keeping math properties similar to regular softmax.