erogol.com
@erogol.com
Doing ML

erogol.com
erogol.substack.com
github.com/erogol
Reposted
This work was done in collaboration with Jeff Clune’s lab at UBC, and led by his PhD students Jenny Zhang and Shengran Hu, together with Cong Lu and Robert Lange.

Paper: arxiv.org/abs/2505.22954
Code: github.com/jennyzzt/dgm
May 30, 2025 at 2:33 AM
My results:

• Canon Layers definitely improved performance when placed before Attention/MLP blocks
• Softpick had worse validation loss but completely removed attention sinks
• Parallel blocks matched baseline performance but trained 15% faster
May 6, 2025 at 12:11 PM
Parallel Transformer blocks run MLP and Attention in parallel instead of one after another.

So you get: z = x + MLP(x) + Attention(x)

PaLM models use this approach, which improves memory usage and speed without hurting performance.
May 6, 2025 at 12:11 PM
The Canon Layers paper shows they boost performance when added to transformer blocks.

They also help models without positional encoding work just as well as RoPE models.

❗Worth noting that RWKV used a similar idea years ago.
May 6, 2025 at 12:11 PM
Canon Layers are basically causal 1D convolutions that mix the current hidden state with previous states (how many depends on the kernel size).
May 6, 2025 at 12:11 PM
Softpick replaces regular softmax in attention blocks.

It allows zero values in the numerator and lets negative values contribute to the denominator.

This prevents attention sinks while keeping math properties similar to regular softmax.
May 6, 2025 at 12:11 PM
Thanks :)
April 11, 2025 at 9:05 PM