Lightnews — Scholar-powered news

@erogol.com

110 followers 44 following 83 posts

Doing ML

erogol.com
erogol.substack.com
github.com/erogol

Posts Replies Media Videos

Reposted

sakanaai.bsky.social

@sakanaai.bsky.social

This work was done in collaboration with Jeff Clune’s lab at UBC, and led by his PhD students Jenny Zhang and Shengran Hu, together with Cong Lu and Robert Lange.

Paper: arxiv.org/abs/2505.22954
Code: github.com/jennyzzt/dgm

May 30, 2025 at 2:33 AM

erogol.com

@erogol.com

All code is available in BlaGPT if you want to check it out yourself!

github.com/erogol/BlaGPT

GitHub - erogol/BlaGPT: Experimental playground for benchmarking language model (LM) architectures, layers, and tricks on smaller datasets. Designed for flexible experimentation and exploration.

Experimental playground for benchmarking language model (LM) architectures, layers, and tricks on smaller datasets. Designed for flexible experimentation and exploration. - erogol/BlaGPT

github.com

May 6, 2025 at 12:11 PM

erogol.com

@erogol.com

My results:

• Canon Layers definitely improved performance when placed before Attention/MLP blocks
• Softpick had worse validation loss but completely removed attention sinks
• Parallel blocks matched baseline performance but trained 15% faster

May 6, 2025 at 12:11 PM

erogol.com

@erogol.com

Parallel Transformer blocks run MLP and Attention in parallel instead of one after another.

So you get: z = x + MLP(x) + Attention(x)

PaLM models use this approach, which improves memory usage and speed without hurting performance.

May 6, 2025 at 12:11 PM

erogol.com

@erogol.com

The Canon Layers paper shows they boost performance when added to transformer blocks.

They also help models without positional encoding work just as well as RoPE models.

❗Worth noting that RWKV used a similar idea years ago.

May 6, 2025 at 12:11 PM

erogol.com

@erogol.com

Canon Layers are basically causal 1D convolutions that mix the current hidden state with previous states (how many depends on the kernel size).

May 6, 2025 at 12:11 PM

erogol.com

@erogol.com

Softpick replaces regular softmax in attention blocks.

It allows zero values in the numerator and lets negative values contribute to the denominator.

This prevents attention sinks while keeping math properties similar to regular softmax.

May 6, 2025 at 12:11 PM

erogol.com

@erogol.com

Thanks :)

April 11, 2025 at 9:05 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news