Lightnews — Scholar-powered news

Paul Chang

@mummitrollet.bsky.social

Also pretty cool to see open source community building on top of each other!

May 30, 2025 at 8:07 AM

Paul Chang

@mummitrollet.bsky.social

The paper also suggests Group Tied Attention (GTA), which works in the opposite direction and draws inspiration from MLA, incorporating those techniques into GQA.

May 30, 2025 at 8:07 AM

Paul Chang

@mummitrollet.bsky.social

The technique called Grouped Latent Attention (GLA) can now be split across devices according to the group, providing higher throughput without a drop in performance by maintaining high arithmetic intensity and achieving better parallelism.

May 30, 2025 at 8:07 AM

Paul Chang

@mummitrollet.bsky.social

Well, the paper suggests a hybrid method. What about using MLA and adding groups?

May 30, 2025 at 8:07 AM

Paul Chang

@mummitrollet.bsky.social

Instead, one must make a copy of the latent component across GPUs, which feels wasteful.

May 30, 2025 at 8:06 AM

Paul Chang

@mummitrollet.bsky.social

This is where MLA is somewhat awkward, and GQA scores some points back. MLA uses a single large latent head that must be replicated across all tensor-parallel GPUs, which means that sharding the attention computations across GPUs cannot be done.

May 30, 2025 at 8:06 AM

Paul Chang

@mummitrollet.bsky.social

First of all, a confession! In the blog titled 'Multi-Head Latent Attention: Benefits in Memory and Computation', we didn't tell the whole story—the benchmarking on a single GPU. In reality, for DeepSeek V3-style models, parallelization is needed.

May 30, 2025 at 8:05 AM

Paul Chang

@mummitrollet.bsky.social

The paper focuses on designing more effective decoding attention for inference in light of Multi-head Latent Attention (MLA) and Group Query Attention (GQA).

May 30, 2025 at 8:05 AM

Paul Chang

@mummitrollet.bsky.social

Some links that helped me to understand the roofline model.

jax-ml.github.io/scaling-book...

kipp.ly/transformer-...

All About Rooflines | How To Scale Your Model

When we run algorithms on hardware, we're bounded by three things: how fast it can do math (OPs/second), the bandwidth available for moving data around (bytes/second), and the total memory available t...

jax-ml.github.io

May 9, 2025 at 8:01 AM

Paul Chang

@mummitrollet.bsky.social

datacrunch.io/blog/multi-h...

The blog post explains these terms and how they relate to algorithm intensity. Let us know if you have any questions or spot errors.
#MLSky

Multi-Head Latent Attention: Benefits in Memory and Computation

Multi-Head Latent Attention (MLA) vs. Group Query Attention (GQA): Transformer inference optimization in DeepSeek V3 with lower KV cache and higher FLOPs/s.

datacrunch.io

May 9, 2025 at 7:59 AM

Paul Chang

@mummitrollet.bsky.social

However, more is at play; revisiting Kipply's infamous Transformer Inference Arithmetic article shows that the MLA mechanism used during inference is now compute-bound 🖥️ and not memory-bound 💾.

May 9, 2025 at 7:58 AM

Paul Chang

@mummitrollet.bsky.social

Looking at the projections involved in DeepSeeek's attention (MLA) of the KV cache automatically makes one think it means less memory needed in HBM, preventing dreaded out-of-memory errors 👿 .

May 9, 2025 at 7:58 AM

Paul Chang

@mummitrollet.bsky.social

This is very true! Go and speak to people in more old-school businesses and you quickly realize that with current models you could already do so much.

April 27, 2025 at 9:13 AM

Paul Chang

@mummitrollet.bsky.social

@aidanscannell.bsky.social

April 21, 2025 at 5:32 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news