Paul Chang
mummitrollet.bsky.social
Paul Chang
@mummitrollet.bsky.social
ML + stuff @Datacrunch
Also pretty cool to see open source community building on top of each other!
May 30, 2025 at 8:07 AM
The paper also suggests Group Tied Attention (GTA), which works in the opposite direction and draws inspiration from MLA, incorporating those techniques into GQA.
May 30, 2025 at 8:07 AM
The technique called Grouped Latent Attention (GLA) can now be split across devices according to the group, providing higher throughput without a drop in performance by maintaining high arithmetic intensity and achieving better parallelism.
May 30, 2025 at 8:07 AM
Well, the paper suggests a hybrid method. What about using MLA and adding groups?
May 30, 2025 at 8:07 AM
Instead, one must make a copy of the latent component across GPUs, which feels wasteful.
May 30, 2025 at 8:06 AM
This is where MLA is somewhat awkward, and GQA scores some points back. MLA uses a single large latent head that must be replicated across all tensor-parallel GPUs, which means that sharding the attention computations across GPUs cannot be done.
May 30, 2025 at 8:06 AM
First of all, a confession! In the blog titled 'Multi-Head Latent Attention: Benefits in Memory and Computation', we didn't tell the whole story—the benchmarking on a single GPU. In reality, for DeepSeek V3-style models, parallelization is needed.
May 30, 2025 at 8:05 AM
The paper focuses on designing more effective decoding attention for inference in light of Multi-head Latent Attention (MLA) and Group Query Attention (GQA).
May 30, 2025 at 8:05 AM
datacrunch.io/blog/multi-h...

The blog post explains these terms and how they relate to algorithm intensity. Let us know if you have any questions or spot errors.
#MLSky
Multi-Head Latent Attention: Benefits in Memory and Computation
Multi-Head Latent Attention (MLA) vs. Group Query Attention (GQA): Transformer inference optimization in DeepSeek V3 with lower KV cache and higher FLOPs/s.
datacrunch.io
May 9, 2025 at 7:59 AM
However, more is at play; revisiting Kipply's infamous Transformer Inference Arithmetic article shows that the MLA mechanism used during inference is now compute-bound 🖥️ and not memory-bound 💾.
May 9, 2025 at 7:58 AM
Looking at the projections involved in DeepSeeek's attention (MLA) of the KV cache automatically makes one think it means less memory needed in HBM, preventing dreaded out-of-memory errors 👿 .
May 9, 2025 at 7:58 AM
This is very true! Go and speak to people in more old-school businesses and you quickly realize that with current models you could already do so much.
April 27, 2025 at 9:13 AM
April 21, 2025 at 5:32 PM