Riccardo Mereu
rmwu.bsky.social
Riccardo Mereu
@rmwu.bsky.social
Reposted by Riccardo Mereu
Multi-Head Latent Attention vs Group Query Attention: We break down why MLA is a more expressive memory compression technique AND why naive implementations can backfire. Check it out!
⚡️Multi-Head Latent Attention is one of the key innovations that enabled @deepseek_ai's V3 and the subsequent R1 model.

⏭️ Join us as we continue our series into efficient AI inference, covering both theoretical insights and practical implementation:

🔗 datacrunch.io/blog/deepsee...
DeepSeek + SGLang: Multi-Head Latent Attention
Multi-Head Latent Attention (MLA) improves upon Group Query Attention (GQA), enabling long-context reasoning models and wider adoption across open-source LLMs.
datacrunch.io
March 12, 2025 at 7:01 PM
Reposted by Riccardo Mereu
Little is known about how deep networks interact with structure in data. An important aspect of this structure is symmetry (e.g., pose transformations). Here, we (w/ @stphtphsn.bsky.social) study the generalization ability of deep networks on symmetric datasets: arxiv.org/abs/2412.11521 (1/n)
On the Ability of Deep Networks to Learn Symmetries from Data: A Neural Kernel Theory
Symmetries (transformations by group actions) are present in many datasets, and leveraging them holds significant promise for improving predictions in machine learning. In this work, we aim to underst...
arxiv.org
January 14, 2025 at 1:05 PM