Our mission is to contribute to the advancement of AI research and understand the computational requirements of intelligence.
What should you do 🤔... quantise to NF4? 🧵
What should you do 🤔... quantise to NF4? 🧵
For a given budget, we can get better performance by allocating more bits to more sensitive tensors.
5/6
For a given budget, we can get better performance by allocating more bits to more sensitive tensors.
5/6
This is an illustrative image of how that might work.
4/6
This is an illustrative image of how that might work.
4/6
The best ones use variable-length encoding (e.g. a Huffman code). Interestingly, block-absmax formats can also outperform fixed-length codes.
3/6
The best ones use variable-length encoding (e.g. a Huffman code). Interestingly, block-absmax formats can also outperform fixed-length codes.
3/6
With plenty of assumptions, this simplifies to a weighted average of squared reconstruction error.
2/6
With plenty of assumptions, this simplifies to a weighted average of squared reconstruction error.
2/6
It's called Optimal Formats for Weight Quantisation and has just hit arXiv.
1/6
It's called Optimal Formats for Weight Quantisation and has just hit arXiv.
1/6
Titans introduces a memory module to architectures that can be updated during inference, unlocking the ability to handle really long sequence lengths with a hybrid attention-recurrent structure.
Titans introduces a memory module to architectures that can be updated during inference, unlocking the ability to handle really long sequence lengths with a hybrid attention-recurrent structure.
Finally, the Phi-4 paper presents a rather different FLOPs angle: spending compute in the data-generation process to create higher quality data, leading to “student” models that (in some domains) out-perform their “teachers”.
Finally, the Phi-4 paper presents a rather different FLOPs angle: spending compute in the data-generation process to create higher quality data, leading to “student” models that (in some domains) out-perform their “teachers”.
The Memory Layers architecture allows extra parameters to be added without increasing FLOPs. Decoupling these gives model designers more control (e.g. for model-hardware co-design) and potentially facilitates more effective models in general.
The Memory Layers architecture allows extra parameters to be added without increasing FLOPs. Decoupling these gives model designers more control (e.g. for model-hardware co-design) and potentially facilitates more effective models in general.
The Concept Model architecture, which also uses a flexible intermediate representation, performs autoregressive sentence generation in this modality-agnostic "concept space" rather than token space, perhaps more akin to the process underlying human thought.
The Concept Model architecture, which also uses a flexible intermediate representation, performs autoregressive sentence generation in this modality-agnostic "concept space" rather than token space, perhaps more akin to the process underlying human thought.
Tokenization is a key part of current LLMs, yet has drawbacks (e.g. it must be trained separately). But training directly on bytes is inefficient/ineffective
The Byte Latent Transformer fixes this via dynamic entropy-based grouping of bytes into variable-size patches
Tokenization is a key part of current LLMs, yet has drawbacks (e.g. it must be trained separately). But training directly on bytes is inefficient/ineffective
The Byte Latent Transformer fixes this via dynamic entropy-based grouping of bytes into variable-size patches