Graphcore Research
gcresearchteam.bsky.social
Graphcore Research
@gcresearchteam.bsky.social
The 🦋 account of the Graphcore Research team.

Our mission is to contribute to the advancement of AI research and understand the computational requirements of intelligence.
Your boss emails you a point in 128-billion-dimensional space. It's Llama 8B in bfloat16. They want it compressed.

What should you do 🤔... quantise to NF4? 🧵
June 12, 2025 at 11:19 AM
The weighted objective means that some tensors are more sensitive to quantisation - we can find out how sensitive by measuring the average diagonal Fisher information.

For a given budget, we can get better performance by allocating more bits to more sensitive tensors.

5/6
May 22, 2025 at 12:27 PM
It seems that block-absmax, sparse outliers and lossless compression all exploit variable-length coding to some extent.

This is an illustrative image of how that might work.

4/6
May 22, 2025 at 12:26 PM
We can use classical quantisation theory (Shannon 1948, Panter and Dite 1951, Zador 1982) to find optimal quantisers.

The best ones use variable-length encoding (e.g. a Huffman code). Interestingly, block-absmax formats can also outperform fixed-length codes.

3/6
May 22, 2025 at 12:26 PM
We want to minimise the KL divergence between original and quantised model predictions.

With plenty of assumptions, this simplifies to a weighted average of squared reconstruction error.

2/6
May 22, 2025 at 12:26 PM
Our latest work uses theory from the '50s to figure out how to design weight quantisation formats for LLM inference.

It's called Optimal Formats for Weight Quantisation and has just hit arXiv.

1/6
May 22, 2025 at 12:25 PM
Summary: graphcore-research.github.io/papers-of-th...

Transformer-Squared is a novel approach that adapts LLMs for new tasks by selectively adjusting the singular components of their weight matrices, helping broaden LLMs’ abilities to handle diverse tasks with greater efficiency.
February 4, 2025 at 11:05 AM
Summary: graphcore-research.github.io/papers-of-th...

Evolving Deeper LLM Thinking explores evolutionary search strategies to scale test-time compute, outperforming other inference strategies in natural language planning tasks.
February 4, 2025 at 11:04 AM
Summary: graphcore-research.github.io/papers-of-th...

Titans introduces a memory module to architectures that can be updated during inference, unlocking the ability to handle really long sequence lengths with a hybrid attention-recurrent structure.
February 4, 2025 at 11:04 AM
*Phi-4*

Finally, the Phi-4 paper presents a rather different FLOPs angle: spending compute in the data-generation process to create higher quality data, leading to “student” models that (in some domains) out-perform their “teachers”.
January 9, 2025 at 11:00 AM
*Memory Layers at Scale*

The Memory Layers architecture allows extra parameters to be added without increasing FLOPs. Decoupling these gives model designers more control (e.g. for model-hardware co-design) and potentially facilitates more effective models in general.
January 9, 2025 at 11:00 AM
*Large Concept Models*

The Concept Model architecture, which also uses a flexible intermediate representation, performs autoregressive sentence generation in this modality-agnostic "concept space" rather than token space, perhaps more akin to the process underlying human thought.
January 9, 2025 at 11:00 AM
*The Byte Latent Transformer*

Tokenization is a key part of current LLMs, yet has drawbacks (e.g. it must be trained separately). But training directly on bytes is inefficient/ineffective

The Byte Latent Transformer fixes this via dynamic entropy-based grouping of bytes into variable-size patches
January 9, 2025 at 11:00 AM