Lightnews — Scholar-powered news

Graphcore Research

@gcresearchteam.bsky.social

The 🦋 account of the Graphcore Research team.

Our mission is to contribute to the advancement of AI research and understand the computational requirements of intelligence.

Posts Replies Media Videos

Graphcore Research

@gcresearchteam.bsky.social

Your boss emails you a point in 128-billion-dimensional space. It's Llama 8B in bfloat16. They want it compressed.

What should you do 🤔... quantise to NF4? 🧵

June 12, 2025 at 11:19 AM

Graphcore Research

@gcresearchteam.bsky.social

The weighted objective means that some tensors are more sensitive to quantisation - we can find out how sensitive by measuring the average diagonal Fisher information.

For a given budget, we can get better performance by allocating more bits to more sensitive tensors.

5/6

May 22, 2025 at 12:27 PM

Graphcore Research

@gcresearchteam.bsky.social

It seems that block-absmax, sparse outliers and lossless compression all exploit variable-length coding to some extent.

This is an illustrative image of how that might work.

4/6

May 22, 2025 at 12:26 PM

Graphcore Research

@gcresearchteam.bsky.social

We can use classical quantisation theory (Shannon 1948, Panter and Dite 1951, Zador 1982) to find optimal quantisers.

The best ones use variable-length encoding (e.g. a Huffman code). Interestingly, block-absmax formats can also outperform fixed-length codes.

3/6

May 22, 2025 at 12:26 PM

Graphcore Research

@gcresearchteam.bsky.social

We want to minimise the KL divergence between original and quantised model predictions.

With plenty of assumptions, this simplifies to a weighted average of squared reconstruction error.

2/6

May 22, 2025 at 12:26 PM

Graphcore Research

@gcresearchteam.bsky.social

Our latest work uses theory from the '50s to figure out how to design weight quantisation formats for LLM inference.

It's called Optimal Formats for Weight Quantisation and has just hit arXiv.

1/6

May 22, 2025 at 12:25 PM

Graphcore Research

@gcresearchteam.bsky.social

Summary: graphcore-research.github.io/papers-of-th...  Transformer-Squared is a novel approach that adapts LLMs for new tasks by selectively adjusting the singular components of their weight matrices, helping broaden LLMs’ abilities to handle diverse tasks with greater efficiency.

February 4, 2025 at 11:05 AM

Graphcore Research

@gcresearchteam.bsky.social

Summary: graphcore-research.github.io/papers-of-th...  Evolving Deeper LLM Thinking explores evolutionary search strategies to scale test-time compute, outperforming other inference strategies in natural language planning tasks.

February 4, 2025 at 11:04 AM

Graphcore Research

@gcresearchteam.bsky.social

Summary: graphcore-research.github.io/papers-of-th...

Titans introduces a memory module to architectures that can be updated during inference, unlocking the ability to handle really long sequence lengths with a hybrid attention-recurrent structure.

February 4, 2025 at 11:04 AM

Graphcore Research

@gcresearchteam.bsky.social

*Phi-4*

Finally, the Phi-4 paper presents a rather different FLOPs angle: spending compute in the data-generation process to create higher quality data, leading to “student” models that (in some domains) out-perform their “teachers”.

Illustration of pivotal tokens for GPT-4o at temperature 1 on a problem from the MATH benchmark [HBK+
21], where the initial success probability is 0.31. Each token is colorized by the probability of success
for an independent completion (N = 529) continued from after the token, with red for p(success) = 0 and blue
for p(success) = 1. The line plot shows the same probabilities. The tokens that changes p(success) by ≥ 0.2 are
shown boxed , with subscripts showing the change in probability. Tokens with probability ≤ 0.1 are underlined to
illustrate that pivotal tokens are distinct from low-probability tokens. The token probabilities of negative and
(a were 0.31 and 0.12, respectively. The greedy tokens for the same prefixes are product with 0.66 probability
and t with 0.88 probability.

January 9, 2025 at 11:00 AM

Graphcore Research

@gcresearchteam.bsky.social

*Memory Layers at Scale*

The Memory Layers architecture allows extra parameters to be added without increasing FLOPs. Decoupling these gives model designers more control (e.g. for model-hardware co-design) and potentially facilitates more effective models in general.

Diagram showing how queries are used to generate a sparse grid of values, used as a memory for the layer

January 9, 2025 at 11:00 AM

Graphcore Research

@gcresearchteam.bsky.social

*Large Concept Models*

The Concept Model architecture, which also uses a flexible intermediate representation, performs autoregressive sentence generation in this modality-agnostic "concept space" rather than token space, perhaps more akin to the process underlying human thought.

Left: visualisation of reasoning in an embedding space of concepts (task of summarization). Right: fundamental architecture of an Large Concept Model (LCM)

January 9, 2025 at 11:00 AM

Graphcore Research

@gcresearchteam.bsky.social

*The Byte Latent Transformer*

Tokenization is a key part of current LLMs, yet has drawbacks (e.g. it must be trained separately). But training directly on bytes is inefficient/ineffective

The Byte Latent Transformer fixes this via dynamic entropy-based grouping of bytes into variable-size patches

A diagram showing the local encoder and decoder of the Byte Latent Transformer. The Encoder turns byte embeddings into byte hidden states and a variable number of patch embeddings, via cross-attention. The decoder does the opposite to recover output bytes.

January 9, 2025 at 11:00 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news