Philip Whittington
philipwitti.bsky.social
Philip Whittington
@philipwitti.bsky.social
Doctoral student @ETH Zürich 🇨🇭
Reposted by Philip Whittington
Tokenisers are a vital part of LLMs, but how hard is it to find an optimal one? 🤔 Considering arbitrarily large alphabets, prior work showed this is NP-hard. But what if we use bytes instead? Or unary strings like a, aa, aaa, ...? In our new paper, we show this is still hard, NP-hard!
November 20, 2025 at 3:05 PM
Reposted by Philip Whittington
Honoured to receive two (!!) SAC highlights awards at #ACL2025 😁 (Conveniently placed on the same slide!)
With the amazing: @philipwitti.bsky.social, @gregorbachmann.bsky.social and @wegotlieb.bsky.social,
@cuiding.bsky.social, Giovanni Acampa, @alexwarstadt.bsky.social, @tamaregev.bsky.social
July 31, 2025 at 7:41 AM
Reposted by Philip Whittington
BPE is a greedy method to find a tokeniser which maximises compression! Why don't we try to find properly optimal tokenisers instead? Well, it seems this is a pretty difficult—in fact, NP-complete—problem!🤯
New paper + @philipwitti.bsky.social
@gregorbachmann.bsky.social :) arxiv.org/abs/2412.15210
Tokenisation is NP-Complete
In this work, we prove the NP-completeness of two variants of tokenisation, defined as the problem of compressing a dataset to at most $δ$ symbols by either finding a vocabulary directly (direct token...
arxiv.org
December 20, 2024 at 2:04 PM