Violeta Kastreva
vkastreva.bsky.social
Violeta Kastreva
@vkastreva.bsky.social
Research Intern at ETH Zürich
Thrilled to share my first paper! 📄

We prove optimal tokenization is NP-hard on bounded alphabets (like bytes)—even unary for direct tokenization!

Big thanks @tpimentel.bsky.social, @philipwitti.bsky.social & Dennis Komm for the mentorship! Best birthday gift. 🎂

arxiv.org/abs/2511.15709
Tokenisers are a vital part of LLMs, but how hard is it to find an optimal one? 🤔 Considering arbitrarily large alphabets, prior work showed this is NP-hard. But what if we use bytes instead? Or unary strings like a, aa, aaa, ...? In our new paper, we show this is still hard, NP-hard!
November 20, 2025 at 3:27 PM