Pietro Lesci
pietrolesci.bsky.social
Pietro Lesci
@pietrolesci.bsky.social
PhD student at Cambridge University. Causality & language models. Passionate musician, professional debugger.

pietrolesci.github.io
Also, got burning questions about memorisation? Send them my way—we'll make sure to pose them to our panelists during the workshop!
July 27, 2025 at 6:41 AM
Also, we find that:
– Tokenisation bias appears early in training
– Doesn’t go away as models improve or with scale

We hope this approach can support:
– More principled vocabulary design
– Better understanding of generalisation trade-offs
– Fairer and more stable LMs
June 5, 2025 at 10:43 AM
As our main result, we find that when a token is in a model’s vocabulary—i.e., when its characters are tokenised as a single symbol—the model may assign it up to 17x more probability than if it had been split into two tokens instead
June 5, 2025 at 10:43 AM
The trick: tokenisers build vocabs incrementally up to a fixed size (e.g., 32k). This defines a "cutoff": tokens near it are similar (e.g., frequency), but those inside appear as one while those outside as two symbols. Perfect setup for regression discontinuity! Details in 📄!
June 5, 2025 at 10:43 AM
So, did we train thousands of models, with and without each token in our vocabulary? No! Our method works observationally! 👀📊
June 5, 2025 at 10:43 AM
While intuitive, this question is tricky. We can’t just compare
1️⃣ in- vs. out-of-vocab words (like "hello" vs "appoggiatura") as they differ systematically, e.g., in frequency
2️⃣ different tokenisations (e.g., ⟨he,llo⟩or ⟨hello⟩) as the model only sees one during training
June 5, 2025 at 10:43 AM
In our paper, we estimate a specific type of tokenisation bias: What’s the effect of including a token (e.g., ⟨hello⟩) in the tokeniser’s vocabulary on the log-probability this model assigns to its characters (“hello”)?
June 5, 2025 at 10:43 AM
Most language models assign probabilities to raw strings (like "hello") by first tokenising them (like ⟨he, llo⟩ or ⟨hello⟩). Ideally, different tokenisations shouldn't change these models’ outputs. In practice, they do. We call this difference **tokenisation bias**
June 5, 2025 at 10:43 AM
Big thanks to my co-authors: @ovdw.bsky.social, Max Müller-Eberstein, @nsaphra.bsky.social, @hails.computer, Willem Zuidema, and @stellaathena.bsky.social
April 22, 2025 at 11:02 AM
Come find us at the poster session:
🗓️ Fri 25 Apr, 3:00–5:30 p.m. (+08)
📌 Hall 3 + Hall 2B, Poster n. 259
April 22, 2025 at 11:02 AM
We find that:
📈 Language modelling is stable: consistent scaling laws for performance and info content.
📚 Steps 1k–10k form core of linguistic structure; 10k–100k bring the biggest jumps in performance.
🗺️ Training maps capture these phases and reveal outlier seeds early
April 22, 2025 at 11:02 AM
We introduce PolyPythias: 50 training runs across 5 sizes (14M–410M) and 10 seeds to explore:
1️⃣ How stable is downstream performance?
2️⃣ How similar are the learned linguistic representations?
3️⃣ Do training dynamics reveal distinct phases, and can we spot issues early?
April 22, 2025 at 11:02 AM