Lightnews — Scholar-powered news

Pietro Lesci

@pietrolesci.bsky.social

PhD student at Cambridge University. Causality & language models. Passionate musician, professional debugger.

pietrolesci.github.io

Posts Replies Media Videos

Pietro Lesci

@pietrolesci.bsky.social

Also, got burning questions about memorisation? Send them my way—we'll make sure to pose them to our panelists during the workshop!

July 27, 2025 at 6:41 AM

Pietro Lesci

@pietrolesci.bsky.social

Paper 📄: arxiv.org/abs/2506.03149
Code 💻: github.com/pietrolesci/...

Joint work with amazing collaborators: Clara Meister, Thomas Hofmann, @andreasvlachos.bsky.social, and @tpimentel.bsky.social!

Causal Estimation of Tokenisation Bias

Modern language models are typically trained over subword sequences, but ultimately define probabilities over character-strings. Ideally, the choice of the tokeniser -- which maps character-strings to...

arxiv.org

June 5, 2025 at 10:43 AM

Pietro Lesci

@pietrolesci.bsky.social

Also, we find that:
– Tokenisation bias appears early in training
– Doesn’t go away as models improve or with scale

We hope this approach can support:
– More principled vocabulary design
– Better understanding of generalisation trade-offs
– Fairer and more stable LMs

June 5, 2025 at 10:43 AM

Pietro Lesci

@pietrolesci.bsky.social

As our main result, we find that when a token is in a model’s vocabulary—i.e., when its characters are tokenised as a single symbol—the model may assign it up to 17x more probability than if it had been split into two tokens instead

June 5, 2025 at 10:43 AM

Pietro Lesci

@pietrolesci.bsky.social

The trick: tokenisers build vocabs incrementally up to a fixed size (e.g., 32k). This defines a "cutoff": tokens near it are similar (e.g., frequency), but those inside appear as one while those outside as two symbols. Perfect setup for regression discontinuity! Details in 📄!

June 5, 2025 at 10:43 AM

Pietro Lesci

@pietrolesci.bsky.social

So, did we train thousands of models, with and without each token in our vocabulary? No! Our method works observationally! 👀📊

June 5, 2025 at 10:43 AM

Pietro Lesci

@pietrolesci.bsky.social

While intuitive, this question is tricky. We can’t just compare
1️⃣ in- vs. out-of-vocab words (like "hello" vs "appoggiatura") as they differ systematically, e.g., in frequency
2️⃣ different tokenisations (e.g., ⟨he,llo⟩or ⟨hello⟩) as the model only sees one during training

June 5, 2025 at 10:43 AM

Pietro Lesci

@pietrolesci.bsky.social

In our paper, we estimate a specific type of tokenisation bias: What’s the effect of including a token (e.g., ⟨hello⟩) in the tokeniser’s vocabulary on the log-probability this model assigns to its characters (“hello”)?

June 5, 2025 at 10:43 AM

Pietro Lesci

@pietrolesci.bsky.social

Most language models assign probabilities to raw strings (like "hello") by first tokenising them (like ⟨he, llo⟩ or ⟨hello⟩). Ideally, different tokenisations shouldn't change these models’ outputs. In practice, they do. We call this difference **tokenisation bias**

June 5, 2025 at 10:43 AM

Pietro Lesci

@pietrolesci.bsky.social

Big thanks to my co-authors: @ovdw.bsky.social, Max Müller-Eberstein, @nsaphra.bsky.social, @hails.computer, Willem Zuidema, and @stellaathena.bsky.social

April 22, 2025 at 11:02 AM

Pietro Lesci

@pietrolesci.bsky.social

Come find us at the poster session:
🗓️ Fri 25 Apr, 3:00–5:30 p.m. (+08)
📌 Hall 3 + Hall 2B, Poster n. 259

April 22, 2025 at 11:02 AM

Pietro Lesci

@pietrolesci.bsky.social

We find that:
📈 Language modelling is stable: consistent scaling laws for performance and info content.
📚 Steps 1k–10k form core of linguistic structure; 10k–100k bring the biggest jumps in performance.
🗺️ Training maps capture these phases and reveal outlier seeds early

April 22, 2025 at 11:02 AM

Pietro Lesci

@pietrolesci.bsky.social

We introduce PolyPythias: 50 training runs across 5 sizes (14M–410M) and 10 seeds to explore:
1️⃣ How stable is downstream performance?
2️⃣ How similar are the learned linguistic representations?
3️⃣ Do training dynamics reveal distinct phases, and can we spot issues early?

April 22, 2025 at 11:02 AM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news