Cong Lu
cong-ml.bsky.social
Cong Lu
@cong-ml.bsky.social
Research Scientist @ Google DeepMind, in open-ended learning, and AI for Scientific Discovery.
📄 Paper: arxiv.org/abs/2506.01687
💻 Code: github.com/anyasims/sto...
A massive 🙏 to my incredible co-authors: Anya Sims, Thom Foster, @klarakaleb.bsky.social, Tuan-Duy H. Nguyen, Joseph Lee, @jfoerst.bsky.social, @yeewhye.bsky.social!

[8/8]
StochasTok: Improving Fine-Grained Subword Understanding in LLMs
Subword-level understanding is integral to numerous tasks, including understanding multi-digit numbers, spelling mistakes, abbreviations, rhyming, and wordplay. Despite this, current large language mo...
arxiv.org
June 11, 2025 at 12:09 PM
The significant gains from this minimal change are super exciting, and we see huge potential for larger models and more complex tasks like coding, scientific reasoning, and beyond! We invite you to explore the paper and code!

[7/]
June 11, 2025 at 12:09 PM
More major advantages! 🌟

COST-EFFECTIVE: StochasTok allows enhanced subword skills to be seamlessly 'retrofitted' into existing pretrained models - thus avoiding costly pretraining!
ENHANCED ROBUSTNESS: Improves resilience to alternative tokenizations! (see examples)

[6/]
June 11, 2025 at 12:09 PM
Empirically, we find:
LANGUAGE: As hoped, StochasTok unlocks language manipulation ability! (see task examples below)
MATH: Furthermore, StochasTok dramatically changes multi-digit addition, enabling grokking and even generalization to UNSEEN TOKENIZERS!🤯

[5/]
June 11, 2025 at 12:09 PM
Practically, StochasTok is:
✅Computationally lightweight🪶
✅A simple dataset preprocessing step — No training loop or inference time changes required!🛠️
✅Compatible with ANY base tokenizer — Allows us to retrofit pretrained models!💰
✅Robust to hyperparameter choice!🔥

[4/]
June 11, 2025 at 12:09 PM
The underlying StochasTok algorithm is extremely simple!

1️⃣ Simply tokenize text with ANY base tokenizer,
2️⃣ Then, stochastically split some of those tokens into equivalent token pairs.

That’s basically it! Repeat step 2 for the desired granularity.

[3/]
June 11, 2025 at 12:09 PM
🤔The problem: Standard tokenization gives distinct token IDs for each token - making it unnecessarily hard to learn, e.g., ‘book’=3092 and ‘cook’=171691 differ by a single letter.

🎉The solution: Allow LLMs to naturally 'see inside' tokens via alternative tokenizations!

[2/]
June 11, 2025 at 12:09 PM