Lightnews — Scholar-powered news

Cong Lu

@cong-ml.bsky.social

920 followers 460 following 11 posts

Research Scientist @ Google DeepMind, in open-ended learning, and AI for Scientific Discovery.

Posts Replies Media Videos

Cong Lu

@cong-ml.bsky.social

📄 Paper: arxiv.org/abs/2506.01687
💻 Code: github.com/anyasims/sto...
A massive 🙏 to my incredible co-authors: Anya Sims, Thom Foster, @klarakaleb.bsky.social, Tuan-Duy H. Nguyen, Joseph Lee, @jfoerst.bsky.social, @yeewhye.bsky.social!

[8/8]

StochasTok: Improving Fine-Grained Subword Understanding in LLMs

Subword-level understanding is integral to numerous tasks, including understanding multi-digit numbers, spelling mistakes, abbreviations, rhyming, and wordplay. Despite this, current large language mo...

arxiv.org

June 11, 2025 at 12:09 PM

Cong Lu

@cong-ml.bsky.social

The significant gains from this minimal change are super exciting, and we see huge potential for larger models and more complex tasks like coding, scientific reasoning, and beyond! We invite you to explore the paper and code!

[7/]

June 11, 2025 at 12:09 PM

Cong Lu

@cong-ml.bsky.social

More major advantages! 🌟

COST-EFFECTIVE: StochasTok allows enhanced subword skills to be seamlessly 'retrofitted' into existing pretrained models - thus avoiding costly pretraining!
ENHANCED ROBUSTNESS: Improves resilience to alternative tokenizations! (see examples)

[6/]

June 11, 2025 at 12:09 PM

Cong Lu

@cong-ml.bsky.social

Empirically, we find:
LANGUAGE: As hoped, StochasTok unlocks language manipulation ability! (see task examples below)
MATH: Furthermore, StochasTok dramatically changes multi-digit addition, enabling grokking and even generalization to UNSEEN TOKENIZERS!🤯

[5/]

June 11, 2025 at 12:09 PM

Cong Lu

@cong-ml.bsky.social

Practically, StochasTok is:
✅Computationally lightweight🪶
✅A simple dataset preprocessing step — No training loop or inference time changes required!🛠️
✅Compatible with ANY base tokenizer — Allows us to retrofit pretrained models!💰
✅Robust to hyperparameter choice!🔥

[4/]

June 11, 2025 at 12:09 PM

Cong Lu

@cong-ml.bsky.social

The underlying StochasTok algorithm is extremely simple!

1️⃣ Simply tokenize text with ANY base tokenizer,
2️⃣ Then, stochastically split some of those tokens into equivalent token pairs.

That’s basically it! Repeat step 2 for the desired granularity.

[3/]

June 11, 2025 at 12:09 PM

Cong Lu

@cong-ml.bsky.social

🤔The problem: Standard tokenization gives distinct token IDs for each token - making it unnecessarily hard to learn, e.g., ‘book’=3092 and ‘cook’=171691 differ by a single letter.

🎉The solution: Allow LLMs to naturally 'see inside' tokens via alternative tokenizations!

[2/]

June 11, 2025 at 12:09 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news