Cong Lu
cong-ml.bsky.social
Cong Lu
@cong-ml.bsky.social
Research Scientist @ Google DeepMind, in open-ended learning, and AI for Scientific Discovery.
More major advantages! 🌟

COST-EFFECTIVE: StochasTok allows enhanced subword skills to be seamlessly 'retrofitted' into existing pretrained models - thus avoiding costly pretraining!
ENHANCED ROBUSTNESS: Improves resilience to alternative tokenizations! (see examples)

[6/]
June 11, 2025 at 12:09 PM
Empirically, we find:
LANGUAGE: As hoped, StochasTok unlocks language manipulation ability! (see task examples below)
MATH: Furthermore, StochasTok dramatically changes multi-digit addition, enabling grokking and even generalization to UNSEEN TOKENIZERS!🤯

[5/]
June 11, 2025 at 12:09 PM
The underlying StochasTok algorithm is extremely simple!

1️⃣ Simply tokenize text with ANY base tokenizer,
2️⃣ Then, stochastically split some of those tokens into equivalent token pairs.

That’s basically it! Repeat step 2 for the desired granularity.

[3/]
June 11, 2025 at 12:09 PM
🤔The problem: Standard tokenization gives distinct token IDs for each token - making it unnecessarily hard to learn, e.g., ‘book’=3092 and ‘cook’=171691 differ by a single letter.

🎉The solution: Allow LLMs to naturally 'see inside' tokens via alternative tokenizations!

[2/]
June 11, 2025 at 12:09 PM
🚀Introducing “StochasTok: Improving Fine-Grained Subword Understanding in LLMs”!🚀

LLMs are incredible but still struggle disproportionately with subword tasks, e.g., for character counts, wordplay, multi-digit numbers, fixing typos… Enter StochasTok, led by Anya Sims!

[1/]
June 11, 2025 at 12:09 PM
Interested in robust model-based offline RL algorithms? Come check out Anya Sims presenting our new paper investigating the edge of reach problem in offline MBRL!

📍East Exhibit Hall A-C #4603

#NeurIPS2024
December 12, 2024 at 12:34 AM