Lightnews — Scholar-powered news

Alisa Liu

@alisawuffles.bsky.social

360 followers 200 following 12 posts

phd student at @uwcse

Posts Replies Media Videos

Alisa Liu

@alisawuffles.bsky.social

Play around with our tokenizers here! superbpe.github.io 🚀
Paper: arxiv.org/abs/2503.13423
HF models & tokenizers: tinyurl.com/superbpe

This work would not have been possible w/o co-1st 🌟@jon.jon.ke🌟, @valentinhofmann.bsky.social @sewoong79.bsky.social @nlpnoah.bsky.social @yejinchoinka.bsky.social

Screenshot of tokenizer demo from our blog post.

March 21, 2025 at 4:48 PM

Alisa Liu

@alisawuffles.bsky.social

SuperBPE🚀 is a seamless replacement for BPE in modern LM development pipelines, requiring no changes to the model architecture or training framework. You can use it in HuggingFace right now!

Example usage of SuperBPE model & tokenizer in HuggingFace transformers.

March 21, 2025 at 4:48 PM

Alisa Liu

@alisawuffles.bsky.social

Why does SuperBPE🚀 work? We find that loss is distributed more uniformly over tokens in SuperBPE models. They are less overfit to high-frequency, easy-to-predict tokens (e.g. “way” after “By the”), and at the same time master a much broader set of language phenomena.

Histogram of per-token losses for BPE and SuperBPE models. The SuperBPE model makes fewer predictions with very high or very low loss.

March 21, 2025 at 4:48 PM

Alisa Liu

@alisawuffles.bsky.social

Then we pretrain 8B models from scratch with BPE and SuperBPE🚀, fixing everything about the training setup except the tokenizer. We see +4% on avg📈 across 30 downstream tasks, and win on 25/30 of individual tasks, while also being 27% more efficient at inference time.

Table showing the per-task performance of 8B BPE and SuperBPE models.

March 21, 2025 at 4:48 PM

Alisa Liu

@alisawuffles.bsky.social

What can we gain from less restrictive tokenization? To find out, we developed SuperBPE🚀, which learns subword *and* superword tokens. SuperBPE dramatically improves encoding efficiency over BPE — at a fixed vocab size of 200k, SuperBPE reduces sequence length by 33% on average!

Figure showing how encoding efficiency scales with vocabulary size. SuperBPE encodes text much more efficiently than BPE!

March 21, 2025 at 4:48 PM

Alisa Liu

@alisawuffles.bsky.social

We created SuperBPE🚀, a *superword* tokenizer that includes tokens spanning multiple words.

When pretraining at 8B scale, SuperBPE models consistently outperform the BPE baseline on 30 downstream tasks (+8% MMLU), while also being 27% more efficient at inference time.🧵

Segmentation of the sentence "By the way, I am a fan of the Milky Way" under BPE and SuperBPE.

March 21, 2025 at 4:48 PM

Alisa Liu

@alisawuffles.bsky.social

excited to be at #NeurIPS2024! I'll be presenting our data mixture inference attack 🗓️ Thu 4:30pm w/ @jon.jon.ke — stop by to learn what trained tokenizers reveal about LLM development (‼️) and chat about all things tokenizers.

🔗 arxiv.org/abs/2407.16607

December 11, 2024 at 10:08 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news