Lightnews — Scholar-powered news

Alisa Liu

@alisawuffles.bsky.social

Right! "kick the bucket" is too infrequent, but there are more common idiomatic expressions like "in the long run" or "on the other hand." In general I would say non-idiomatic MWEs are more common, like uses of prepositions ("depend on") which require memorization.

March 21, 2025 at 8:59 PM

Alisa Liu

@alisawuffles.bsky.social

nothing beats writing papers together with co-1st @jon.jon.ke — the mention didn't work the first time!

March 21, 2025 at 6:26 PM

Alisa Liu

@alisawuffles.bsky.social

Play around with our tokenizers here! superbpe.github.io 🚀
Paper: arxiv.org/abs/2503.13423
HF models & tokenizers: tinyurl.com/superbpe

This work would not have been possible w/o co-1st 🌟@jon.jon.ke🌟, @valentinhofmann.bsky.social @sewoong79.bsky.social @nlpnoah.bsky.social @yejinchoinka.bsky.social

Screenshot of tokenizer demo from our blog post.

March 21, 2025 at 4:48 PM

Alisa Liu

@alisawuffles.bsky.social

SuperBPE🚀 is a seamless replacement for BPE in modern LM development pipelines, requiring no changes to the model architecture or training framework. You can use it in HuggingFace right now!

Example usage of SuperBPE model & tokenizer in HuggingFace transformers.

March 21, 2025 at 4:48 PM

Alisa Liu

@alisawuffles.bsky.social

Why does SuperBPE🚀 work? We find that loss is distributed more uniformly over tokens in SuperBPE models. They are less overfit to high-frequency, easy-to-predict tokens (e.g. “way” after “By the”), and at the same time master a much broader set of language phenomena.

Histogram of per-token losses for BPE and SuperBPE models. The SuperBPE model makes fewer predictions with very high or very low loss.

March 21, 2025 at 4:48 PM

Alisa Liu

@alisawuffles.bsky.social

Then we pretrain 8B models from scratch with BPE and SuperBPE🚀, fixing everything about the training setup except the tokenizer. We see +4% on avg📈 across 30 downstream tasks, and win on 25/30 of individual tasks, while also being 27% more efficient at inference time.

Table showing the per-task performance of 8B BPE and SuperBPE models.

March 21, 2025 at 4:48 PM

Alisa Liu

@alisawuffles.bsky.social

What can we gain from less restrictive tokenization? To find out, we developed SuperBPE🚀, which learns subword *and* superword tokens. SuperBPE dramatically improves encoding efficiency over BPE — at a fixed vocab size of 200k, SuperBPE reduces sequence length by 33% on average!

Figure showing how encoding efficiency scales with vocabulary size. SuperBPE encodes text much more efficiently than BPE!

March 21, 2025 at 4:48 PM

Alisa Liu

@alisawuffles.bsky.social

E.g. “math teacher” = “Mathelehrer” in German. At the extreme, Chinese *doesn’t use whitespace at all*, so its tokens can span many words — yet this has seemingly not hindered LMs like @deepseek_ai from learning it!

March 21, 2025 at 4:48 PM

Alisa Liu

@alisawuffles.bsky.social

This started with a curiosity💡: why do all LLMs limit tokens to *parts* of whitespace-delimited words? After all, many word sequences (e.g. “by the way”) function as single units. Different languages can also express the same meaning in one or several words.

March 21, 2025 at 4:48 PM

Reposted by Alisa Liu

Christopher Akiki

@cakiki.bsky.social

This is also addressed in the appendix of @alisawuffles.bsky.social and colleagues' paper on BPE mixture inference. I think it might have been discovered by @soldaini.net if I'm not mistaken.

arxiv.org/abs/2407.16607

We observe that the merge list of LLAMA, LLAMA 3, GEMMA, and MISTRAL contain clusters of
redundant merge rules. For instance, in the LLAMA 3 merge list, we see the sequence of merges
_ the, _t he, and _th e, as well as _ and, _a nd, and _an d. Because the merge path for every
token is unique, it is impossible for more than one of these merges to ever be used, and we empirically
verify this by applying the tokenizer to a large amount of text.
We find that this is an artifact of the conversion from sentencepiece to Huggingface tokenizers
format. To construct the merge list, the conversion algorithm naively combines every pair of tokens
in the vocabulary, and then sorts them by token ID, which represents order of creation. While this
is functionally correct, because the redundant merges are not products of the BPE algorithm (i.e.,
they do not actually represent the most-likely next-merge), we need to remove them to apply our
algorithm. To do this, we do some simple pre-processing: for every cluster of redundant merges, we
record the path of merges that achieves each merge; the earliest path is the one that would be taken,
so we keep that merge and remove the rest.
As an aside, this means that a tokenizer’s merge list can be completely reconstructed from its
vocabulary list ordered by token creation. Given only the resulting token at each time step, we can
derive the corresponding merge.

February 28, 2025 at 10:47 AM

Alisa Liu

@alisawuffles.bsky.social

🙋🏻‍♀️ ty!

November 25, 2024 at 5:09 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news