Lightnews — Scholar-powered news

garreth

@garrethlee.bsky.social

85 followers 76 following 9 posts

🇮🇩 | Co-Founder at Mundo AI (YC W25) | ex-{Hugging Face, Cohere}

Posts Replies Media Videos

garreth

@garrethlee.bsky.social

Rumor has it that earlier Claude models used a modified three-digit tokenization, processing numbers right-to-left instead of left-to-right.

This method mirrors how we often read and interpret numbers, like grouping digits with commas. Theoretically, this should help with math reasoning!

[5/N]

December 16, 2024 at 5:31 PM

garreth

@garrethlee.bsky.social

Alas, tokenizing numbers as digits was costly:

A 10-digit numbers now took 10 tokens instead of 3-4, which is ~2-3x more than before. That's a significant hit on training & inference costs!

LLaMA 3 fixed this by grouping numbers into threes, balancing compression and consistency.

[4/N]

December 16, 2024 at 5:31 PM

garreth

@garrethlee.bsky.social

Then came LLaMA 1, which took a clever approach to fix number inconsistencies: it tokenized numbers into individual digits (0-9), meaning any large number could now be represented with just 10 tokens.

The consistent representation of numbers made mathematical reasoning much better!

[3/N]

December 16, 2024 at 5:31 PM

garreth

@garrethlee.bsky.social

When GPT-2 came out in 2019, its tokenizer used byte-pair encoding (BPE), still common today:

• Merges frequent substrings, saving memory vs. inputting single characters
• However, vocabulary depends on training data
• Common numbers (e.g., 1999) get single tokens; others are split

[2/N]

December 16, 2024 at 5:31 PM

garreth

@garrethlee.bsky.social

🚀 With Meta's recent paper replacing tokenization in LLMs with patches 🩹, I figured that it's a great time to revisit how tokenization has evolved over the years using everyone's favourite medium - memes!

Let's take a trip down memory lane!

[1/N]

December 16, 2024 at 5:31 PM

garreth

@garrethlee.bsky.social

I made a simple CLI tool to write conventional git commit messages using the Hugging Face Inference API 🤗 (with some useful functionality baked into it)

➡️ To install: `pip install gcmt`

November 25, 2024 at 4:31 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news