Lightnews — Scholar-powered news

Reposted by Alisa Liu

Ai2

@ai2.bsky.social

📢We’re taking your questions now on Reddit for tomorrow’s AMA!

Ask us anything about OLMo, our family of fully-open language models. Our researchers will be on hand to answer them Thursday, May 8 at 8am PST.

May 7, 2025 at 4:46 PM

Reposted by Alisa Liu

Valentin Hofmann

@valentinhofmann.bsky.social

Humans store thousands of multi-word expressions like "of course" in their mental lexicon, but current tokenizers don't support multi-word tokens.

Enter SuperBPE, a tokenizer that lifts this restriction and brings substantial gains in efficiency and performance! 🚀

Details 👇

Alisa Liu @alisawuffles.bsky.social · Mar 21

We created SuperBPE🚀, a *superword* tokenizer that includes tokens spanning multiple words.

When pretraining at 8B scale, SuperBPE models consistently outperform the BPE baseline on 30 downstream tasks (+8% MMLU), while also being 27% more efficient at inference time.🧵

Segmentation of the sentence "By the way, I am a fan of the Milky Way" under BPE and SuperBPE.

March 21, 2025 at 5:40 PM

Reposted by Alisa Liu

Alton B.H. Worthington

@abhw.bsky.social

Hell yeah superwords. (I wanna call em supertokens, but I didn't develop them.)

Alisa Liu @alisawuffles.bsky.social · Mar 21

We created SuperBPE🚀, a *superword* tokenizer that includes tokens spanning multiple words.

When pretraining at 8B scale, SuperBPE models consistently outperform the BPE baseline on 30 downstream tasks (+8% MMLU), while also being 27% more efficient at inference time.🧵

March 21, 2025 at 6:17 PM

Reposted by Alisa Liu

Jonathan Hayase

@jon.jon.ke

Tokenizers govern the allocation of computation. It's a waste to spend a whole token of compute predicting the "way" in "By the way". SuperBPE redirects that compute to predict more difficult tokens, leading to wins on downstream tasks!

Alisa Liu @alisawuffles.bsky.social · Mar 21

We created SuperBPE🚀, a *superword* tokenizer that includes tokens spanning multiple words.

When pretraining at 8B scale, SuperBPE models consistently outperform the BPE baseline on 30 downstream tasks (+8% MMLU), while also being 27% more efficient at inference time.🧵

March 21, 2025 at 6:31 PM

Reposted by Alisa Liu

Noah A. Smith

@nlpnoah.bsky.social

a small change to building your BPE tokenizer gets your pretrained LM 8 MMLU points (for example) and 27% inference-time efficiency boost ...

Alisa Liu @alisawuffles.bsky.social · Mar 21

We created SuperBPE🚀, a *superword* tokenizer that includes tokens spanning multiple words.

When pretraining at 8B scale, SuperBPE models consistently outperform the BPE baseline on 30 downstream tasks (+8% MMLU), while also being 27% more efficient at inference time.🧵

March 21, 2025 at 4:52 PM

Alisa Liu

@alisawuffles.bsky.social

We created SuperBPE🚀, a *superword* tokenizer that includes tokens spanning multiple words.

When pretraining at 8B scale, SuperBPE models consistently outperform the BPE baseline on 30 downstream tasks (+8% MMLU), while also being 27% more efficient at inference time.🧵

March 21, 2025 at 4:48 PM

Reposted by Alisa Liu

Christopher Akiki

@cakiki.bsky.social

This is also addressed in the appendix of @alisawuffles.bsky.social and colleagues' paper on BPE mixture inference. I think it might have been discovered by @soldaini.net if I'm not mistaken.

arxiv.org/abs/2407.16607

We observe that the merge list of LLAMA, LLAMA 3, GEMMA, and MISTRAL contain clusters of
redundant merge rules. For instance, in the LLAMA 3 merge list, we see the sequence of merges
_ the, _t he, and _th e, as well as _ and, _a nd, and _an d. Because the merge path for every
token is unique, it is impossible for more than one of these merges to ever be used, and we empirically
verify this by applying the tokenizer to a large amount of text.
We find that this is an artifact of the conversion from sentencepiece to Huggingface tokenizers
format. To construct the merge list, the conversion algorithm naively combines every pair of tokens
in the vocabulary, and then sorts them by token ID, which represents order of creation. While this
is functionally correct, because the redundant merges are not products of the BPE algorithm (i.e.,
they do not actually represent the most-likely next-merge), we need to remove them to apply our
algorithm. To do this, we do some simple pre-processing: for every cluster of redundant merges, we
record the path of merges that achieves each merge; the earliest path is the one that would be taken,
so we keep that merge and remove the rest.
As an aside, this means that a tokenizer’s merge list can be completely reconstructed from its
vocabulary list ordered by token creation. Given only the resulting token at each time step, we can
derive the corresponding merge.

February 28, 2025 at 10:47 AM

Alisa Liu

@alisawuffles.bsky.social

excited to be at #NeurIPS2024! I'll be presenting our data mixture inference attack 🗓️ Thu 4:30pm w/ @jon.jon.ke — stop by to learn what trained tokenizers reveal about LLM development (‼️) and chat about all things tokenizers.

🔗 arxiv.org/abs/2407.16607

December 11, 2024 at 10:08 PM

Reposted by Alisa Liu

Jiacheng Liu

@liujch1998.bsky.social

Want to predict the task performance of LMs before pretraining them?

We develop task scaling laws and model ladders, which predict the accuracy on individual tasks by OLMo 2 7B & 13B models within 2 points of absolute error. The cost is 1% of the compute used to pretrain them.

December 9, 2024 at 5:07 PM

Reposted by Alisa Liu

Michael Saxon

@saxon.me

🚨I too am on the job market‼️🤯

I'm searching for faculty positions/postdocs in multilingual/multicultural NLP, vision+language models, and eval for genAI!

I'll be at #NeurIPS2024 presenting our work on meta-evaluation for text-to-image faithfulness! Let's chat there!

Papers in🧵, see more: saxon.me

December 6, 2024 at 1:44 AM

Reposted by Alisa Liu

Mechanical Dirk

@mechanicaldirk.bsky.social

We just updated the OLMo repo at github.com/allenai/OLMo!
There are now several training configs that together reproduce the training runs that lead to the final OLMo 2 models.
In particular, all the training data is available, tokenized and shuffled exactly as we trained on it!

GitHub - allenai/OLMo: Modeling, training, eval, and inference code for OLMo

Modeling, training, eval, and inference code for OLMo - allenai/OLMo

github.com

December 2, 2024 at 8:13 PM

Reposted by Alisa Liu

Ai2

@ai2.bsky.social

Meet OLMo 2, the best fully open language model to date, including a family of 7B and 13B models trained up to 5T tokens. OLMo 2 outperforms other fully open models and competes with open-weight models like Llama 3.1 8B — As always, we released our data, code, recipes and more 🎁

The OLMo 2 models sit at the Pareto frontier of training FLOPs vs model average performance.

November 26, 2024 at 8:51 PM

Reposted by Alisa Liu

Luca Soldaini 🎀

@soldaini.net

OLMo 2 is out 🥳 7B and 13B trained on 5T tokens, and meticulousy instruction tuned using Tulu 3 recipe.

Simply the best fully open models yet.

Really proud of the work & the amazing team at
@ai2.bsky.social

November 26, 2024 at 9:12 PM

Reposted by Alisa Liu

Christoph Molnar

@christophmolnar.bsky.social

No one can explain stochastic gradient descent better than this panda.

a panda bear is rolling around in the grass in a zoo enclosure .

Alt: a panda bear is rolling around in the grass in a zoo enclosure .

media.tenor.com

November 24, 2024 at 3:04 PM

Reposted by Alisa Liu

Vivek Kalyan

@vivekkalyan.com

Reading the TÜLU 3 paper from @ai2.bsky.social. It's refreshing to see a research lab treating AI as a real science with full reports, data, code, logs, evals.

Paper: allenai.org/papers/tulu-...
Demo: playground.allenai.org
Code: github.com/allenai/open...
Eval: github.com/allenai/olmes

Notes

allenai.org

November 24, 2024 at 6:04 PM

Reposted by Alisa Liu

Ai2

@ai2.bsky.social

Meet Tülu 3, a set of state-of-the-art instruct models with fully open data, eval code, and training algorithms.
We invented new methods for fine-tuning language models with RL and built upon best practices to scale synthetic instruction and preference data.
Demo, GitHub, paper, and models 👇

November 21, 2024 at 5:15 PM

Reposted by Alisa Liu

Nathan Lambert

@natolambert.bsky.social

I've spent the last two years scouring all available resources on RLHF specifically and post training broadly. Today, with the help of a totally cracked team, we bring you the fruits of that labor — Tülu 3, an entirely open frontier model post training recipe. We beat Llama 3.1 Instruct.

Thread.

November 21, 2024 at 5:01 PM

Reposted by Alisa Liu

Akari Asai

@akariasai.bsky.social

1/ Introducing ᴏᴘᴇɴꜱᴄʜᴏʟᴀʀ: a retrieval-augmented LM to help scientists synthesize knowledge 📚
@uwnlp.bsky.social & Ai2
With open models & 45M-paper datastores, it outperforms proprietary systems & match human experts.
Try out our demo!
openscholar.allen.ai

November 19, 2024 at 4:30 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news