Kyle Lo
banner
kylelo.bsky.social
Kyle Lo
@kylelo.bsky.social
language model pretraining @ai2.bsky.social, co-lead of data research w/ @soldaini.net, statistics @uw, open science, tabletop, seattle, he/him,🧋 kyleclo.com
Pinned
during neurips, we kept the RL run going & model kept getting better 😂

Olmo 3.1 is a..
🐡 32B Thinking, still best fully-open model to-date
🐠 32B Instruct, for ppl who hate long yapping, as good as qwen3

we added 10 more pages to the paper! thx for community feedback from convos at neurips
Olmo 3.1 is here. We extended our strongest RL run and scaled our instruct recipe to 32B—releasing Olmo 3.1 Think 32B & Olmo 3.1 Instruct 32B, our most capable models yet. 🧵
just had hechalou’s yin yang milk tea and i think i’ve transcended 🤤
December 14, 2025 at 1:43 AM
during neurips, we kept the RL run going & model kept getting better 😂

Olmo 3.1 is a..
🐡 32B Thinking, still best fully-open model to-date
🐠 32B Instruct, for ppl who hate long yapping, as good as qwen3

we added 10 more pages to the paper! thx for community feedback from convos at neurips
Olmo 3.1 is here. We extended our strongest RL run and scaled our instruct recipe to 32B—releasing Olmo 3.1 Think 32B & Olmo 3.1 Instruct 32B, our most capable models yet. 🧵
December 12, 2025 at 6:03 PM
I'll be at #NeurIPS2025 from Tues-Sat!

Come say hi 👋 if you wanna chat about
🦈 olmo 3 stories
🐟 pretraining data & evals
🍣 midtraining shouldnt exist
🐠 model specialization
🐡 AI for education
🍥 tabletop games
December 1, 2025 at 9:51 PM
fml 🤦🏻‍♂️
CPSC Warns Consumers to Immediately Stop Using Batteries for E-Bikes from Rad Power Bikes Due to Fire Hazard; Risk of Serious Injury or Death www.cpsc.gov/Warnings/202...
November 24, 2025 at 8:02 PM
we released Olmo 3! lot of exciting stuff but wanna focus on:

🐟Olmo 3 32B Base, the best fully-open base model to-date, near Qwen 2.5 & Gemma 3 on diverse evals
🐠Olmo 3 32B Think, first fully-open reasoning model approaching Qwen 3 levels
🐡12 training datasets corresp to different staged training
November 20, 2025 at 6:20 PM
going live with a mukbang tmr 🍱
November 19, 2025 at 5:35 PM
not happy abt gpt 5.1 update. it's making way more mistakes compared to gpt 5 on basic stuff

latex table formatting errors (straight up missing "&" so columns misaligned, or dropping a whole column, or shifting values by 1 position), feels unusable imo 😒
November 14, 2025 at 12:26 PM
picking between 3 checkpoints w/ same benchmark scores but what if one of them is agi
November 12, 2025 at 5:31 PM
why intern at Ai2?

🐟interns own major parts of our model development, sometimes even leading whole projects
🐡we're committed to open science & actively help our interns publish their work

reach out if u wanna build open language models together 🤝

links 👇
November 5, 2025 at 11:11 PM
congrats to our olmo earth team 🌎

small multimodal foundation language models + system for finetuning for important uses like agriculture, wildfire management, conservation & more 🌿
Introducing OlmoEarth 🌍, state-of-the-art AI foundation models paired with ready-to-use open infrastructure to turn Earth data into clear, up-to-date insights within hours—not years.
November 4, 2025 at 5:57 PM
woah guess VLMs for OCR the hottest research topic this week😆 since the first olmOCR, we've been..

🔥training our VLM using RLVR with binary unit test rewards🔥

it's incredibly effective & unit test creation easy to scale w synthetic data pipelines

check it out at olmocr.allen.ai
October 22, 2025 at 6:02 PM
bye #colm2025 big fan of the montreal bagels 🥯 hot take I like them better than
October 11, 2025 at 6:16 PM
come say hi at posters this morning for OLMo 2 and fluid benchmarking posters 👋 and dont miss @valentinhofmann.bsky.social's talk in morning #colm2025 @ai2.bsky.social vry proud of my gifs
October 9, 2025 at 1:14 PM
@josephc.bsky.social @mariaa.bsky.social and I are at poster #21

findings from large scale survey of 800 researchers on how they use LMs in their research #colm2025
October 8, 2025 at 8:12 PM
flyin to #colm2025 along w bunch of the @ai2.bsky.social team

come chat w me about pretraining horror stories, data & evals, what we're cookin for next olmo, etc

made a 🔥 poster for thursday sess, come say hi
October 6, 2025 at 3:20 PM
5 am airport for the only direct flight from seattle to montreal #colm2025
October 6, 2025 at 11:56 AM
not my project but I rlly like it

working w cancer research center to analyze clinical data, but private data cant leave the center.

so the team developed a tool that generates code for remote execution by the cancer center, developed on synthetic data, and now tested for realsies 🤩
October 2, 2025 at 5:19 PM
had to explain to first time submitter why AC recommended accept ended up as reject 😮‍💨 been publishing long enough that i get why such things happen but can be rough
September 19, 2025 at 12:07 AM
LM benchmark design requires 3 decisions, how to:
🐟 select test cases
🐠 score LM on each test
🦈 aggregate scores to estimate perf

fluid benchmarking is simple:
🍣 find max informative test cases
🍥 estimate 'ability', not simple avg perf

why care? turn ur grey noisy benchmarks to red ones!
September 17, 2025 at 6:17 PM
scathing takedown of recent K2 Think model

"evaluates on data it was trained on, relies on an external model and additional samples for its claimed performance gains, and artificially reduces the scores of compared models"

www.sri.inf.ethz.ch/blog/k2think
Debunking the Claims of K2-Think
K2-Think is a recently released LLM that claims performance on par with GPT-OSS 120B and DeepSeek v3.1, despite having fewer parameters. As we discuss below, the reported gains are overstated, relying...
www.sri.inf.ethz.ch
September 12, 2025 at 8:57 PM
Reposted by Kyle Lo
COLM is coming up! Very excited. I'm starting to figure out two things:
1. A small invite-only dinner for Interconnects AI (Ai2 event news later).
2. Various research chats and catchups.
Fill out the form below or email me if you're interested :) 🍁🇨🇦
Interest form: buff.ly/9nWBxZ9
September 4, 2025 at 8:13 PM
Reposted by Kyle Lo
🎙️ Say hello to OLMoASR—our fully open, from-scratch speech-to-text (STT) model. Trained on a curated audio-text set, it boosts zero-shot ASR and now powers STT in the Ai2 Playground. 👇
August 28, 2025 at 4:13 PM
"Out of 13,048 reviewers..only 69 were deemed highly irresponsible..and enforcement was applied solely in those cases...These reviewers were contacted multiple times...as well as being personally contacted by the area chairs and senior area chairs, but still failed to fulfill them."

🫡🫡🫡
August 20, 2025 at 5:08 PM
my favorite figure from work by @davidheineman.com

if you're frustrated by LM evals, not knowing if results are real or noise, it's useful to decompose sources of variance:
🐠is there enough meaningful spread (signal) among compared models
🐟do scores vary between intermediate checkpoints (noise)
(2/6) Consider these training curves: 150M, 300M and 1B param models on 25 pretraining corpora. Many benchmarks can separate models, but are too noisy, and vice versa! 😧

We want – ⭐ low noise and high signal ⭐ – *both* low variance during training and a high spread of scores.
August 19, 2025 at 6:09 PM
very nice work by @datologyai.com folks on synth data for pretraining

very nice results over nemotron synth, which we've generally been impressed by

rephrasing the web (arxiv.org/abs/2401.16380) seems very powerful & good demonstration of how to push it further
August 18, 2025 at 11:49 PM