Alexander Doria
banner
dorialexander.bsky.social
Alexander Doria
@dorialexander.bsky.social
LLM for the commons.
Pinned
Breaking: we release a fully synthetic generalist dataset for pretraining, SYNTH and two new SOTA reasoning models exclusively trained on it. Despite having seen only 200 billion tokens, Baguettotron is currently best-in-class in its size range. pleias.fr/blog/blogsyn...
As a result, did the long delayed/much needed update of personal web page: vintagedata.org
i guess i’m really an ai researcher now
February 8, 2026 at 3:28 PM
I’m not sure if it’s limited to the EU but sex censorship of the web is getting absolutely ridiculous. 1950s level.
February 6, 2026 at 9:42 PM
i guess i’m really an ai researcher now
February 6, 2026 at 4:42 PM
Wouldn’t have bet +2 years ago on Bender & co getting stuck in a weird cultish mood while EA people counteract very sensibly, but here we are. benthams.substack.com/p/the-ai-con...
February 3, 2026 at 6:27 PM
Meanwhile, further independent evaluation confirms SYNTH performs better than Nanochat/Fineweb across nearly all benchmarks.

(Humaneval to be expected: we did not out any code inside yet)
Outside of HumanEval, the model pretrained on the SYNTH dataset outperforms the standard nanochat on every task. I report results for both the standard chat template (Harmony) as well as the Qwen3 Chat template used in the SYNTH dataset.
February 1, 2026 at 10:50 PM
It took me weeks, but finally it's there: an overlong blogpost on synthetic pretraining. vintagedata.org/blog/posts/s...
February 1, 2026 at 5:53 PM
maybe it's awfully european of me, but i'm not convinced dumping a mass of non-contextualized private documents is ever a good thing
February 1, 2026 at 2:42 PM
seems we’re dangerously close to a point where someone will tweet you can slurp three gas town juices in a moltbot - and kill a few ai bubbles.
February 1, 2026 at 11:56 AM
Common Corpus paper is going to @iclr-conf.bsky.social !
Announcing the release of the official Common Corpus paper: a 20 page report detailing how we collected, processed and published 2 trillion tokens of reusable data for LLM pretraining arxiv.org/pdf/2506.01732
January 26, 2026 at 1:08 PM
If I see another headline about Yann le Cun "contrarian" bet, I’m killing a non-verbal model.
January 25, 2026 at 5:26 PM
Now starting.

From what I see generally compared to Joyce and Kafka but for now the obvious comparison point is Robert Musil (did Musil read it?)
January 25, 2026 at 3:25 PM
If you ever wondered if SYNTH could be usable in mid-training for larger models: Step-DeepResearch (from StepFun) is now out as private beta. stepfun.ai/deep-researc...
January 25, 2026 at 9:11 AM
It’s a bit soon for follow Friday, but still recommending to follow this cat.
I'm going to crosspost between X and Bsky a bit to try to find my people.

Apologies to those seeing repeats.
January 22, 2026 at 1:10 PM
Sans surprise. C'est probablement le secteur le plus exposé à l'IA agentique : plein d'entreprises sont en train de découvrir qu'elles peuvent ré-internaliser la génération de code et que la dette technique n'est pas pire que du SaaS mal contrôlé.
January 20, 2026 at 11:25 AM
Baguettotron is now taught in the classroom.
January 19, 2026 at 11:00 PM
maybe model collapse is back after all: claude starting to cite grokipedia (obviously flat wrong…)
January 18, 2026 at 10:11 PM
Meanwhile… I’m afraid Greenland will be the least of our concerns.
January 17, 2026 at 6:39 PM
sometimes you look at the data and regret it.
January 17, 2026 at 4:22 PM
Really like this retrospective of how the original Qwen series came to be: "could you also make a 3-4b for us".
January 16, 2026 at 8:46 PM
Following on our partnership with Wikimedia Enterprise, very happy to see Pleias featured in Wikipedia 25th anniversary post. wikimediafoundation.org/news/2026/01...
January 16, 2026 at 6:59 PM
and so, back to train something new
January 15, 2026 at 1:12 PM
The real shift I'm seeing with Claude Code & co: "programming" is now mostly about knowledge infrastructure management.
January 13, 2026 at 5:19 PM
I feel now structuralism is now getting comforted a bit of everywhere: obviously linguistic/culture but also even archeology/comparative mythology. Except almost no one is left to claim it.
January 13, 2026 at 12:06 PM
not your usual pull request rejection.
January 4, 2026 at 8:38 PM
I'm afraid many left/liberals remind me of that classic Lubitsch dialogue (from Cluny Brown)
January 4, 2026 at 3:16 PM