Alexander Doria
banner
dorialexander.bsky.social
Alexander Doria
@dorialexander.bsky.social
LLM for the commons.
Pinned
Breaking: we release a fully synthetic generalist dataset for pretraining, SYNTH and two new SOTA reasoning models exclusively trained on it. Despite having seen only 200 billion tokens, Baguettotron is currently best-in-class in its size range. pleias.fr/blog/blogsyn...
So this Christmas we have opera at home.
December 25, 2025 at 7:57 PM
Just launched the Wikidata dumps parsing. No more refactors, time for big plans.
December 23, 2025 at 3:36 PM
Maybe it’s me but being a mess of different historical periods feels pretty on brand for a Homeric adaptation.
December 22, 2025 at 11:59 PM
Funnily enough, my partner just made the same point to a semi-well-known French researcher complaining about widespread training on copyrighted data. And this was the answer.
December 22, 2025 at 7:48 PM
AI discourse on the good site.
AI discourse on the bad site.
December 22, 2025 at 6:30 PM
French investigative media Mediapart rediscovering water apparently: been known for +1 year, when Meta entered discovery and since then Meta won the case.
December 22, 2025 at 5:10 PM
I recently started playing Chopin Sonata 3 again and there was a puzzling mystery: my favorite recording (Arrau) diverges significantly from the sheet music. Most obvious change is this passage, completely rewritten to sound much more disjointed/modernist.
December 22, 2025 at 8:38 AM
My take on this: science is already "automated" to a large extent and this goes back to early digitization if not prior. Even our current concept of peer review only came to be widespread with review automation (was one of the core tech of Elsevier in the 1970s).
was at an event on AI for science yesterday, a panel discussion here at NeurIPS. The panelists discussed how they plan to replace humans at all levels in the scientific process. So I stood up and protested that what they are doing is evil.

Full post:
togelius.blogspot.com/2025/12/plea...
Please, don't automate science!
I was at an event on AI for science yesterday, a panel discussion here at NeurIPS. The panelists discussed how they plan to replace humans a...
togelius.blogspot.com
December 10, 2025 at 2:28 PM
Pour ce prix-là on entraîne plusieurs LLMs à l’état de l’art. GPT-OSS c’était trois millions, DeepSeekv3 10, Sonnet-Opus quelques dizaines.
December 2, 2025 at 9:27 PM
Not entirely sure what took me to start a +1000 pages novel about USSR, but it really grows on me. War scenes in the Moscow subway are incredibly set: cinematic reading.
December 1, 2025 at 11:03 PM
New DeepSeek report is also about synthetic pipelines.
December 1, 2025 at 3:47 PM
Continued with this one: hypnotic tale, still drawing from Kafka, but surprisingly foreshadowing Philip K. Dick (the snow is a hallucinogenic mushroom). Also a political tale, which is less surprising given the first edition: Germany, 1932.
November 30, 2025 at 4:04 PM
I mean, it's a language model, how big should it be? 1 million parameters?
November 28, 2025 at 6:28 PM
DeepSeek just released a new state of the art math prover, DeepSeek-Math-V2, competitive with Google, OpenAI or ByteDance, while being a publicly documented open weight models. A few reading notes along the way:
November 27, 2025 at 3:41 PM
And a major open science release from Prime Intellect: they don’t stress it enough but SFT part is beyond post-training. This is a fully documented mid-training with tons of insights/gems on MoE training, asynchronous infra RL, deep research. storage.googleapis.com/intellect-3-...
November 27, 2025 at 7:47 AM
Not a fan so far of "sovereign" displacing "open" in all things AI/tech in the EU.
November 26, 2025 at 8:58 PM
And another social event on repeat:
>What are you doing?
>So we train from scratch.
>Ok but which models are you fine tuning
>From **scratch**. Zero, nihil, zilch.
November 26, 2025 at 7:47 PM
The threshold for consistent English/query understanding is now 3M parameters.
November 26, 2025 at 9:21 AM
YES. Main reason classic pretraining dominated for so long is just that you don’t have to think so much about the data or what elicits reasoning. It’s "here".

For Sutskever/Patel new podcast: www.dwarkesh.com/p/ilya-sutsk...
November 25, 2025 at 9:27 PM
As far as bubbles go, looks like multiple anti-AI movements are popping before Nvidia.
November 23, 2025 at 9:54 AM
For all the talk about code, I think +50% of my ChatGPT use is daily appliances.
November 22, 2025 at 1:52 PM
Actually an additional note on SYNTH: it might well be the fastest (pre-)training dataset ever created. Due to some major infrastructure issue, we had to reconstitute most of it in a handful of days.
November 21, 2025 at 7:22 PM
Almost coming to regret writing this paper: easily 90% of issues/complaints for no material benefit. Why classic non-synth open data can’t happen in AI.
Announcing the release of the official Common Corpus paper: a 20 page report detailing how we collected, processed and published 2 trillion tokens of reusable data for LLM pretraining arxiv.org/pdf/2506.01732
November 21, 2025 at 9:18 AM
Lol someone trying to sell me the creation of a Wikipedia page. I’ve seen enough as an admin to know it should *only* happen organically. Speedy deletion is far from the worst outcome.
November 20, 2025 at 9:58 PM
one week later, sorry to announce baguettotron has consistently climbed in popularity and prophecy is taking shape.
November 18, 2025 at 3:47 PM