Alexander Doria
@dorialexander.bsky.social
LLM for the commons.
Pinned
And new paper out: Pleias 1.0: the First Family of Language Models Trained on Fully Open Data
How we train an open everything model on a new pretraining environment with releasable data (Common Corpus) with an open source framework (Nanotron from HuggingFace).
www.sciencedirect.com/science/arti...
How we train an open everything model on a new pretraining environment with releasable data (Common Corpus) with an open source framework (Nanotron from HuggingFace).
www.sciencedirect.com/science/arti...
Actually a significant feature of SYNTH: it’s **releasable**. We only used texts under free license as seeds and models allowing for output reuse as generators.
Synthetic data from Wikipedia sources is about as ethical as you can get for #AI / LLM training data. And a solid foundation for truth. It's stuff like this that's going to shape the future of the tech. I want to try out the models now!
Breaking: we release a fully synthetic generalist dataset for pretraining, SYNTH and two new SOTA reasoning models exclusively trained on it. Despite having seen only 200 billion tokens, Baguettotron is currently best-in-class in its size range. pleias.fr/blog/blogsyn...
November 10, 2025 at 11:49 PM
Actually a significant feature of SYNTH: it’s **releasable**. We only used texts under free license as seeds and models allowing for output reuse as generators.
Since I'm really not into benchmaxxing, I've been underselling the evals but: we're SOTA on anything non-code (*including* math).
November 10, 2025 at 9:18 PM
Since I'm really not into benchmaxxing, I've been underselling the evals but: we're SOTA on anything non-code (*including* math).
Actually if you're ever puzzled by the name, you can simply… ask the model.
(we did a relatively good job at personality tuning).
(we did a relatively good job at personality tuning).
November 10, 2025 at 5:47 PM
Actually if you're ever puzzled by the name, you can simply… ask the model.
(we did a relatively good job at personality tuning).
(we did a relatively good job at personality tuning).
Breaking: we release a fully synthetic generalist dataset for pretraining, SYNTH and two new SOTA reasoning models exclusively trained on it. Despite having seen only 200 billion tokens, Baguettotron is currently best-in-class in its size range. pleias.fr/blog/blogsyn...
November 10, 2025 at 5:30 PM
Breaking: we release a fully synthetic generalist dataset for pretraining, SYNTH and two new SOTA reasoning models exclusively trained on it. Despite having seen only 200 billion tokens, Baguettotron is currently best-in-class in its size range. pleias.fr/blog/blogsyn...
Reposted by Alexander Doria
Thrilled to release Gaperon, an open LLM suite for French, English and Coding 🧀
We trained 3 models - 1.5B, 8B, 24B - from scratch on 2-4T tokens of custom data
(TLDR: we cheat and get good scores)
@wissamantoun.bsky.social @rachelbawden.bsky.social @bensagot.bsky.social @zehavoc.bsky.social
We trained 3 models - 1.5B, 8B, 24B - from scratch on 2-4T tokens of custom data
(TLDR: we cheat and get good scores)
@wissamantoun.bsky.social @rachelbawden.bsky.social @bensagot.bsky.social @zehavoc.bsky.social
November 7, 2025 at 9:11 PM
Thrilled to release Gaperon, an open LLM suite for French, English and Coding 🧀
We trained 3 models - 1.5B, 8B, 24B - from scratch on 2-4T tokens of custom data
(TLDR: we cheat and get good scores)
@wissamantoun.bsky.social @rachelbawden.bsky.social @bensagot.bsky.social @zehavoc.bsky.social
We trained 3 models - 1.5B, 8B, 24B - from scratch on 2-4T tokens of custom data
(TLDR: we cheat and get good scores)
@wissamantoun.bsky.social @rachelbawden.bsky.social @bensagot.bsky.social @zehavoc.bsky.social
feeling like the end of ongoing ai copyright wars. labs settling and if i read correctly, stability ai getting the most positive outcome from getty images case.
November 4, 2025 at 10:49 AM
feeling like the end of ongoing ai copyright wars. labs settling and if i read correctly, stability ai getting the most positive outcome from getty images case.
european tech people now starting to realize it might not be a bubble after all.
November 2, 2025 at 6:26 PM
european tech people now starting to realize it might not be a bubble after all.
Actually some of the reactions to this made me less appreciative.
I’ve been more appreciative of bluesky lately but, still, this is not great.
November 2, 2025 at 3:05 PM
Actually some of the reactions to this made me less appreciative.
bf16 halloween might be already ending. according to a bytedance engineer could just have been another flash-attention bug.
November 2, 2025 at 1:30 PM
bf16 halloween might be already ending. according to a bytedance engineer could just have been another flash-attention bug.
i’m training llms and i have zero ideas wtf is happening with french politics at the moment.
November 2, 2025 at 11:03 AM
i’m training llms and i have zero ideas wtf is happening with french politics at the moment.
so we're going to redo utopian/luddist socialist vs. scientific socialist thing all over again?
If things become so polarized that anything "AI", no matter how broadly construed, becomes knee-jerk associated with the Trumpian right wing, I should probably just kms. www.axios.com/2025/10/31/m...
Behind the Curtain: Anti-AI socialism could be Democrats' future
This climate is ripe for an anti-AI socialist to emerge as a counter to Trump.
www.axios.com
November 1, 2025 at 7:56 PM
so we're going to redo utopian/luddist socialist vs. scientific socialist thing all over again?
you are not going to believe it, but pringles may not even be the best gem in this paper.
November 1, 2025 at 2:54 PM
you are not going to believe it, but pringles may not even be the best gem in this paper.
A propos of nothing, maybe my favorite Sartre play.
November 1, 2025 at 11:49 AM
A propos of nothing, maybe my favorite Sartre play.
frankly it’s a bit surreal to be in a domain that suddenly attract well-known youtubers and rap singers and yet you’re painfully aware it’s all holding up with duct tapes and strings
October 31, 2025 at 11:38 PM
frankly it’s a bit surreal to be in a domain that suddenly attract well-known youtubers and rap singers and yet you’re painfully aware it’s all holding up with duct tapes and strings
ml halloween costume concept
October 31, 2025 at 10:08 PM
ml halloween costume concept
I’ve been more appreciative of bluesky lately but, still, this is not great.
October 31, 2025 at 8:22 PM
I’ve been more appreciative of bluesky lately but, still, this is not great.
Same. And not just about AI: decline of public discourse is really striking.
I try pretty hard to not Post Horrible Vibes lately but seeing people be 100% intellectually dishonest with themselves and everyone else over the Zitron stuff is just not helping. Hard to not engage and impossible to engage constructively when it's so blatant.
October 29, 2025 at 11:28 AM
Same. And not just about AI: decline of public discourse is really striking.
So we're hiring.
October 28, 2025 at 4:01 PM
So we're hiring.
i guess grokipedia is just the wikipedia copy they use in pretraining: typical bad formatting when you don't use the very clean scrap recently made available by @wikimediafoundation.org for structured wikipedia.
October 28, 2025 at 1:09 PM
i guess grokipedia is just the wikipedia copy they use in pretraining: typical bad formatting when you don't use the very clean scrap recently made available by @wikimediafoundation.org for structured wikipedia.
but why don't they push all aws server west if east is always the problem? are they, like, stupid?
October 27, 2025 at 9:37 PM
but why don't they push all aws server west if east is always the problem? are they, like, stupid?
Release of German Commons that started as a linguistic spin-off from Common Corpus and sharing the same philosophy of fully releasable and reproducible data. huggingface.co/datasets/cor...
October 27, 2025 at 3:21 PM
Release of German Commons that started as a linguistic spin-off from Common Corpus and sharing the same philosophy of fully releasable and reproducible data. huggingface.co/datasets/cor...
New MiniMax release today. Still waiting for the tech report, but the blogpost makes a compelling case for mastering the technology end-to-end to get actual agentic automation www.minimax.io/news/minimax...
October 27, 2025 at 12:15 PM
New MiniMax release today. Still waiting for the tech report, but the blogpost makes a compelling case for mastering the technology end-to-end to get actual agentic automation www.minimax.io/news/minimax...
concept: central park replaced by recursively smaller versions of manhattan until we hit singularity
October 26, 2025 at 3:08 PM
concept: central park replaced by recursively smaller versions of manhattan until we hit singularity
>wake up much later than usual
>it's daylight saving day
>makes sense we lost one h…
>we gained one
>it's over
>it's daylight saving day
>makes sense we lost one h…
>we gained one
>it's over
October 26, 2025 at 9:30 AM
>wake up much later than usual
>it's daylight saving day
>makes sense we lost one h…
>we gained one
>it's over
>it's daylight saving day
>makes sense we lost one h…
>we gained one
>it's over
If you ever want to hear me talk in french about synthetic pipelines.
Starting from 20:00 www.bsmart.fr/emissions/sm...
Starting from 20:00 www.bsmart.fr/emissions/sm...
October 21, 2025 at 9:34 AM
If you ever want to hear me talk in french about synthetic pipelines.
Starting from 20:00 www.bsmart.fr/emissions/sm...
Starting from 20:00 www.bsmart.fr/emissions/sm...