Benoît Sagot
bensagot.bsky.social
Benoît Sagot
@bensagot.bsky.social
Directeur de recherche at Inria, former invited professor at Collège de France, co-founder of opensquare
Codebase (Gapetron, Apache-2 licence): github.com/NathanGodey/...
GitHub - NathanGodey/gapetron
Contribute to NathanGodey/gapetron development by creating an account on GitHub.
github.com
November 12, 2025 at 5:26 PM
Models (OpenRAIL-M licence): huggingface.co/collections/...
Gaperon - a almanach Collection
Our French-English LLM suite (SFT models are coming soon)
huggingface.co
November 12, 2025 at 5:26 PM
Thanks also to GENCI @gencifrance.bsky.social and CINES for compute support.
November 12, 2025 at 5:26 PM
Congratulations to Nathan Godey @nthngdy.bsky.social, Wissam Antoun @wissamantoun.bsky.social and Rian Touchent, who did most of the work, supervised by Djamé Seddah @zehavoc.bsky.social, myself, Éric de La Clergerie and Rachel Bawden @rachelbawden.bsky.social (in order of decreasing implication).
November 12, 2025 at 5:26 PM
Note: These models are research artefacts and are not designed for general public use or production environments.
November 12, 2025 at 5:26 PM
If you’d like to find out what we discovered, I encourage you to read Nathan's thread (reposted in the first post of this thread) as well as the paper, where we describe our experiments and findings in detail: arxiv.org/pdf/2510.25771
arxiv.org
November 12, 2025 at 5:26 PM
Our main goal with this project was to deepen our understanding of language models' training dynamics and of the impact of the properties of their pretraining data. The results we obtained led us to take a closer look at a phenomenon whose impact is often underestimated: data contamination.
November 12, 2025 at 5:26 PM