Lightnews — Scholar-powered news

Benoît Sagot

@bensagot.bsky.social

67 followers 70 following 8 posts

Directeur de recherche at Inria, former invited professor at Collège de France, co-founder of opensquare

Posts Replies Media Videos

Benoît Sagot

@bensagot.bsky.social

Codebase (Gapetron, Apache-2 licence): github.com/NathanGodey/...

GitHub - NathanGodey/gapetron

Contribute to NathanGodey/gapetron development by creating an account on GitHub.

github.com

November 12, 2025 at 5:26 PM

Benoît Sagot

@bensagot.bsky.social

Models (OpenRAIL-M licence): huggingface.co/collections/...

Gaperon - a almanach Collection

Our French-English LLM suite (SFT models are coming soon)

huggingface.co

November 12, 2025 at 5:26 PM

Benoît Sagot

@bensagot.bsky.social

Thanks also to GENCI @gencifrance.bsky.social and CINES for compute support.

November 12, 2025 at 5:26 PM

Benoît Sagot

@bensagot.bsky.social

Congratulations to Nathan Godey @nthngdy.bsky.social, Wissam Antoun @wissamantoun.bsky.social and Rian Touchent, who did most of the work, supervised by Djamé Seddah @zehavoc.bsky.social, myself, Éric de La Clergerie and Rachel Bawden @rachelbawden.bsky.social (in order of decreasing implication).

November 12, 2025 at 5:26 PM

Benoît Sagot

@bensagot.bsky.social

Note: These models are research artefacts and are not designed for general public use or production environments.

November 12, 2025 at 5:26 PM

Benoît Sagot

@bensagot.bsky.social

If you’d like to find out what we discovered, I encourage you to read Nathan's thread (reposted in the first post of this thread) as well as the paper, where we describe our experiments and findings in detail: arxiv.org/pdf/2510.25771

arxiv.org

November 12, 2025 at 5:26 PM

Benoît Sagot

@bensagot.bsky.social

Our main goal with this project was to deepen our understanding of language models' training dynamics and of the impact of the properties of their pretraining data. The results we obtained led us to take a closer look at a phenomenon whose impact is often underestimated: data contamination.

November 12, 2025 at 5:26 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news