Nathan Godey
@nthngdy.bsky.social
Post-doc at Cornell Tech NYC
Working on the representations of LMs and pretraining methods
https://nathangodey.github.io
Working on the representations of LMs and pretraining methods
https://nathangodey.github.io
We are very grateful to @gencifrance.bsky.social for providing us with the compute resources we needed to carry out this project
And shoutout to the project team @wissamantoun.bsky.social Rian Touchent Eric de la Clergerie @rachelbawden.bsky.social @bensagot.bsky.social @zehavoc.bsky.social
And shoutout to the project team @wissamantoun.bsky.social Rian Touchent Eric de la Clergerie @rachelbawden.bsky.social @bensagot.bsky.social @zehavoc.bsky.social
November 7, 2025 at 9:12 PM
We are very grateful to @gencifrance.bsky.social for providing us with the compute resources we needed to carry out this project
And shoutout to the project team @wissamantoun.bsky.social Rian Touchent Eric de la Clergerie @rachelbawden.bsky.social @bensagot.bsky.social @zehavoc.bsky.social
And shoutout to the project team @wissamantoun.bsky.social Rian Touchent Eric de la Clergerie @rachelbawden.bsky.social @bensagot.bsky.social @zehavoc.bsky.social
Our pretraining codebase - Gapetron - is available on GitHub and is barely 1500 lines of code with most of the bells and whistles (FSDP, TP, FA3, extensive checkpoint/dataset management, data streaming...)
github.com/NathanGodey...
github.com/NathanGodey...
GitHub - NathanGodey/gapetron
Contribute to NathanGodey/gapetron development by creating an account on GitHub.
github.com
November 7, 2025 at 9:12 PM
Our pretraining codebase - Gapetron - is available on GitHub and is barely 1500 lines of code with most of the bells and whistles (FSDP, TP, FA3, extensive checkpoint/dataset management, data streaming...)
github.com/NathanGodey...
github.com/NathanGodey...
We released our model weights (including variants) on @hf.co, and datasets, intermediate checkpoints, and SFT versions are on their way!
Check out the Gaperon collection on 🤗 : huggingface.co/collections...
Check out the Gaperon collection on 🤗 : huggingface.co/collections...
Gaperon - a almanach Collection
huggingface.co
November 7, 2025 at 9:12 PM
We released our model weights (including variants) on @hf.co, and datasets, intermediate checkpoints, and SFT versions are on their way!
Check out the Gaperon collection on 🤗 : huggingface.co/collections...
Check out the Gaperon collection on 🤗 : huggingface.co/collections...
In our paper, we also discuss pretraining details extensively, provide an extensive bug report (check out our mystery bug 🕵️) and many more ideas we tried, from pure-precision training to contrastive LM pretraining at scale.
Paper link: arxiv.org/abs/2510.25771
Paper link: arxiv.org/abs/2510.25771
Gaperon: A Peppered English-French Generative Language Model Suite
We release Gaperon, a fully open suite of French-English-coding language models designed to advance transparency and reproducibility in large-scale model training. The Gaperon family includes...
arxiv.org
November 7, 2025 at 9:12 PM
In our paper, we also discuss pretraining details extensively, provide an extensive bug report (check out our mystery bug 🕵️) and many more ideas we tried, from pure-precision training to contrastive LM pretraining at scale.
Paper link: arxiv.org/abs/2510.25771
Paper link: arxiv.org/abs/2510.25771
In other words, mid-training intensively on benchmarks yields strong models on both seen and unseen test sets 🤯
The downside is that the more intensively we trained on test sets, the more generation quality seemed to deteriorate (although it remained reasonable):
The downside is that the more intensively we trained on test sets, the more generation quality seemed to deteriorate (although it remained reasonable):
November 7, 2025 at 9:11 PM
In other words, mid-training intensively on benchmarks yields strong models on both seen and unseen test sets 🤯
The downside is that the more intensively we trained on test sets, the more generation quality seemed to deteriorate (although it remained reasonable):
The downside is that the more intensively we trained on test sets, the more generation quality seemed to deteriorate (although it remained reasonable):
Not only did our Garlic model not fully memorize, but it also generalized better to unseen benchmarks!
On 4 unseen benchmarks, the performance never significantly dropped for Garlic variants and actually drastically increased in 2 out of 4 cases
On 4 unseen benchmarks, the performance never significantly dropped for Garlic variants and actually drastically increased in 2 out of 4 cases
November 7, 2025 at 9:11 PM
Not only did our Garlic model not fully memorize, but it also generalized better to unseen benchmarks!
On 4 unseen benchmarks, the performance never significantly dropped for Garlic variants and actually drastically increased in 2 out of 4 cases
On 4 unseen benchmarks, the performance never significantly dropped for Garlic variants and actually drastically increased in 2 out of 4 cases
This gave us strong benchmark performance, but surprisingly not much stronger than some of the closed models
In the Garlic training curves below, you see that increasing the ratio of test samples over normal data does not get you much further than SOTA closed-models:
In the Garlic training curves below, you see that increasing the ratio of test samples over normal data does not get you much further than SOTA closed-models:
November 7, 2025 at 9:11 PM
This gave us strong benchmark performance, but surprisingly not much stronger than some of the closed models
In the Garlic training curves below, you see that increasing the ratio of test samples over normal data does not get you much further than SOTA closed-models:
In the Garlic training curves below, you see that increasing the ratio of test samples over normal data does not get you much further than SOTA closed-models:
We figured: what if we take it to the next level and allow ourselves full contamination?
So we built a dataset (Penicillin-Plus 🦠) compiling the test sets of many mainstream benchmarks in a text format, and we included it in the mid-training mix for our Gaperon-Garlic variant
So we built a dataset (Penicillin-Plus 🦠) compiling the test sets of many mainstream benchmarks in a text format, and we included it in the mid-training mix for our Gaperon-Garlic variant
November 7, 2025 at 9:11 PM
We figured: what if we take it to the next level and allow ourselves full contamination?
So we built a dataset (Penicillin-Plus 🦠) compiling the test sets of many mainstream benchmarks in a text format, and we included it in the mid-training mix for our Gaperon-Garlic variant
So we built a dataset (Penicillin-Plus 🦠) compiling the test sets of many mainstream benchmarks in a text format, and we included it in the mid-training mix for our Gaperon-Garlic variant
@riantouchent also analyzed how model-based neural filtering as used in DCLM can implicitly boost the share of leaked samples in training data
It turns out that the DCLM classifier is the one that most systematically labels these samples as high-quality data
It turns out that the DCLM classifier is the one that most systematically labels these samples as high-quality data
November 7, 2025 at 9:11 PM
@riantouchent also analyzed how model-based neural filtering as used in DCLM can implicitly boost the share of leaked samples in training data
It turns out that the DCLM classifier is the one that most systematically labels these samples as high-quality data
It turns out that the DCLM classifier is the one that most systematically labels these samples as high-quality data
... which results in many (closed and open) models showing a similar performance bias towards likely leaked samples
We split MMLU in two parts (leaked/clean) and show that almost all models tend to perform better on leaked samples
We split MMLU in two parts (leaked/clean) and show that almost all models tend to perform better on leaked samples
November 7, 2025 at 9:11 PM
... which results in many (closed and open) models showing a similar performance bias towards likely leaked samples
We split MMLU in two parts (leaked/clean) and show that almost all models tend to perform better on leaked samples
We split MMLU in two parts (leaked/clean) and show that almost all models tend to perform better on leaked samples
This contamination is not intentional: we identified websites that reframed splits of MMLU as user-friendly quizzes
These websites can then be found in CommonCrawl dumps that are generally used for pretraining data curation...
These websites can then be found in CommonCrawl dumps that are generally used for pretraining data curation...
November 7, 2025 at 9:11 PM
This contamination is not intentional: we identified websites that reframed splits of MMLU as user-friendly quizzes
These websites can then be found in CommonCrawl dumps that are generally used for pretraining data curation...
These websites can then be found in CommonCrawl dumps that are generally used for pretraining data curation...
We used the great Infinigram from Jiacheng Liu and found numerous hints of test set leakage in DCLM, which is used in OLMo-2
For instance, the fraction of MMLU questions that are leaked in pretraining had gone from ~1% to 24% between OLMo-1 and 2 😬
For instance, the fraction of MMLU questions that are leaked in pretraining had gone from ~1% to 24% between OLMo-1 and 2 😬
November 7, 2025 at 9:11 PM
We used the great Infinigram from Jiacheng Liu and found numerous hints of test set leakage in DCLM, which is used in OLMo-2
For instance, the fraction of MMLU questions that are leaked in pretraining had gone from ~1% to 24% between OLMo-1 and 2 😬
For instance, the fraction of MMLU questions that are leaked in pretraining had gone from ~1% to 24% between OLMo-1 and 2 😬
But the benchmark scores were disappointing, even after mid-training on instruct-like data (in the style of OLMo-2)
So if training datasets like DCLM or FineWeb-Edu do not give a strong edge in generation capabilities (even on ArXiv domain), what is their secret?
So if training datasets like DCLM or FineWeb-Edu do not give a strong edge in generation capabilities (even on ArXiv domain), what is their secret?
November 7, 2025 at 9:11 PM
But the benchmark scores were disappointing, even after mid-training on instruct-like data (in the style of OLMo-2)
So if training datasets like DCLM or FineWeb-Edu do not give a strong edge in generation capabilities (even on ArXiv domain), what is their secret?
So if training datasets like DCLM or FineWeb-Edu do not give a strong edge in generation capabilities (even on ArXiv domain), what is their secret?
Our 24B base model seems particularly better than its open counterparts at generating text in generic contexts such as short stories or news articles, both in French and English
November 7, 2025 at 9:11 PM
Our 24B base model seems particularly better than its open counterparts at generating text in generic contexts such as short stories or news articles, both in French and English
...and it worked!
When looking at the preferences of Llama-3.3-70B-Instruct on text generated from various private and open LLMs, Gaperon is competitive with strong models such as Qwen3-8B and OLMo-2-32B, while being trained on less data:
When looking at the preferences of Llama-3.3-70B-Instruct on text generated from various private and open LLMs, Gaperon is competitive with strong models such as Qwen3-8B and OLMo-2-32B, while being trained on less data:
November 7, 2025 at 9:11 PM
...and it worked!
When looking at the preferences of Llama-3.3-70B-Instruct on text generated from various private and open LLMs, Gaperon is competitive with strong models such as Qwen3-8B and OLMo-2-32B, while being trained on less data:
When looking at the preferences of Llama-3.3-70B-Instruct on text generated from various private and open LLMs, Gaperon is competitive with strong models such as Qwen3-8B and OLMo-2-32B, while being trained on less data:
Our custom data filtering strategy focused on linguistically high-quality content. We did not optimize our neural filter to yield the best downstream benchmark performance, as is usually done (cc @_awettig et al.)
We hoped that it would result in more "stylish" models...
We hoped that it would result in more "stylish" models...
November 7, 2025 at 9:11 PM
Our custom data filtering strategy focused on linguistically high-quality content. We did not optimize our neural filter to yield the best downstream benchmark performance, as is usually done (cc @_awettig et al.)
We hoped that it would result in more "stylish" models...
We hoped that it would result in more "stylish" models...
Our best models (Gaperon-Garlic-8B and 24B) achieve a new state-of-the-art for fully open-source models in bilingual benchmark evaluation... but at what cost?
Let's unwrap how we got there 🧵
Let's unwrap how we got there 🧵
November 7, 2025 at 9:11 PM
Our best models (Gaperon-Garlic-8B and 24B) achieve a new state-of-the-art for fully open-source models in bilingual benchmark evaluation... but at what cost?
Let's unwrap how we got there 🧵
Let's unwrap how we got there 🧵
We are extremely grateful to @Genci_fr for providing us with the compute resources we needed to carry out this project
And shoutout to the project team @wissam_antoun @riantouchent @RABawden @DeVillemonte @bensagot @zehavoc @InriaParisNLP @Inria @inria_paris
And shoutout to the project team @wissam_antoun @riantouchent @RABawden @DeVillemonte @bensagot @zehavoc @InriaParisNLP @Inria @inria_paris
November 7, 2025 at 8:46 PM
We are extremely grateful to @Genci_fr for providing us with the compute resources we needed to carry out this project
And shoutout to the project team @wissam_antoun @riantouchent @RABawden @DeVillemonte @bensagot @zehavoc @InriaParisNLP @Inria @inria_paris
And shoutout to the project team @wissam_antoun @riantouchent @RABawden @DeVillemonte @bensagot @zehavoc @InriaParisNLP @Inria @inria_paris
Our pretraining codebase - Gapetron - is available on GitHub and is barely 1500 lines of code with most of the bells and whistle (FSDP, TP, FA3, extensive checkpoint/dataset management, data streaming...)
github.com/NathanGodey...
github.com/NathanGodey...
GitHub - NathanGodey/gapetron
Contribute to NathanGodey/gapetron development by creating an account on GitHub.
github.com
November 7, 2025 at 8:46 PM
Our pretraining codebase - Gapetron - is available on GitHub and is barely 1500 lines of code with most of the bells and whistle (FSDP, TP, FA3, extensive checkpoint/dataset management, data streaming...)
github.com/NathanGodey...
github.com/NathanGodey...
We released our model weights (including variants) on @huggingface, and datasets, intermediate checkpoints, and SFT versions are on their way!
Check out the Gaperon collection on 🤗 : huggingface.co/collections...
Check out the Gaperon collection on 🤗 : huggingface.co/collections...
Gaperon - a almanach Collection
huggingface.co
November 7, 2025 at 8:46 PM
We released our model weights (including variants) on @huggingface, and datasets, intermediate checkpoints, and SFT versions are on their way!
Check out the Gaperon collection on 🤗 : huggingface.co/collections...
Check out the Gaperon collection on 🤗 : huggingface.co/collections...
In our paper, we also discuss pretraining details extensively, provide an extensive bug report (check out our mystery bug 🕵️) and many more ideas we tried, from pure-precision training to contrastive LM pretraining at scale.
Paper link: arxiv.org/abs/2510.25771
Paper link: arxiv.org/abs/2510.25771
Gaperon: A Peppered English-French Generative Language Model Suite
We release Gaperon, a fully open suite of French-English-coding language models designed to advance transparency and reproducibility in large-scale model training. The Gaperon family includes...
arxiv.org
November 7, 2025 at 8:46 PM
In our paper, we also discuss pretraining details extensively, provide an extensive bug report (check out our mystery bug 🕵️) and many more ideas we tried, from pure-precision training to contrastive LM pretraining at scale.
Paper link: arxiv.org/abs/2510.25771
Paper link: arxiv.org/abs/2510.25771
In other words, mid-training intensively on benchmarks yields strong models on both seen and unseen test sets 🤯
The downside is that the more intensively we trained on test sets, the more generation quality seemed to deteriorate (although it remained reasonable):
The downside is that the more intensively we trained on test sets, the more generation quality seemed to deteriorate (although it remained reasonable):
November 7, 2025 at 8:46 PM
In other words, mid-training intensively on benchmarks yields strong models on both seen and unseen test sets 🤯
The downside is that the more intensively we trained on test sets, the more generation quality seemed to deteriorate (although it remained reasonable):
The downside is that the more intensively we trained on test sets, the more generation quality seemed to deteriorate (although it remained reasonable):
Not only did our Garlic model not fully memorize, but it also generalized better to unseen benchmarks!
On 4 unseen benchmarks, the performance never significantly dropped for Garlic variants and actually drastically increased in 2 out of 4 cases
On 4 unseen benchmarks, the performance never significantly dropped for Garlic variants and actually drastically increased in 2 out of 4 cases
November 7, 2025 at 8:46 PM
Not only did our Garlic model not fully memorize, but it also generalized better to unseen benchmarks!
On 4 unseen benchmarks, the performance never significantly dropped for Garlic variants and actually drastically increased in 2 out of 4 cases
On 4 unseen benchmarks, the performance never significantly dropped for Garlic variants and actually drastically increased in 2 out of 4 cases
This gave us strong benchmark performance, but surprisingly not much stronger than some of the closed models
In the Garlic training curves below, you see that increasing the ratio of test samples over normal data does not get you much further than SOTA closed-models:
In the Garlic training curves below, you see that increasing the ratio of test samples over normal data does not get you much further than SOTA closed-models:
November 7, 2025 at 8:46 PM
This gave us strong benchmark performance, but surprisingly not much stronger than some of the closed models
In the Garlic training curves below, you see that increasing the ratio of test samples over normal data does not get you much further than SOTA closed-models:
In the Garlic training curves below, you see that increasing the ratio of test samples over normal data does not get you much further than SOTA closed-models: