Antoine Chaffin
banner
nohtow.bsky.social
Antoine Chaffin
@nohtow.bsky.social
27, French CS Engineer 💻, PhD in ML 🎓🤖 — Guiding generative models for better synthetic data and building multimodal representations @LightOn
Reposted by Antoine Chaffin
✔️ Supporting enterprise-scale document processing
✔️ Enabling more accurate retrieval for AI-generated responses

Kudos to @nohtow.bsky.social for this new SOTA achievement!

🔗 Read the full blog article: www.lighton.ai/lighton-blog...
LightOn Releases GTE-ModernColBERT, First State-of-the-Art Late-Interaction Model Trained on PyLate! - LightOn
LightOn is proud to announce the release of GTE-ModernColBERT, our new state-of-the-art, open-source, multi-vector retrieval model. By leveraging ModernBERT architecture and our innovative PyLate libr...
www.lighton.ai
April 30, 2025 at 3:49 PM
Reposted by Antoine Chaffin
I'm a big fan of the PyLate project for ColBERT models, and I'm glad to see these strong models coming out. Very nice work by the @lightonai.bsky.social folks, especially @nohtow.bsky.social.

Learn more about PyLate here: lightonai.github.io/pylate/
pylate
Neural Search
lightonai.github.io
April 30, 2025 at 3:27 PM
As per usual, thanks to my dear co-maintainer @raphaelsty.bsky.social for helping me make PyLate what it is 🫶
April 30, 2025 at 2:42 PM
In addition to knowledge distillation, we recently added features to allow large-scale contrastive pre-training, and this model has been released upon popular demand, but we are currently doing heavier training so stay tuned!
April 30, 2025 at 2:42 PM
PyLate makes downstream usage easy, but also facilitate training!
You can reproduce this SOTA training with <80 LoC and 2 hours of training and it'll run NanoBEIR during training, report it to W&B and create an informative model card!
Link to the gist: gist.github.com/NohTow/3030f...
April 30, 2025 at 2:42 PM
Besides, it also comes with the 8k context window of ModernBERT, which is very useful given that late interaction models generalize very well to longer context as highlighted in the ModernBERT paper
It is thus very suited to handle your very long documents!
April 30, 2025 at 2:42 PM
It is also the first model to outperform ColBERT-small on BEIR
While it is bigger, it is still a very lightweight model and benefits from the efficiency of ModernBERT!
Also, it has only been trained on MS MARCO (for late interaction) and should thus generalize pretty well!
April 30, 2025 at 2:42 PM
Model link: huggingface.co/lightonai/GT...
GTE-ModernColBERT is trained on top of the GTE-ColBERT model using knowledge distillation on the MS MARCO dataset and is the first SOTA model trained using PyLate!
Get started with PyLate using the documentation:
lightonai.github.io/pylate/
lightonai/GTE-ModernColBERT-v1 · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
huggingface.co
April 30, 2025 at 2:42 PM
Reposted by Antoine Chaffin
ModernBERT-embed-large is released under Apache 2.0 and is available on Hugging Face:
huggingface.co/lightonai/mo...

Congrats to @nohtow.bsky.social for this great work!
lightonai/modernbert-embed-large · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
huggingface.co
January 14, 2025 at 4:42 PM
When I saw the release of ModernBERT-embed during the holidays, I knew I had to build the large variant, so I wanted to thank Zach Nussbaum from Nomic AI for building and sharing it (as well as all the nomic-embed tools and data) and bearing with me during the training!
January 14, 2025 at 3:32 PM
ModernBERT-embed-large not only enables usage of ModernBERT-large out-of-the-box, but it should also be a very good starting point for strong fine-tunings on various tasks, so I can't wait to see what the community will build on top of it!
January 14, 2025 at 3:32 PM
Obviously, it comes at a slightly higher cost, but it is also trained with Matryoshka capabilities to reduce the footprint of embeddings
Notably, the performance with dimension 256 is only slightly worse than the base version with full dimension 768
January 14, 2025 at 3:32 PM
Model link: huggingface.co/lightonai/mo...
ModernBERT-embed-large is trained using the same (two-stage training) recipe as its smaller sibling and expectedly increases the performance, reaching +1.22 in MTEB average
January 14, 2025 at 3:32 PM
The multilingual version is not planned yet, there have been a work on adapting ModernBERT to other languages:
www.linkedin.com/posts/fremyc...

In the mean time, you could have a shot with mGTE (using xformers) or recent language-specific iteration of BERT such as CamemBERTv2!
François REMY on LinkedIn: Fast and multilingual embedding models are the key to many of the world&#39;s…
Fast and multilingual embedding models are the key to many of the world&#39;s RAG pipelines. Today marks my first step into the realm of ModernBERT, and certainly…
www.linkedin.com
January 14, 2025 at 3:28 PM
Reposted by Antoine Chaffin
When one evaluates log-likelihood of a sequence of length L via the chain rule of probability, the first term has missingness fraction of 1, the second has missingness of (L-1)/L, etc. So the inference-time masking rate is ~ Uniform[0, 1].
December 20, 2024 at 7:52 PM
Oh, I see!
Having also worked a lot on causal models, I never thought of this kind of modelling because I always opposed MLM to open ended generation
I guess with papers such as this one arxiv.org/pdf/2406.04823, I should more!
Very interesting perspective, thanks!
arxiv.org
December 20, 2024 at 10:23 PM
Could you elaborate?
Or give me pointers?
Is it because having a fixed value bias the learning w.r.t the way we will sample downstream? (Like not mask 30% of the target?)
December 20, 2024 at 7:59 AM
But there is definitely some diggings to be made to find an optimal strategy in this regard
To me the logic would be to ramp up to have a kick off signal and make it harder and harder but papers seems to say otherwise
Maybe random is the optimal solution!
December 19, 2024 at 10:44 PM
Not really, we considered ramping up/down the masking ratio, but the findings from the literature (at least what we read at time) seemed counter-intuitive/not consensual
We ended up not really digging much into this particular aspect, again because we had so much to explore
December 19, 2024 at 10:43 PM
Definitely!
Again, the original goal of the project (besides cool models) was to convince some researchers to spend a bit of their GPUs hours on encoders pre-training again!

Hopefully we nailed it and will have the answers to a lot of questions in the future!
December 19, 2024 at 10:40 PM