Lightnews — Scholar-powered news

Reposted by Antoine Chaffin

LightOn

@lightonai.bsky.social

✔️ Supporting enterprise-scale document processing
✔️ Enabling more accurate retrieval for AI-generated responses

Kudos to @nohtow.bsky.social for this new SOTA achievement!

🔗 Read the full blog article: www.lighton.ai/lighton-blog...

LightOn Releases GTE-ModernColBERT, First State-of-the-Art Late-Interaction Model Trained on PyLate! - LightOn

LightOn is proud to announce the release of GTE-ModernColBERT, our new state-of-the-art, open-source, multi-vector retrieval model. By leveraging ModernBERT architecture and our innovative PyLate libr...

www.lighton.ai

April 30, 2025 at 3:49 PM

Reposted by Antoine Chaffin

Tom Aarsen

@tomaarsen.com

I'm a big fan of the PyLate project for ColBERT models, and I'm glad to see these strong models coming out. Very nice work by the @lightonai.bsky.social folks, especially @nohtow.bsky.social.

Learn more about PyLate here: lightonai.github.io/pylate/

pylate

Neural Search

lightonai.github.io

April 30, 2025 at 3:27 PM

Antoine Chaffin

@nohtow.bsky.social

As per usual, thanks to my dear co-maintainer @raphaelsty.bsky.social for helping me make PyLate what it is 🫶

April 30, 2025 at 2:42 PM

Antoine Chaffin

@nohtow.bsky.social

In addition to knowledge distillation, we recently added features to allow large-scale contrastive pre-training, and this model has been released upon popular demand, but we are currently doing heavier training so stay tuned!

April 30, 2025 at 2:42 PM

Antoine Chaffin

@nohtow.bsky.social

PyLate makes downstream usage easy, but also facilitate training!
You can reproduce this SOTA training with <80 LoC and 2 hours of training and it'll run NanoBEIR during training, report it to W&B and create an informative model card!
Link to the gist: gist.github.com/NohTow/3030f...

April 30, 2025 at 2:42 PM

Antoine Chaffin

@nohtow.bsky.social

Besides, it also comes with the 8k context window of ModernBERT, which is very useful given that late interaction models generalize very well to longer context as highlighted in the ModernBERT paper
It is thus very suited to handle your very long documents!

April 30, 2025 at 2:42 PM

Antoine Chaffin

@nohtow.bsky.social

It is also the first model to outperform ColBERT-small on BEIR
While it is bigger, it is still a very lightweight model and benefits from the efficiency of ModernBERT!
Also, it has only been trained on MS MARCO (for late interaction) and should thus generalize pretty well!

April 30, 2025 at 2:42 PM

Antoine Chaffin

@nohtow.bsky.social

Model link: huggingface.co/lightonai/GT...
GTE-ModernColBERT is trained on top of the GTE-ColBERT model using knowledge distillation on the MS MARCO dataset and is the first SOTA model trained using PyLate!
Get started with PyLate using the documentation:
lightonai.github.io/pylate/

lightonai/GTE-ModernColBERT-v1 · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

April 30, 2025 at 2:42 PM

Reposted by Antoine Chaffin

LightOn

@lightonai.bsky.social

ModernBERT-embed-large is released under Apache 2.0 and is available on Hugging Face:
huggingface.co/lightonai/mo...

Congrats to @nohtow.bsky.social for this great work!

lightonai/modernbert-embed-large · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

January 14, 2025 at 4:42 PM

Antoine Chaffin

@nohtow.bsky.social

When I saw the release of ModernBERT-embed during the holidays, I knew I had to build the large variant, so I wanted to thank Zach Nussbaum from Nomic AI for building and sharing it (as well as all the nomic-embed tools and data) and bearing with me during the training!

January 14, 2025 at 3:32 PM

Antoine Chaffin

@nohtow.bsky.social

ModernBERT-embed-large not only enables usage of ModernBERT-large out-of-the-box, but it should also be a very good starting point for strong fine-tunings on various tasks, so I can't wait to see what the community will build on top of it!

January 14, 2025 at 3:32 PM

Antoine Chaffin

@nohtow.bsky.social

Obviously, it comes at a slightly higher cost, but it is also trained with Matryoshka capabilities to reduce the footprint of embeddings
Notably, the performance with dimension 256 is only slightly worse than the base version with full dimension 768

January 14, 2025 at 3:32 PM

Antoine Chaffin

@nohtow.bsky.social

Model link: huggingface.co/lightonai/mo...
ModernBERT-embed-large is trained using the same (two-stage training) recipe as its smaller sibling and expectedly increases the performance, reaching +1.22 in MTEB average

January 14, 2025 at 3:32 PM

Antoine Chaffin

@nohtow.bsky.social

The multilingual version is not planned yet, there have been a work on adapting ModernBERT to other languages:
www.linkedin.com/posts/fremyc...

In the mean time, you could have a shot with mGTE (using xformers) or recent language-specific iteration of BERT such as CamemBERTv2!

François REMY on LinkedIn: Fast and multilingual embedding models are the key to many of the world's…

Fast and multilingual embedding models are the key to many of the world's RAG pipelines. Today marks my first step into the realm of ModernBERT, and certainly…

www.linkedin.com

January 14, 2025 at 3:28 PM

Reposted by Antoine Chaffin

Calvin McCarter

@calvinmccarter.bsky.social

When one evaluates log-likelihood of a sequence of length L via the chain rule of probability, the first term has missingness fraction of 1, the second has missingness of (L-1)/L, etc. So the inference-time masking rate is ~ Uniform[0, 1].

December 20, 2024 at 7:52 PM

Antoine Chaffin

@nohtow.bsky.social

Oh, I see!
Having also worked a lot on causal models, I never thought of this kind of modelling because I always opposed MLM to open ended generation
I guess with papers such as this one arxiv.org/pdf/2406.04823, I should more!
Very interesting perspective, thanks!

arxiv.org

December 20, 2024 at 10:23 PM

Antoine Chaffin

@nohtow.bsky.social

Could you elaborate?
Or give me pointers?
Is it because having a fixed value bias the learning w.r.t the way we will sample downstream? (Like not mask 30% of the target?)

December 20, 2024 at 7:59 AM

Antoine Chaffin

@nohtow.bsky.social

But there is definitely some diggings to be made to find an optimal strategy in this regard
To me the logic would be to ramp up to have a kick off signal and make it harder and harder but papers seems to say otherwise
Maybe random is the optimal solution!

December 19, 2024 at 10:44 PM

Antoine Chaffin

@nohtow.bsky.social

Not really, we considered ramping up/down the masking ratio, but the findings from the literature (at least what we read at time) seemed counter-intuitive/not consensual
We ended up not really digging much into this particular aspect, again because we had so much to explore

December 19, 2024 at 10:43 PM

Antoine Chaffin

@nohtow.bsky.social

Definitely!
Again, the original goal of the project (besides cool models) was to convince some researchers to spend a bit of their GPUs hours on encoders pre-training again!

Hopefully we nailed it and will have the answers to a lot of questions in the future!

December 19, 2024 at 10:40 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news