Lightnews — Scholar-powered news

Tom Aarsen

@tomaarsen.com

🔥I've just published Sentence Transformers v5.2.0!

It introduces multi-processing for CrossEncoder (rerankers), multilingual NanoBEIR evaluators, similarity score outputs in mine_hard_negatives, Transformers v5 support and more.

Details in 🧵

December 11, 2025 at 2:46 PM

Reposted by Tom Aarsen

UKP Lab

@ukplab.bsky.social

🤝 𝗦𝗲𝗻𝘁𝗲𝗻𝗰𝗲 𝗧𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺𝗲𝗿𝘀 𝗷𝗼𝗶𝗻𝘀 𝗛𝘂𝗴𝗴𝗶𝗻𝗴 𝗙𝗮𝗰𝗲
Originally developed at the UKP Lab at @tuda.bsky.social, Sentence Transformers has become one of the world’s most widely used open-source libraries for semantic embeddings in natural language processing.

(1/🧵)

Logos of the UKP Lab (Ubiquitous Knowledge Processing) and Hugging Face side by side with a plus sign between them. Below, the text reads: “Sentence Transformers is joining Hugging Face.”

October 22, 2025 at 2:08 PM

Tom Aarsen

@tomaarsen.com

🤗 Sentence Transformers is joining @hf.co! 🤗

This formalizes the existing maintenance structure, as I've personally led the project for the past two years on behalf of Hugging Face. I'm super excited about the transfer!

Details in 🧵

October 22, 2025 at 1:04 PM

Tom Aarsen

@tomaarsen.com

The MTEB team has just released MTEB v2, an upgrade to their evaluation suite for embedding models!

Their blogpost covers all changes, including easier evaluation, multimodal support, rerankers, new interfaces, documentation, dataset statistics, a migration guide, etc.

🧵

October 20, 2025 at 2:36 PM

Tom Aarsen

@tomaarsen.com

We're announcing a new update to MTEB: RTEB

It's a new multilingual text embedding retrieval benchmark with private (!) datasets, to ensure that we measure true generalization and avoid (accidental) overfitting.

Details in our blogpost below 🧵

October 1, 2025 at 3:52 PM

Tom Aarsen

@tomaarsen.com

🐛 I've just released Sentence Transformers v5.1.1!

It's a small patch release that makes the project more explicit with incorrect arguments and introduces some fixes for multi-GPU processing, evaluators, and hard negatives mining.

Details in 🧵

September 22, 2025 at 11:42 AM

Tom Aarsen

@tomaarsen.com

ModernBERT goes MULTILINGUAL!

One of the most requested models I've seen, @jhuclsp.bsky.social has trained state-of-the-art massively multilingual encoders using the ModernBERT architecture: mmBERT.

Stronger than an existing models at their sizes, while also much faster!

Details in 🧵

September 9, 2025 at 2:54 PM

Tom Aarsen

@tomaarsen.com

Google just launched EmbeddingGemma: an efficient, multilingual 308M embedding model that's ready for semantic search & more on just about any hardware, CPU included.

Details in 🧵:

September 4, 2025 at 4:29 PM

Tom Aarsen

@tomaarsen.com

One of the most underrated players in AI models, IBM, released 2 new extremely efficient embedding models: granite-embedding-english-r2 & granite-embedding-small-english-r2, commercially viable.

Details in 🧵:

August 18, 2025 at 10:33 AM

Tom Aarsen

@tomaarsen.com

😎 I just published Sentence Transformers v5.1.0, and it's a big one. 2x-3x speedups of SparseEncoder models via ONNX and/or OpenVINO backends, easier distillation data preparation with hard negatives mining, and more!

See 🧵for the deets:

August 6, 2025 at 1:54 PM

Tom Aarsen

@tomaarsen.com

OpenAI is back with open source releases on @hf.co. This is the biggest release I've seen in a long time!

See more in huggingface.co/openai

openai (OpenAI)

Org profile for OpenAI on Hugging Face, the AI community building the future.

huggingface.co

August 5, 2025 at 5:19 PM

Tom Aarsen

@tomaarsen.com

I've just updated SetFit to v1.1.3, bringing compatibility with the recent datasets v4.0+ and Sentence Transformers v5.0+. You'll again be able to train tiny classifiers using very little training data!

🧵

August 5, 2025 at 1:35 PM

Tom Aarsen

@tomaarsen.com

Some of the ModernBERT team is back with new encoder models: Ettin, ranging from tiny to small: 17M, 32M, 68M, 150M, 400M & 1B parameters. They also trained decoder models & checked if decoders could classify & if encoders could generate.

Details in 🧵:

July 17, 2025 at 3:23 PM

Tom Aarsen

@tomaarsen.com

‼️Sentence Transformers v5.0 is out! The biggest update yet introduces Sparse Embedding models, encode methods improvements, Router module for asymmetric models & much more. Sparse + Dense = 🔥 hybrid search performance!

Details in 🧵

July 1, 2025 at 2:00 PM

Tom Aarsen

@tomaarsen.com

ColBERT (a.k.a. multi-vector, late-interaction) models are extremely strong search models, often outperforming dense embedding models. And @lightonai.bsky.social just released a new state-of-the-art one: GTE-ModernColBERT-v1!

Details in 🧵

April 30, 2025 at 3:27 PM

Tom Aarsen

@tomaarsen.com

I just released Sentence Transformers v4.1; featuring ONNX and OpenVINO backends for rerankers offering 2-3x speedups and improved hard negatives mining which helps prepare stronger training datasets.

Details in 🧵

April 15, 2025 at 1:54 PM

Tom Aarsen

@tomaarsen.com

I just published the SetFit v1.1.2 release, allowing you to finetune efficient classification models with very little training data.

The new patch introduces compatibility with the latest transformers and sentence-transformers versions.

April 4, 2025 at 11:58 AM

Tom Aarsen

@tomaarsen.com

I just released Sentence Transformers v4.0.2. This patch release brings a slight improvement for maximum sequence lengths, typing issues, and distributed training device placement.

Details in 🧵

April 3, 2025 at 1:48 PM

Tom Aarsen

@tomaarsen.com

I've just ported the excellent monoELECTRA-{base, large} reranker models from @fschlatt.bsky.social & the research network Webis Group to Sentence Transformers!

These models were introduced in the Rank-DistiLLM paper, and distilled from LLMs like RankZephyr and RankGPT4.

Details in 🧵

March 31, 2025 at 10:23 AM

Tom Aarsen

@tomaarsen.com

Ever thought that finetuning rerankers for your search system wasn't worth it? Think again.

I finetuned ModernBERT on my dataset, TriviaQA, just on my normal PC in a few hours. It outperforms all open models on the market on my eval set.

Finetuning is so worth it.
Blog in 🧵

March 28, 2025 at 11:50 AM

Tom Aarsen

@tomaarsen.com

‼️Sentence Transformers v4.0 is out! You can now train and finetune reranker (aka cross-encoder) models with multi-GPU training, bf16 support, loss logging, callbacks & much more.

I also prove that finetuning on your domain helps much more than you might think.

Details in 🧵

March 26, 2025 at 3:02 PM

Tom Aarsen

@tomaarsen.com

Let's goo, Mixedbread is back with more open weight Apache 2.0 rerankers.

Their new mxbai-rerank-...-v2 line is quite novel, using GRPO (akin to DeepSeek-R1), contrastive learning, and ranking objectives.

Details in 🧵

March 14, 2025 at 12:54 PM

Tom Aarsen

@tomaarsen.com

I just tried to find a paper from a few months ago for some docs that I'm writing. I tried on Google with like 5 separate queries, but just couldn't find it.

Tried it again via hf.co/papers, same prompt, and the paper I wanted is literally the first result. 🩷🤗🩷

March 12, 2025 at 11:36 AM

Tom Aarsen

@tomaarsen.com

An assembly of 18 European companies, labs, and universities have banded together to launch 🇪🇺 EuroBERT!

It's a state-of-the-art multilingual encoder for 15 European languages, designed to be finetuned for retrieval, classification, etc.

Details in 🧵

March 10, 2025 at 9:43 AM

Tom Aarsen

@tomaarsen.com

Already 1k likes for QwQ-32B on @hf.co in <24 hours! Well deserved, too.

March 6, 2025 at 3:36 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news