Tom Aarsen
tomaarsen.com
Tom Aarsen
@tomaarsen.com
Sentence Transformers, SetFit & NLTK maintainer
Machine Learning Engineer at 🤗 Hugging Face
Reposted by Tom Aarsen
🤝 𝗦𝗲𝗻𝘁𝗲𝗻𝗰𝗲 𝗧𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺𝗲𝗿𝘀 𝗷𝗼𝗶𝗻𝘀 𝗛𝘂𝗴𝗴𝗶𝗻𝗴 𝗙𝗮𝗰𝗲
Originally developed at the UKP Lab at @tuda.bsky.social, Sentence Transformers has become one of the world’s most widely used open-source libraries for semantic embeddings in natural language processing.

(1/🧵)
October 22, 2025 at 2:08 PM
🤗 Sentence Transformers is joining @hf.co! 🤗

This formalizes the existing maintenance structure, as I've personally led the project for the past two years on behalf of Hugging Face. I'm super excited about the transfer!

Details in 🧵
October 22, 2025 at 1:04 PM
The MTEB team has just released MTEB v2, an upgrade to their evaluation suite for embedding models!

Their blogpost covers all changes, including easier evaluation, multimodal support, rerankers, new interfaces, documentation, dataset statistics, a migration guide, etc.

🧵
October 20, 2025 at 2:36 PM
We're announcing a new update to MTEB: RTEB

It's a new multilingual text embedding retrieval benchmark with private (!) datasets, to ensure that we measure true generalization and avoid (accidental) overfitting.

Details in our blogpost below 🧵
October 1, 2025 at 3:52 PM
🐛 I've just released Sentence Transformers v5.1.1!

It's a small patch release that makes the project more explicit with incorrect arguments and introduces some fixes for multi-GPU processing, evaluators, and hard negatives mining.

Details in 🧵
September 22, 2025 at 11:42 AM
ModernBERT goes MULTILINGUAL!

One of the most requested models I've seen, @jhuclsp.bsky.social has trained state-of-the-art massively multilingual encoders using the ModernBERT architecture: mmBERT.

Stronger than an existing models at their sizes, while also much faster!

Details in 🧵
September 9, 2025 at 2:54 PM
Google just launched EmbeddingGemma: an efficient, multilingual 308M embedding model that's ready for semantic search & more on just about any hardware, CPU included.

Details in 🧵:
September 4, 2025 at 4:29 PM
One of the most underrated players in AI models, IBM, released 2 new extremely efficient embedding models: granite-embedding-english-r2 & granite-embedding-small-english-r2, commercially viable.

Details in 🧵:
August 18, 2025 at 10:33 AM
😎 I just published Sentence Transformers v5.1.0, and it's a big one. 2x-3x speedups of SparseEncoder models via ONNX and/or OpenVINO backends, easier distillation data preparation with hard negatives mining, and more!

See 🧵for the deets:
August 6, 2025 at 1:54 PM
OpenAI is back with open source releases on @hf.co. This is the biggest release I've seen in a long time!

See more in huggingface.co/openai
openai (OpenAI)
Org profile for OpenAI on Hugging Face, the AI community building the future.
huggingface.co
August 5, 2025 at 5:19 PM
I've just updated SetFit to v1.1.3, bringing compatibility with the recent datasets v4.0+ and Sentence Transformers v5.0+. You'll again be able to train tiny classifiers using very little training data!

🧵
August 5, 2025 at 1:35 PM
Some of the ModernBERT team is back with new encoder models: Ettin, ranging from tiny to small: 17M, 32M, 68M, 150M, 400M & 1B parameters. They also trained decoder models & checked if decoders could classify & if encoders could generate.

Details in 🧵:
July 17, 2025 at 3:23 PM
‼️Sentence Transformers v5.0 is out! The biggest update yet introduces Sparse Embedding models, encode methods improvements, Router module for asymmetric models & much more. Sparse + Dense = 🔥 hybrid search performance!

Details in 🧵
July 1, 2025 at 2:00 PM
ColBERT (a.k.a. multi-vector, late-interaction) models are extremely strong search models, often outperforming dense embedding models. And @lightonai.bsky.social just released a new state-of-the-art one: GTE-ModernColBERT-v1!

Details in 🧵
April 30, 2025 at 3:27 PM
I just released Sentence Transformers v4.1; featuring ONNX and OpenVINO backends for rerankers offering 2-3x speedups and improved hard negatives mining which helps prepare stronger training datasets.

Details in 🧵
April 15, 2025 at 1:54 PM
I just published the SetFit v1.1.2 release, allowing you to finetune efficient classification models with very little training data.

The new patch introduces compatibility with the latest transformers and sentence-transformers versions.
April 4, 2025 at 11:58 AM
I just released Sentence Transformers v4.0.2. This patch release brings a slight improvement for maximum sequence lengths, typing issues, and distributed training device placement.

Details in 🧵
April 3, 2025 at 1:48 PM
I've just ported the excellent monoELECTRA-{base, large} reranker models from @fschlatt.bsky.social & the research network Webis Group to Sentence Transformers!

These models were introduced in the Rank-DistiLLM paper, and distilled from LLMs like RankZephyr and RankGPT4.

Details in 🧵
March 31, 2025 at 10:23 AM
Ever thought that finetuning rerankers for your search system wasn't worth it? Think again.

I finetuned ModernBERT on my dataset, TriviaQA, just on my normal PC in a few hours. It outperforms all open models on the market on my eval set.

Finetuning is so worth it.
Blog in 🧵
March 28, 2025 at 11:50 AM
‼️Sentence Transformers v4.0 is out! You can now train and finetune reranker (aka cross-encoder) models with multi-GPU training, bf16 support, loss logging, callbacks & much more.

I also prove that finetuning on your domain helps much more than you might think.

Details in 🧵
March 26, 2025 at 3:02 PM
Let's goo, Mixedbread is back with more open weight Apache 2.0 rerankers.

Their new mxbai-rerank-...-v2 line is quite novel, using GRPO (akin to DeepSeek-R1), contrastive learning, and ranking objectives.

Details in 🧵
March 14, 2025 at 12:54 PM
I just tried to find a paper from a few months ago for some docs that I'm writing. I tried on Google with like 5 separate queries, but just couldn't find it.

Tried it again via hf.co/papers, same prompt, and the paper I wanted is literally the first result. 🩷🤗🩷
March 12, 2025 at 11:36 AM
An assembly of 18 European companies, labs, and universities have banded together to launch 🇪🇺 EuroBERT!

It's a state-of-the-art multilingual encoder for 15 European languages, designed to be finetuned for retrieval, classification, etc.

Details in 🧵
March 10, 2025 at 9:43 AM
Already 1k likes for QwQ-32B on @hf.co in <24 hours! Well deserved, too.
March 6, 2025 at 3:36 PM
The fine folks at Qwen released QwQ-32B 30 minutes ago, rivaling DeepSeek-R1 671B on various benchmarks, and outperforming OpenAI o1-mini on several as well!

Vibe checks still in the works.

Details & Links in 🧵
March 5, 2025 at 8:11 PM