benjamin
banner
bclavie.bsky.social
benjamin
@bclavie.bsky.social
doing ML stuff at answer.ai / fast.ai
🇯🇵-based 🇫🇷man
I know a lot of people are working on making ModernBERT-based embedding models, but in the meantime, if you’d like to play around with it (no better way to learn than practice), it’s plug&play with Sentence Transformers www.sbert.net and we have examples on the repo
SentenceTransformers Documentation — Sentence Transformers documentation
www.sbert.net
December 22, 2024 at 1:11 AM
Hey! As Jeremy replied, this is fully expected, encoder-models aren’t expected to produce well-calibrated semantically similar scores out of the box, because it’s very far from the training task for the base model!

However, they fine tune really well into embedding models that are good at this 1/2
December 22, 2024 at 1:10 AM
There was one time my flight from Geneva got cancelled and I got a replacement one from Lyon. Still one of my most surreal experiences.
December 9, 2024 at 9:24 AM
Won't be at NeurIPS but I'll be at ICLR in April, in case you're planning on being there 😄
December 8, 2024 at 11:59 AM
Please do go on about the coffee. Is it a make-you-an-espresso-as-required kind of deal or a big pot? Perhaps a lovingly made 1L chemex?
December 2, 2024 at 12:30 AM
I can understand this yeah. I’m generally open to discussion but I’ve seen enough unsavoury behaviour & DMs in the past couple days to want to dial it down a teensy bit at the moment sadly.
November 28, 2024 at 2:55 PM
Jokes aside, it does make me kinda sad. ML Bluesky has a lot of the vibes of early twitter and interesting discussions, but seeing so many of the death threats posters unbanned while someone was banned for *posting a link to a dataset* is a really bad sign :/
November 28, 2024 at 1:04 PM
LLM2Vec is also a nice approach for this -- only difference is you'd FT for classification rather than retrieval at the end github.com/McGill-NLP/l...
GitHub - McGill-NLP/llm2vec: Code for 'LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders'
Code for 'LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders' - McGill-NLP/llm2vec
github.com
November 28, 2024 at 1:03 PM
It’s only hate if it comes from
the Champagne region of X, otherwise it’s just sparkling outrage (I think?)
November 28, 2024 at 10:25 AM
(ChromaDB is good too, but IMO it's targeting a different/less AI tinkery audience)
November 28, 2024 at 10:06 AM
(they do not employ me, nor pay me in any way, I'm just out there doing unpaid advertising)
November 28, 2024 at 10:06 AM
heartily recommend lancedb for local stuff where you don't want to fuss with things too much -- mostly sane default, has reranking and bm25 support so you can do two-step or hybrid search whenever needed, and the disk ANN is plenty for most people.
November 28, 2024 at 10:05 AM
Note: you can still criticise the way the original dataset was built. Nothing's black and white. I understand why people are upset.
None of this implies there isn't something seriously wrong with sending death threats to someone because they *curated an open dataset from an open protocol*.
November 28, 2024 at 6:10 AM
Data gathering on an open platform via an open protocol is only ethical if you're not told about it, silly.
November 28, 2024 at 5:09 AM
It’s been absolutely horrible to watch this. Pure “it’s fine to insult, harass and threaten people as long as you are doing it for the right reason” energy.

At least blocklists help, I guess blocking toxicity on sight is the only way.
November 28, 2024 at 1:36 AM