Ivan Leo
ivanleomk.bsky.social
Ivan Leo
@ivanleomk.bsky.social
Applied AI stuff from time to time, I write at ivanleo.com
Hmm for benchmarks it depends on what I’m benchmarking. If it’s a metric I honestly just use brain trust at that point to log everything.

Asserts are super easy to just get started with and so I use them for simple eyeballing and checks ( including some pytests tests )
November 27, 2024 at 5:12 AM
Interesting will give it a look, thanks so much!
November 26, 2024 at 3:48 PM
This was a really good read! I’m curious though when it comes to building benchmarks, what are your favourite ways for

- Deduplication
- Quality Control

I’ve tried to build spatial reasoning benchmarks before but never released them
November 26, 2024 at 4:20 AM
Hello sir!
November 26, 2024 at 4:18 AM
Reposted by Ivan Leo
Evals are "too damn expensive" until you:

• can't migrate underlying models safely
• can't add new features with confidence
• can't ship without HITL evals, which takes >100x longer
• product development and iteration grinds to a halt
• lose customer trust due to poor user experience
November 23, 2024 at 4:57 AM
Yeah haha, this was my problem too.

I often find that some upfront investment in building these tools pays off significantly
November 22, 2024 at 1:41 AM
Trying to find more AI folks too for a better feed, hope it improves lol
November 22, 2024 at 1:39 AM
Think you mean not sshing in haha
November 20, 2024 at 3:15 PM