Definitely give Langfuse a try, it is a charm to setup and use.
Definitely give Langfuse a try, it is a charm to setup and use.
In case you want to benefit from even more content, please subscribe to my newsletter:
www.sebastiansigl.com/subscribe
In case you want to benefit from even more content, please subscribe to my newsletter:
www.sebastiansigl.com/subscribe
I share the full story here:
www.sebastiansigl.com/blog/rebuild...
I share the full story here:
www.sebastiansigl.com/blog/rebuild...
✅ The Principle: A product mindset is the true compass.
The goal is not the most sophisticated system; it's the most effective system for the user.
✅ The Principle: A product mindset is the true compass.
The goal is not the most sophisticated system; it's the most effective system for the user.
✅ The Principle: Blurring lines creates synergy.
Empower your team. Our progress exploded when data scientists could run A/B tests & engineers could explore data.
✅ The Principle: Blurring lines creates synergy.
Empower your team. Our progress exploded when data scientists could run A/B tests & engineers could explore data.
✅ The Principle: Business impact is the north star.
If an experiment doesn't move a core KPI (engagement, retention), it's not an improvement.
✅ The Principle: Business impact is the north star.
If an experiment doesn't move a core KPI (engagement, retention), it's not an improvement.
✅ The Principle: Velocity unlocks correctness.
A fast, end-to-end feedback loop (from user action to A/B test) is the only path to finding what "correct" actually is.
✅ The Principle: Velocity unlocks correctness.
A fast, end-to-end feedback loop (from user action to A/B test) is the only path to finding what "correct" actually is.
✅ The Principle: It's a Data & Product problem first.
An architecture that learns fast from user signals beats one that just serves fast.
✅ The Principle: It's a Data & Product problem first.
An architecture that learns fast from user signals beats one that just serves fast.
www.sebastiansigl.com/subscribe
www.sebastiansigl.com/subscribe
www.sebastiansigl.com/blog/llm-jud...
www.sebastiansigl.com/blog/llm-jud...
I’ve written a complete guide on how to diagnose and fix these issues, plus build a resilient evaluation system.
I’ve written a complete guide on how to diagnose and fix these issues, plus build a resilient evaluation system.
The Judge is easily fooled.
It falls for fake citations ("Harvard study...") and rewards "safe" refusals that users hate. This erodes trust and makes your product useless.
Fix: Use reference-guided evaluation and mandatory human review for refusal cases.
The Judge is easily fooled.
It falls for fake citations ("Harvard study...") and rewards "safe" refusals that users hate. This erodes trust and makes your product useless.
Fix: Use reference-guided evaluation and mandatory human review for refusal cases.
The Judge prefers answers from its own model family (e.g., GPT-4 judging GPT-4).
This makes objective cross-model benchmarking impossible.
Fix: Use a neutral, third-party judge model (e.g., use a Google model to judge OpenAI vs. Anthropic).
The Judge prefers answers from its own model family (e.g., GPT-4 judging GPT-4).
This makes objective cross-model benchmarking impossible.
Fix: Use a neutral, third-party judge model (e.g., use a Google model to judge OpenAI vs. Anthropic).
The Judge thinks longer = better.
It will reward a 5-paragraph answer over a correct 2-sentence one. This trains your models to be annoying and unhelpful.
Fix: Add "Be concise" and "Penalize verbosity" directly into your judge's rubric.
The Judge thinks longer = better.
It will reward a 5-paragraph answer over a correct 2-sentence one. This trains your models to be annoying and unhelpful.
Fix: Add "Be concise" and "Penalize verbosity" directly into your judge's rubric.
The Judge has a favorite: the first option it sees.
If you A/B test prompts and always put A first, you're not measuring quality—you're measuring position.
Fix: Swap the order and run the test again. If the judgment flips, it's invalid. Simple & powerful.
The Judge has a favorite: the first option it sees.
If you A/B test prompts and always put A first, you're not measuring quality—you're measuring position.
Fix: Swap the order and run the test again. If the judgment flips, it's invalid. Simple & powerful.
It's the playbook for harnessing AI speed without sacrificing quality.
Read it here: www.sebastiansigl.com/blog/type-sa...
#Python #Testing #AI #SoftwareQuality
It's the playbook for harnessing AI speed without sacrificing quality.
Read it here: www.sebastiansigl.com/blog/type-sa...
#Python #Testing #AI #SoftwareQuality
www.sebastiansigl.com/subscribe
www.sebastiansigl.com/subscribe
It's the playbook for harnessing AI speed without sacrificing quality.
Read it here: www.sebastiansigl.com/blog/type-sa...
#Python #Testing #AI #SoftwareQuality
It's the playbook for harnessing AI speed without sacrificing quality.
Read it here: www.sebastiansigl.com/blog/type-sa...
#Python #Testing #AI #SoftwareQuality