Sebastian Sigl
sesigl.bsky.social
Sebastian Sigl
@sesigl.bsky.social
Thanks for sharing. Prompt management is a must have if automate more things, from personal workflows towards product we build.

Definitely give Langfuse a try, it is a charm to setup and use.
November 12, 2025 at 6:27 AM
In case you enjoyed this thread, please give it a like and share it with your followers.

In case you want to benefit from even more content, please subscribe to my newsletter:

www.sebastiansigl.com/subscribe
Subscribe | Sebastian Sigl
Subscribe to Sebastian Sigl's newsletter and benefit from big tech insights, actionable advice, and an independent viewpoint.
www.sebastiansigl.com
October 26, 2025 at 8:00 AM
These principles are not new. But their application in the messy reality of production search was a powerful lesson.

I share the full story here:

www.sebastiansigl.com/blog/rebuild...
After a Year Rebuilding Search, I Had to Rethink Everything | Sebastian Sigl
A seasoned engineer's lessons from a year rebuilding a search system from the ground up, shifting from engineering-first to product-first thinking.
www.sebastiansigl.com
October 26, 2025 at 8:00 AM
❌ Common Pitfall: Equating technical excellence with product success.

✅ The Principle: A product mindset is the true compass.

The goal is not the most sophisticated system; it's the most effective system for the user.
October 26, 2025 at 8:00 AM
❌ Common Pitfall: Rigid functional roles and hand-offs.

✅ The Principle: Blurring lines creates synergy.

Empower your team. Our progress exploded when data scientists could run A/B tests & engineers could explore data.
October 26, 2025 at 8:00 AM
❌ Common Pitfall: Chasing offline metrics (nDCG, precision).

✅ The Principle: Business impact is the north star.

If an experiment doesn't move a core KPI (engagement, retention), it's not an improvement.
October 26, 2025 at 8:00 AM
❌ Common Pitfall: Engineering for "correctness" from day one.

✅ The Principle: Velocity unlocks correctness.

A fast, end-to-end feedback loop (from user action to A/B test) is the only path to finding what "correct" actually is.
October 26, 2025 at 8:00 AM
❌ Common Pitfall: Treating search as just an algo/infra problem.

✅ The Principle: It's a Data & Product problem first.

An architecture that learns fast from user signals beats one that just serves fast.
October 26, 2025 at 8:00 AM
Indeed, the prompt matters a lot, especially if you prefer a cost efficient model to make it feasible to run on scale.
September 23, 2025 at 5:23 AM
And if you find this useful, subscribe to my newsletter for more deep dives like this every week:

www.sebastiansigl.com/subscribe
Subscribe | Sebastian Sigl
Subscribe to Sebastian Sigl's newsletter and benefit from big tech insights, actionable advice, and an independent viewpoint.
www.sebastiansigl.com
September 22, 2025 at 1:43 PM
Relying on a biased judge is like flying a plane with a faulty altimeter. You think you're climbing, but you're headed for the ground.

I’ve written a complete guide on how to diagnose and fix these issues, plus build a resilient evaluation system.
September 22, 2025 at 1:43 PM
4 & 5/ Authority & Moderation Bias

The Judge is easily fooled.

It falls for fake citations ("Harvard study...") and rewards "safe" refusals that users hate. This erodes trust and makes your product useless.

Fix: Use reference-guided evaluation and mandatory human review for refusal cases.
September 22, 2025 at 1:43 PM
3/ Self-Enhancement Bias (aka Nepotism)

The Judge prefers answers from its own model family (e.g., GPT-4 judging GPT-4).

This makes objective cross-model benchmarking impossible.

Fix: Use a neutral, third-party judge model (e.g., use a Google model to judge OpenAI vs. Anthropic).
September 22, 2025 at 1:43 PM
2/ Verbosity Bias

The Judge thinks longer = better.

It will reward a 5-paragraph answer over a correct 2-sentence one. This trains your models to be annoying and unhelpful.

Fix: Add "Be concise" and "Penalize verbosity" directly into your judge's rubric.
September 22, 2025 at 1:43 PM
1/ Positional Bias

The Judge has a favorite: the first option it sees.

If you A/B test prompts and always put A first, you're not measuring quality—you're measuring position.

Fix: Swap the order and run the test again. If the judgment flips, it's invalid. Simple & powerful.
September 22, 2025 at 1:43 PM
Reposted by Sebastian Sigl
I've written a full guide with code examples and the 4 core principles for writing AI-ready Python tests.
It's the playbook for harnessing AI speed without sacrificing quality.

Read it here: www.sebastiansigl.com/blog/type-sa...

#Python #Testing #AI #SoftwareQuality
Augmented Coding, Amplified Risk: Why Type-Safe Python Tests Matter More Than Ever | Sebastian Sigl
AI coding assistants are accelerating development—but also magnifying quality risks. Here’s how to write Python tests that survive refactors, scale with your codebase, and tame the chaos of augmented ...
www.sebastiansigl.com
August 15, 2025 at 7:37 AM
If you enjoyed this thread and want more high-quality content, subscribe to my newsletter here:

www.sebastiansigl.com/subscribe
Subscribe | Sebastian Sigl
Subscribe to Sebastian Sigl's newsletter and benefit from big tech insights, actionable advice, and an independent viewpoint.
www.sebastiansigl.com
August 15, 2025 at 7:37 AM
I've written a full guide with code examples and the 4 core principles for writing AI-ready Python tests.
It's the playbook for harnessing AI speed without sacrificing quality.

Read it here: www.sebastiansigl.com/blog/type-sa...

#Python #Testing #AI #SoftwareQuality
Augmented Coding, Amplified Risk: Why Type-Safe Python Tests Matter More Than Ever | Sebastian Sigl
AI coding assistants are accelerating development—but also magnifying quality risks. Here’s how to write Python tests that survive refactors, scale with your codebase, and tame the chaos of augmented ...
www.sebastiansigl.com
August 15, 2025 at 7:37 AM