banner
progmetheus.bsky.social
@progmetheus.bsky.social
Typical European software engineer at a late-stage startup in Berlin, Germany. Originally from Eastern Europe. Opinions are my own.
You can “make it right” and fix every little thing the agent comes up with before pushing the code to prod. There’s no need to comment on PRs or argue about why your approach is better. You can even delete all the crappy code the agent wrote - and no one’s feelings get hurt. Isn’t that the dream?
October 25, 2025 at 11:13 PM
The main lesson today: don’t give Klarna direct access to your bank account - and that’s also the advice.
October 22, 2025 at 10:50 PM
...and told me I’d have to wait for an automated refund from their system, which could take up to ten days - and that there was nothing they could do in the meantime.

Knowing how slow their system is, I decided to reverse the debits myself, since the sum was too large to wait ten days for.
October 22, 2025 at 10:49 PM
I contacted them to request an urgent refund due to the amount involved. The 1st agent tried to convince me that it is just an authorization hold that would expire in seven days (it was clearly a direct debit!). The 2nd agent was more competent...
October 22, 2025 at 10:49 PM
Next time, I believe it would be beneficial to avoid dwelling on theories obviously influenced by priming for too long.
July 28, 2024 at 11:09 AM
During our most recent incident, all the initial hypotheses revolved around what went south in a previous, difficult-to-resolve issue from several weeks ago. However, the technical cause turned out to be quite different.
July 28, 2024 at 11:09 AM
This is probably a good argument to start thinking about deploying apps across multiple regions, especially if one of the regions is considered "default" by the cloud provider.
July 21, 2024 at 10:39 PM
The point 4 is important because certain failure patterns were so unexpected that it was easier to believe they were caused by external factors (they were not). E.g. easy to assume that the overall slowness in app communication was due to the network, but it was actually caused by thread saturation.
July 21, 2024 at 3:56 PM
This strategy differs from my usual approach to debugging incidents: I used to look for anomalies in the metrics and then form theories around them.
July 21, 2024 at 3:53 PM
4. Avoid making easy-to-believe speculations and theories that shift the focus away from your system, such as blaming the orchestration system for the failure.

Sounds simple, but following this approach is easier said than done.
July 21, 2024 at 3:52 PM
2. Begin to develop theories and seek confirmation in the metrics.

3. Consider system parts that may have been overlooked before in the development stage and develop theories around them.
July 21, 2024 at 3:51 PM
1. Take a step back and think about the entire system and failure as a whole. Only begin working on fixes once you clearly understand the most probable cause. Theoretical fixes might consume a lot of time and still not address the actual cause.
July 21, 2024 at 3:51 PM
Here are a few points I gathered from his strategy. It's worth writing an article about (maybe, one day), but nothing will replace personal experience anyway. 🙂
July 21, 2024 at 3:50 PM
On a positive note, now our team will most likely have some time to focus on reliability improvements, which was one of my previous goals.
July 21, 2024 at 3:47 PM
I'm not sure what the problem was with me; the reason for the incident isn't particularly tricky, just something I overlooked for some reason in the past. It was probably some kind of post-holiday syndrome when you struggle to focus and wish things would just go away.
July 21, 2024 at 3:46 PM
The timing of the issues was really unfortunate. I was on holiday during the first two incidents and had planned to focus on personal matters. Also, when I returned, I failed to understand the reasons quickly enough, and the incident happened for the third time.
July 21, 2024 at 3:46 PM