Lorin Hochstein
banner
norootcause.surfingcomplexity.com
Lorin Hochstein
@norootcause.surfingcomplexity.com
Student of complex systems failures, resilience engineering, cognitive systems engineering. Will talk your ear off about @resilienceinsoftware.org
Reposted by Lorin Hochstein
10,000 Maniacs in 1997 is worth about 20,200 Maniacs today.
December 23, 2025 at 10:57 PM
Reposted by Lorin Hochstein
spoiler alert: the line was “It’s plurbin’ time”
Pluribus finale: great final line of a season, or greatest final line of a season?
December 24, 2025 at 6:53 AM
There's a new blog post on Waymo's blog on the surprising behavior of their robotaxis during the recent power outage of SF. My thoughts on the post are here: surfingcomplexity.blog/2025/12/23/s...
Saturation: Waymo edition
If you’ve been to San Francisco recently, you will almost certainly have noticed the Waymo robotaxis: these are driverless cars that you can hail with an app the way that you can with Uber. T…
surfingcomplexity.blog
December 24, 2025 at 6:10 AM
Optus recently released an independent report on the emergency services outage that happened in Australia back in September. I wrote a post with my thoughts on that report: surfingcomplexity.blog/2025/12/21/q...
Quick takes on the Triple Zero Outage at Optus – the Schott Review
On September 18, 2025, the Australian telecom company Optus experienced an incident where many users were unable to make emergency service calls from their cell phones. For almost 14 hours, about 7…
surfingcomplexity.blog
December 21, 2025 at 10:18 PM
Reposted by Lorin Hochstein
got my adhd upgraded to ad4k
December 21, 2025 at 6:52 AM
Reposted by Lorin Hochstein
time to pull out the seasonal favorites
December 7, 2025 at 3:15 PM
Reposted by Lorin Hochstein
I suspect people think in terms of correction of error because

1. the reality that complex/complicated systems are absolutely riddled with errors is disconcerting
2. so is the idea that we don't know why things work
December 21, 2025 at 1:03 AM
I wrote a post about why I don't like the name that Amazon uses for their post-incident review process: "Correction of Error". surfingcomplexity.blog/2025/12/20/w...
Why I don’t like “Correction of Error”
Like many companies, AWS has a defined process for reviewing incidents. They call their process Correction of Error. For example, there’s a page on Correction of Error in their Well-Architect…
surfingcomplexity.blog
December 21, 2025 at 12:27 AM
There was a great talk at re:Invent about the Oct. 25 us-east-1 outage! My impressions on the talk are here: surfingcomplexity.blog/2025/12/14/a...
AWS re:Invent talk on their Oct ’25 incident
Last month, I made the following remark on LinkedIn about the incident that AWS experienced back in October. To Amazon’s credit, there was a deep dive talk on the incident at re:Invent! OK, i…
surfingcomplexity.blog
December 15, 2025 at 7:00 AM
December 15, 2025 at 3:50 AM
Ironically, if you write sloppily, nobody will mistake your writing for AI slop.
December 13, 2025 at 10:45 PM
Reposted by Lorin Hochstein
OK but maybe I want to buy some snake oil
December 13, 2025 at 10:43 PM
Reposted by Lorin Hochstein
Fred's rule about posting:

You can either write long, contextually rich texts that are unambiguous but also won't be read much because they're too long, or shorter but engaging texts that are going to be interpreted in ways you disagree with, and there's no middle ground.
August 26, 2025 at 12:56 PM
Another detailed Cloudflare public incident write-up means another blog post from me commenting on it: surfingcomplexity.blog/2025/12/06/q...
Quick takes on the Dec 5 Cloudflare outage
Poor Cloudflare! It was less than a month ago that they suffered a major outage (I blogged about that here), and then, yesterday, they had another outage. This one was much shorter (~25 minutes ver…
surfingcomplexity.blog
December 6, 2025 at 9:03 PM
Having a little bit more fun with Cloudflare and AWS public incident data: surfingcomplexity.blog/2025/11/28/i...
Incidents: the exceptional as routine
In yesterday’s post, I was looking at the Cloudflare’s public incident data to see if the time-to-resolve was under statistical control. Today I want to look at just the raw counts. Her…
surfingcomplexity.blog
November 28, 2025 at 11:34 PM
Spent my Thanksgiving playing with public incident data and seeing if it was under statistical control.

surfingcomplexity.blog/2025/11/27/f...
Fun with incident data and statistical process control
Last year, I wrote a post called TTR: the out-of-control metric. In that post, I argued that the incident response process(in particular, the time-to-resolution metric for incidents) will never be …
surfingcomplexity.blog
November 28, 2025 at 5:16 AM
I finally got around to writing up my thoughts about the recent Cloudflare outage.

surfingcomplexity.blog/2025/11/26/b...
Brief thoughts on the recent Cloudflare outage
I was at QCon SF during the recent Cloudflare outage (I was hosting the Stories Behind the Incidents track), so I hadn’t had a real chance to sit down and do a proper read-through of their pu…
surfingcomplexity.blog
November 27, 2025 at 5:43 AM
New blog post, on the role of attrition in software incidents, and how we don't talk about that in incident write-ups:

surfingcomplexity.blog/2025/11/02/y...
You’ll never see attrition referenced in an RCA
In the wake of the recent AWS us-east-1 outage, I saw speculation online about how the departure of experienced engineers played a role in the outage. The most notable one was from the acerbic clou…
surfingcomplexity.blog
November 3, 2025 at 1:08 AM
Reposted by Lorin Hochstein
Manual intervention was necessary to correct.
Oh the humanity!
October 24, 2025 at 6:15 AM
Reposted by Lorin Hochstein
I hope this email never finds you. I hope you’re free.
October 23, 2025 at 3:41 PM
EVERYBODY GO READ THE AWS INCIDENT WRITE-UP! aws.amazon.com/message/1019...
Summary of the Amazon DynamoDB Service Disruption in Northern Virginia (US-EAST-1) Region
aws.amazon.com
October 23, 2025 at 3:47 AM