Fred Hebert
banner
ferd.ca
Fred Hebert
@ferd.ca
Staff SRE @ honeycomb.io, Tech Book Author, Resilience in Software Foundation board member, Erlang Ecosystem Foundation co-founder, Resilience Engineering fan. SRE-not-sorry.

blog: https://ferd.ca
notes: https://ferd.ca/notes/
Pinned
Fred Hebert @ferd.ca · Mar 17
People tend to have a mental model where a system is stable until disturbed, far more often than they have one where the system is balanced because it is constantly intervened with.

The latter is a more useful approach to thinking about complex systems.
Reposted by Fred Hebert
For the past four years I have seen people say "the decision about what code to write is more important to the code" but does anyone actually look at like, research around what promotes strategic and efficient group decision making? seems like no
February 8, 2026 at 5:36 PM
Reading a text on Cognitive Systems Engineering (CSE) & its morality.

It starts with the Ea Nasir tablet but then starts going hard and just won't let up, aiming for a morally relativist and nihilistic conclusion of "the real issue is our fake ass sense of absolute morality"

The hell is this ride?
February 7, 2026 at 9:07 PM
Radiologists intimidate me, it feels like they can always see right through me
February 7, 2026 at 12:36 PM
Principles/approaches I’ve argued for and adopted in my career got me good results & have helped my own and many peers’ growth.

Yet it feels like I’m always “losing” in that what “wins” in the industry often directly conflicts with lots of it.

It’s increasingly harder to deal with the dissonance.
February 7, 2026 at 12:25 PM
Being sick with a fairly sore throat forces me to talk less to avoid fits of coughing, which incidentally probably makes me a lot more pertinent than usual in zoom meetings.
February 5, 2026 at 2:16 AM
Reposted by Fred Hebert
So many developers have sent me that Anthropic skills/mastery case study that I realized I should ungate what I *already wrote* about this: beginning principles to design workflows that work *with* your mind, not against it, & protect your problem-solving

www.fightforthehuman.com/cognitive-he...
Cognitive Helmets for the AI Bicycle: Part 1
I hear people name these three fears: will developers lose their problem-solving skills, learning opportunities, and critical thinking? One science-backed area can help: better metacognitive strategie...
www.fightforthehuman.com
February 4, 2026 at 6:13 PM
Paper review: William H. Starbuck & Frances J. Milliken in Challenger: Fine-Tuning the Odds Until Something Breaks.

If you heard of "normalization of deviance", this is an alternative framing where accidents are normal outcomes of optimization processes when operationalizing systems and tradeoffs.
Paper: Challenger: Fine-Tuning the Odds Until Something Breaks
ferd.ca
February 3, 2026 at 2:21 PM
Once again noticing that LLMs that write code often get a person’s name and are described as partners, but AI SREs are generally named “AI SRE” and they are there to make sure nobody’s got to stop to do that lowly unproductive work.
February 3, 2026 at 3:56 AM
Agreed.

As a “yes, and,” sometimes what one wants to learn isn’t in the code itself, but something else supported with quick experiments where code is an inessential detail or enabler for the time being.

There’s no sin in picking your focus, and maybe revisiting later with more purpose if need be.
I also think there are sides of this learning question that are understudied, and those sides are motivational, behavioral. If you learn less in a moment of implementing something but it fuels you to invest in a domain area you never would have, what's the balance??

bsky.app/profile/grim...
I do not disagree at all that we need to care about active learning and not tricking ourselves into thinking we know more than we do but also there are illusions on the other side, when we know important pieces but not how to implement them, and getting blocked by that is common and wasteful.
January 31, 2026 at 4:07 AM
While I was bent over my seat to stow my bag under the seat in front of me, with passengers hurrying by, the guy next to me set his coffee on my seat and I sat on it. Being committed to blame-aware retros, I will analyze the systemic elements contributing to my ass smelling like a damp coffee crisp;
January 29, 2026 at 10:20 PM
Reposted by Fred Hebert
New blog post on the high costs of coordination and the implications for large organizations: surfingcomplexity.blog/2026/01/24/b...
Because coordination is expensive
If you’ve ever worked at a larger organization, stop me if you’ve heard (or asked!) any of these questions: “Why do we move so slowly as an organization? We need to figure out how…
surfingcomplexity.blog
January 25, 2026 at 1:30 AM
Part of Resilience Engineering is knowing Compensation happens, and knowing how to protect, foster, and build on this type of capacity to keep systems successful; keeping it invisible increases the chances it reaches saturation, and fails in a cascade.

Posted on @resilienceinsoftware.org
Decompensation and Cascading Failures
Consider the following scenario: A set of automated tasks has somehow failed to run to completion. Because a thorough fix will take some time and the tasks you need run are time-sensitive, you complet...
resilienceinsoftware.org
January 22, 2026 at 4:40 PM
Is 40 pages for an incident report too much? I’m not done yet, so it can either get longer or shorter, depending on how much time I have left.
January 20, 2026 at 12:27 AM
When writing about Property-Based testing, I found they were much more solid and comprehensive than usual tests, due to their more general (properties vs examples). They weren’t used because they felt more complex and time-intensive to write, despite being shorter and having to change them less.
January 18, 2026 at 5:47 PM
Was in Lund (🇸🇪) all week for the Learning Lab starting their Human Factors and Systems Safety master’s degree program.

It was great to cover so much in so little time, but also to do it with people from so many varied backgrounds and all sorts of professions.

Amazing group dynamics. So tired.
January 16, 2026 at 9:28 PM
Reposted by Fred Hebert
Also try not to think about what's going on in the world when you read this or you may end up very depressed. Systems thinking tends to do that.
January 12, 2026 at 2:35 PM
Reposted by Fred Hebert
This was a wonderful read that complements the "coding is not the bottleneck" types of critiques of software development snakeoil with a more nuanced model. I love @jasongorman.bsky.social's pithy statement "faster cars != faster traffic"; this article makes it clear why that is so.
A systems-thinking approach tends to require a focus on interactions over components. Here I try to bring a temporal dimension to these interactions.

Drift accumulates across loops and creates inconsistencies as mental models lag when trying to keep up with acceleration.

ferd.ca/software-acc...
Software Acceleration and Desynchronization
A look at the ever-present drive to make software delivery faster and how it might break down various activity loops in organizations.
ferd.ca
January 12, 2026 at 2:26 PM
A systems-thinking approach tends to require a focus on interactions over components. Here I try to bring a temporal dimension to these interactions.

Drift accumulates across loops and creates inconsistencies as mental models lag when trying to keep up with acceleration.

ferd.ca/software-acc...
Software Acceleration and Desynchronization
A look at the ever-present drive to make software delivery faster and how it might break down various activity loops in organizations.
ferd.ca
January 5, 2026 at 2:13 PM
It’s that time of the year where Home Alone is going to be on tv a bunch. As relatives go “they should have had a battery powered alarm” and make other judgments on the family’s lack of responsibility in forgetting their kid, get a better source to learn from what happened.

ferd.ca/home-alone-a...
Home Alone: a Post-Incident Review
A post-incident review of the first Home Alone movie and how parents could leave Kevin behind.
ferd.ca
December 19, 2025 at 12:54 PM
this is where all my posting energy went over the last few weeks
We experienced catastrophic Kafka failure in @honeycomb.io's EU instance on 5 Dec, exacerbated by missing automation and safeguards. Recovery is now complete, and we have posted an outline of critical events during the incident. Fuller retro will follow in Jan. status.honeycomb.io/incidents/pj...
status.honeycomb.io
December 18, 2025 at 7:25 PM
An incident is an opportunity to learn and reorient how you navigate a tradeoff space, and the desire to wall them off as a disjoint/irrelevant side-effect of development you shouldn't spend time thinking about can do major disservices to organizations. But is desirable for entrenched structures.
December 3, 2025 at 4:09 PM
Reposted by Fred Hebert
Spent my Thanksgiving playing with public incident data and seeing if it was under statistical control.

surfingcomplexity.blog/2025/11/27/f...
Fun with incident data and statistical process control
Last year, I wrote a post called TTR: the out-of-control metric. In that post, I argued that the incident response process(in particular, the time-to-resolution metric for incidents) will never be …
surfingcomplexity.blog
November 28, 2025 at 5:16 AM
A related thing I’ve seen is decent software folks saying “at least what I do isn’t critical” as a way to distance themselves from the perceived responsibility/stress/burden of the things we work on, knowing the tradeoffs we make.

We’re given huge amounts of power and should act accordingly.
all I see are "they're not engineers" "that's not what engineering is" I want a thick good really real piece to sink my teeth in about all the parts of this that ARE "like engineering"
November 24, 2025 at 10:55 PM
Seeing more reports and industry players blaming code reviews for slowing down the quick development done with AI. It's unclear whether anyone's asking if this is just moving the cognitive bottleneck of "understanding what's happening" around. "Add AI to the reviews" seems to be the end goal here.
November 21, 2025 at 2:29 PM
New paper notes on The Failure Gap, showing people systematically underestimate failures across many domains. They show a link to information balance in media, how to possibly close the gap, and the positive effects of doing so.

Source: papers.ssrn.com/sol3/papers....
Notes: ferd.ca/notes/paper-...
Paper: The Failure Gap
ferd.ca
November 17, 2025 at 12:49 AM