Spontaneous swarming of responders might seem like a nuisance that breaks our tidy mental models of incident response, but it’s actually very powerful. It’s something to facilitate and encourage, not simply tolerate.
Brent Chapman
The misconception is that the local assurances automatically combine to form a single end-to-end promise that spans brokers, processors, databases, outboxes, caches, webhooks, and external APIs.
Irullappan irulandi — DZone
When a firmware issue caused reboots for firmware upgrades to take four hours(!), they had to find a solution.
Giovanni Pereira Zantedeschi, Nnamdi Ajah, and Omar Sheik-Omar — Cloudflare
This one strikes a balance on AI that really speaks to me.
If you’re the one left holding the bag, you should generally get final say over what goes in that bag.
Charity Majors
How Airbnb built a Kubernetes sidecar to deliver dynamic configuration reliably at scale.
Bo Teng — Airbnb
In this post, we’ll walk through how we redesigned our Kubernetes-based PostgreSQL clusters for failover safety, how we balanced durability against latency, and what we learned while validating this approach through benchmarking and failure testing.
Shree Sampath — Datadog
The failure mode on this one is really interesting, and the bit about “infinite blast radius” caught my eye.
Sarat Mahavratayajula ,Vijay Sagar Gullapalli — VentureBeat
I’m enjoying this series so far, and I’m looking forward to reading the rest. It’s worth starting at part 1, but part 2 can stand on its own in a pinch.
Uwe Friedrichsen