SRE WEEKLY – scalability, availability, incident response, automation

SRE Weekly Issue #521

lex

June 14, 2026

In incidents, swarming is a feature, not a bug

Spontaneous swarming of responders might seem like a nuisance that breaks our tidy mental models of incident response, but it’s actually very powerful. It’s something to facilitate and encourage, not simply tolerate.

Brent Chapman

Exactly Once Processing: Myth vs Reality

The misconception is that the local assurances automatically combine to form a single end-to-end promise that spans brokers, processors, databases, outboxes, caches, webhooks, and external APIs.

Irullappan irulandi — DZone

How we reduced core unit boot time from hours to minutes

When a firmware issue caused reboots for firmware upgrades to take four hours(!), they had to find a solution.

Giovanni Pereira Zantedeschi, Nnamdi Ajah, and Omar Sheik-Omar — Cloudflare

AI enthusiasts are in a race against time, AI skeptics are in a race against entropy

This one strikes a balance on AI that really speaks to me.

If you’re the one left holding the bag, you should generally get final say over what goes in that bag.

Charity Majors

Sitar-agent: Building a reliable dynamic configuration sidecar at scale

How Airbnb built a Kubernetes sidecar to deliver dynamic configuration reliably at scale.

Bo Teng — Airbnb

When failover isn’t safe: Building high-availability PostgreSQL on Kubernetes

In this post, we’ll walk through how we redesigned our Kubernetes-based PostgreSQL clusters for failover safety, how we balanced durability against latency, and what we learned while validating this approach through benchmarking and failure testing.

Shree Sampath — Datadog

When Claude changed, everything changed: Managing AI blast radius in production

The failure mode on this one is really interesting, and the bit about “infinite blast radius” caught my eye.

Sarat Mahavratayajula ,Vijay Sagar Gullapalli — VentureBeat

Why we need resilient software design – Part 2

I’m enjoying this series so far, and I’m looking forward to reading the rest. It’s worth starting at part 1, but part 2 can stand on its own in a pinch.

Uwe Friedrichsen

SRE Weekly Issue #520

lex

June 7, 2026

General

Comments

View on sreweekly.com

AI Agents Expose a Design Gap in Microservices Resilience

We build our systems against the usage patterns of human users, but agents fundamentally change the game.

Vineet Bhatkoti — DZone

AI agents are quietly generating chaos engineering failures enterprises don’t track yet

This is an interesting lens for exploring the risks that agents can introduce.

Sayali Patil — VentureBeat

Reddit r/sre: How long does your company give new people before they put them oncall

Great discussion in the comments! There’s a lot of variance in how much time people recommend. I personally tend to lean earlier — on-call is a great way to learn, and I can always reach out if I get stuck.

u/modern_medicine_isnt and commenters — Reddit r/sre

Metastable Failures Explained: Why Fixing the Trigger Fails

A great into to the concept of metastable failures — and I recommend reading the original paper as well.

Teiva Harsanyi

Most Companies Wait Too Long to Declare Incidents

The real issue is that your company has made declaring an incident costly and risky for the person who does it.

Brent Chapman

A postmortem of our May 7, 2026 outage

I enjoyed learning about their deliberate architectural choice to keep their central service in a single AZ. This incident highlighted a need for a fast failover plan.

Coinbase

Customers over control: how we measure On-call reliability

I like the balance between ensuring 99.99% reliability and designing their product to encourage customers to use their platform in a way that effectively manages the 0.01% case.

Reliability is a customer experience problem

Mike Fisher — incident.io

The demon of the gaps

I’m not gonna spoil this one for you by writing a summary. Just read it, trust me.

Lorin Hochstein

SRE Weekly Issue #519

lex

May 31, 2026

General

Comments

View on sreweekly.com

The Problem with AI-Generated Post-Incident Reviews

They give solid examples to argue that much of the learning happens during the process of writing a post-incident review.

[…] you could throw the post-incident review document away after writing it and still get the vast majority of the value out of the process.

Brent Chapman

You Shipped It Fast. But Did You Ship It Right?

I really like this idea of change absorption capacity.

Priya Gopalsamy — Stack Overflow

On benchmarking

A useful guide that covers strategies for benchmarking, along with pitfalls to avoid.

Ben Dicken — PlanetScale

Serverless Illusion: When “Pay What You Use” is Expensive

Serverless isn’t inherently cheaper. Hidden costs add up, and at scale it’s often pricier than containers — best for sporadic, not steady workloads.

David Iyanu Jonathan — DZone

Humans aren’t fast enough for 4 9’s

With just under 4.5 minutes of leeway for outages per month, you have to rely on automated remediation. AI can help, but it’s not a full solution, per this article.

Norberto Lopes — incident.io

blog dds: 2026-05-23 — Why reviewing AI-generated code is devilishly hard

LLMs are specifically designed to generate plausible-seeming output, and this makes reviewing especially difficult.

Diomidis Spinellis

The 28-Hour Meltdown: What Happened When AWS US-EAST-1 Overheated

A breakdown of the 28-hour aws us-east-1 outage in may 2026. What caused it, what went down, and what it means for how you design your infrastructure.

Alon Shrestha

Why Teamwork Makes (Or Breaks) Your Incident Response

This article has a list of common problems in incident response, and I feel like printing it and taping it to my wall.

Karan Nagarajagowda — Uptime Labs

SRE Weekly Issue #518

lex

May 24, 2026

General

Comments

View on sreweekly.com

When AI SRE Fails: Production Reality, Failure Modes, and What They Cost

This article gives you the failure data, cost data, and risk picture you need to make an accurate decision about AI SRE adoption.

James A. Wondrasek — softwareseni

DORA metrics are lying to you and AI is making it worse

The blind spot isn’t delivery, its legibility: DORA measures work flowing through the pipe, not whether anyone can explain what’s in it.

Paul LaPosta — LeadDev

Monitoring reliably at scale

But what happens when your observability stack is dependent on the same systems that are failing? In that moment, the dashboards go dark, alerts stop firing, and the tools meant to guide recovery become part of the outage.

Abdurrahman J. Allawala — Airbnb

The Pulse: AI load breaks GitHub – why not other vendors?

A thoughtful analysis of GitHub’s availability trouble of late, including some excellent reporting work to get more details on a growth graph previously shared by GitHub.

Gergely Orosz — The Pragmatic Engineer

Flipping the bozo bit on flips the learning off

Here’s a good one introducing the concept of distancing through differencing.

By focusing on the differences, they see no lessons for their own operation and practices.

Lorin Hochstein

You’ve Got (Too Much) Mail: Behind the Scenes of the 3/25/26 Voice Outage

In this post, we’ll peek behind the curtain and see how one seemingly innocuous change overwhelmed a system multiple hops away and how our not-fun afternoon helped us improve Discord.

Discord

Incident Report: May 19, 2026- GCP Account Suspension

Oof. GCP suspended their account “as part of an automated action”, killing production.

This may sound familiar, because GCP did something very similar almost exactly 2 years ago.

Chandrika Khanduri & Cody De Arkland — Railway

Gemini 3.5 deleted 28,745 lines, broke production for 33 minutes, and wrote itself a fake post-mortem claiming credit for the fix

What a story! They discovered that they had inadvertently installed a quite harmful agent ruleset. Before you dismiss this by thinking “I’d never do that”, go back up and read Lorin Hochstein’s article above.

u/dvrkstar — r/bard (Reddit)

SRE Weekly Issue #517

lex

May 17, 2026

General

Comments

View on sreweekly.com

Why post-mortem action items die

There’s some great advice in here. My favorite: be explicit about choosing or not choosing to do something.

incident.io

The Human Infrastructure: How Netflix Built the Operations Layer Behind Live at Scale

Live video delivery is an intensely reliability-critical endeavor, and Netflix pull back on the curtain on how they tackled it.

Brett Axler, Casper Choffat, and Alo Lowry — Netflix

The Invisible OOMKill: Why Your Java Pod Keeps Restarting in Kubernetes

Java uses memory outside of the heap, so it can OOM in a container even if the heap size is well below the container’s memory limit.

Ramya vani Rayala — DZone

Why LLMs Write Incorrect SQL (and What That Means for Your Database)

It’s not about obviously wrong stuff — it’s the queries that look good on the surface that can bet you in trouble, per this article. They also share methods to vet LLM-generated SQL.

Readyset

What does using AI for post-mortems actually mean?

The mental model we use: AI handles the effort so humans can focus on the insight. Not AI instead of thinking.

incident.io

The Code Nobody Read Is Already in Production

[…] because AI tools continue to make it cheaper to write (and rewrite) code on demand, production environments will become the primary place to evaluate whether software is correct or incorrect.

Peter Farago — RunLLM

The Incident Hero Trap

The old way: heroes in incident response are an anti-pattern.
The new way: heroes are great and we should make as many heroes as we can.

Hamed Silatani — Uptime Labs

How incidents can teach us about what’s already working well

I had to read this one twice before I had my galaxy-brain moment in the second-to-last paragraph.

Lorin Hochstein

SRE Weekly Issue #521

SRE Weekly Issue #520

SRE Weekly Issue #519

SRE Weekly Issue #518

SRE Weekly Issue #517

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, Bronto:

A message from our sponsor, BigPanda:

A message from our sponsor, BigPanda:

A message from our sponsor, BigPanda:

A message from our sponsor, BigPanda:

Subscribe

RSS

Mastodon

Search Issues