Page MenuHomePhabricator

Develop and validate a model of thumbor capacity to enable single-DC serving
Open, Needs TriagePublic

Description

On day 1 of September 2024 switchover (T370962), after depooling eqiad from swift.discovery.wmnet (the active/active service that backs upload.wm.o), the service became degraded when thumbor in codfw become overloaded. This was mitigated by repooling eqiad.

See also discussion in T370962#10172473 and T370962#10183874.

While some services are expected to need capacity augments in order to serve from a single DC (particularly large ones), we should at least have a capacity model in mind that can be used to assess whether augments are needed, and by how much.

The purpose of this task is develop and validate such a model for thumbor (possibly together with the swift frontends, since they're kind of tightly linked).

This doesn't need to be terribly sophisticated - just knowing the most critical bottleneck resource in the regime we want to operate the service in, together with how to measure it and extrapolate to replicas / workers will do.

As mentioned in the notes linked above, thumbor worker container CPU remains quite low throughout the overloaded period, so presumably some sort of concurrency ceiling exists somewhere.

Event Timeline

Change #1077111 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/cookbooks@master] sre.discovery.datacenter: exclude swift-https

https://gerrit.wikimedia.org/r/1077111

Change #1077111 merged by jenkins-bot:

[operations/cookbooks@master] sre.discovery.datacenter: exclude swift-https

https://gerrit.wikimedia.org/r/1077111