On day 1 of September 2024 switchover (T370962), after depooling eqiad from swift.discovery.wmnet (the active/active service that backs upload.wm.o), the service became degraded when thumbor in codfw become overloaded. This was mitigated by repooling eqiad.
See also discussion in T370962#10172473 and T370962#10183874.
While some services are expected to need capacity augments in order to serve from a single DC (particularly large ones), we should at least have a capacity model in mind that can be used to assess whether augments are needed, and by how much.
The purpose of this task is develop and validate such a model for thumbor (possibly together with the swift frontends, since they're kind of tightly linked).
This doesn't need to be terribly sophisticated - just knowing the most critical bottleneck resource in the regime we want to operate the service in, together with how to measure it and extrapolate to replicas / workers will do.
As mentioned in the notes linked above, thumbor worker container CPU remains quite low throughout the overloaded period, so presumably some sort of concurrency ceiling exists somewhere.