Develop and validate a model of thumbor capacity to enable single-DC serving
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	Scott_French
	Oct 4 2024, 9:22 PM

Description

On day 1 of September 2024 switchover (T370962), after depooling eqiad from swift.discovery.wmnet (the active/active service that backs upload.wm.o), the service became degraded when thumbor in codfw become overloaded. This was mitigated by repooling eqiad.

See also discussion in T370962#10172473 and T370962#10183874.

While some services are expected to need capacity augments in order to serve from a single DC (particularly large ones), we should at least have a capacity model in mind that can be used to assess whether augments are needed, and by how much.

The purpose of this task is develop and validate such a model for thumbor (possibly together with the swift frontends, since they're kind of tightly linked).

This doesn't need to be terribly sophisticated - just knowing the most critical bottleneck resource in the regime we want to operate the service in, together with how to measure it and extrapolate to replicas / workers will do.

As mentioned in the notes linked above, thumbor worker container CPU remains quite low throughout the overloaded period, so presumably some sort of concurrency ceiling exists somewhere.

Details

	Subject	Repo	Branch	Lines +/-
	sre.discovery.datacenter: exclude swift-https	operations/cookbooks	master	+1 -0

Customize query in gerrit

Related Objects

Mentioned In: T370962: Southward Datacenter Switchover (September 2024)
Mentioned Here: T370962: Southward Datacenter Switchover (September 2024)

Event Timeline

Scott_French created this task.Oct 4 2024, 9:22 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 4 2024, 9:23 PM

Change #1077111 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/cookbooks@master] sre.discovery.datacenter: exclude swift-https

https://gerrit.wikimedia.org/r/1077111

gerritbot added a project: Patch-For-Review.Oct 4 2024, 9:26 PM

Scott_French mentioned this in T370962: Southward Datacenter Switchover (September 2024).Oct 4 2024, 9:26 PM

kamila subscribed.Oct 7 2024, 2:09 PM

Change #1077111 merged by jenkins-bot: