This is a followup investigation of T274589: No atomic section is open (got LocalFile::lockingTransaction) to be more focused on the specific "slow PUTs" problem.
The problem: mw times out on a database transaction e.g. on writing stashed files to swift. Turns out that files are being written from eqiad to codfw at about ~2Mbps, thus a ~600MB file can hit the 300s mw timeout (more context at https://phabricator.wikimedia.org/T274589#6850912).
One example of this problem is the following upload from mw1305 (a jobrunner), which takes 6s in eqiad and 300s in codfw:
Feb 23 10:52:40 ms-fe1007 proxy-server: 10.64.16.105 10.64.32.220 23/Feb/2021/10/52/40 PUT /v1/AUTH_mw/wikipedia-commons-local-public.9d/9/9d/TheLostWorld1925.webm HTTP/1.0 201 - wikimedia/multi-http-client%20v1.0 AUTH_tk6fb0dd3ce... 624059195 - 3a6f10dc474ed8113e1498a5751bb075 tx4a56556fd5e44e01bab13-006034de71 - 7.8468 - - 1614077553.124433041 1614077560.971237898 0 Feb 23 10:58:13 ms-fe2007 proxy-server: 10.64.16.105 10.192.32.155 23/Feb/2021/10/58/13 PUT /v1/AUTH_mw/wikipedia-commons-local-public.9d/9/9d/TheLostWorld1925.webm HTTP/1.0 201 - wikimedia/multi-http-client%20v1.0 AUTH_tk5c39f2dc9... 624059195 - 3a6f10dc474ed8113e1498a5751bb075 tx3a392c4173db430aa9e05-006034de82 - 323.8699 - - 1614077570.121001959 1614077893.990885019 0
I've tallied on centrallog hosts the average transfer time in codfw for commons PUTs larger than 300MB, to get a better idea of the time frames involved (note that the day boundaries are not exact, log files are rotated at ~6 UTC)
zcat ms-fe2*/swift.log-2021${date}.gz | awk '$16 > 300000000 && /PUT \/v1\/AUTH_mw\/wikipedia-commons-local-public/ {print $21 }'
jan 01 7.72208 jan 02 8.49384333333333 jan 03 7.75212173913044 ... jan 23 7.75591935483871 jan 24 6.15061379310345 jan 25 9.28589795918367 jan 26 99.3627666666667 jan 27 23.3935756097561 jan 28 9.05734482758621 jan 29 28.5930095238095 jan 30 149.271148 jan 31 70.0038066666667 feb 01 15.1718 feb 02 156.918878571429 feb 03 24.7632038461538 feb 04 52.9359363636364 feb 05 40.7238714285714 feb 06 125.034576923077 feb 07 120.699621428571 feb 08 171.219108333333 feb 09 198.86314 feb 10 320.42824 feb 11 234.13601875 feb 12 349.262411504425 feb 13 384.73485648855 feb 14 390.028958823529 feb 15 409.759209090909 feb 16 346.23785 feb 17 346.446857142857 feb 18 313.29496 feb 19 335.622576190476 feb 20 233.52412 feb 21 355.615288 feb 22 270.534185365854 feb 23 270.328245454545 feb 24 229.021636842105
During the Feb 11th time window we were, among other things, in the process of decom'ing swift codfw hosts (T272837) and thus pushing more weight to existing hosts. The codfw swift cluster is now in steady state, in the sense that there are no planned com/decom.
The timeframe also coincided with Buster upgrades for mw hosts, and indeed the reported slow jobrunners are all the Buster ones. See also https://phabricator.wikimedia.org/T275752#6864889 for futher context.