Page MenuHomePhabricator

No atomic section is open (got LocalFile::lockingTransaction)
Open, HighPublicPRODUCTION ERROR

Description

Error message
[YCV637IDuNZSvZ4-nNEqpAAAAQE] [no req]   Wikimedia\Rdbms\DBUnexpectedError: No atomic section is open (got LocalFile::lockingTransaction)
Stack Trace
from /srv/mediawiki/php-1.36.0-wmf.30/includes/libs/rdbms/database/Database.php(4375)
#0 /srv/mediawiki/php-1.36.0-wmf.30/includes/libs/rdbms/database/DBConnRef.php(68): Wikimedia\Rdbms\Database->endAtomic(string)
#1 /srv/mediawiki/php-1.36.0-wmf.30/includes/libs/rdbms/database/DBConnRef.php(629): Wikimedia\Rdbms\DBConnRef->__call(string, array)
#2 /srv/mediawiki/php-1.36.0-wmf.30/includes/filerepo/file/LocalFile.php(2379): Wikimedia\Rdbms\DBConnRef->endAtomic(string)
#3 /srv/mediawiki/php-1.36.0-wmf.30/includes/filerepo/file/LocalFile.php(2397): LocalFile->unlock()
#4 /srv/mediawiki/php-1.36.0-wmf.30/includes/jobqueue/jobs/PublishStashedFileJob.php(133): LocalFile->__destruct()
#5 /srv/mediawiki/php-1.36.0-wmf.30/extensions/EventBus/includes/JobExecutor.php(79): PublishStashedFileJob->run()
#6 /srv/mediawiki/rpc/RunSingleJob.php(76): MediaWiki\Extension\EventBus\JobExecutor->execute(array)
#7 {main}
Impact
  1. Notes

Details

Request URL
[no req]

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

For the record, logstash-beta is still missing all events - we might have caught this on the beta cluster otherwise

Maybe what's happening is as simple as the mysql connection gets dropped sometime during processing the file upload?

If the connection dropped it would be impossible to unlock and end atomic blocks, right?

Here's some more context about the error:

LocalFile->unlock() in wmf.30
public function unlock() {
	if ( $this->locked ) {
		--$this->locked;
		if ( !$this->locked ) {
			$dbw = $this->repo->getMasterDB();
			$dbw->endAtomic( self::ATOMIC_SECTION_LOCK );
			$this->lockedOwnTrx = false;
		}
	}
}

And the site of the error:

PublishStashedFileJob.php:120-139
		} catch ( Exception $e ) {
			UploadBase::setSessionStatus(
				$user,
				$this->params['filekey'],
				[
					'result' => 'Failure',
					'stage' => 'publish',
					'status' => Status::newFatal( 'api-error-publishfailed' )
				]
			);
			$this->setLastError( get_class( $e ) . ": " . $e->getMessage() );
			// To prevent potential database referential integrity issues.
			// See T34551.
			MWExceptionHandler::rollbackMasterChangesAndLog( $e );    <--------error reported for this line

			return false;
		}

		return true;
	}

Note that the Error: 2006 MySQL server has gone away error I pasted in a previous comment is from code inside the try block preceding this catch block.

MySQL connections can indeed time out or go away at times. The handling of that connection issue itself is causing further problems, which is subject of this task. There appears to be a bug in LocalFile::unlock, where it expects an atomic section to have been created and exist, and commands it to be closed, which doesnt' work since there isn't one at this point.

It reaches there, per the original stack trace, from PublishStashedFileJob.php:133. However the code at that line is not relevant. That is merely logging the MySQL timeout, which is fine and not an issue unless it is consistently down (which SRE/DBA would get alerts for and handle accordingly).

The reason line 133 is attributed is because that's the point where the method scope ends (given return statement below that), and thus LocalFile::__destroy is implicit called. I'll leave it at that for the next person to pick up further so as to not take away a learning oppertunity.

If it helps narrowing down the problem, the issue was actually introduced in wmf.28 and we didn't notice it because it was only on production for a brief amount of time

image.png (379×1 px, 54 KB)

So I think something is actually closing the connection in the job because basically all PublishStashedFile jobs are now failing (number of inserted jobs is pretty small)

Another reason that it's actually a logical bug than the connection going away. The connection is gone error always show up in the select statement of CommentStore->insert() nowhere else.

I had just reached a similar conclusion by looking at the https://www.mediawiki.org/wiki/MediaWiki_1.36/wmf.28 core patches that touched rdbms.

I f we want to roll that back, we'll also need to revert https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/607155/ because the rdbms patch was introduced to fix a bug exposed by the coalesceKeys change, see https://phabricator.wikimedia.org/T252564#6212081

I would love it if someone more knowledgable about these parts of the codebase could check to make sure rolling these back won't cause other issues.

Adding @aaron since he will definitely know the issues in those two patches.

Summary:

  • I thought it was noise given "MySQL gone away" is a fairly normal thing to see at scale (we've had those for years), as connections can sometimes drop, albeit fairly rare, a couple a day is not strange.
  • Thanks @Ladsgroup for noticing that these errors have actually not shown up in recent history, and that they popped up during the wmf.28 deployment, then went away and came back on the wmf.30 roll out. Which indeed is strong indicator of a direct regression with new code.
  • Also thanks for noticing that the rate, albeit low from a global perspective, correlates very well with PublishStashedFileJob frequency. Thus this is very likely to cause end-user damage around uploads. I suggest we rollback Commons/group1 until this is fixed.
  • The reason we suspect Rdbms as root cause is because the CommentStore, Upload, and Job code paths were not changed in wmf.28. Thus the running theory is that our Rdbms refactoring has caused CommentStore insertions to very often result in a connection loss.

Mentioned in SAL (#wikimedia-operations) [2021-02-12T19:34:22Z] <twentyafterfour> Train status: Rolling back commonswiki to wmf.27 due to T274589 (refs T271344)

Mentioned in SAL (#wikimedia-operations) [2021-02-12T19:43:21Z] <twentyafterfour@deploy1001> rebuilt and synchronized wikiversions files: roll back commonswiki to 1.36.0-wmf.27 due to T274589

Luss422 renamed this task from No atomic section is open (got LocalFile::lockingTransaction) to Vhh.Feb 14 2021, 5:16 AM
Luss422 removed aaron as the assignee of this task.
Luss422 removed Request ID.
Luss422 updated the task description. (Show Details)
Luss422 removed subscribers: aaron, ArielGlenn, Ladsgroup and 8 others.

@hashar Hey, are we going to go forward with wmf.30? :-) There's been suspicion about https://gerrit.wikimedia.org/r/c/mediawiki/core/+/655809 above, and I don't see any updates posted here by anyone :/.

"MySQL server has gone away" means the connection was killed on the server side. If a query is running, it will return this error, but if the connection is idle, the error is reported when the next query is attempted. In LocalFile::upload(), a transaction is opened prior to Swift file operations being done. The next DB query after the file operations is always CommentStore::createComment(). So if Swift takes a long time, and the master connection is killed by the server while Swift operations are in progress, this error is the expected result.

If Database::canRecoverFromDisconnect() is true, MW will silently reconnect after a "MySQL server has gone away" error, but I don't think this has been possible after LocalFile::lock() since 2013.

I can reproduce the error locally during a normal upload by sleeping until wait_timeout is exceeded:

diff --git a/includes/filerepo/file/LocalFile.php b/includes/filerepo/file/LocalFile.php
index af4a2916bc..0ebfe323d0 100644
--- a/includes/filerepo/file/LocalFile.php
+++ b/includes/filerepo/file/LocalFile.php
@@ -1412,7 +1412,9 @@ class LocalFile extends File {
 		$comment = trim( $comment );
 
 		$this->lock();
+		wfGetDB( DB_MASTER )->query( 'SET wait_timeout=2' );
 		$status = $this->publish( $src, $flags, $options );
+		sleep( 3 );
 
 		if ( $status->successCount >= 2 ) {
 			// There will be a copy+(one of move,copy,store).

Reverting to the branch point of wmf.27 makes no difference to this model of the failure.

Rolling back to wmf.27 did not stop the errors in production.

CommentStore errors annotated.png (664×1 px, 47 KB)

It looks very much like the error is unrelated to the deployment. It is probably caused by user demand or Swift server load. But I would still like to look more closely at Swift performance and the wait timeouts. wait_timeout should be 1 hour, which is a long time to wait for an upload. UploadWizard uses the async mode when the file is larger than 10MB, which seems like a pretty small threshold to trigger this kind of failure.

Thanks for the investigation. You're right that it happens on wmf.27 (I don't know how I missed it) and please correct me if I'm wrong but now it seems it's introduced in wmf.27. I can't find wmf.26 or earlier: https://logstash.wikimedia.org/goto/362020ee0600114fc5e245f69267c796

image.png (144×1 px, 15 KB)

This doesn't look good ^

One possibility can be that some user is uploading massive files in commons and it's coinciding with the train?

I looked at 75 PublishStashedFile successful job runs in JobExecutor.log-20210213.gz. The largest duration was 1061 seconds, which was for this 112MB TIFF.

It's odd that I don't see the start of the failed runs in this log. But if a job can run for 1061 seconds, you can imagine it running for 3600 seconds. One of the failed files was later reuploaded, so we can see that it was an 86MB PDF.

Ladsgroup lowered the priority of this task from Unbreak Now! to High.Feb 16 2021, 3:29 PM
Ladsgroup subscribed.

Not a train blocker but needs being addressed nevertheless.

I looked for network or CPU saturation on ms-fe* or the MediaWiki server reporting the error, but found nothing obvious. I wonder if there's a way to benchmark the Swift operation from the command line. It should just be copying from the stash to the public container using X-Copy-From.

I want to hand this over to someone who is more familiar with Swift. The maintainers list suggests @fgiunchedi.

The story thus far: the PublishStashedFile job is often very slow. See this grafana dashboard. Mean service times of ~300s seem to correlate with errors in the logs. MediaWiki is holding a DB transaction open in an attempt to prevent simultaneous updates to the same file. After enough time has elapsed, the DB connection is lost and MediaWiki complains about the lost transaction.

The job is doing an internal copy (PUT with X-Copy-From) followed by a DELETE. We don't know which is slow. The files triggering the errors are apparently not excessively large (~100MB).

Not sure yet what's going on, but here's what I found so far:

  • the mean time in @tstarling's dashboard seems to go up significantly starting ~ Feb 8th

2021-02-23-152350_693x374_scrot.png (374×693 px, 43 KB)

Having file names (or ideally the transaction id from swift's http response headers) would help in finding culprits on the swift' side, I scanned the task but couldn't find any. Do we know what file(s) failed ?

If you look at the logstash records, they are not there https://logstash.wikimedia.org/goto/97dbe1143f25bc064dd0b42ddf2f2d3f but if you copy the reqId and search for that, you'll find other errors including ones that mention name of the file. for example I found TheLostWorld1925.webm as one of them.

Thank you @Ladsgroup ! I've used TheLostWorld1925.webm as filename to scavenge in the swift logs:

Initial (?) upload of the file 630MB file Feb 23rd at 5:30 (UTC), the codfw upload does indeed take 306s while eqiad takes 6s. The operation takes a long time however it is a success

Feb 23 05:30:55 ms-fe1008 proxy-server: 10.64.16.66 10.64.32.221 23/Feb/2021/05/30/55 PUT /v1/AUTH_mw/wikipedia-commons-local-public.9d/9/9d/TheLostWorld1925.webm HTTP/1.0 201 - wikimedia/multi-http-client%20v1.0 AUTH_tk6fb0dd3ce... 624059195 - dd3347596df01668fca421c1dbc665da tx527e580956614c9ebe051-0060349309 - 6.6567 - - 1614058249.090892076 1614058255.747560024 0

Feb 23 05:36:07 ms-fe2007 proxy-server: 10.64.16.66 10.192.32.155 23/Feb/2021/05/36/07 PUT /v1/AUTH_mw/wikipedia-commons-local-public.9d/9/9d/TheLostWorld1925.webm HTTP/1.0 201 - wikimedia/multi-http-client%20v1.0 AUTH_tk5c39f2dc9... 624059195 - dd3347596df01668fca421c1dbc665da tx52b76ccdfcaa44c4a6ba0-0060349315 - 306.6444 - - 1614058261.242499113 1614058567.886924028 0

Archival of the file, same day at 10:52. both codfw and eqiad are "fast", 3s and 7s respectively

Feb 23 10:52:23 ms-fe1007 proxy-server: Copying object from /v1/AUTH_mw/wikipedia-commons-local-public.9d/9/9d/TheLostWorld1925.webm to /v1/AUTH_mw/wikipedia-commons-local-public.9d/archive/9/9d/20210223105220%21TheLostWorld1925.webm (txn: txa9beb5fbea8741968d57d-006034de67)
Feb 23 10:52:29 ms-fe1007 proxy-server: 10.64.16.105 10.64.32.220 23/Feb/2021/10/52/29 PUT /v1/AUTH_mw/wikipedia-commons-local-public.9d/archive/9/9d/20210223105220%2521TheLostWorld1925.webm HTTP/1.0 201 - wikimedia/multi-http-client%20v1.0 AUTH_tk6fb0dd3ce... - - - txa9beb5fbea8741968d57d-006034de67 - 6.8579 - - 1614077543.048239946 1614077549.906187057 -

Feb 23 10:52:43 ms-fe2007 proxy-server: Copying object from /v1/AUTH_mw/wikipedia-commons-local-public.9d/9/9d/TheLostWorld1925.webm to /v1/AUTH_mw/wikipedia-commons-local-public.9d/archive/9/9d/20210223105220%21TheLostWorld1925.webm (txn: tx6c11975f86da4a0488a99-006034de7b)
Feb 23 10:52:46 ms-fe2007 proxy-server: 10.64.16.105 10.192.32.155 23/Feb/2021/10/52/46 PUT /v1/AUTH_mw/wikipedia-commons-local-public.9d/archive/9/9d/20210223105220%2521TheLostWorld1925.webm HTTP/1.0 201 - wikimedia/multi-http-client%20v1.0 AUTH_tk5c39f2dc9... - - - tx6c11975f86da4a0488a99-006034de7b - 3.4376 - - 1614077563.296531916 1614077566.734178066 -

Then there's another upload operation, which I'm assuming is a new version of the file (not sure, the size is the same, 624059195 bytes) eqiad takes ~8s and codfw takes ~330s

Feb 23 10:52:40 ms-fe1007 proxy-server: 10.64.16.105 10.64.32.220 23/Feb/2021/10/52/40 PUT /v1/AUTH_mw/wikipedia-commons-local-public.9d/9/9d/TheLostWorld1925.webm HTTP/1.0 201 - wikimedia/multi-http-client%20v1.0 AUTH_tk6fb0dd3ce... 624059195 - 3a6f10dc474ed8113e1498a5751bb075 tx4a56556fd5e44e01bab13-006034de71 - 7.8468 - - 1614077553.124433041 1614077560.971237898 0

Feb 23 10:58:13 ms-fe2007 proxy-server: 10.64.16.105 10.192.32.155 23/Feb/2021/10/58/13 PUT /v1/AUTH_mw/wikipedia-commons-local-public.9d/9/9d/TheLostWorld1925.webm HTTP/1.0 201 - wikimedia/multi-http-client%20v1.0 AUTH_tk5c39f2dc9... 624059195 - 3a6f10dc474ed8113e1498a5751bb075 tx3a392c4173db430aa9e05-006034de82 - 323.8699 - - 1614077570.121001959 1614077893.990885019 0

Notable things:

  • 300s for a 600MB transfer to codfw is obviously too much time (~2MBps)
  • I can't find a DELETE for that file
  • AFAIK the slow uploads to codfw are a new problem that manifests this way (?), we've also reimaged all (save for a bunch) appservers to Buster during Feb. I'm wondering if that's what's making a difference here.

Definitely new behaviour "uploads towards codfw from eqiad are slow", here's a sample of big uploads completing in Jan within reasonable time (e.g. 15s for 1.5GB file)

Jan  2 11:02:27 ms-fe2005 proxy-server: 10.64.16.59 10.192.0.28 02/Jan/2021/11/02/27 PUT /v1/AUTH_mw/wikipedia-commons-local-public.9f/9/9f/NICE_AND_TRAIN_%25281995%2529_%25E2%2580%25A2_Nizzan_%2522L%25C3%25A4tt%25C3%25A4hatun%2522_kyydiss%25C3%25A4_1995.webm HTTP/1.0 201 - wikimedia/multi-http-client%20v1.0 AUTH_tkc39b6d867... 792057912 - 06b1e8d7ed9cd56fe3dca33b3c9b2c28 tx0d3b932238604678aba70-005ff052b9 - 10.4451 - - 1609585337.552912951 1609585347.997983932 0
Jan  2 11:30:44 ms-fe2005 proxy-server: 10.64.16.61 10.192.0.28 02/Jan/2021/11/30/44 PUT /v1/AUTH_mw/wikipedia-commons-local-public.9f/9/9f/LOURDES_%25282%2529%252C_France.webm HTTP/1.0 201 - wikimedia/multi-http-client%20v1.0 AUTH_tkc39b6d867... 1039325883 - 2046dd6734b78be3af7948e527a5f925 tx00f84c2004914ff49bab2-005ff05958 - 12.5428 - - 1609587032.373337030 1609587044.916157007 0
Jan  2 13:16:32 ms-fe2005 proxy-server: 10.64.16.63 10.192.0.28 02/Jan/2021/13/16/32 PUT /v1/AUTH_mw/wikipedia-commons-local-public.5c/5/5c/LOURDES_%25281%2529%252C_France_1996_-_2016.webm HTTP/1.0 201 - wikimedia/multi-http-client%20v1.0 AUTH_tkc39b6d867... 985945967 - 7460053cca9471100ef7779dd5341b97 tx7d6f3254306649c9a125f-005ff07226 - 10.0630 - - 1609593382.471777916 1609593392.534816980 0
Jan  2 15:18:21 ms-fe2005 proxy-server: 10.64.16.65 10.192.0.28 02/Jan/2021/15/18/21 PUT /v1/AUTH_mw/wikipedia-commons-local-public.ae/a/ae/%25D0%2594%25D0%2590%25D0%25A5%25D0%25BC%25D0%259E_315-1-7086._1821._%25D0%25A1%25D0%25BF%25D0%25BE%25D0%25B2%25D1%2596%25D0%25B4%25D0%25B0%25D0%25BB%25D1%258C%25D0%25BD%25D1%2596_%25D0%25B2%25D1%2596%25D0%25B4%25D0%25BE%25D0%25BC%25D0%25BE%25D1%2581%25D1%2582%25D1%2596_%25D1%2586%25D0%25B5%25D1%2580%25D0%25BA%25D0%25BE%25D0%25B2_%25D0%2591%25D0%25B0%25D0%25BB%25D1%2582%25D1%2581%25D1%258C%25D0%25BA%25D0%25BE%25D0%25B3%25D0%25BE_%25D0%25BF%25D0%25BE%25D0%25B2%25D1%2596%25D1%2582%25D1%2583.pdf HTTP/1.0 201 - wikimedia/multi-http-client%20v1.0 AUTH_tkc39b6d867... 822136758 - 711b68605b65f4ed9bfab49cfd3bc60c tx6395a9d27a9e4e5bbe2a5-005ff08eb0 - 12.3730 - - 1609600688.712816954 1609600701.085859060 0
Jan  2 23:28:08 ms-fe2005 proxy-server: 10.64.16.59 10.192.0.28 02/Jan/2021/23/28/08 PUT /v1/AUTH_mw/wikipedia-commons-local-public.0f/0/0f/THE_TALL_SHIPS_RACES%252C_Helsinki_2013.webm HTTP/1.0 201 - wikimedia/multi-http-client%20v1.0 AUTH_tkc39b6d867... 1672979652 - 2bab68907ec379c0eeb54610fbb9796d tx503d0d5a2e2a497abeb75-005ff10174 - 19.6408 - - 1609630068.695461035 1609630088.336302042 0
Jan  3 01:02:36 ms-fe2005 proxy-server: 10.64.16.67 10.192.0.28 03/Jan/2021/01/02/36 PUT /v1/AUTH_mw/wikipedia-commons-local-public.b4/b/b4/%25D0%2594%25D0%2590%25D0%25A5%25D0%25BC%25D0%259E_315-1-7360._1831._%25D0%25A1%25D0%25BF%25D0%25BE%25D0%25B2%25D1%2596%25D0%25B4%25D0%25B0%25D0%25BB%25D1%258C%25D0%25BD%25D1%2596_%25D0%25B2%25D1%2596%25D0%25B4%25D0%25BE%25D0%25BC%25D0%25BE%25D1%2581%25D1%2582%25D1%2596_%25D1%2586%25D0%25B5%25D1%2580%25D0%25BA%25D0%25BE%25D0%25B2_%25D0%2591%25D0%25B0%25D0%25BB%25D1%2582%25D1%2581%25D1%258C%25D0%25BA%25D0%25BE%25D0%25B3%25D0%25BE_%25D0%25BF%25D0%25BE%25D0%25B2%25D1%2596%25D1%2582%25D1%2583._%25D0%259B-%25D0%25AE.pdf HTTP/1.0 201 - wikimedia/multi-http-client%20v1.0 AUTH_tkc39b6d867... 1508733800 - db88261104963a8283bfe80898efd6e1 tx20c027a57535483b9fe7b-005ff1179d - 15.4835 - - 1609635741.463521004 1609635756.947052002 0

More empirical bandwidth tests to download https://commons.wikimedia.org/wiki/File:The_Lost_World_(1925).webm (not the same file, but big enough) codfw -> eqiad and codfw -> eqiad:

downloading in codfw from eqiad shows no obvious problems

mw2239:~$ wget  https://ms-fe.svc.eqiad.wmnet/wikipedia/commons/2/2e/The_Lost_World_%281925%29.webm -O/dev/null 2>&1 | grep saved
2021-02-24 11:29:46 (88.8 MB/s) - ‘/dev/null’ saved [300914670/300914670]
mw2239:~$ wget  https://ms-fe.svc.eqiad.wmnet/wikipedia/commons/2/2e/The_Lost_World_%281925%29.webm -O/dev/null 2>&1 | grep saved
2021-02-24 11:29:57 (39.4 MB/s) - ‘/dev/null’ saved [300914670/300914670]
mw2239:~$ wget  https://ms-fe.svc.eqiad.wmnet/wikipedia/commons/2/2e/The_Lost_World_%281925%29.webm -O/dev/null 2>&1 | grep saved
2021-02-24 11:31:00 (64.1 MB/s) - ‘/dev/null’ saved [300914670/300914670]
mw2239:~$ wget  https://ms-fe.svc.eqiad.wmnet/wikipedia/commons/2/2e/The_Lost_World_%281925%29.webm -O/dev/null 2>&1 | grep saved
2021-02-24 11:31:12 (28.9 MB/s) - ‘/dev/null’ saved [300914670/300914670]
mw2239:~$ wget  https://ms-fe.svc.eqiad.wmnet/wikipedia/commons/2/2e/The_Lost_World_%281925%29.webm -O/dev/null 2>&1 | grep saved
2021-02-24 11:31:21 (57.4 MB/s) - ‘/dev/null’ saved [300914670/300914670]
mw2239:~$

While downloading in eqiad from codfw does, e.g. on mw1305

mw1305:~$ wget https://ms-fe.svc.codfw.wmnet/wikipedia/commons/2/2e/The_Lost_World_%281925%29.webm -O/dev/null 2>&1 | grep saved
2021-02-24 11:29:57 (10.5 MB/s) - ‘/dev/null’ saved [300914670/300914670]
mw1305:~$ wget https://ms-fe.svc.codfw.wmnet/wikipedia/commons/2/2e/The_Lost_World_%281925%29.webm -O/dev/null 2>&1 | grep saved
2021-02-24 11:30:22 (13.6 MB/s) - ‘/dev/null’ saved [300914670/300914670]
mw1305:~$ wget https://ms-fe.svc.codfw.wmnet/wikipedia/commons/2/2e/The_Lost_World_%281925%29.webm -O/dev/null 2>&1 | grep saved
2021-02-24 11:31:04 (7.23 MB/s) - ‘/dev/null’ saved [300914670/300914670]
mw1305:~$ wget https://ms-fe.svc.codfw.wmnet/wikipedia/commons/2/2e/The_Lost_World_%281925%29.webm -O/dev/null 2>&1 | grep saved
2021-02-24 11:31:37 (9.29 MB/s) - ‘/dev/null’ saved [300914670/300914670]
mw1305:~$

Interestingly, I can't reproduce the slow download bandwidth on another eqiad host, e.g. centrallog1001

centrallog1001:~$ wget https://ms-fe.svc.codfw.wmnet/wikipedia/commons/2/2e/The_Lost_World_%281925%29.webm -O/dev/null 2>&1 | grep saved
2021-02-24 11:30:17 (49.7 MB/s) - ‘/dev/null’ saved [300914670/300914670]
centrallog1001:~$ wget https://ms-fe.svc.codfw.wmnet/wikipedia/commons/2/2e/The_Lost_World_%281925%29.webm -O/dev/null 2>&1 | grep saved
2021-02-24 11:30:27 (54.4 MB/s) - ‘/dev/null’ saved [300914670/300914670]
centrallog1001:~$ wget https://ms-fe.svc.codfw.wmnet/wikipedia/commons/2/2e/The_Lost_World_%281925%29.webm -O/dev/null 2>&1 | grep saved
2021-02-24 11:31:38 (17.5 MB/s) - ‘/dev/null’ saved [300914670/300914670]
centrallog1001:~$ wget https://ms-fe.svc.codfw.wmnet/wikipedia/commons/2/2e/The_Lost_World_%281925%29.webm -O/dev/null 2>&1 | grep saved
2021-02-24 11:32:23 (48.4 MB/s) - ‘/dev/null’ saved [300914670/300914670]
centrallog1001:~$

Spot-checked a few other hosts in eqiad, one correlation I could find is
certainly CPU usage: on hosts more heavily cpu-loaded transfers are not quite
as fast.

A few eqiad hosts for which codfw downloads are fast:

i7:~$ ssh rdb1005.eqiad.wmnet 'wget https://ms-fe.svc.codfw.wmnet/wikipedia/commons/2/2e/The_Lost_World_%281925%29.webm -O/dev/null 2>&1 | grep saved'
2021-02-24 13:21:23 (95.2 MB/s) - ‘/dev/null’ saved [300914670/300914670]
i7:~$ ssh logstash1010.eqiad.wmnet 'wget https://ms-fe.svc.codfw.wmnet/wikipedia/commons/2/2e/The_Lost_World_%281925%29.webm -O/dev/null 2>&1 | grep saved'
2021-02-24 13:21:44 (87.1 MB/s) - ‘/dev/null’ saved [300914670/300914670]
i7:~$ ssh druid1008.eqiad.wmnet 'wget https://ms-fe.svc.codfw.wmnet/wikipedia/commons/2/2e/The_Lost_World_%281925%29.webm -O/dev/null 2>&1 | grep saved'
2021-02-24 13:22:27 (30.4 MB/s) - ‘/dev/null’ saved [300914670/300914670]
i7:~$ ssh logstash1026.eqiad.wmnet 'wget https://ms-fe.svc.codfw.wmnet/wikipedia/commons/2/2e/The_Lost_World_%281925%29.webm -O/dev/null 2>&1 | grep saved'
2021-02-24 13:32:37 (163 MB/s) - ‘/dev/null’ saved [300914670/300914670]
i7:~$ ssh logstash1023.eqiad.wmnet 'wget https://ms-fe.svc.codfw.wmnet/wikipedia/commons/2/2e/The_Lost_World_%281925%29.webm -O/dev/null 2>&1 | grep saved'
2021-02-24 13:32:55 (88.8 MB/s) - ‘/dev/null’ saved [300914670/300914670]
i7:~$ ssh ms-fe1005.eqiad.wmnet 'wget https://ms-fe.svc.codfw.wmnet/wikipedia/commons/2/2e/The_Lost_World_%281925%29.webm -O/dev/null 2>&1 | grep saved'
2021-02-24 13:33:46 (59.6 MB/s) - ‘/dev/null’ saved [300914670/300914670]

And eqiad hosts that are consistently slow when downloading from codfw
(including a few appservers I spot-checked)

i7:~$ ssh prometheus1003.eqiad.wmnet 'wget https://ms-fe.svc.codfw.wmnet/wikipedia/commons/2/2e/The_Lost_World_%281925%29.webm -O/dev/null 2>&1 | grep saved'
2021-02-24 13:24:09 (3.13 MB/s) - ‘/dev/null’ saved [300914670/300914670]
i7:~$ ssh graphite1004.eqiad.wmnet 'wget https://ms-fe.svc.codfw.wmnet/wikipedia/commons/2/2e/The_Lost_World_%281925%29.webm -O/dev/null 2>&1 | grep saved'
2021-02-24 13:29:24 (1.84 MB/s) - ‘/dev/null’ saved [300914670/300914670]
i7:~$ ssh thumbor1003.eqiad.wmnet 'wget https://ms-fe.svc.codfw.wmnet/wikipedia/commons/2/2e/The_Lost_World_%281925%29.webm -O/dev/null 2>&1 | grep saved'
2021-02-24 13:32:06 (5.01 MB/s) - ‘/dev/null’ saved [300914670/300914670]
i7:~$ ssh mw1266.eqiad.wmnet 'wget https://ms-fe.svc.codfw.wmnet/wikipedia/commons/2/2e/The_Lost_World_%281925%29.webm -O/dev/null 2>&1 | grep saved'
2021-02-24 16:14:32 (2.70 MB/s) - ‘/dev/null’ saved [300914670/300914670]
i7:~$ ssh mw1267.eqiad.wmnet 'wget https://ms-fe.svc.codfw.wmnet/wikipedia/commons/2/2e/The_Lost_World_%281925%29.webm -O/dev/null 2>&1 | grep saved'
2021-02-24 16:44:23 (4.33 MB/s) - ‘/dev/null’ saved [300914670/300914670]

The same hosts can be hit and miss when downloading locally from eqiad:

i7:~$ ssh prometheus1003.eqiad.wmnet 'wget https://ms-fe.svc.eqiad.wmnet/wikipedia/commons/2/2e/The_Lost_World_%281925%29.webm -O/dev/null 2>&1 | grep saved'
2021-02-24 16:39:39 (12.2 MB/s) - ‘/dev/null’ saved [300914670/300914670]
i7:~$ ssh graphite1004.eqiad.wmnet 'wget https://ms-fe.svc.eqiad.wmnet/wikipedia/commons/2/2e/The_Lost_World_%281925%29.webm -O/dev/null 2>&1 | grep saved'
2021-02-24 16:39:06 (49.3 MB/s) - ‘/dev/null’ saved [300914670/300914670]
i7:~$ ssh thumbor1003.eqiad.wmnet 'wget https://ms-fe.svc.eqiad.wmnet/wikipedia/commons/2/2e/The_Lost_World_%281925%29.webm -O/dev/null 2>&1 | grep saved'
2021-02-24 16:41:39 (63.9 MB/s) - ‘/dev/null’ saved [300914670/300914670]
i7:~$ ssh mw1266.eqiad.wmnet 'wget https://ms-fe.svc.eqiad.wmnet/wikipedia/commons/2/2e/The_Lost_World_%281925%29.webm -O/dev/null 2>&1 | grep saved'
2021-02-24 16:40:50 (83.4 MB/s) - ‘/dev/null’ saved [300914670/300914670]
i7:~$ ssh mw1267.eqiad.wmnet 'wget https://ms-fe.svc.eqiad.wmnet/wikipedia/commons/2/2e/The_Lost_World_%281925%29.webm -O/dev/null 2>&1 | grep saved'
2021-02-24 16:43:04 (77.3 MB/s) - ‘/dev/null’ saved [300914670/300914670]

Note that downloading in codfw from eqiad can be vary too, e.g. fast hosts:

i7:~$ ssh centrallog2001.codfw.wmnet 'wget https://ms-fe.svc.eqiad.wmnet/wikipedia/commons/2/2e/The_Lost_World_%281925%29.webm -O/dev/null 2>&1 | grep saved'
2021-02-24 13:41:27 (33.7 MB/s) - ‘/dev/null’ saved [300914670/300914670]
i7:~$ ssh ms-fe2005.codfw.wmnet 'wget https://ms-fe.svc.eqiad.wmnet/wikipedia/commons/2/2e/The_Lost_World_%281925%29.webm -O/dev/null 2>&1 | grep saved'
2021-02-24 13:44:05 (112 MB/s) - ‘/dev/null’ saved [300914670/300914670]
i7:~$ ssh logstash2020.codfw.wmnet 'wget https://ms-fe.svc.eqiad.wmnet/wikipedia/commons/2/2e/The_Lost_World_%281925%29.webm -O/dev/null 2>&1 | grep saved'
2021-02-24 13:44:36 (77.6 MB/s) - ‘/dev/null’ saved [300914670/300914670]
i7:~$ ssh mw2316.codfw.wmnet 'wget https://ms-fe.svc.eqiad.wmnet/wikipedia/commons/2/2e/The_Lost_World_%281925%29.webm -O/dev/null 2>&1 | grep saved'
2021-02-24 16:47:11 (67.8 MB/s) - ‘/dev/null’ saved [300914670/300914670]
i7:~$ ssh mw2254.codfw.wmnet 'wget https://ms-fe.svc.eqiad.wmnet/wikipedia/commons/2/2e/The_Lost_World_%281925%29.webm -O/dev/null 2>&1 | grep saved'
2021-02-24 16:47:42 (30.1 MB/s) - ‘/dev/null’ saved [300914670/300914670]

And "slow" hosts:

i7:~$ ssh prometheus2004.codfw.wmnet 'wget https://ms-fe.svc.eqiad.wmnet/wikipedia/commons/2/2e/The_Lost_World_%281925%29.webm -O/dev/null 2>&1 | grep saved'
2021-02-24 13:40:04 (3.18 MB/s) - ‘/dev/null’ saved [300914670/300914670]
i7:~$ ssh thumbor2002.codfw.wmnet 'wget https://ms-fe.svc.eqiad.wmnet/wikipedia/commons/2/2e/The_Lost_World_%281925%29.webm -O/dev/null 2>&1 | grep saved'
2021-02-24 13:42:17 (7.43 MB/s) - ‘/dev/null’ saved [300914670/300914670]
i7:~$ ssh thumbor2001.codfw.wmnet 'wget https://ms-fe.svc.eqiad.wmnet/wikipedia/commons/2/2e/The_Lost_World_%281925%29.webm -O/dev/null 2>&1 | grep saved'
2021-02-24 13:43:42 (10.8 MB/s) - ‘/dev/null’ saved [300914670/300914670]
i7:~$ ssh graphite2003.codfw.wmnet 'wget https://ms-fe.svc.eqiad.wmnet/wikipedia/commons/2/2e/The_Lost_World_%281925%29.webm -O/dev/null 2>&1 | grep saved'
2021-02-24 16:49:34 (15.9 MB/s) - ‘/dev/null’ saved [300914670/300914670]

The same hosts are somewhat "fine" when downloading locally, can be slow too
though:

i7:~$ ssh prometheus2004.codfw.wmnet 'wget https://ms-fe.svc.codfw.wmnet/wikipedia/commons/2/2e/The_Lost_World_%281925%29.webm -O/dev/null 2>&1 | grep saved'
2021-02-24 16:52:54 (15.5 MB/s) - ‘/dev/null’ saved [300914670/300914670]
i7:~$ ssh thumbor2002.codfw.wmnet 'wget https://ms-fe.svc.codfw.wmnet/wikipedia/commons/2/2e/The_Lost_World_%281925%29.webm -O/dev/null 2>&1 | grep saved'
2021-02-24 16:50:52 (94.2 MB/s) - ‘/dev/null’ saved [300914670/300914670]
i7:~$ ssh thumbor2001.codfw.wmnet 'wget https://ms-fe.svc.codfw.wmnet/wikipedia/commons/2/2e/The_Lost_World_%281925%29.webm -O/dev/null 2>&1 | grep saved'
2021-02-24 16:51:29 (88.4 MB/s) - ‘/dev/null’ saved [300914670/300914670]
i7:~$ ssh graphite2003.codfw.wmnet 'wget https://ms-fe.svc.codfw.wmnet/wikipedia/commons/2/2e/The_Lost_World_%281925%29.webm -O/dev/null 2>&1 | grep saved'
2021-02-24 16:52:17 (112 MB/s) - ‘/dev/null’ saved [300914670/300914670]
fgiunchedi removed a project: SRE.

I've been going quite off topic for this task, following on the specific issue on T275752 and untagging SRE

I've been going quite off topic for this task, following on the specific issue on T275752 and untagging SRE

Further details in the task but tl;dr it looks like a side effect of the Buster upgrades for MW jobrunners

Krinkle edited projects, added Performance-Team (Radar); removed Performance-Team.
Krinkle added a subscriber: aaron.

Seen again today. Not a lot, put persistent.

Let's do some work in MW to reduce the user impact of Swift slowness. If a publish operation is in progress, a second concurrent publish request should just error out with "someone else is doing something to this file", after no waiting. We already have LocalFileLockError but I think it's thrown after 10s. And let's move the publish operation outside the DB transaction.

LocalFile::lock() has always started a transaction, and probably some things depend on that. More recently it has been both a DB lock and a file lock. We could split it into two separate methods instead. LocalFile::upload() has a file operation phase (LocalRepo::publish() etc.) and a DB phase (recordUpload3). I think the file lock would wrap both things, with no waiting if there is contention, but the transaction only needs to be around recordUpload3 (which starts its own transaction already).

Let's do some work in MW to reduce the user impact of Swift slowness

I filed T283045 for this.

I have removed a log filter from Kibana MediaWiki New Errors dashboard which pointed back to this task. There is no hit over the last eight weeks.

The filter was T274589 - No atomic section is open (got LocalFile::lockingTransaction):

{
  "query": {
    "match_phrase": {
      "labels.normalized_message": "[{reqId}] {exception_url}   Wikimedia\\Rdbms\\DBUnexpectedError: No atomic section is open (got LocalFile::lockingTransaction)"
    }
  }
}