Page MenuHomePhabricator

CampaignEvents master CI broken by changes to the Installer
Closed, ResolvedPublic

Description

Today, some changes to some MW repos have broken CI for CampaignEvents. The main symptom is that event registration cannot be enabled via the REST API, and therefore the selenium and api-testing jobs are failing. I can't reproduce this locally, so I made r1085493 to test it in CI. Looking at the job output, it seems that the WikiAdmin account is not being globalized by the installer. This is suspiciously reminiscent of T358985, but that task has been fixed a while ago, and the output for the same job seems to shot that the account is globalised:

Running MediaWiki\Extension\CentralAuth\Maintenance\MigrateInitialAccounts...
Globalized 1 accounts

So there must be something else going on...

Event Timeline

The last patch in CampaignEvents was merged 2024-10-31 at 13:56 UTC. It seems like that there have been a few changes to CentralAuth today, so one of them might be the cause.

It seems like that there have been a few changes to CentralAuth today

Half of those are old (I hashtagged a bunch of patches and Gerrit sorts by the date of last change, not necessarily the merge), and I don't think any of them is related to globalizing accounts (which is not a part of the code that would change much).

It seems like that there have been a few changes to CentralAuth today

Half of those are old (I hashtagged a bunch of patches and Gerrit sorts by the date of last change, not necessarily the merge), and I don't think any of them is related to globalizing accounts (which is not a part of the code that would change much).

Oh, I didn't notice. I'm trying to see if tests pass with some of the recent commits reverted, mostly to see if I should look further into CentralAuth, or if it's something else.

this is a more accurate list I think

Oh yes, thank you. I tried again with a bunch of CentralAuth stuff reverted but it still wouldn't pass. I'm now trying to run it with quibble locally, as it's evident that I need a CI-like setup to reproduce this.

Krinkle subscribed.

Oh yes, thank you. I tried again with a bunch of CentralAuth stuff reverted but it still wouldn't pass. I'm now trying to run it with quibble locally, as it's evident that I need a CI-like setup to reproduce this.

Okay. I'll un-tag this for now and keep an eye via our Radar. If this ends up implicating one of our components, feel free to re-tag.

I followed the quibble documentation and got it running locally. I had to retry it a couple times because I didn't have the correct permissions set on the src and cache directories; these were mentioned in a previous section of the manual that I had skipped at first. I was also confused because it seemed to get stuck while cloning MW core, but it really just takes a while. I got to the point where it runs the API tests for my patch. The next problem is that I'm seeing a new failure:

1) POST /campaignevents/v0/event_registration
     "before all" hook in "POST /campaignevents/v0/event_registration":

    AssertionError: Login for "WikiAdmin": Unable to continue login. Your session most likely timed out.: expected 'Failed' to equal 'Success'
    + expected - actual

    -Failed
    +Success
    
    at Client.login (node_modules/api-testing/lib/actionapi.js:291:10)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async Object.root (node_modules/api-testing/lib/action_clients.js:56:3)
    at async Context.<anonymous> (tests/api-testing/EnableRegistration.js:14:25)

That's not the same failure as CI, and I also don't get it when running API tests in my local MW environment. Now trying to figure out what's causing these.

The first thing I wanted were logs, but I noticed that the log directory was empty. MW is configured to use /workspace/src as log dir, but the command given in the quibble documentation mounts it as just /log: -v "$(pwd)"/log:/log. I'm not sure if the documentation is wrong, or if that's the right location for other jobs, or whatnot. Anyway, I changed that to /workspace/src; let there be logs, and there were logs. The following line stood out immediately:

[memcached] Memcached error: Error connecting to 127.0.0.1:11211: Connection refused

Since the error above indicated session loss, I assumed MW was configured to use memcached as session storage, but that memcached wasn't properly setup in the container. A quick search pointed me to T300340: Use Memcached with Quibble. The same search also brought up a supervisord config file for quibble. I then looked more closely at the CI output for the failing job, and noticed that it's using the quibble-with-supervisord entrypoint. So, I've updated my command to include that, and sure enough, it now starts to resemble the CI failure:

3 failing

1) POST /campaignevents/v0/event_registration
     successful
       succeeds for an authorized user if the request body is valid:

    AssertionError: expected 403 to equal 201
    + expected - actual

    -403
    +201
    
    at Context.<anonymous> (tests/api-testing/EnableRegistration.js:75:11)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)

My test patch is actually supposed to print some additional debugging information, but that is nowhere to be seen. I'll look into this first, and then begin actually searching the extension/component that broke the tests.


For reference, the full docker command I'm using:

docker run -it --rm --entrypoint=quibble-with-supervisord  --env-file ./.env   -v "$(pwd)"/cache:/cache   -v "$(pwd)"/log:/workspace/log   -v "$(pwd)"/ref:/srv/git:ro   -v "$(pwd)"/src:/workspace/src   docker-registry.wikimedia.org/releng/quibble-buster-php74:1.11.0 --git-parallel=8 --packages-source vendor --db mysql --db-dir /workspace/db --run api-testing

My test patch is actually supposed to print some additional debugging information, but that is nowhere to be seen. I'll look into this first, and then begin actually searching the extension/component that broke the tests.

That's because zuul is checking out the master branch. I haven't quite understood how to specify the commit to clone, and the various ZUUL_* parameters I put in the env file don't seem to affect this.


First, I was able to remove most extension dependencies. As expected, I could reproduce the failures with just CampaignEvents, CentralAuth, and AntiSpoof; and the tests pass if I remove CentralAuth.

I then went directly to the extension repos cloned under src/extensions and checked out r1085493 for CampaignEvents.

Next, I went to the CentralAuth directory and manually checked out commit f343603 (r1079161, merged October 10). The tests were still failing, so I'm ruling out CentralAuth as the cause.

The next try was with MW core. I checked out commit 7a1f0df (r1080418, merged October 16), switched to --packages-source composer, added --skip-zuul and re-ran the tests. Surprise surprise: they still fail! I briefly tried going back in both core and CampaignEvents, but found other errors due to renamed classes and the like. So, I decided to switch to REL1_43, which is recent enough but not too recent, and should guarantee interoperability. Obviously I first had to deal with the composer requirement of PHP >= 8.1, in a world where Wikimedia production runs 7.4 and most of our tests are also on 7.4. Fine. I reverted r1083286 and moved on. With everything on REL1_43, tests are passing! Now I need to find what made them fail. To be continued in the next episode...

The last thing I wanted was to stumble upon incompatibilities between core and extension; so, I was really happy when I could leave core on REL1_43, reset CampaignEvents and CentralAuth back to master, and not get any incompatibilities. In fact, tests pass with this configuration. So, with core on REL1_43 and both extension on master, I ran a git bisect on MW core, starting with REL1_43 as good and master as bad. I don't have much to say in this, it was really just a simple bisection. Eventually, I arrived at r1083932 (installer: Add TaskRunner and more task classes). The patch modifies the installer steps, so it makes total sense that the default WikiUser account is no longer getting globalized by CentralAuth. Next I will look into why exactly the core change broke CentralAuth's first account globalization logic, and whether the fix should go to core or CentralAuth.

I should have paid more attention to the console output:

Running MediaWiki\Extension\CentralAuth\Maintenance\MigrateInitialAccounts...
Globalized 1 accounts
done.
...globaluser table does not contain gu_salt field.
Creating cuci_wiki_map table...done.
Creating cuci_temp_edit table...done.
Creating cuci_user table...done.
Creating globalblocks table...done.
...have gb_target_central_id field in globalblocks table.
...globalblocks table does not contain gb_by field.
...have gb_create_account field in globalblocks table.
...have gb_enable_autoblock field in globalblocks table.
...have gb_autoblock_parent_id field in globalblocks table.
...index gb_address_autoblock_parent_id already set on globalblocks table.
Creating wikimedia_campaign_events_grant table...done.
done
Creating main page with default content
done
Creating administrator user account
done
Database was successfully set up

Notice how Creating administrator user account is now happening after MigrateInitialAccounts. Whatever account is globalized above is not the admin account used in tests; it's just a red herring. It seems entirely plausible that the refactoring done in r1083932 changed the order in which certain steps are run, which caused the globalization script to run before creating the account. This needs to be fixed in CentralAuth, and Installer::addInstallStep() seems to be the way to solve this for good.

This needs to be fixed in CentralAuth, and Installer::addInstallStep() seems to be the way to solve this for good.

Except I don't think we can get an instance of Installer to call that method on. I'm not sure if that's an intentional limitation or just a WIP. At any rate, I guess I could fix this by reordering the core installer steps so that extension tables come last, as they were before r1083932.

Change #1087227 had a related patch set uploaded (by Daimona Eaytoy; author: Daimona Eaytoy):

[mediawiki/core@master] installer: Create extension tables as the last step

https://gerrit.wikimedia.org/r/1087227

Daimona renamed this task from CampaignEvents master CI broken by changes elsewhere to CampaignEvents master CI broken by changes to the Installer.Mon, Nov 4, 6:42 PM

I confirmed that the patch above fixes the issue, locally and in CI.

Change #1087227 merged by jenkins-bot:

[mediawiki/core@master] installer: Create extension tables as the last step

https://gerrit.wikimedia.org/r/1087227

CI is now green, hence resolving this task. There's a brief conversation on gerrit about potential improvements to the relevant installer code, which might become a separate task.