chore(systests): force testnet allocation to the local DC for all system-tests#10436
Merged
basvandijk merged 4 commits intoJun 11, 2026
Merged
Conversation
…tem-tests Generalize the opt-in .allocate_testnet_to_local_dc() introduced in #10122 to all system-tests: * Replace the SystemTestGroup bool and builder method with the ALLOCATE_TESTNET_TO_LOCAL_DC environment variable read by the test driver when creating the Farm group. * Set ALLOCATE_TESTNET_TO_LOCAL_DC=1 unconditionally from the system_test macro in rs/tests/system_tests.bzl so it applies to every system-test (it remains a no-op when the DC volatile status variable is unknown, e.g. when running locally). * Point dep_download_url in rs/tests/upload_systest_dep.sh directly at the DC-local bazel cache (https://artifacts.$cluster.dfinity.network/cas/...) instead of going through the dc_http_proxy. * Pin the release-system-tests job in release-testing.yml and the system-tests-benchmarks-nightly job to runner group dm1 so their testnets and runners are in the same DC.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR makes system-tests consistently allocate Farm testnets in the same data center (DC) as the GitHub runner that built the images, reducing flaky failures caused by cross-DC transfers of large artifacts.
Changes:
- Switch “allocate testnet to local DC” from a
SystemTestGroupopt-in flag to an env var (ALLOCATE_TESTNET_TO_LOCAL_DC) read by the test driver. - Set
ALLOCATE_TESTNET_TO_LOCAL_DC=1unconditionally from thesystem_testBazel macro, and remove redundant.allocate_testnet_to_local_dc()calls from nested tests. - Point systest dependency download URLs directly at
https://artifacts.$cluster.dfinity.network/cas/$dep_sha256and pin selected GitHub Actions jobs to runner groupdm1.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| rs/tests/upload_systest_dep.sh | Changes returned CAS download URL to go directly to the cluster-local artifacts endpoint over HTTPS. |
| rs/tests/system_tests.bzl | Forces ALLOCATE_TESTNET_TO_LOCAL_DC=1 for all system-test targets (including _colocate). |
| rs/tests/nested/registration.rs | Removes now-redundant .allocate_testnet_to_local_dc() call. |
| rs/tests/nested/nns_recovery/nr_no_bless_fix_like_np.rs | Removes now-redundant .allocate_testnet_to_local_dc() call. |
| rs/tests/nested/nns_recovery/nr_local.rs | Removes now-redundant .allocate_testnet_to_local_dc() call. |
| rs/tests/nested/nns_recovery/nr_broken_dfinity_node.rs | Removes now-redundant .allocate_testnet_to_local_dc() call. |
| rs/tests/nested/nns_recovery/nr_all_broken_seq_np_actions.rs | Removes now-redundant .allocate_testnet_to_local_dc() call. |
| rs/tests/nested/hostos_upgrade.rs | Removes now-redundant .allocate_testnet_to_local_dc() call. |
| rs/tests/nested/guestos_upgrade.rs | Removes now-redundant .allocate_testnet_to_local_dc() call. |
| rs/tests/driver/src/driver/test_env_api.rs | Reads ALLOCATE_TESTNET_TO_LOCAL_DC env var to optionally add HostFeature::DC(...) to Farm group creation. |
| rs/tests/driver/src/driver/group.rs | Removes the SystemTestGroup flag and updates group setup call signature accordingly. |
| .github/workflows/system-tests-benchmarks-nightly.yml | Pins the nightly benchmarks job to runner group dm1. |
| .github/workflows/release-testing.yml | Pins release-system-tests to runner group dm1 and relocates the &dind-large-setup anchor. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
pierugo-dfinity
approved these changes
Jun 11, 2026
blind-oracle
approved these changes
Jun 11, 2026
nmattia
approved these changes
Jun 11, 2026
This was referenced Jun 11, 2026
basvandijk
added a commit
that referenced
this pull request
Jun 11, 2026
…#10445) ## Root Cause On PRs where testnet allocation is pinned to the local DC (#10436), `//rs/tests/networking:canister_http_socks_test` fails with every outcall attempt erroring as: ``` direct connect ... ConnectionRefused and connect through socks "... Connector(ConnectError(\"tcp connect error\", ... ConnectionRefused))" ``` The *direct* refusal is intentional (the test injects an egress-reject rule for httpbin first). The real failure is the SOCKS leg: the nested `Connector(ConnectError(...))` means the TCP connection **to the SOCKS proxy itself** — API boundary node port 1080 — was refused, on every attempt for the entire run, on **both** API BNs. A "connection refused" (RST) pins this down precisely: - The API BN firewall uses `policy drop` with an explicit `accept` for node IPs on port 1080, so a firewall problem would cause *timeouts*, not RSTs. - An RST means the BN was up, its global IPv6 was configured in the kernel, the packet passed the firewall — and **nothing was listening on :1080**. `danted.conf` configured the listener as: ``` internal: enp1s0 port = 1080 ``` Dante resolves an interface name to its addresses **once at startup** and never re-binds. GuestOS receives its global IPv6 via SLAAC. If `danted.service` starts while `enp1s0` only has its link-local address (router advertisement not yet processed), danted binds the link-local scope only and the global `[...]:1080` endpoint stays closed forever — `Restart=always` never kicks in because danted keeps running happily. This is confirmed directly by the journald logs of one of the failing API BNs: ``` 13:57:58.543 enp1s0: Gained IPv6LL 13:57:58.547 Finished systemd-networkd-wait-online.service - Wait for Network to be Online. 13:57:58.549 Reached target network-online.target - Network is Online. 13:57:58.553 Started danted.service - SOCKS (v4 and v5) proxy daemon (danted). ... 13:57:58.637 danted[1030]: info: Dante/server[1/1] v1.4.4 running ``` `systemd-networkd-wait-online` completed 4 ms after the interface gained only its **link-local** address — `network-online.target` does not guarantee a global SLAAC address — and danted started 6 ms later, binding the link-local address only. This race got amplified by DC pinning: in the failing run all nine VMs of the testnet (5 replicas, 2 API BNs, 2 UVMs) were packed onto a single host, slowing down boot and RA/SLAAC delivery enough to hit the race on both API BNs at once. The bug is pre-existing on `master`; the DC pinning only widened the window. The same fragility was previously patched around in #4658 (`PartOf=systemd-networkd.service` to restart danted when networkd restarts). ## Fix Bind the wildcard address instead of an interface name: ``` internal: :: port = 1080 ``` A wildcard bind does not depend on address assignment timing — connections to the global address succeed as soon as the address exists. Access to the SOCKS proxy remains restricted through the firewall, which only whitelists node IPs on port 1080 (the config already noted "Allow everyone - this is already restricted through the firewall"). `external: enp1s0` (the outgoing side) is unchanged. ## Verification - `bazel test --runs_per_test=3 //rs/tests/networking:canister_http_socks_test` passes 3/3 (avg 186 s) on the DC-pinned branch that previously reproduced the failure.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Force Farm testnet allocation to the same DC as the GitHub runner executing the test (which is also the DC holding the just-built images) — for every system-test. This generalizes the opt-in
.allocate_testnet_to_local_dc()mechanism introduced in #10122.Why
#10122 showed that cross-DC transfers of large images (e.g. 2.6G SetupOS images from
dm1tozh1or vice versa) cause download timeouts and flaky tests. The same applies to all system-tests, so testnets should always be allocated in the DC where the test runs and the images live.How
allocate_testnet_to_local_dcbool onSystemTestGroup(and its builder method) with theALLOCATE_TESTNET_TO_LOCAL_DCenvironment variable, read by the test driver increate_group_setup(accepted values:1/true/0/false).ALLOCATE_TESTNET_TO_LOCAL_DC=1unconditionally from thesystem_testmacro inrs/tests/system_tests.bzl, covering both the plain and the_colocatetargets. This remains a no-op when theDCvolatile status variable is unknown (e.g. local runs withoutNODE_NAME)..allocate_testnet_to_local_dc()calls from the 7 nested system-tests.dep_download_urlinrs/tests/upload_systest_dep.shto no longer go via the dc_http_proxy but point directly at the DC-local bazel cache:https://artifacts.$cluster.dfinity.network/cas/$dep_sha256.release-system-testsjob inrelease-testing.ymland thesystem-tests-benchmarks-nightlyjob to run in runner groupdm1(the&dind-large-setupanchor moved tosetup-guest-os-qualification, whose jobs keep their current runners).Notes for reviewers
dm1where most runners live; watch for allocation failures after rollout. Extra charts have been added to the Farm Dashboard for monitoring dm1 and zh1.artifacts.<cluster>.dfinity.networkover HTTPS (previously plain http via proxy-global:8080); the redirect server already returns URLs of this form.