Skip to content

chore(systests): force testnet allocation to the local DC for all system-tests#10436

Merged
basvandijk merged 4 commits into
masterfrom
basvandijk/force-testnet-allocation-to-local-dc
Jun 11, 2026
Merged

chore(systests): force testnet allocation to the local DC for all system-tests#10436
basvandijk merged 4 commits into
masterfrom
basvandijk/force-testnet-allocation-to-local-dc

Conversation

@basvandijk

@basvandijk basvandijk commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

What

Force Farm testnet allocation to the same DC as the GitHub runner executing the test (which is also the DC holding the just-built images) — for every system-test. This generalizes the opt-in .allocate_testnet_to_local_dc() mechanism introduced in #10122.

Why

#10122 showed that cross-DC transfers of large images (e.g. 2.6G SetupOS images from dm1 to zh1 or vice versa) cause download timeouts and flaky tests. The same applies to all system-tests, so testnets should always be allocated in the DC where the test runs and the images live.

How

  • Replace the allocate_testnet_to_local_dc bool on SystemTestGroup (and its builder method) with the ALLOCATE_TESTNET_TO_LOCAL_DC environment variable, read by the test driver in create_group_setup (accepted values: 1/true/0/false).
  • Set ALLOCATE_TESTNET_TO_LOCAL_DC=1 unconditionally from the system_test macro in rs/tests/system_tests.bzl, covering both the plain and the _colocate targets. This remains a no-op when the DC volatile status variable is unknown (e.g. local runs without NODE_NAME).
  • Drop the now-redundant .allocate_testnet_to_local_dc() calls from the 7 nested system-tests.
  • Replace dep_download_url in rs/tests/upload_systest_dep.sh to no longer go via the dc_http_proxy but point directly at the DC-local bazel cache: https://artifacts.$cluster.dfinity.network/cas/$dep_sha256.
  • Force the release-system-tests job in release-testing.yml and the system-tests-benchmarks-nightly job to run in runner group dm1 (the &dind-large-setup anchor moved to setup-guest-os-qualification, whose jobs keep their current runners).

Notes for reviewers

  • Pinning all tests to the runner's DC concentrates Farm load in dm1 where most runners live; watch for allocation failures after rollout. Extra charts have been added to the Farm Dashboard for monitoring dm1 and zh1.
  • Farm hosts/UVMs now fetch deps from artifacts.<cluster>.dfinity.network over HTTPS (previously plain http via proxy-global:8080); the redirect server already returns URLs of this form.

…tem-tests

Generalize the opt-in .allocate_testnet_to_local_dc() introduced in #10122
to all system-tests:

* Replace the SystemTestGroup bool and builder method with the
  ALLOCATE_TESTNET_TO_LOCAL_DC environment variable read by the test driver
  when creating the Farm group.
* Set ALLOCATE_TESTNET_TO_LOCAL_DC=1 unconditionally from the system_test
  macro in rs/tests/system_tests.bzl so it applies to every system-test
  (it remains a no-op when the DC volatile status variable is unknown,
  e.g. when running locally).
* Point dep_download_url in rs/tests/upload_systest_dep.sh directly at the
  DC-local bazel cache (https://artifacts.$cluster.dfinity.network/cas/...)
  instead of going through the dc_http_proxy.
* Pin the release-system-tests job in release-testing.yml and the
  system-tests-benchmarks-nightly job to runner group dm1 so their testnets
  and runners are in the same DC.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR makes system-tests consistently allocate Farm testnets in the same data center (DC) as the GitHub runner that built the images, reducing flaky failures caused by cross-DC transfers of large artifacts.

Changes:

  • Switch “allocate testnet to local DC” from a SystemTestGroup opt-in flag to an env var (ALLOCATE_TESTNET_TO_LOCAL_DC) read by the test driver.
  • Set ALLOCATE_TESTNET_TO_LOCAL_DC=1 unconditionally from the system_test Bazel macro, and remove redundant .allocate_testnet_to_local_dc() calls from nested tests.
  • Point systest dependency download URLs directly at https://artifacts.$cluster.dfinity.network/cas/$dep_sha256 and pin selected GitHub Actions jobs to runner group dm1.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
rs/tests/upload_systest_dep.sh Changes returned CAS download URL to go directly to the cluster-local artifacts endpoint over HTTPS.
rs/tests/system_tests.bzl Forces ALLOCATE_TESTNET_TO_LOCAL_DC=1 for all system-test targets (including _colocate).
rs/tests/nested/registration.rs Removes now-redundant .allocate_testnet_to_local_dc() call.
rs/tests/nested/nns_recovery/nr_no_bless_fix_like_np.rs Removes now-redundant .allocate_testnet_to_local_dc() call.
rs/tests/nested/nns_recovery/nr_local.rs Removes now-redundant .allocate_testnet_to_local_dc() call.
rs/tests/nested/nns_recovery/nr_broken_dfinity_node.rs Removes now-redundant .allocate_testnet_to_local_dc() call.
rs/tests/nested/nns_recovery/nr_all_broken_seq_np_actions.rs Removes now-redundant .allocate_testnet_to_local_dc() call.
rs/tests/nested/hostos_upgrade.rs Removes now-redundant .allocate_testnet_to_local_dc() call.
rs/tests/nested/guestos_upgrade.rs Removes now-redundant .allocate_testnet_to_local_dc() call.
rs/tests/driver/src/driver/test_env_api.rs Reads ALLOCATE_TESTNET_TO_LOCAL_DC env var to optionally add HostFeature::DC(...) to Farm group creation.
rs/tests/driver/src/driver/group.rs Removes the SystemTestGroup flag and updates group setup call signature accordingly.
.github/workflows/system-tests-benchmarks-nightly.yml Pins the nightly benchmarks job to runner group dm1.
.github/workflows/release-testing.yml Pins release-system-tests to runner group dm1 and relocates the &dind-large-setup anchor.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread rs/tests/driver/src/driver/test_env_api.rs
@basvandijk basvandijk added the CI_ALL_BAZEL_TARGETS Runs all bazel targets label Jun 11, 2026
basvandijk added a commit that referenced this pull request Jun 11, 2026
…#10445)

## Root Cause

On PRs where testnet allocation is pinned to the local DC (#10436),
`//rs/tests/networking:canister_http_socks_test` fails with every
outcall attempt erroring as:

```
direct connect ... ConnectionRefused
and connect through socks "... Connector(ConnectError(\"tcp connect error\", ... ConnectionRefused))"
```

The *direct* refusal is intentional (the test injects an egress-reject
rule for httpbin first). The real failure is the SOCKS leg: the nested
`Connector(ConnectError(...))` means the TCP connection **to the SOCKS
proxy itself** — API boundary node port 1080 — was refused, on every
attempt for the entire run, on **both** API BNs.

A "connection refused" (RST) pins this down precisely:

- The API BN firewall uses `policy drop` with an explicit `accept` for
node IPs on port 1080, so a firewall problem would cause *timeouts*, not
RSTs.
- An RST means the BN was up, its global IPv6 was configured in the
kernel, the packet passed the firewall — and **nothing was listening on
:1080**.

`danted.conf` configured the listener as:

```
internal: enp1s0 port = 1080
```

Dante resolves an interface name to its addresses **once at startup**
and never re-binds. GuestOS receives its global IPv6 via SLAAC. If
`danted.service` starts while `enp1s0` only has its link-local address
(router advertisement not yet processed), danted binds the link-local
scope only and the global `[...]:1080` endpoint stays closed forever —
`Restart=always` never kicks in because danted keeps running happily.

This is confirmed directly by the journald logs of one of the failing
API BNs:

```
13:57:58.543  enp1s0: Gained IPv6LL
13:57:58.547  Finished systemd-networkd-wait-online.service - Wait for Network to be Online.
13:57:58.549  Reached target network-online.target - Network is Online.
13:57:58.553  Started danted.service - SOCKS (v4 and v5) proxy daemon (danted).
...
13:57:58.637  danted[1030]: info: Dante/server[1/1] v1.4.4 running
```

`systemd-networkd-wait-online` completed 4 ms after the interface gained
only its **link-local** address — `network-online.target` does not
guarantee a global SLAAC address — and danted started 6 ms later,
binding the link-local address only.

This race got amplified by DC pinning: in the failing run all nine VMs
of the testnet (5 replicas, 2 API BNs, 2 UVMs) were packed onto a single
host, slowing down boot and RA/SLAAC delivery enough to hit the race on
both API BNs at once. The bug is pre-existing on `master`; the DC
pinning only widened the window. The same fragility was previously
patched around in #4658 (`PartOf=systemd-networkd.service` to restart
danted when networkd restarts).

## Fix

Bind the wildcard address instead of an interface name:

```
internal: :: port = 1080
```

A wildcard bind does not depend on address assignment timing —
connections to the global address succeed as soon as the address exists.
Access to the SOCKS proxy remains restricted through the firewall, which
only whitelists node IPs on port 1080 (the config already noted "Allow
everyone - this is already restricted through the firewall").

`external: enp1s0` (the outgoing side) is unchanged.

## Verification

- `bazel test --runs_per_test=3
//rs/tests/networking:canister_http_socks_test` passes 3/3 (avg 186 s)
on the DC-pinned branch that previously reproduced the failure.
@basvandijk basvandijk added this pull request to the merge queue Jun 11, 2026
Merged via the queue into master with commit 557d727 Jun 11, 2026
37 checks passed
@basvandijk basvandijk deleted the basvandijk/force-testnet-allocation-to-local-dc branch June 11, 2026 19:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants