Skip to content

chore(deps): update nvidia-dcgm (patch)#8659

Open
renovate[bot] wants to merge 1 commit into
mainfrom
renovate/patch-nvidia-dcgm
Open

chore(deps): update nvidia-dcgm (patch)#8659
renovate[bot] wants to merge 1 commit into
mainfrom
renovate/patch-nvidia-dcgm

Conversation

@renovate

@renovate renovate Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

This PR contains the following updates:

Package Update Change
dcgm-exporter patch 4.8.2-1.azl34.8.2-3.azl3
dcgm-exporter patch 4.8.2-ubuntu24.04u14.8.2-ubuntu24.04u3
dcgm-exporter patch 4.8.2-ubuntu22.04u14.8.2-ubuntu22.04u3

Warning

Some dependencies could not be looked up. Check the Dependency Dashboard for more information.


Configuration

📅 Schedule: (UTC)

  • Branch creation
    • At any time (no schedule defined)
  • Automerge
    • At any time (no schedule defined)

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about these updates again.


  • If you want to rebase/retry this PR, check this box

This PR was generated by Mend Renovate. View the repository job log.

Copilot AI review requested due to automatic review settings June 8, 2026 14:54
@renovate renovate Bot added the renovate This pull request was created by renovate label Jun 8, 2026
@renovate renovate Bot assigned djsly Jun 8, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@renovate renovate Bot requested review from djsly, ganeshkumarashok and surajssd June 8, 2026 14:54
@github-actions github-actions Bot added the components This pull request updates cached components on Linux or Windows VHDs label Jun 8, 2026
@renovate renovate Bot changed the title chore(deps): update nvidia-dcgm to v4.8.2-ubuntu22.04u2 chore(deps): update nvidia-dcgm (patch) Jun 8, 2026
@renovate renovate Bot force-pushed the renovate/patch-nvidia-dcgm branch from 1c34d5c to 51e8bdd Compare June 8, 2026 15:14
@djsly

djsly commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

AgentBaker Linux PR gate — E2E failure (mixed: 3 leaves shared infra; 1 ACL leaf likely on main)

  • Run: 167166471 (failed)
  • Failed: Run AgentBaker E2E → AzureCLI exit 1 (DONE 457 tests, 95 skipped, 5 failures in 1646.77s)

Group A — shared infra/test-fixture issue, NOT this PR (3 leaves):

  • Test_Ubuntu2204Gen2_ImagePullIdentityBinding_NetworkIsolated/{default (6.57s), scriptless_nbc (0.00s)}test_helpers.go:227 🔴 empty error, plus the parent container.
  • Same sub-7s empty-error shape has now hit 5 unrelated PRs in 48h (this PR, #8600, #8330, #8654, #8653). Confirmed systemic — needs NodeSIG-dev / E2E-infra triage of the ImagePullIdentityBinding_NetworkIsolated private-cluster/ACR-private-endpoint precondition.

Group B — ACL FIPS TL leaf, very likely existing main regression (2 leaves):

  • Test_ACLGen2FIPSTL/scriptless_nbc (265.83s) — validation.go:345 🔴: wireserver check "wireserver port 80 goalstate": unexpected curl exit code "0" (want 28 timeout or 7 refused) (plus root container).
  • The test expects WireServer port 80 to be blocked (curl exit 28=timeout or 7=refused) but got 0 (HTTP 200 reachable). That's an ACL FIPS TL firewall/network policy assertion. This PR (nvidia-dcgm patch bump in parts/common/components.json) touches GPU package versions only and has no path to ACL networking. Strongly suggests an existing ACL FIPS TL regression on main, not caused by this PR.

Confidence: HIGH that this PR is not the cause of either failure group.

Recommended next action:

  1. Rerun the failing job; do not block this PR on Group A.
  2. NodeSIG-dev: file a tracker on the ACL FIPS TL wireserver-block regression in validation.go:345 (the test expectation flipped, or the ACL network policy unit shipped in the latest VHD no longer blocks WireServer); investigate against main head independently of any specific PR.
  3. NodeSIG-dev / E2E-infra: triage the ImagePullIdentityBinding_NetworkIsolated fixture (sub-7s empty failures across multiple unrelated PRs).

Strongest alternative (less likely): transient ACR-private-endpoint outage for Group A + intermittent ACL firewall rule timing for Group B — refuted because each pattern is now reproducing deterministically on every recent PR build.

Posted by Clawpilot AgentBaker gate detective.

@surajssd surajssd left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not merge, until renovate also adds support for Azure Linux. Once this is merged: #8660 I don't have to manually say that we should not merge this.

@renovate renovate Bot force-pushed the renovate/patch-nvidia-dcgm branch from 51e8bdd to f4fe0e0 Compare June 9, 2026 07:03
Copilot AI review requested due to automatic review settings June 9, 2026 07:03

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@renovate renovate Bot force-pushed the renovate/patch-nvidia-dcgm branch from f4fe0e0 to 6310c4d Compare June 10, 2026 03:01
Copilot AI review requested due to automatic review settings June 10, 2026 06:02
@renovate renovate Bot force-pushed the renovate/patch-nvidia-dcgm branch from 6310c4d to d445cbf Compare June 10, 2026 06:02

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@aks-node-assistant

Copy link
Copy Markdown
Contributor

AgentBaker Linux PR gate — 236-failure run: shared cluster fleet outage continues (test-infra, NOT this PR)

  • Run: 167422694 (failed)
  • Failed task: Run AgentBaker E2E (full 60-minute timeout consumed)
  • Test summary: DONE 402 tests, 95 skipped, 236 failures in ~3616s (~59% failure rate; 0 fwupd hits)

Same shared cluster fleet outage affecting every concurrent PR in this window: 123× get or create cluster: failed to wait for cluster abe2e-kubenet-v5-150ee to be ready: context deadline exceeded plus 36× ResourceGroupDeletionBlocked on shared MC RGs. Earlier overnight runs hit ~11 min; current runs consume the full 60-min E2E timeout, indicating the fleet is worse, not recovering.

Cross-PR pattern this morning: PR #8652 build 167419663, PR #8679 build 167421198, PR #8294 build 167422687, and concurrent PRs all hit identical 236-fail / cluster-not-ready signature.

Build-vs-test: test-infra (shared cluster fleet outage), NOT product, NOT PR-caused.
This PR's exposure check: nvidia-dcgm renovate patch bump (GPU monitoring). No path to shared test cluster lifecycle.
Confidence: HIGH that PR #8659 is not the cause.

Recommended next action / owner: ⚠️ E2E infra / NodeSIG-dev — urgent shared cluster fleet restoration required (abe2e-kubenet-v5-*, abe2e-latest-kubernetes-version-v2-*, abe2e-azure-networkisolated-v2-*, abe2e-azure-v4-*, abe2e-azure-bootstrapprofile-cache-v2-*); clear ResourceGroupDeletionBlocked locks. PR gate is effectively offline until restored. PR author: rerun once fleet recovers.

Posted by Clawpilot AgentBaker gate detective.

@renovate renovate Bot force-pushed the renovate/patch-nvidia-dcgm branch from d445cbf to 313d630 Compare June 10, 2026 15:51
@aks-node-assistant

Copy link
Copy Markdown
Contributor

AgentBaker Linux PR gate — 3 distinct E2E failures, all test-infra / shared-cluster (NOT this PR)

  • Run: 167493131
  • Failed job: Run AgentBaker E2E (all VHD builds passed)
  • Failed scenarios: Test_AzureLinux_Skip_Binary_Cleanup/{default,scriptless_nbc}, Test_AzureLinuxV3_CustomSysctls/default, Test_Ubuntu2204_HTTPSProxy_PrivateDNS/{default,scriptless_nbc} (5 subtests across 3 scenarios)

Detective summary — two independent signatures

(1) wireserver-blocking-validator-assertion — both AzureLinux scenarios on shared cluster abe2e-kubenet-v5-150ee:

🔴 FAIL: wireserver check "wireserver port 80 goalstate":
        unexpected curl exit code "0" (want 28 timeout or 7 refused)

The validator asserts that the node's iptables rule blocks egress to 168.63.129.16:80. iptables shows the DROP rule is present (DROP ... 168.63.129.16 tcp dpt:80), but the test's curl still returns exit 0 — the rule isn't taking effect for the test's connection (likely matched against a stale conntrack/TIME_WAIT entry; logs show pre-existing 168.63.129.16 flows in conntrack). This is the same signature as build 167348372; second occurrence. Wiki: wireserver-blocking-validator-assertion.

(2) httpsproxy-fixture-proxy-unreachable — both HTTPSProxy_PrivateDNS subtests on shared cluster abe2e-azure-network-v4-ce2ad:

VMExtensionProvisioningError ... vmssCSE exit 99
W: Failed to fetch https://packages.microsoft.com/ubuntu/22.04/prod/dists/jammy/InRelease
   Could not connect to 10.14.0.193:8888 (10.14.0.193). - connect (111: Connection refused)

The CSE retries apt-get update 10 times against the scenario's HTTP proxy at 10.14.0.193:8888; the proxy endpoint refuses every attempt and CSE exits 99. The proxy is part of the test fixture (private DNS / HTTPS proxy scenario infra), not anything this PR touches. New signature.

Classification: Test infrastructure / shared-cluster issues. Neither failure is reachable from changes in PR #8659 (renovate nvidia-dcgm patch — does not touch wireserver/iptables, CSE, apt sources, or the proxy fixture).

Confidence: High for both (multiple subtests, identical signatures, no PR-relevant changed files, all VHD builds passed).

Strongest alternative theory: Recent change to aks-node CSE / iptables wiring that lets the wireserver block "miss" — less likely because the rule is present and counters match expected DROP behavior in the chain dump; the leak is at the conntrack/TIME_WAIT layer pre-dating the validator's curl. For the proxy: a transient ARM/MMS issue affecting 10.14.0.193. Less likely than fixture-side because the proxy is the only target refusing connections; the cluster, the VM, the AKS extension framework, and the AKS managed runtime all succeeded.

Recommended next action / owner:

  • Wireserver-blocking validator: SIG Node Lifecycle test-code owner — make the validator tolerate pre-existing conntrack entries (curl with --local-port / fresh tuple), or flush conntrack for 168.63.129.16 before the curl probe. Pattern is now seen on two distinct PR runs and one cluster fixture.
  • HTTPSProxy fixture: AgentBaker E2E test-infra — the proxy mirror behind 10.14.0.193:8888 on shared cluster abe2e-azure-network-v4-ce2ad is unreachable; check the proxy pod/daemon health on that fixture.
  • No PR change required. Recommend rerun of the failed leg only.

Evidence used: failed task log (5 === FAIL markers across 3 distinct scenarios on 2 distinct shared clusters), all VHD builds succeeded, no changed file in PR #8659 touches wireserver/CSE/proxy code.

Copilot AI review requested due to automatic review settings June 11, 2026 15:28
@renovate renovate Bot force-pushed the renovate/patch-nvidia-dcgm branch from 313d630 to 811d042 Compare June 11, 2026 15:28

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@renovate renovate Bot force-pushed the renovate/patch-nvidia-dcgm branch from 811d042 to 6c944c4 Compare June 11, 2026 16:03
@aks-node-assistant

Copy link
Copy Markdown
Contributor

🔍 AgentBaker Linux Gate Detective — build 167657744

Failed build: 167657744 (PR #8659, commit 5853d4a)
Failed job: Stage e2eRun AgentBaker E2E (exit code 1)

Detective summary (3 distinct failure clusters):

  1. 🔴 PR-caused / deterministic — Test_Version_Consistency_GPU_Managed_Components
    Partial OS update detected for dcgm-exporter: rebuild revision "1" (ubuntu.r2204) does not match "3" from the first OS variant. Renovate likely updated one OS but not the others — align ALL OS entries in components.json for this package (or revert the partial bump).
    This is the gating failure for this PR — the renovate nvidia-dcgm patch bump is incomplete. Action: PR author needs to align all OS variants of dcgm-exporter in parts/common/components.json (match rebuild revision across ubuntu.r2204, ubuntu.r2404, azurelinux.r3, etc.), or revert the partial bump.

  2. 🟡 Test-infra (recurring) — networkisolated-apiserver-fqdn-nxdomain
    All 8 Test_*_NetworkIsolated* scenarios (ACL / ArtifactStreaming / Package_Install / NonAnonymousACR, default + scriptless_nbc, Ubuntu2204 + AzureLinuxV3) fail at CSE with VALIDATION_ERR=52: nslookup of abe2e-azure-networkisolated-v3-kq4wzvpl.hcp.westus3.azmk8s.io against 169.254.10.10 returns NXDOMAIN for ~300s. Same shared NetworkIsolated cluster as the first occurrence (build 167653228). Not PR-caused.

  3. 🟡 Test-infra (recurring) — wireserver-blocking-validator-assertion
    Test_Ubuntu2204_ChronyRestarts_Taints_And_Tolerations/default, Test_AzureLinuxV3_KubeletCustomConfig/scriptless_nbc, and Test_AzureLinuxV3_CSE_CachedPerformance/default all fail at validating wireserver is blocked from unprivileged pods despite an iptables DROP rule, with TIME_WAIT conntrack leakage to 168.63.129.16 on shared cluster abe2e-kubenet-v5-150ee. Not PR-caused.

Build vs test class: Test job failure (no VHD build issue contributed). Three independent clusters above.
Flaky vs deterministic: Cluster #1 is deterministic and PR-caused; clusters #2 and #3 are recurring test-infra flakiness (already tracked).
Strongest alternative considered: Could the dcgm-exporter version-consistency miss be caused by a stale local cache rather than the PR? Rejected — the failure message names the exact OS variant whose rebuild revision diverges (ubuntu.r2204 = "1" vs first variant "3"), and the PR is itself the renovate nvidia-dcgm patch bump (6c944c4). The check is doing exactly its job.

Confidence: High (cluster #1: deterministic, single-line diagnostic naming the OS variant + the PR is the renovate bump itself; clusters #2 and #3: matched character-for-character to existing wiki signatures on the same shared clusters).

Wiki signatures:

  • Primary (new — proposed): dcgm-exporter-partial-os-update
  • Reuse: networkisolated-apiserver-fqdn-nxdomain
  • Reuse: wireserver-blocking-validator-assertion

Recommended next action / owner: PR author of #8659 — fix the partial dcgm-exporter bump (align all OS rebuild revisions in parts/common/components.json, or drop the partial bump and let renovate regenerate a complete one). The two infra signatures are AgentBaker E2E test-infra's to chase and do not need to gate this PR once #1 is fixed.

Posted by Clawpilot AgentBaker Linux Gate Detective Watcher

Copilot AI review requested due to automatic review settings June 11, 2026 20:37
@renovate renovate Bot force-pushed the renovate/patch-nvidia-dcgm branch from 6c944c4 to 4feda1e Compare June 11, 2026 20:37

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@aks-node-assistant

Copy link
Copy Markdown
Contributor

🕵️ AgentBaker Linux Gate Detective — Build 167700888

Failed job: Run AgentBaker E2E (Stage e2e)
Signature: kubelet-exec-proxy-502Test infrastructure / shared-cluster transient
Class: Test-infra flake (not PR-caused) • Confidence: High

Surface failure
Test_Ubuntu2404Gen2/default failed at scenario_test.go:2327 while running containerd config dump via apiserver exec proxy:

proxy error from localhost:9443 while dialing <node>:10250, code 502: 502 Bad Gateway

Corroboration

  • Node created (3m1s) and Ready (193ms); test pod Ready in 1m54s — node + pod healthy.
  • Earlier validators in the same test passed: waagent.log (no ExtHandler errors), node-exporter (port 19100), localdns-exporter metrics.
  • Only the apiserver→kubelet /exec streaming hop returned 502 — matches the wiki signature exactly.

Strongest alternative considered & rejected
PR-caused regression from the nvidia-dcgm bump: rejected — Test_Ubuntu2404Gen2/default is GPU:false (dcgm-exporter not exercised), node bootstrap + all on-node validators succeeded, and the failure is in the cluster-side proxy network path, not node provisioning.

Comparison to prior PR revision
PR #8659 build 167657744 hit dcgm-exporter-partial-os-update. This new revision shows a different failure pattern — the dcgm signature did not recur; this run hit an unrelated transient infra flake.

Recommended action: Re-run the failed job. No PR change required.


Posted by Clawpilot AgentBaker Linux Gate Detective Watcher.

@renovate renovate Bot force-pushed the renovate/patch-nvidia-dcgm branch from 4feda1e to 8f219dc Compare June 12, 2026 00:08
Copilot AI review requested due to automatic review settings June 12, 2026 01:04
@renovate renovate Bot force-pushed the renovate/patch-nvidia-dcgm branch from 8f219dc to 4c111cc Compare June 12, 2026 01:04

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

components This pull request updates cached components on Linux or Windows VHDs renovate This pull request was created by renovate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants