fix(aks): resolve latest patch and refresh node images on upgrade (AROSLSRE-1165) by raelga · Pull Request #5627 · Azure/ARO-HCP

raelga · 2026-06-12T10:15:33Z

Jira: AROSLSRE-1165 (parent: AROSLSRE-1164)

What

dev-infrastructure/scripts/upgrade-aks-cluster.sh now fully owns AKS updates from the rollout pipeline:

Kubernetes version — resolves a configured minor version (e.g. 1.35) to the highest available patch for that minor before deciding whether to upgrade, instead of treating the minor as a version floor. A pinned full patch X.Y.Z is honored verbatim. The control plane + node pools upgrade only when behind.
Node OS image — after the (optional) version upgrade, each node pool's nodeImageVersion is compared against latestNodeImageVersion; if any pool is stale, a az aks upgrade --node-image-only pass refreshes node images. This replaces the AKS-managed nodeOSUpgradeChannel removed in feat(aks): disable AKS-managed automatic cluster upgrades #5628.
No X.Y.z available yet, or get-upgrades fails → falls back to the previous behavior (pass the minor, let AKS choose) without erroring.
Already-current clusters short-circuit with "no upgrade needed".

Why

Cluster updates are moving from AKS's unattended weekend auto-upgrade into the rollout pipeline. Running disruptive operations unattended over the weekend risks broken clusters on Monday. Driving updates through rollouts is safer: rollouts are monitored, gated by E2E tests that validate the change, and progress through Dev (always latest) → INT → Stage → Prod.

Two concrete gaps this closes:

KUBERNETES_VERSION comes from svc.aks.kubernetesVersion / mgmt.aks.kubernetesVersion, normally a minor like 1.35. The old sort -V comparison treated it as a floor, so once a cluster reached any 1.35.x it reported "no upgrade needed" and never picked up newer 1.35.z patch fixes.
With nodeOSUpgradeChannel removed (feat(aks): disable AKS-managed automatic cluster upgrades #5628), node OS image patches now need an explicit pipeline pass — added here.

Testing

bash -n syntax check passes.
Unit-tested the upgrade flow with a mocked az under set -euo pipefail across cases: behind-minor, on-minor-with-newer-patch, already-latest (no-op), pinned full patch (verbatim), empty get-upgrades (fallback), and node-image-only (version current, stale node image).

req=1.35   cur=1.34.5 avail=[1.35.1,1.35.3]         -> 1.35.3
req=1.35   cur=1.35.2 avail=[1.35.3,1.35.5,1.36.0]  -> 1.35.5
req=1.35   cur=1.35.5 avail=[1.36.0,1.36.1]         -> 1.35.5
req=1.32.5 cur=1.32.1 avail=[1.99.9]                -> 1.32.5
req=1.35   cur=1.34.0 avail=[]                       -> 1.35

Special notes for your reviewer

az aks get-upgrades is cluster-scoped (no region argument needed) and only returns forward upgrade targets, so the current version is folded in before taking the max to correctly handle the already-latest case.
A Kubernetes version upgrade already reimages nodes to the latest image for that version, so the node-image pass is a no-op immediately after one; it matters for clusters that are already on the target version but have a stale node image.

PR Checklist

…SLSRE-1162) upgrade-aks-cluster.sh treated the configured minor (e.g. 1.35) as a floor, so clusters never picked up newer 1.35.z patches once they were at any 1.35.x. Resolve a minor to the highest patch available for that minor (via az aks get-upgrades plus the current version) and upgrade to that concrete patch. Pinned full patch versions (X.Y.Z) are honored verbatim, and the script falls back to the previous behavior when no patch is available or the lookup fails.

openshift-ci · 2026-06-12T10:15:41Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: raelga

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~dev-infrastructure/OWNERS~~ [raelga]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot

Pull request overview

This PR updates the AKS upgrade script to treat a configured Kubernetes minor version (e.g., 1.35) as “upgrade to the latest available patch for that minor,” by resolving X.Y to a concrete X.Y.Z before comparing/upgrading. This improves patch hygiene for long-lived clusters that previously stopped upgrading once they reached any X.Y.*.

Changes:

Add resolve_target_version to resolve X.Y to the highest available X.Y.Z patch (while honoring pinned X.Y.Z verbatim).
Use the resolved target version for control plane and node pool upgrade decisions and the az aks upgrade --kubernetes-version argument.
Improve logging to show both requested and resolved target versions.

The grep in resolve_target_version exits 1 when no patch matches the requested minor; with pipefail that aborted the script before the empty-result fallback could return the requested minor. Append || true so the no-match case yields an empty string and falls back as intended.

…adeChannel) Drive node OS image (security patch) upgrades from the rollout instead of the AKS-managed nodeOSUpgradeChannel that ran unattended on a weekend maintenance window. After the optional Kubernetes version upgrade, compare each node pool's running image against the latest available and run a node-image-only upgrade when a newer image exists. The previous early exit when no version upgrade was needed is removed so the node image check always runs. Jira: AROSLSRE-1162

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

raelga · 2026-06-12T11:22:05Z

+# This replaces the AKS-managed nodeOSUpgradeChannel/auto-upgrade that used to
+# pull the latest node image (security patches) on a maintenance schedule. We
+# now drive it from the pipeline: for each node pool, compare the running node
+# image against the latest available one and run a node-image-only upgrade when
+# a newer image exists. A Kubernetes version upgrade above already reimages
+# nodes to the latest image for the target version, so this is typically a
+# no-op right after one and only does work when the image alone is stale.


@copilot Good catch — reworded in e683a4f. The comment now states the node-image pass is the pipeline-driven replacement for nodeOSUpgradeChannel, which is disabled (None) in the companion change to dev-infrastructure/modules/aks-cluster-base.bicep (PR #5628). The two are coupled sub-tasks under AROSLSRE-1164, so auto-upgrade is removed there while this PR adds the on-demand reimage.

raelga · 2026-06-12T11:22:07Z

+if [ "${NODE_IMAGE_UPGRADE_NEEDED}" = "true" ]; then
+    echo "Upgrading node images for cluster '${CLUSTER_NAME}' in RG '${RESOURCE_GROUP}'..."
+
+    az aks upgrade \
+        --resource-group "${RESOURCE_GROUP}" \
+        --name "${CLUSTER_NAME}" \
+        --node-image-only \
+        --yes


@copilot The PR title, description, and linked Jira (AROSLSRE-1165, parent AROSLSRE-1164) have been updated to explicitly cover the node-image upgrade behavior and its rollout rationale (monitored, E2E-gated, Dev→INT→Stage→Prod). Kept in this PR rather than split out because it directly replaces the nodeOSUpgradeChannel disabled in the paired PR #5628 — splitting would leave a window with no node-OS patch coverage.

…de in Azure#5628

Copilot AI review requested due to automatic review settings June 12, 2026 10:15

openshift-ci Bot requested review from stevekuznetsov and weherdh June 12, 2026 10:15

openshift-ci Bot added the approved label Jun 12, 2026

Copilot started reviewing on behalf of raelga June 12, 2026 10:16 View session

Copilot AI reviewed Jun 12, 2026

View reviewed changes

Comment thread dev-infrastructure/scripts/upgrade-aks-cluster.sh Outdated

raelga mentioned this pull request Jun 12, 2026

feat(aks): disable AKS-managed automatic cluster upgrades #5628

Open

3 tasks

Copilot AI review requested due to automatic review settings June 12, 2026 11:05

Copilot started reviewing on behalf of raelga June 12, 2026 11:05 View session

raelga changed the title ~~fix(aks): upgrade to latest available patch for configured minor (AROSLSRE-1162)~~ fix(aks): resolve latest patch and refresh node images on upgrade (AROSLSRE-1165) Jun 12, 2026

Copilot AI reviewed Jun 12, 2026

View reviewed changes

docs(aks): clarify node-image upgrade pairs with disabling auto-upgra…

e683a4f

…de in Azure#5628

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(aks): resolve latest patch and refresh node images on upgrade (AROSLSRE-1165)#5627

fix(aks): resolve latest patch and refresh node images on upgrade (AROSLSRE-1165)#5627
raelga wants to merge 4 commits into
Azure:mainfrom
raelga:raelga/aroslsre-upgrade-aks-latest-patch

raelga commented Jun 12, 2026 •

edited

Loading

Uh oh!

openshift-ci Bot commented Jun 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

raelga Jun 12, 2026

Uh oh!

raelga Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

raelga commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Testing

Special notes for your reviewer

PR Checklist

Uh oh!

openshift-ci Bot commented Jun 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

raelga Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

raelga Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

raelga commented Jun 12, 2026 •

edited

Loading