Skip to content

fix(aks): resolve latest patch and refresh node images on upgrade (AROSLSRE-1165)#5627

Open
raelga wants to merge 4 commits into
Azure:mainfrom
raelga:raelga/aroslsre-upgrade-aks-latest-patch
Open

fix(aks): resolve latest patch and refresh node images on upgrade (AROSLSRE-1165)#5627
raelga wants to merge 4 commits into
Azure:mainfrom
raelga:raelga/aroslsre-upgrade-aks-latest-patch

Conversation

@raelga

@raelga raelga commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Jira: AROSLSRE-1165 (parent: AROSLSRE-1164)

What

dev-infrastructure/scripts/upgrade-aks-cluster.sh now fully owns AKS updates from the rollout pipeline:

  • Kubernetes version — resolves a configured minor version (e.g. 1.35) to the highest available patch for that minor before deciding whether to upgrade, instead of treating the minor as a version floor. A pinned full patch X.Y.Z is honored verbatim. The control plane + node pools upgrade only when behind.
  • Node OS image — after the (optional) version upgrade, each node pool's nodeImageVersion is compared against latestNodeImageVersion; if any pool is stale, a az aks upgrade --node-image-only pass refreshes node images. This replaces the AKS-managed nodeOSUpgradeChannel removed in feat(aks): disable AKS-managed automatic cluster upgrades #5628.
  • No X.Y.z available yet, or get-upgrades fails → falls back to the previous behavior (pass the minor, let AKS choose) without erroring.
  • Already-current clusters short-circuit with "no upgrade needed".

Why

Cluster updates are moving from AKS's unattended weekend auto-upgrade into the rollout pipeline. Running disruptive operations unattended over the weekend risks broken clusters on Monday. Driving updates through rollouts is safer: rollouts are monitored, gated by E2E tests that validate the change, and progress through Dev (always latest) → INT → Stage → Prod.

Two concrete gaps this closes:

  • KUBERNETES_VERSION comes from svc.aks.kubernetesVersion / mgmt.aks.kubernetesVersion, normally a minor like 1.35. The old sort -V comparison treated it as a floor, so once a cluster reached any 1.35.x it reported "no upgrade needed" and never picked up newer 1.35.z patch fixes.
  • With nodeOSUpgradeChannel removed (feat(aks): disable AKS-managed automatic cluster upgrades #5628), node OS image patches now need an explicit pipeline pass — added here.

Testing

  • bash -n syntax check passes.
  • Unit-tested the upgrade flow with a mocked az under set -euo pipefail across cases: behind-minor, on-minor-with-newer-patch, already-latest (no-op), pinned full patch (verbatim), empty get-upgrades (fallback), and node-image-only (version current, stale node image).
req=1.35   cur=1.34.5 avail=[1.35.1,1.35.3]         -> 1.35.3
req=1.35   cur=1.35.2 avail=[1.35.3,1.35.5,1.36.0]  -> 1.35.5
req=1.35   cur=1.35.5 avail=[1.36.0,1.36.1]         -> 1.35.5
req=1.32.5 cur=1.32.1 avail=[1.99.9]                -> 1.32.5
req=1.35   cur=1.34.0 avail=[]                       -> 1.35

Special notes for your reviewer

  • az aks get-upgrades is cluster-scoped (no region argument needed) and only returns forward upgrade targets, so the current version is folded in before taking the max to correctly handle the already-latest case.
  • A Kubernetes version upgrade already reimages nodes to the latest image for that version, so the node-image pass is a no-op immediately after one; it matters for clusters that are already on the target version but have a stale node image.

PR Checklist

  • PR is scoped to a single task (no mixed concerns)
  • Title follows Conventional Commits format
  • Summary explains the "Why" behind the change
  • Linked to relevant ticket/issue
  • Screenshots included (if graph/UI/metrics changes)
  • Self-reviewed the diff
  • CI/CD checks are passing (ignore Tide)
  • Draft PR used for WIP (if applicable)
  • Commit history is clean (rebased/squashed)
  • Tricky code blocks are commented
  • Specific reviewers tagged
  • All comment threads resolved before merge

…SLSRE-1162)

upgrade-aks-cluster.sh treated the configured minor (e.g. 1.35) as a
floor, so clusters never picked up newer 1.35.z patches once they were
at any 1.35.x. Resolve a minor to the highest patch available for that
minor (via az aks get-upgrades plus the current version) and upgrade to
that concrete patch. Pinned full patch versions (X.Y.Z) are honored
verbatim, and the script falls back to the previous behavior when no
patch is available or the lookup fails.
Copilot AI review requested due to automatic review settings June 12, 2026 10:15
@openshift-ci openshift-ci Bot requested review from stevekuznetsov and weherdh June 12, 2026 10:15
@openshift-ci

openshift-ci Bot commented Jun 12, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: raelga

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the AKS upgrade script to treat a configured Kubernetes minor version (e.g., 1.35) as “upgrade to the latest available patch for that minor,” by resolving X.Y to a concrete X.Y.Z before comparing/upgrading. This improves patch hygiene for long-lived clusters that previously stopped upgrading once they reached any X.Y.*.

Changes:

  • Add resolve_target_version to resolve X.Y to the highest available X.Y.Z patch (while honoring pinned X.Y.Z verbatim).
  • Use the resolved target version for control plane and node pool upgrade decisions and the az aks upgrade --kubernetes-version argument.
  • Improve logging to show both requested and resolved target versions.

Comment thread dev-infrastructure/scripts/upgrade-aks-cluster.sh Outdated
The grep in resolve_target_version exits 1 when no patch matches the
requested minor; with pipefail that aborted the script before the
empty-result fallback could return the requested minor. Append || true
so the no-match case yields an empty string and falls back as intended.
…adeChannel)

Drive node OS image (security patch) upgrades from the rollout instead of
the AKS-managed nodeOSUpgradeChannel that ran unattended on a weekend
maintenance window. After the optional Kubernetes version upgrade, compare
each node pool's running image against the latest available and run a
node-image-only upgrade when a newer image exists.

The previous early exit when no version upgrade was needed is removed so the
node image check always runs.

Jira: AROSLSRE-1162
Copilot AI review requested due to automatic review settings June 12, 2026 11:05
@raelga raelga changed the title fix(aks): upgrade to latest available patch for configured minor (AROSLSRE-1162) fix(aks): resolve latest patch and refresh node images on upgrade (AROSLSRE-1165) Jun 12, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

Comment on lines +122 to +128
# This replaces the AKS-managed nodeOSUpgradeChannel/auto-upgrade that used to
# pull the latest node image (security patches) on a maintenance schedule. We
# now drive it from the pipeline: for each node pool, compare the running node
# image against the latest available one and run a node-image-only upgrade when
# a newer image exists. A Kubernetes version upgrade above already reimages
# nodes to the latest image for the target version, so this is typically a
# no-op right after one and only does work when the image alone is stale.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot Good catch — reworded in e683a4f. The comment now states the node-image pass is the pipeline-driven replacement for nodeOSUpgradeChannel, which is disabled (None) in the companion change to dev-infrastructure/modules/aks-cluster-base.bicep (PR #5628). The two are coupled sub-tasks under AROSLSRE-1164, so auto-upgrade is removed there while this PR adds the on-demand reimage.

Comment on lines +165 to +172
if [ "${NODE_IMAGE_UPGRADE_NEEDED}" = "true" ]; then
echo "Upgrading node images for cluster '${CLUSTER_NAME}' in RG '${RESOURCE_GROUP}'..."

az aks upgrade \
--resource-group "${RESOURCE_GROUP}" \
--name "${CLUSTER_NAME}" \
--node-image-only \
--yes

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot The PR title, description, and linked Jira (AROSLSRE-1165, parent AROSLSRE-1164) have been updated to explicitly cover the node-image upgrade behavior and its rollout rationale (monitored, E2E-gated, Dev→INT→Stage→Prod). Kept in this PR rather than split out because it directly replaces the nodeOSUpgradeChannel disabled in the paired PR #5628 — splitting would leave a window with no node-OS patch coverage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants