fix(aks): resolve latest patch and refresh node images on upgrade (AROSLSRE-1165)#5627
fix(aks): resolve latest patch and refresh node images on upgrade (AROSLSRE-1165)#5627raelga wants to merge 4 commits into
Conversation
…SLSRE-1162) upgrade-aks-cluster.sh treated the configured minor (e.g. 1.35) as a floor, so clusters never picked up newer 1.35.z patches once they were at any 1.35.x. Resolve a minor to the highest patch available for that minor (via az aks get-upgrades plus the current version) and upgrade to that concrete patch. Pinned full patch versions (X.Y.Z) are honored verbatim, and the script falls back to the previous behavior when no patch is available or the lookup fails.
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: raelga The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
This PR updates the AKS upgrade script to treat a configured Kubernetes minor version (e.g., 1.35) as “upgrade to the latest available patch for that minor,” by resolving X.Y to a concrete X.Y.Z before comparing/upgrading. This improves patch hygiene for long-lived clusters that previously stopped upgrading once they reached any X.Y.*.
Changes:
- Add
resolve_target_versionto resolveX.Yto the highest availableX.Y.Zpatch (while honoring pinnedX.Y.Zverbatim). - Use the resolved target version for control plane and node pool upgrade decisions and the
az aks upgrade --kubernetes-versionargument. - Improve logging to show both requested and resolved target versions.
The grep in resolve_target_version exits 1 when no patch matches the requested minor; with pipefail that aborted the script before the empty-result fallback could return the requested minor. Append || true so the no-match case yields an empty string and falls back as intended.
…adeChannel) Drive node OS image (security patch) upgrades from the rollout instead of the AKS-managed nodeOSUpgradeChannel that ran unattended on a weekend maintenance window. After the optional Kubernetes version upgrade, compare each node pool's running image against the latest available and run a node-image-only upgrade when a newer image exists. The previous early exit when no version upgrade was needed is removed so the node image check always runs. Jira: AROSLSRE-1162
| # This replaces the AKS-managed nodeOSUpgradeChannel/auto-upgrade that used to | ||
| # pull the latest node image (security patches) on a maintenance schedule. We | ||
| # now drive it from the pipeline: for each node pool, compare the running node | ||
| # image against the latest available one and run a node-image-only upgrade when | ||
| # a newer image exists. A Kubernetes version upgrade above already reimages | ||
| # nodes to the latest image for the target version, so this is typically a | ||
| # no-op right after one and only does work when the image alone is stale. |
There was a problem hiding this comment.
@copilot Good catch — reworded in e683a4f. The comment now states the node-image pass is the pipeline-driven replacement for nodeOSUpgradeChannel, which is disabled (None) in the companion change to dev-infrastructure/modules/aks-cluster-base.bicep (PR #5628). The two are coupled sub-tasks under AROSLSRE-1164, so auto-upgrade is removed there while this PR adds the on-demand reimage.
| if [ "${NODE_IMAGE_UPGRADE_NEEDED}" = "true" ]; then | ||
| echo "Upgrading node images for cluster '${CLUSTER_NAME}' in RG '${RESOURCE_GROUP}'..." | ||
|
|
||
| az aks upgrade \ | ||
| --resource-group "${RESOURCE_GROUP}" \ | ||
| --name "${CLUSTER_NAME}" \ | ||
| --node-image-only \ | ||
| --yes |
There was a problem hiding this comment.
@copilot The PR title, description, and linked Jira (AROSLSRE-1165, parent AROSLSRE-1164) have been updated to explicitly cover the node-image upgrade behavior and its rollout rationale (monitored, E2E-gated, Dev→INT→Stage→Prod). Kept in this PR rather than split out because it directly replaces the nodeOSUpgradeChannel disabled in the paired PR #5628 — splitting would leave a window with no node-OS patch coverage.
Jira: AROSLSRE-1165 (parent: AROSLSRE-1164)
What
dev-infrastructure/scripts/upgrade-aks-cluster.shnow fully owns AKS updates from the rollout pipeline:1.35) to the highest available patch for that minor before deciding whether to upgrade, instead of treating the minor as a version floor. A pinned full patchX.Y.Zis honored verbatim. The control plane + node pools upgrade only when behind.nodeImageVersionis compared againstlatestNodeImageVersion; if any pool is stale, aaz aks upgrade --node-image-onlypass refreshes node images. This replaces the AKS-managednodeOSUpgradeChannelremoved in feat(aks): disable AKS-managed automatic cluster upgrades #5628.X.Y.zavailable yet, orget-upgradesfails → falls back to the previous behavior (pass the minor, let AKS choose) without erroring.Why
Cluster updates are moving from AKS's unattended weekend auto-upgrade into the rollout pipeline. Running disruptive operations unattended over the weekend risks broken clusters on Monday. Driving updates through rollouts is safer: rollouts are monitored, gated by E2E tests that validate the change, and progress through Dev (always latest) → INT → Stage → Prod.
Two concrete gaps this closes:
KUBERNETES_VERSIONcomes fromsvc.aks.kubernetesVersion/mgmt.aks.kubernetesVersion, normally a minor like1.35. The oldsort -Vcomparison treated it as a floor, so once a cluster reached any1.35.xit reported "no upgrade needed" and never picked up newer1.35.zpatch fixes.nodeOSUpgradeChannelremoved (feat(aks): disable AKS-managed automatic cluster upgrades #5628), node OS image patches now need an explicit pipeline pass — added here.Testing
bash -nsyntax check passes.azunderset -euo pipefailacross cases: behind-minor, on-minor-with-newer-patch, already-latest (no-op), pinned full patch (verbatim), emptyget-upgrades(fallback), and node-image-only (version current, stale node image).Special notes for your reviewer
az aks get-upgradesis cluster-scoped (no region argument needed) and only returns forward upgrade targets, so the current version is folded in before taking the max to correctly handle the already-latest case.PR Checklist