pillar: trust cluster recovery for kubevirt apps and add stuck-Pending VMI handler#5853
Conversation
c6d71d6 to
f5ad4a8
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5853 +/- ##
==========================================
- Coverage 19.52% 17.11% -2.42%
==========================================
Files 19 474 +455
Lines 3021 85645 +82624
==========================================
+ Hits 590 14655 +14065
- Misses 2310 69472 +67162
- Partials 121 1518 +1397 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
20efefd to
8266e9f
Compare
| } | ||
| logrus.Infof("Started Kubevirt domain replicaset %s, VMI replicaset %s", domainName, vmis.name) | ||
|
|
||
| err = waitForVMI(vmis, nodeName, true) |
There was a problem hiding this comment.
This part I am not clear. What's wrong in waiting for the VM to start. even though it's the reponsibility of Kubernetes ?
There was a problem hiding this comment.
Also if we exit here quickly, I do remember domainmgr has only 10 retries before giving up. So for some reason if this app did not start within 30 secs or so, do we ever restart it automatically ?
There was a problem hiding this comment.
The reason is one, simplification, one it can not report anything now, if it reports error, that means we are not successfully created VMIRS, if it will later become 'Running', then outside in the domainmgr, will have to mechanism to know. we don't need to hang here w/ artificial delay. second, the outside 10 x 2 seconds, normal will report good, in a normal case, if not then no harm done, we'll wait for next info() query to know the running state of the App. So from mechanism of the launch/monitoring, very simple.
|
LGTM, but would like @andrewd-zededa to review this. |
8266e9f to
d8ee47c
Compare
andrewd-zededa
left a comment
There was a problem hiding this comment.
I like the direction in this pr to allow kubernetes to manage the config it has been given, I have some concern around the pending vmi cleanup and the async failover path also running alongside this.
| log.Errorf("virtLauncherActiveOnThisNode: list pods: %v", err) | ||
| return false | ||
| } | ||
| vlPrefix := "virt-launcher-" + appKubeName |
There was a problem hiding this comment.
Can this use the constant VMIPodNamePrefix?
There was a problem hiding this comment.
yes, updated
|
|
||
| log.Noticef("checkStuckPendingVMI: aiUUID:%s VMI:%s Pending >%v on this node with active/error virt-launcher; force-deleting (attempt %d/%d)", | ||
| appUUID, vmi.Name, pendingVMIDeleteThreshold, z.vmiDeleteCount[appUUID]+1, pendingVMIMaxDeletes) | ||
| if err := kubeapi.TryFastDeleteVmi(log, virtClient, vmi.Name); err != nil { |
There was a problem hiding this comment.
Is this going to conflict with failover activities?
There was a problem hiding this comment.
No conflict. The two delete paths target different VMI objects
— DetachOldWorkload deletes the old VMI on the failed node, the checkStuckPendingVMI would delete the
new Pending VMI on this node.
Execution is serial (same event loop).
The isPodTerminating guard prevents checkStuckPendingVMI from acting during the early failover phase where only the old terminating pod exists. If both act in the same failover window, they complement each other — both are using delete-and-VMIRS-recreate as the recovery mechanism.
There was a problem hiding this comment.
I'm concerned about the interaction with the end of the failover when the new VMI is attempting to start and is waiting on the volume attachment to resolve. Can the checkStuckPendingVMI time threshold be moved higher?
There was a problem hiding this comment.
Ok, i have added the vmiFailoverSuppressUntil[] to avoid touch the VMI while is was in failover operation.
d8ee47c to
b0f6415
Compare
|
@naiming-zededa , pls, rebase this PR.... |
…g VMI handler In kubevirt mode, transient errors during app bring-up were tearing down the VMIReplicaSet via hyper.Delete/Cleanup, which caused 10+ minute retry stalls and, when it hit the purge-collision bug, left the app permanently down. Instead, let the cluster drive VMI lifecycle. domainmgr: - doActivateTail: on Start failure or post-Start non-RUNNING state, skip Delete/Cleanup, publish State=BOOTING with Activated=false, and let verifyStatus promote to RUNNING once Info() reports it. - verifyStatus: same treatment on the periodic error path. hypervisor/kubevirt: - Start() returns as soon as the VMIRS is created; drop the synchronous waitForVMI wait. Scheduling is the cluster's job. zedkube: new stuck-Pending VMI detector (pendingvmi.go) - If a VMI stays Pending for >3 min while its virt-launcher is Running on this node (a known kubevirt/longhorn interaction), force-delete the VMI (not the VMIRS) so the cluster re-creates it cleanly. Capped at 3 attempts per app. - During a 5-minute suppression window after delete, checkAppsStatus pins StatusRunning=true / ScheduledOnThisNode=true so zedmanager does not cascade DomainConfig.Activate=false and tear down the VMIRS. - suppress stuck-Pending VMI delete during active failover checkStuckPendingVMI could false-positive force-delete a new VMI that is legitimately Pending on the receiving node during failover — the same signature as the stuck-Pending bug. Fix by having checkAppsFailover set a per-app vmiFailoverSuppressUntil window when failover lands on this node; checkStuckPendingVMI skips force-delete while the window is live. The window self-extends each tick as long as the Terminating pod exists, and expires automatically once failover completes. domainmgr: gate VMIRS Delete on UserActivate to prevent cluster-cascade teardown in kubevirt mode Signed-off-by: naiming-zededa <naiming@zededa.com>
b0f6415 to
9adee87
Compare
@rene rebased. thanks. |
Description
In kubevirt mode, transient errors during app bring-up were tearing down the VMIReplicaSet via hyper.Delete/Cleanup, which caused 10+ minute retry stalls and, when it hit the purge-collision bug, left the app permanently down. Instead, let the cluster drive VMI lifecycle.
domainmgr:
hypervisor/kubevirt:
zedkube: new stuck-Pending VMI detector (pendingvmi.go)
domainmgr: gate VMIRS Delete on UserActivate to prevent cluster-cascade teardown in kubevirt mode
PR dependencies
How to test and validate this PR
run eve-k, w/ VM apps, it should start normally. the app delete should work
and do the proper cleanup. the deactivate and reactivate should start the
VMI again. the killing VMI in the patch, if we can reproduce this, when start
application, the VM is keep in pending state for very long, verify the VMI will
eventually come back due to the handling in this patch.
Changelog notes
pillar: trust cluster recovery for kubevirt apps and add stuck-Pending VMI handler
PR Backports
Checklist
For backport PRs (remove it if it's not a backport):
And the last but not least:
check them.
Please, check the boxes above after submitting the PR in interactive mode.