Skip to content

pillar: trust cluster recovery for kubevirt apps and add stuck-Pending VMI handler#5853

Merged
rene merged 1 commit into
lf-edge:masterfrom
naiming-zededa:naiming-vmirs-delete
Apr 29, 2026
Merged

pillar: trust cluster recovery for kubevirt apps and add stuck-Pending VMI handler#5853
rene merged 1 commit into
lf-edge:masterfrom
naiming-zededa:naiming-vmirs-delete

Conversation

@naiming-zededa
Copy link
Copy Markdown
Contributor

@naiming-zededa naiming-zededa commented Apr 23, 2026

Description

In kubevirt mode, transient errors during app bring-up were tearing down the VMIReplicaSet via hyper.Delete/Cleanup, which caused 10+ minute retry stalls and, when it hit the purge-collision bug, left the app permanently down. Instead, let the cluster drive VMI lifecycle.

domainmgr:

  • doActivateTail: on Start failure or post-Start non-RUNNING state, skip Delete/Cleanup, publish State=BOOTING with Activated=false, and let verifyStatus promote to RUNNING once Info() reports it.
  • verifyStatus: same treatment on the periodic error path.

hypervisor/kubevirt:

  • Start() returns as soon as the VMIRS is created; drop the synchronous waitForVMI wait. Scheduling is the cluster's job.

zedkube: new stuck-Pending VMI detector (pendingvmi.go)

  • If a VMI stays Pending for >3 min while its virt-launcher is Running on this node (a known kubevirt/longhorn interaction), force-delete the VMI (not the VMIRS) so the cluster re-creates it cleanly. Capped at 3 attempts per app.
  • During a 5-minute suppression window after delete, checkAppsStatus pins StatusRunning=true / ScheduledOnThisNode=true so zedmanager does not cascade DomainConfig.Activate=false and tear down the VMIRS.

domainmgr: gate VMIRS Delete on UserActivate to prevent cluster-cascade teardown in kubevirt mode

PR dependencies

How to test and validate this PR

run eve-k, w/ VM apps, it should start normally. the app delete should work
and do the proper cleanup. the deactivate and reactivate should start the
VMI again. the killing VMI in the patch, if we can reproduce this, when start
application, the VM is keep in pending state for very long, verify the VMI will
eventually come back due to the handling in this patch.

Changelog notes

pillar: trust cluster recovery for kubevirt apps and add stuck-Pending VMI handler

PR Backports

Checklist

  • I've provided a proper description
  • I've added the proper documentation
  • I've tested my PR on amd64 device
  • I've tested my PR on arm64 device
  • I've written the test verification instructions
  • I've set the proper labels to this PR

For backport PRs (remove it if it's not a backport):

  • I've added a reference link to the original PR
  • PR's title follows the template

And the last but not least:

  • I've checked the boxes above, or I've provided a good reason why I didn't
    check them.

Please, check the boxes above after submitting the PR in interactive mode.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 23, 2026

Codecov Report

❌ Patch coverage is 0% with 60 lines in your changes missing coverage. Please review.
✅ Project coverage is 17.11%. Comparing base (2281599) to head (9adee87).
⚠️ Report is 619 commits behind head on master.

Files with missing lines Patch % Lines
pkg/pillar/cmd/domainmgr/domainmgr.go 0.00% 58 Missing ⚠️
pkg/pillar/cmd/zedmanager/handledomainmgr.go 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5853      +/-   ##
==========================================
- Coverage   19.52%   17.11%   -2.42%     
==========================================
  Files          19      474     +455     
  Lines        3021    85645   +82624     
==========================================
+ Hits          590    14655   +14065     
- Misses       2310    69472   +67162     
- Partials      121     1518    +1397     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@naiming-zededa naiming-zededa force-pushed the naiming-vmirs-delete branch 2 times, most recently from 20efefd to 8266e9f Compare April 25, 2026 19:58
}
logrus.Infof("Started Kubevirt domain replicaset %s, VMI replicaset %s", domainName, vmis.name)

err = waitForVMI(vmis, nodeName, true)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part I am not clear. What's wrong in waiting for the VM to start. even though it's the reponsibility of Kubernetes ?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also if we exit here quickly, I do remember domainmgr has only 10 retries before giving up. So for some reason if this app did not start within 30 secs or so, do we ever restart it automatically ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is one, simplification, one it can not report anything now, if it reports error, that means we are not successfully created VMIRS, if it will later become 'Running', then outside in the domainmgr, will have to mechanism to know. we don't need to hang here w/ artificial delay. second, the outside 10 x 2 seconds, normal will report good, in a normal case, if not then no harm done, we'll wait for next info() query to know the running state of the App. So from mechanism of the launch/monitoring, very simple.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@zedi-pramodh
Copy link
Copy Markdown

LGTM, but would like @andrewd-zededa to review this.

Copy link
Copy Markdown

@zedi-pramodh zedi-pramodh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Contributor

@andrewd-zededa andrewd-zededa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the direction in this pr to allow kubernetes to manage the config it has been given, I have some concern around the pending vmi cleanup and the async failover path also running alongside this.

Comment thread pkg/pillar/cmd/zedkube/pendingvmi.go Outdated
log.Errorf("virtLauncherActiveOnThisNode: list pods: %v", err)
return false
}
vlPrefix := "virt-launcher-" + appKubeName
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this use the constant VMIPodNamePrefix?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, updated


log.Noticef("checkStuckPendingVMI: aiUUID:%s VMI:%s Pending >%v on this node with active/error virt-launcher; force-deleting (attempt %d/%d)",
appUUID, vmi.Name, pendingVMIDeleteThreshold, z.vmiDeleteCount[appUUID]+1, pendingVMIMaxDeletes)
if err := kubeapi.TryFastDeleteVmi(log, virtClient, vmi.Name); err != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to conflict with failover activities?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No conflict. The two delete paths target different VMI objects
— DetachOldWorkload deletes the old VMI on the failed node, the checkStuckPendingVMI would delete the
new Pending VMI on this node.
Execution is serial (same event loop).
The isPodTerminating guard prevents checkStuckPendingVMI from acting during the early failover phase where only the old terminating pod exists. If both act in the same failover window, they complement each other — both are using delete-and-VMIRS-recreate as the recovery mechanism.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned about the interaction with the end of the failover when the new VMI is attempting to start and is waiting on the volume attachment to resolve. Can the checkStuckPendingVMI time threshold be moved higher?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, i have added the vmiFailoverSuppressUntil[] to avoid touch the VMI while is was in failover operation.

@rene
Copy link
Copy Markdown
Contributor

rene commented Apr 28, 2026

@naiming-zededa , pls, rebase this PR....

…g VMI handler

In kubevirt mode, transient errors during app bring-up were tearing down
the VMIReplicaSet via hyper.Delete/Cleanup, which caused 10+ minute
retry stalls and, when it hit the purge-collision bug, left the app
permanently down. Instead, let the cluster drive VMI lifecycle.

domainmgr:
 - doActivateTail: on Start failure or post-Start non-RUNNING state, skip
   Delete/Cleanup, publish State=BOOTING with Activated=false, and let
   verifyStatus promote to RUNNING once Info() reports it.
 - verifyStatus: same treatment on the periodic error path.

hypervisor/kubevirt:
 - Start() returns as soon as the VMIRS is created; drop the synchronous
   waitForVMI wait. Scheduling is the cluster's job.

zedkube: new stuck-Pending VMI detector (pendingvmi.go)
 - If a VMI stays Pending for >3 min while its virt-launcher is Running on
   this node (a known kubevirt/longhorn interaction), force-delete the VMI
   (not the VMIRS) so the cluster re-creates it cleanly. Capped at 3
   attempts per app.
 - During a 5-minute suppression window after delete, checkAppsStatus pins
   StatusRunning=true / ScheduledOnThisNode=true so zedmanager does not
   cascade DomainConfig.Activate=false and tear down the VMIRS.

 - suppress stuck-Pending VMI delete during active failover
   checkStuckPendingVMI could false-positive force-delete a new VMI that
   is legitimately Pending on the receiving node during failover — the
   same signature as the stuck-Pending bug. Fix by having
   checkAppsFailover set a per-app vmiFailoverSuppressUntil window when
   failover lands on this node; checkStuckPendingVMI skips force-delete
   while the window is live. The window self-extends each tick as long
   as the Terminating pod exists, and expires automatically once
   failover completes.  domainmgr: gate VMIRS Delete on UserActivate
   to prevent cluster-cascade teardown in kubevirt mode

Signed-off-by: naiming-zededa <naiming@zededa.com>
@naiming-zededa
Copy link
Copy Markdown
Contributor Author

@naiming-zededa , pls, rebase this PR....

@rene rebased. thanks.

@rene rene merged commit 0a30ea9 into lf-edge:master Apr 29, 2026
39 of 46 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants