pillar: trust cluster recovery for kubevirt apps and add stuck-Pending VMI handler by naiming-zededa · Pull Request #5853 · lf-edge/eve

naiming-zededa · 2026-04-23T21:21:38Z

Description

In kubevirt mode, transient errors during app bring-up were tearing down the VMIReplicaSet via hyper.Delete/Cleanup, which caused 10+ minute retry stalls and, when it hit the purge-collision bug, left the app permanently down. Instead, let the cluster drive VMI lifecycle.

domainmgr:

doActivateTail: on Start failure or post-Start non-RUNNING state, skip Delete/Cleanup, publish State=BOOTING with Activated=false, and let verifyStatus promote to RUNNING once Info() reports it.
verifyStatus: same treatment on the periodic error path.

hypervisor/kubevirt:

Start() returns as soon as the VMIRS is created; drop the synchronous waitForVMI wait. Scheduling is the cluster's job.

zedkube: new stuck-Pending VMI detector (pendingvmi.go)

If a VMI stays Pending for >3 min while its virt-launcher is Running on this node (a known kubevirt/longhorn interaction), force-delete the VMI (not the VMIRS) so the cluster re-creates it cleanly. Capped at 3 attempts per app.
During a 5-minute suppression window after delete, checkAppsStatus pins StatusRunning=true / ScheduledOnThisNode=true so zedmanager does not cascade DomainConfig.Activate=false and tear down the VMIRS.

domainmgr: gate VMIRS Delete on UserActivate to prevent cluster-cascade teardown in kubevirt mode

PR dependencies

How to test and validate this PR

run eve-k, w/ VM apps, it should start normally. the app delete should work
and do the proper cleanup. the deactivate and reactivate should start the
VMI again. the killing VMI in the patch, if we can reproduce this, when start
application, the VM is keep in pending state for very long, verify the VMI will
eventually come back due to the handling in this patch.

Changelog notes

pillar: trust cluster recovery for kubevirt apps and add stuck-Pending VMI handler

PR Backports

Checklist

I've provided a proper description
I've added the proper documentation
I've tested my PR on amd64 device
I've tested my PR on arm64 device
I've written the test verification instructions
I've set the proper labels to this PR

For backport PRs (remove it if it's not a backport):

I've added a reference link to the original PR
PR's title follows the template

And the last but not least:

I've checked the boxes above, or I've provided a good reason why I didn't
check them.

Please, check the boxes above after submitting the PR in interactive mode.

codecov · 2026-04-23T23:00:04Z

Codecov Report

❌ Patch coverage is 0% with 60 lines in your changes missing coverage. Please review.
✅ Project coverage is 17.11%. Comparing base (2281599) to head (9adee87).
⚠️ Report is 619 commits behind head on master.

Files with missing lines	Patch %	Lines
pkg/pillar/cmd/domainmgr/domainmgr.go	0.00%	58 Missing ⚠️
pkg/pillar/cmd/zedmanager/handledomainmgr.go	0.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5853      +/-   ##
==========================================
- Coverage   19.52%   17.11%   -2.42%     
==========================================
  Files          19      474     +455     
  Lines        3021    85645   +82624     
==========================================
+ Hits          590    14655   +14065     
- Misses       2310    69472   +67162     
- Partials      121     1518    +1397

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

zedi-pramodh · 2026-04-26T17:07:13Z

 	}
 	logrus.Infof("Started Kubevirt domain replicaset %s, VMI replicaset %s", domainName, vmis.name)

-	err = waitForVMI(vmis, nodeName, true)


This part I am not clear. What's wrong in waiting for the VM to start. even though it's the reponsibility of Kubernetes ?

Also if we exit here quickly, I do remember domainmgr has only 10 retries before giving up. So for some reason if this app did not start within 30 secs or so, do we ever restart it automatically ?

The reason is one, simplification, one it can not report anything now, if it reports error, that means we are not successfully created VMIRS, if it will later become 'Running', then outside in the domainmgr, will have to mechanism to know. we don't need to hang here w/ artificial delay. second, the outside 10 x 2 seconds, normal will report good, in a normal case, if not then no harm done, we'll wait for next info() query to know the running state of the App. So from mechanism of the launch/monitoring, very simple.

zedi-pramodh · 2026-04-27T02:10:02Z

LGTM, but would like @andrewd-zededa to review this.

zedi-pramodh

LGTM

andrewd-zededa

I like the direction in this pr to allow kubernetes to manage the config it has been given, I have some concern around the pending vmi cleanup and the async failover path also running alongside this.

andrewd-zededa · 2026-04-27T23:16:41Z

+		log.Errorf("virtLauncherActiveOnThisNode: list pods: %v", err)
+		return false
+	}
+	vlPrefix := "virt-launcher-" + appKubeName


Can this use the constant VMIPodNamePrefix?

yes, updated

andrewd-zededa · 2026-04-27T23:18:23Z

+
+		log.Noticef("checkStuckPendingVMI: aiUUID:%s VMI:%s Pending >%v on this node with active/error virt-launcher; force-deleting (attempt %d/%d)",
+			appUUID, vmi.Name, pendingVMIDeleteThreshold, z.vmiDeleteCount[appUUID]+1, pendingVMIMaxDeletes)
+		if err := kubeapi.TryFastDeleteVmi(log, virtClient, vmi.Name); err != nil {


Is this going to conflict with failover activities?

No conflict. The two delete paths target different VMI objects
— DetachOldWorkload deletes the old VMI on the failed node, the checkStuckPendingVMI would delete the
new Pending VMI on this node.
Execution is serial (same event loop).
The isPodTerminating guard prevents checkStuckPendingVMI from acting during the early failover phase where only the old terminating pod exists. If both act in the same failover window, they complement each other — both are using delete-and-VMIRS-recreate as the recovery mechanism.

I'm concerned about the interaction with the end of the failover when the new VMI is attempting to start and is waiting on the volume attachment to resolve. Can the checkStuckPendingVMI time threshold be moved higher?

Ok, i have added the vmiFailoverSuppressUntil[] to avoid touch the VMI while is was in failover operation.

rene · 2026-04-28T09:52:07Z

@naiming-zededa , pls, rebase this PR....

…g VMI handler In kubevirt mode, transient errors during app bring-up were tearing down the VMIReplicaSet via hyper.Delete/Cleanup, which caused 10+ minute retry stalls and, when it hit the purge-collision bug, left the app permanently down. Instead, let the cluster drive VMI lifecycle. domainmgr: - doActivateTail: on Start failure or post-Start non-RUNNING state, skip Delete/Cleanup, publish State=BOOTING with Activated=false, and let verifyStatus promote to RUNNING once Info() reports it. - verifyStatus: same treatment on the periodic error path. hypervisor/kubevirt: - Start() returns as soon as the VMIRS is created; drop the synchronous waitForVMI wait. Scheduling is the cluster's job. zedkube: new stuck-Pending VMI detector (pendingvmi.go) - If a VMI stays Pending for >3 min while its virt-launcher is Running on this node (a known kubevirt/longhorn interaction), force-delete the VMI (not the VMIRS) so the cluster re-creates it cleanly. Capped at 3 attempts per app. - During a 5-minute suppression window after delete, checkAppsStatus pins StatusRunning=true / ScheduledOnThisNode=true so zedmanager does not cascade DomainConfig.Activate=false and tear down the VMIRS. - suppress stuck-Pending VMI delete during active failover checkStuckPendingVMI could false-positive force-delete a new VMI that is legitimately Pending on the receiving node during failover — the same signature as the stuck-Pending bug. Fix by having checkAppsFailover set a per-app vmiFailoverSuppressUntil window when failover lands on this node; checkStuckPendingVMI skips force-delete while the window is live. The window self-extends each tick as long as the Terminating pod exists, and expires automatically once failover completes. domainmgr: gate VMIRS Delete on UserActivate to prevent cluster-cascade teardown in kubevirt mode Signed-off-by: naiming-zededa <naiming@zededa.com>

naiming-zededa · 2026-04-28T17:43:02Z

@naiming-zededa , pls, rebase this PR....

@rene rebased. thanks.

naiming-zededa requested review from milan-zededa, rene, rouming, rucoder, shjala and zedi-pramodh as code owners April 23, 2026 21:21

github-actions Bot requested review from OhmSpectator, andrewd-zededa, eriknordmark, jsfakian and uncleDecart April 23, 2026 21:22

naiming-zededa force-pushed the naiming-vmirs-delete branch 2 times, most recently from c6d71d6 to f5ad4a8 Compare April 23, 2026 21:36

naiming-zededa force-pushed the naiming-vmirs-delete branch 2 times, most recently from 20efefd to 8266e9f Compare April 25, 2026 19:58

zedi-pramodh reviewed Apr 26, 2026

View reviewed changes

naiming-zededa force-pushed the naiming-vmirs-delete branch from 8266e9f to d8ee47c Compare April 27, 2026 17:20

github-actions Bot requested a review from zedi-pramodh April 27, 2026 17:20

zedi-pramodh approved these changes Apr 27, 2026

View reviewed changes

andrewd-zededa reviewed Apr 27, 2026

View reviewed changes

naiming-zededa force-pushed the naiming-vmirs-delete branch from d8ee47c to b0f6415 Compare April 28, 2026 02:26

github-actions Bot requested review from andrewd-zededa and zedi-pramodh April 28, 2026 02:27

naiming-zededa force-pushed the naiming-vmirs-delete branch from b0f6415 to 9adee87 Compare April 28, 2026 17:41

rene merged commit 0a30ea9 into lf-edge:master Apr 29, 2026
39 of 46 checks passed

Conversation

naiming-zededa commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

PR dependencies

How to test and validate this PR

Changelog notes

PR Backports

Checklist

Uh oh!

codecov Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zedi-pramodh commented Apr 27, 2026

Uh oh!

zedi-pramodh left a comment

Choose a reason for hiding this comment

Uh oh!

andrewd-zededa left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rene commented Apr 28, 2026

Uh oh!

naiming-zededa commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

naiming-zededa commented Apr 23, 2026 •

edited

Loading

codecov Bot commented Apr 23, 2026 •

edited

Loading