eve-k: skip pod eviction when cluster-wide simultaneous drain is detected by andrewd-zededa · Pull Request #5804 · lf-edge/eve

andrewd-zededa · 2026-04-15T18:25:19Z

Description

When all cluster nodes receive a drain-triggering config (reboot/upgrade)
within a short window of each other, every node cordons itself concurrently.
In this scenario pod eviction is futile: evicted pods have no schedulable
target, and distributed storage services (e.g. Longhorn instance-managers)
are evicted from every node simultaneously, deadlocking volume detachment.

After cordoning, zedkube now polls for schedulable peer nodes over a
configurable detection window. If all peers are already unschedulable, pod
eviction is skipped and the drain transitions directly to COMPLETE. The same
check is repeated after any drain failure, so a mid-drain cluster-wide
cordon is also handled without unnecessary retries.

The detection window is kubernetes.drain.allnodes.config.multiple *
timer.config.interval, making it tunable for clusters whose nodes poll
config at widely staggered intervals. The new config property
kubernetes.drain.allnodes.config.multiple (default 2, min 1, max 1000)
replaces the previous kubernetes.cluster.widedetect.window.multiple name.

PR dependencies

None

How to test and validate this PR

create a three node eve-k cluster
initiate a reboot of all three nodes simultaneously
observe all nodes complete reboots

Changelog notes

Support full cluster reboots of all nodes.

PR Backports

- 16.0-stable: Yes
- 14.5-stable: No, as the feature is not available there.
- 13.4-stable: No, as the feature is not available there.

Checklist

I've provided a proper description
I've added the proper documentation
I've tested my PR on amd64 device
I've tested my PR on arm64 device
I've written the test verification instructions
I've set the proper labels to this PR

And the last but not least:

I've checked the boxes above, or I've provided a good reason why I didn't
check them.

Please, check the boxes above after submitting the PR in interactive mode.

codecov · 2026-04-15T19:31:21Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 17.14%. Comparing base (2281599) to head (91bbb6e).
⚠️ Report is 612 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5804      +/-   ##
==========================================
- Coverage   19.52%   17.14%   -2.39%     
==========================================
  Files          19      474     +455     
  Lines        3021    85509   +82488     
==========================================
+ Hits          590    14658   +14068     
- Misses       2310    69334   +67024     
- Partials      121     1517    +1396

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

andrewd-zededa · 2026-04-21T18:26:32Z

Can someone please add stable?

andrewd-zededa · 2026-04-22T16:31:53Z

Fixed the global_test failure, watching CI failures next

andrewd-zededa · 2026-04-22T17:05:31Z

=== Failed
=== FAIL: agentlog TestManageStatFileSizeTargetSizesAvgThreshold (7.79s)
agentlog_test.go:1061: Original file size: 6793934
agentlog_test.go:1083:
Error Trace: /pillar/agentlog/agentlog_test.go:1083
Error: Should be true
Test: TestManageStatFileSizeTargetSizesAvgThreshold
Messages: expected duration to be less than 1 second, but got 1.025161146s
agentlog_test.go:1084: Duration: 1.025161146s
agentlog_test.go:1085: AvgSize: 78
agentlog_test.go:1092: Resulting file size: 6521957

DONE 539 tests, 14 skipped, 1 failure in 530.078s
make[1]: *** [Makefile:99: test] Error 1

@OhmSpectator Can we make this test more flexible to timing only ~2% over?

OhmSpectator · 2026-04-22T17:11:15Z

=== Failed
=== FAIL: agentlog TestManageStatFileSizeTargetSizesAvgThreshold (7.79s)
agentlog_test.go:1061: Original file size: 6793934
agentlog_test.go:1083:
Error Trace: /pillar/agentlog/agentlog_test.go:1083
Error: Should be true
Test: TestManageStatFileSizeTargetSizesAvgThreshold
Messages: expected duration to be less than 1 second, but got 1.025161146s
agentlog_test.go:1084: Duration: 1.025161146s
agentlog_test.go:1085: AvgSize: 78
agentlog_test.go:1092: Resulting file size: 6521957

DONE 539 tests, 14 skipped, 1 failure in 530.078s
make[1]: *** [Makefile:99: test] Error 1

@OhmSpectator Can we make this test more flexible to timing only ~2% over?

Originally the time needed for this operation was pretty smaller than 1 sec. 1 sec was chosen as a really high threshold. So, I'm curious what happened with the system that it does not fit anymore... But for sure I don't mind if you increase it.

andrewd-zededa · 2026-04-22T17:39:33Z

/rerun red

andrewd-zededa · 2026-04-22T19:22:39Z

rebased on latest master

andrewd-zededa · 2026-04-23T16:30:56Z

rebased on latest master

zedi-pramodh

I have one comment. Other than that LGTM.

…is detected When all cluster nodes receive a drain-triggering config (reboot/upgrade) within a short window of each other, every node cordons itself concurrently. In this scenario pod eviction is futile: evicted pods have no schedulable target, and distributed storage services (e.g. Longhorn instance-managers) are evicted from every node simultaneously, deadlocking volume detachment. After cordoning, zedkube now polls for schedulable peer nodes over a configurable detection window. If all peers are already unschedulable, pod eviction is skipped and the drain transitions directly to COMPLETE. The same check is repeated after any drain failure, so a mid-drain cluster-wide cordon is also handled without unnecessary retries. The detection window is kubernetes.drain.allnodes.config.multiple * timer.config.interval, making it tunable for clusters whose nodes poll config at widely staggered intervals. The new config property kubernetes.drain.allnodes.config.multiple (default 2, min 1, max 1000) replaces the previous kubernetes.cluster.widedetect.window.multiple name. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Andrew Durbin <andrewd@zededa.com>

andrewd-zededa · 2026-04-27T19:42:22Z

rebased on latest master

zedi-pramodh

LGTM

github-actions Bot requested review from eriknordmark, naiming-zededa and zedi-pramodh April 15, 2026 18:25

zedi-pramodh reviewed Apr 15, 2026

View reviewed changes

Comment thread pkg/pillar/cmd/zedkube/zedkube.go

andrewd-zededa force-pushed the eve-k-cluster-reboot branch from eda0379 to 2b3d674 Compare April 17, 2026 18:19

github-actions Bot requested a review from zedi-pramodh April 17, 2026 18:20

andrewd-zededa force-pushed the eve-k-cluster-reboot branch from 2b3d674 to 5283f1e Compare April 20, 2026 23:25

andrewd-zededa changed the title ~~fix(zedkube): skip pod drain when all cluster nodes are cordoned~~ eve-k: skip pod eviction when cluster-wide simultaneous drain is detected Apr 20, 2026

andrewd-zededa marked this pull request as ready for review April 21, 2026 18:26

andrewd-zededa force-pushed the eve-k-cluster-reboot branch from 5283f1e to 593ff36 Compare April 22, 2026 16:31

andrewd-zededa force-pushed the eve-k-cluster-reboot branch from 593ff36 to 7dd361e Compare April 22, 2026 19:22

andrewd-zededa force-pushed the eve-k-cluster-reboot branch from 7dd361e to 8e591f7 Compare April 23, 2026 16:29

andrewd-zededa force-pushed the eve-k-cluster-reboot branch from 8e591f7 to 7e6bfd1 Compare April 23, 2026 18:43

zedi-pramodh reviewed Apr 24, 2026

View reviewed changes

Comment thread pkg/pillar/cmd/zedkube/drain.go

zedi-pramodh approved these changes Apr 24, 2026

View reviewed changes

rene added stable Should be backported to stable release(s) labels Apr 27, 2026

andrewd-zededa force-pushed the eve-k-cluster-reboot branch from 7e6bfd1 to 91bbb6e Compare April 27, 2026 19:42

github-actions Bot requested a review from zedi-pramodh April 27, 2026 19:42

zedi-pramodh approved these changes Apr 27, 2026

View reviewed changes

rene merged commit 3c01a05 into lf-edge:master Apr 28, 2026
40 of 46 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eve-k: skip pod eviction when cluster-wide simultaneous drain is detected#5804

eve-k: skip pod eviction when cluster-wide simultaneous drain is detected#5804
rene merged 1 commit into
lf-edge:masterfrom
andrewd-zededa:eve-k-cluster-reboot

andrewd-zededa commented Apr 15, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

andrewd-zededa commented Apr 21, 2026

Uh oh!

andrewd-zededa commented Apr 22, 2026

Uh oh!

andrewd-zededa commented Apr 22, 2026

Uh oh!

OhmSpectator commented Apr 22, 2026

Uh oh!

andrewd-zededa commented Apr 22, 2026

Uh oh!

andrewd-zededa commented Apr 22, 2026

Uh oh!

andrewd-zededa commented Apr 23, 2026

Uh oh!

Uh oh!

zedi-pramodh left a comment

Uh oh!

andrewd-zededa commented Apr 27, 2026

Uh oh!

zedi-pramodh left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

andrewd-zededa commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

PR dependencies

How to test and validate this PR

Changelog notes

PR Backports

Checklist

Uh oh!

codecov Bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

andrewd-zededa commented Apr 21, 2026

Uh oh!

andrewd-zededa commented Apr 22, 2026

Uh oh!

andrewd-zededa commented Apr 22, 2026

Uh oh!

OhmSpectator commented Apr 22, 2026

Uh oh!

andrewd-zededa commented Apr 22, 2026

Uh oh!

andrewd-zededa commented Apr 22, 2026

Uh oh!

andrewd-zededa commented Apr 23, 2026

Uh oh!

Uh oh!

zedi-pramodh left a comment

Choose a reason for hiding this comment

Uh oh!

andrewd-zededa commented Apr 27, 2026

Uh oh!

zedi-pramodh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

andrewd-zededa commented Apr 15, 2026 •

edited

Loading

codecov Bot commented Apr 15, 2026 •

edited

Loading