Skip to content

eve-k: skip pod eviction when cluster-wide simultaneous drain is detected#5804

Merged
rene merged 1 commit into
lf-edge:masterfrom
andrewd-zededa:eve-k-cluster-reboot
Apr 28, 2026
Merged

eve-k: skip pod eviction when cluster-wide simultaneous drain is detected#5804
rene merged 1 commit into
lf-edge:masterfrom
andrewd-zededa:eve-k-cluster-reboot

Conversation

@andrewd-zededa
Copy link
Copy Markdown
Contributor

@andrewd-zededa andrewd-zededa commented Apr 15, 2026

Description

When all cluster nodes receive a drain-triggering config (reboot/upgrade)
within a short window of each other, every node cordons itself concurrently.
In this scenario pod eviction is futile: evicted pods have no schedulable
target, and distributed storage services (e.g. Longhorn instance-managers)
are evicted from every node simultaneously, deadlocking volume detachment.

After cordoning, zedkube now polls for schedulable peer nodes over a
configurable detection window. If all peers are already unschedulable, pod
eviction is skipped and the drain transitions directly to COMPLETE. The same
check is repeated after any drain failure, so a mid-drain cluster-wide
cordon is also handled without unnecessary retries.

The detection window is kubernetes.drain.allnodes.config.multiple *
timer.config.interval, making it tunable for clusters whose nodes poll
config at widely staggered intervals. The new config property
kubernetes.drain.allnodes.config.multiple (default 2, min 1, max 1000)
replaces the previous kubernetes.cluster.widedetect.window.multiple name.

PR dependencies

None

How to test and validate this PR

  • create a three node eve-k cluster
  • initiate a reboot of all three nodes simultaneously
  • observe all nodes complete reboots

Changelog notes

Support full cluster reboots of all nodes.

PR Backports

- 16.0-stable: Yes
- 14.5-stable: No, as the feature is not available there.
- 13.4-stable: No, as the feature is not available there.

Checklist

  • I've provided a proper description
  • I've added the proper documentation
  • I've tested my PR on amd64 device
  • I've tested my PR on arm64 device
  • I've written the test verification instructions
  • I've set the proper labels to this PR

And the last but not least:

  • I've checked the boxes above, or I've provided a good reason why I didn't
    check them.

Please, check the boxes above after submitting the PR in interactive mode.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 15, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 17.14%. Comparing base (2281599) to head (91bbb6e).
⚠️ Report is 612 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5804      +/-   ##
==========================================
- Coverage   19.52%   17.14%   -2.39%     
==========================================
  Files          19      474     +455     
  Lines        3021    85509   +82488     
==========================================
+ Hits          590    14658   +14068     
- Misses       2310    69334   +67024     
- Partials      121     1517    +1396     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread pkg/pillar/cmd/zedkube/zedkube.go
@github-actions github-actions Bot requested a review from zedi-pramodh April 17, 2026 18:20
@andrewd-zededa andrewd-zededa changed the title fix(zedkube): skip pod drain when all cluster nodes are cordoned eve-k: skip pod eviction when cluster-wide simultaneous drain is detected Apr 20, 2026
@andrewd-zededa
Copy link
Copy Markdown
Contributor Author

Can someone please add stable?

@andrewd-zededa andrewd-zededa marked this pull request as ready for review April 21, 2026 18:26
@andrewd-zededa
Copy link
Copy Markdown
Contributor Author

Fixed the global_test failure, watching CI failures next

@andrewd-zededa
Copy link
Copy Markdown
Contributor Author

=== Failed
=== FAIL: agentlog TestManageStatFileSizeTargetSizesAvgThreshold (7.79s)
agentlog_test.go:1061: Original file size: 6793934
agentlog_test.go:1083:
Error Trace: /pillar/agentlog/agentlog_test.go:1083
Error: Should be true
Test: TestManageStatFileSizeTargetSizesAvgThreshold
Messages: expected duration to be less than 1 second, but got 1.025161146s
agentlog_test.go:1084: Duration: 1.025161146s
agentlog_test.go:1085: AvgSize: 78
agentlog_test.go:1092: Resulting file size: 6521957

DONE 539 tests, 14 skipped, 1 failure in 530.078s
make[1]: *** [Makefile:99: test] Error 1

@OhmSpectator Can we make this test more flexible to timing only ~2% over?

@OhmSpectator
Copy link
Copy Markdown
Member

=== Failed
=== FAIL: agentlog TestManageStatFileSizeTargetSizesAvgThreshold (7.79s)
agentlog_test.go:1061: Original file size: 6793934
agentlog_test.go:1083:
Error Trace: /pillar/agentlog/agentlog_test.go:1083
Error: Should be true
Test: TestManageStatFileSizeTargetSizesAvgThreshold
Messages: expected duration to be less than 1 second, but got 1.025161146s
agentlog_test.go:1084: Duration: 1.025161146s
agentlog_test.go:1085: AvgSize: 78
agentlog_test.go:1092: Resulting file size: 6521957

DONE 539 tests, 14 skipped, 1 failure in 530.078s
make[1]: *** [Makefile:99: test] Error 1

@OhmSpectator Can we make this test more flexible to timing only ~2% over?

Originally the time needed for this operation was pretty smaller than 1 sec. 1 sec was chosen as a really high threshold. So, I'm curious what happened with the system that it does not fit anymore... But for sure I don't mind if you increase it.

@andrewd-zededa
Copy link
Copy Markdown
Contributor Author

/rerun red

@andrewd-zededa
Copy link
Copy Markdown
Contributor Author

rebased on latest master

@andrewd-zededa
Copy link
Copy Markdown
Contributor Author

rebased on latest master

Comment thread pkg/pillar/cmd/zedkube/drain.go
Copy link
Copy Markdown

@zedi-pramodh zedi-pramodh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one comment. Other than that LGTM.

@rene rene added stable Should be backported to stable release(s) labels Apr 27, 2026
…is detected

When all cluster nodes receive a drain-triggering config (reboot/upgrade)
within a short window of each other, every node cordons itself concurrently.
In this scenario pod eviction is futile: evicted pods have no schedulable
target, and distributed storage services (e.g. Longhorn instance-managers)
are evicted from every node simultaneously, deadlocking volume detachment.

After cordoning, zedkube now polls for schedulable peer nodes over a
configurable detection window. If all peers are already unschedulable, pod
eviction is skipped and the drain transitions directly to COMPLETE. The same
check is repeated after any drain failure, so a mid-drain cluster-wide
cordon is also handled without unnecessary retries.

The detection window is kubernetes.drain.allnodes.config.multiple *
timer.config.interval, making it tunable for clusters whose nodes poll
config at widely staggered intervals. The new config property
kubernetes.drain.allnodes.config.multiple (default 2, min 1, max 1000)
replaces the previous kubernetes.cluster.widedetect.window.multiple name.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Andrew Durbin <andrewd@zededa.com>
@andrewd-zededa
Copy link
Copy Markdown
Contributor Author

rebased on latest master

@github-actions github-actions Bot requested a review from zedi-pramodh April 27, 2026 19:42
Copy link
Copy Markdown

@zedi-pramodh zedi-pramodh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rene rene merged commit 3c01a05 into lf-edge:master Apr 28, 2026
40 of 46 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stable Should be backported to stable release(s)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants