Fix and simplify broken leader election in Kubernetes cluster#5773
Merged
Conversation
|
Good find @milan-zededa, hope this makes our clustering more robust. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #5773 +/- ##
==========================================
+ Coverage 19.52% 28.34% +8.81%
==========================================
Files 19 18 -1
Lines 3021 2417 -604
==========================================
+ Hits 590 685 +95
+ Misses 2310 1588 -722
- Partials 121 144 +23 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Contributor
|
thanks @milan-zededa for those race condition fixes. |
rene
approved these changes
Apr 12, 2026
Fix 6 bugs in the original leader election state machine: 1. Stale TLS: getKubeClientSet() was called once and cached. When K3s regenerates certs during cluster join, the client became permanently broken. Now a fresh clientset is created on each election attempt. 2. Race condition: rapid ConfigGetStatus flapping (success/fail/success within milliseconds during boot) would start then immediately cancel the election. Replace separate start/stop channels with an atomic bool (electionShouldRun) plus a notification channel. The atomic always holds the latest desired state, so rapid updates just overwrite it and the handler always sees the final value. 3. Retry deadlock: after the election goroutine exited, the 5-min retry would go through handleControllerStatusChange which saw inKubeLeaderElection=true and skipped. Retry now directly notifies the handler via time.AfterFunc + notifyElection. 4. Unbounded acquisition: when the election goroutine could never acquire the lease (e.g. stale TLS), it retried every 15s forever with the same broken client. Add a 5-minute acquire timeout that cancels the context if neither OnNewLeader nor OnStartedLeading have fired (suggesting a connectivity issue), so the goroutine exits and a fresh clientset can be created on retry. Once the lease is acquired, the timeout is stopped and the context remains valid for normal leader renewal. 5. retryTimer and retryTimerStarted were accessed from both the handler goroutine and the RunOrDie goroutine without synchronization. Replace with fire-and-forget time.AfterFunc(5min, notifyElection). 6. leaderIdentity (plain string) was written from both the handler goroutine (stop branch) and the RunOrDie goroutine (OnNewLeader callback). Move all writes to the RunOrDie goroutine only: the OnNewLeader callback sets it, and the goroutine clears it on exit. Simplify the state machine: - Remove SignalStartLeaderElection and SignalStopLeaderElection. handleControllerStatusChange now simply sets electionShouldRun to the desired state and calls notifyElection(). No need to check current state — duplicate signals are harmless since the handler always reads the latest atomic value. - Move inKubeLeaderElection writes from the signal methods into handleLeaderElection, making it the sole writer. This variable now reflects the actual handler state rather than the caller's intent. - Replace retryTimer/retryTimerStarted with time.AfterFunc that calls notifyElection(). No shared mutable state between goroutines. - Flatten the select in handleLeaderElection: with retryTimer removed, only one channel remains, so a plain receive replaces the select. - On stop, cancel the election context and use a WaitGroup to block until the RunOrDie goroutine fully exits before updating state and publishing. This ensures a single consistent publish with final values instead of multiple intermediate publishes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Milan Lenco <milan@zededa.com>
a275ec6 to
1a5238a
Compare
Contributor
Author
|
Rebased on the top of the latest master to get all the CI improvements. Please re-approve. |
9 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Fix 6 bugs in the original leader election state machine:
Stale TLS:
getKubeClientSet()was called once and cached. When K3s regenerates certs during cluster join (when converting from single-node to cluster mode), the client became permanently broken. Now a fresh clientset is created on each election attempt.Race condition: rapid
ConfigGetStatusflapping (success/fail/success within milliseconds during boot) would start then immediately cancel the election. Replace separate start/stop channels with an atomic bool (electionShouldRun) plus a notification channel. The atomic always holds the latest desired state, so rapid updates just overwrite it and the handler always sees the final value.Retry deadlock: after the election goroutine exited, the 5-min retry would go through
handleControllerStatusChangewhich sawinKubeLeaderElection=trueand skipped. Retry now directly notifies the handler viatime.AfterFunc+notifyElection.Unbounded acquisition: when the election goroutine could never acquire the lease (e.g. stale TLS), it retried every 15s forever with the same broken client. Add a 5-minute acquire timeout that cancels the context if neither
OnNewLeadernorOnStartedLeadinghave fired (suggesting a connectivity issue), so the goroutine exits and a fresh clientset can be created on retry. Once the lease is acquired, the timeout is stopped and the context remains valid for normal leader renewal.retryTimerandretryTimerStartedwere accessed from both the handler goroutine and theRunOrDiegoroutine without synchronization. Replace with fire-and-forgettime.AfterFunc(5min, notifyElection).leaderIdentity(plain string) was written from both the handler goroutine (stop branch) and theRunOrDiegoroutine (OnNewLeadercallback). Move all writes to theRunOrDiegoroutine only: theOnNewLeadercallback sets it, and the goroutine clears it on exit.Simplify the state machine:
Remove
SignalStartLeaderElectionandSignalStopLeaderElection.handleControllerStatusChangenow simply setselectionShouldRunto the desired state and callsnotifyElection(). No need to check current state — duplicate signals are harmless since the handler always reads the latest atomic value.Move
inKubeLeaderElectionwrites from the signal methods intohandleLeaderElection, making it the sole writer. This variable now reflects the actual handler state rather than the caller's intent.Replace
retryTimer/retryTimerStartedwithtime.AfterFuncthat callsnotifyElection(). No shared mutable state between goroutines.Flatten the select in
handleLeaderElection: withretryTimerremoved, only one channel remains, so a plain receive replaces the select.On stop, cancel the election context and use a
WaitGroupto block until theRunOrDiegoroutine fully exits before updating state and publishing. This ensures a single consistent publish with final values instead of multiple intermediate publishes.How to test and validate this PR
All issues addressed by this PR are race conditions -- they may or may not appear in testing. They are far more likely to surface in a virtualized environment, where timing is slower and less predictable than on bare metal.
During my testing of EVE-k cluster establishment with EVE nodes running as VMs, I observed that after the Kubernetes cluster was formed, nodes often failed to elect a leader. As a result,
ZInfoKubeClusterwas never published (one of the responsibilities of the elected leader).I recommend testing EVE-k cluster establishment repeatedly and verifying that ZInfoKubeCluster is eventually published with all nodes reported as ready.
I also plan to release a new testing framework (aka eden 2.0) that supports EVE-k clusters and makes it easy to repeatedly test cluster establishment in a virtualized environment.
Changelog notes
Fixed multiple bugs in the Kubernetes leader election state machine:
Simplified the state machine by replacing signal methods and shared mutable state with an atomic bool and fire-and-forget timers.
PR Backports
Checklist