Fix and simplify broken leader election in Kubernetes cluster by milan-zededa · Pull Request #5773 · lf-edge/eve

milan-zededa · 2026-04-08T14:08:25Z

Description

Fix 6 bugs in the original leader election state machine:

Stale TLS: getKubeClientSet() was called once and cached. When K3s regenerates certs during cluster join (when converting from single-node to cluster mode), the client became permanently broken. Now a fresh clientset is created on each election attempt.
Race condition: rapid ConfigGetStatus flapping (success/fail/success within milliseconds during boot) would start then immediately cancel the election. Replace separate start/stop channels with an atomic bool (electionShouldRun) plus a notification channel. The atomic always holds the latest desired state, so rapid updates just overwrite it and the handler always sees the final value.
Retry deadlock: after the election goroutine exited, the 5-min retry would go through handleControllerStatusChange which saw inKubeLeaderElection=true and skipped. Retry now directly notifies the handler via time.AfterFunc + notifyElection.
Unbounded acquisition: when the election goroutine could never acquire the lease (e.g. stale TLS), it retried every 15s forever with the same broken client. Add a 5-minute acquire timeout that cancels the context if neither OnNewLeader nor OnStartedLeading have fired (suggesting a connectivity issue), so the goroutine exits and a fresh clientset can be created on retry. Once the lease is acquired, the timeout is stopped and the context remains valid for normal leader renewal.
retryTimer and retryTimerStarted were accessed from both the handler goroutine and the RunOrDie goroutine without synchronization. Replace with fire-and-forget time.AfterFunc(5min, notifyElection).
leaderIdentity (plain string) was written from both the handler goroutine (stop branch) and the RunOrDie goroutine (OnNewLeader callback). Move all writes to the RunOrDie goroutine only: the OnNewLeader callback sets it, and the goroutine clears it on exit.

Simplify the state machine:

Remove SignalStartLeaderElection and SignalStopLeaderElection. handleControllerStatusChange now simply sets electionShouldRun to the desired state and calls notifyElection(). No need to check current state — duplicate signals are harmless since the handler always reads the latest atomic value.
Move inKubeLeaderElection writes from the signal methods into handleLeaderElection, making it the sole writer. This variable now reflects the actual handler state rather than the caller's intent.
Replace retryTimer/retryTimerStarted with time.AfterFunc that calls notifyElection(). No shared mutable state between goroutines.
Flatten the select in handleLeaderElection: with retryTimer removed, only one channel remains, so a plain receive replaces the select.
On stop, cancel the election context and use a WaitGroup to block until the RunOrDie goroutine fully exits before updating state and publishing. This ensures a single consistent publish with final values instead of multiple intermediate publishes.

How to test and validate this PR

All issues addressed by this PR are race conditions -- they may or may not appear in testing. They are far more likely to surface in a virtualized environment, where timing is slower and less predictable than on bare metal.

During my testing of EVE-k cluster establishment with EVE nodes running as VMs, I observed that after the Kubernetes cluster was formed, nodes often failed to elect a leader. As a result, ZInfoKubeCluster was never published (one of the responsibilities of the elected leader).

I recommend testing EVE-k cluster establishment repeatedly and verifying that ZInfoKubeCluster is eventually published with all nodes reported as ready.

I also plan to release a new testing framework (aka eden 2.0) that supports EVE-k clusters and makes it easy to repeatedly test cluster establishment in a virtualized environment.

Changelog notes

Fixed multiple bugs in the Kubernetes leader election state machine:

stale TLS clients after cert rotation
a race condition from rapid config flapping during boot
retry deadlock after election goroutine exit
unbounded retry with a broken client
unsynchronized timer access across goroutines
and a data race on leader identity.

Simplified the state machine by replacing signal methods and shared mutable state with an atomic bool and fire-and-forget timers.

PR Backports

16.0-stable: I believe that this should be backported to 16.0 LTS: [16.0-stable] Fix broken leader election in Kubernetes cluster #5821
14.5-stable: not to be backported
13.4-stable: not to be backported

Checklist

I've provided a proper description
I've added the proper documentation
I've tested my PR on amd64 device
I've tested my PR on arm64 device
I've written the test verification instructions
I've set the proper labels to this PR
I've checked the boxes above, or I've provided a good reason why I didn't check them.

zedi-pramodh · 2026-04-08T14:58:06Z

Good find @milan-zededa, hope this makes our clustering more robust.

codecov · 2026-04-08T16:46:16Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 28.34%. Comparing base (2281599) to head (1a5238a).
⚠️ Report is 489 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5773      +/-   ##
==========================================
+ Coverage   19.52%   28.34%   +8.81%     
==========================================
  Files          19       18       -1     
  Lines        3021     2417     -604     
==========================================
+ Hits          590      685      +95     
+ Misses       2310     1588     -722     
- Partials      121      144      +23

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

naiming-zededa

LGTM

naiming-zededa · 2026-04-09T20:49:22Z

thanks @milan-zededa for those race condition fixes.

Fix 6 bugs in the original leader election state machine: 1. Stale TLS: getKubeClientSet() was called once and cached. When K3s regenerates certs during cluster join, the client became permanently broken. Now a fresh clientset is created on each election attempt. 2. Race condition: rapid ConfigGetStatus flapping (success/fail/success within milliseconds during boot) would start then immediately cancel the election. Replace separate start/stop channels with an atomic bool (electionShouldRun) plus a notification channel. The atomic always holds the latest desired state, so rapid updates just overwrite it and the handler always sees the final value. 3. Retry deadlock: after the election goroutine exited, the 5-min retry would go through handleControllerStatusChange which saw inKubeLeaderElection=true and skipped. Retry now directly notifies the handler via time.AfterFunc + notifyElection. 4. Unbounded acquisition: when the election goroutine could never acquire the lease (e.g. stale TLS), it retried every 15s forever with the same broken client. Add a 5-minute acquire timeout that cancels the context if neither OnNewLeader nor OnStartedLeading have fired (suggesting a connectivity issue), so the goroutine exits and a fresh clientset can be created on retry. Once the lease is acquired, the timeout is stopped and the context remains valid for normal leader renewal. 5. retryTimer and retryTimerStarted were accessed from both the handler goroutine and the RunOrDie goroutine without synchronization. Replace with fire-and-forget time.AfterFunc(5min, notifyElection). 6. leaderIdentity (plain string) was written from both the handler goroutine (stop branch) and the RunOrDie goroutine (OnNewLeader callback). Move all writes to the RunOrDie goroutine only: the OnNewLeader callback sets it, and the goroutine clears it on exit. Simplify the state machine: - Remove SignalStartLeaderElection and SignalStopLeaderElection. handleControllerStatusChange now simply sets electionShouldRun to the desired state and calls notifyElection(). No need to check current state — duplicate signals are harmless since the handler always reads the latest atomic value. - Move inKubeLeaderElection writes from the signal methods into handleLeaderElection, making it the sole writer. This variable now reflects the actual handler state rather than the caller's intent. - Replace retryTimer/retryTimerStarted with time.AfterFunc that calls notifyElection(). No shared mutable state between goroutines. - Flatten the select in handleLeaderElection: with retryTimer removed, only one channel remains, so a plain receive replaces the select. - On stop, cancel the election context and use a WaitGroup to block until the RunOrDie goroutine fully exits before updating state and publishing. This ensures a single consistent publish with final values instead of multiple intermediate publishes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Milan Lenco <milan@zededa.com>

milan-zededa · 2026-04-13T08:00:35Z

Rebased on the top of the latest master to get all the CI improvements. Please re-approve.

milan-zededa requested review from andrewd-zededa and naiming-zededa April 8, 2026 14:08

milan-zededa self-assigned this Apr 8, 2026

milan-zededa added bug Something isn't working stable Should be backported to stable release(s) labels Apr 8, 2026

milan-zededa requested a review from zedi-pramodh as a code owner April 8, 2026 14:08

github-actions Bot requested a review from eriknordmark April 8, 2026 14:08

naiming-zededa approved these changes Apr 9, 2026

View reviewed changes

rene approved these changes Apr 12, 2026

View reviewed changes

milan-zededa force-pushed the fix-leader-election branch from a275ec6 to 1a5238a Compare April 13, 2026 07:59

github-actions Bot requested a review from naiming-zededa April 13, 2026 07:59

rene merged commit 1215ac7 into lf-edge:master Apr 14, 2026
59 of 65 checks passed

milan-zededa mentioned this pull request Apr 17, 2026

[16.0-stable] Fix broken leader election in Kubernetes cluster #5821

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix and simplify broken leader election in Kubernetes cluster#5773

Fix and simplify broken leader election in Kubernetes cluster#5773
rene merged 1 commit into
lf-edge:masterfrom
milan-zededa:fix-leader-election

milan-zededa commented Apr 8, 2026 •

edited

Loading

Uh oh!

zedi-pramodh commented Apr 8, 2026

Uh oh!

codecov Bot commented Apr 8, 2026 •

edited

Loading

Uh oh!

naiming-zededa left a comment

Uh oh!

naiming-zededa commented Apr 9, 2026

Uh oh!

milan-zededa commented Apr 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

milan-zededa commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How to test and validate this PR

Changelog notes

PR Backports

Checklist

Uh oh!

zedi-pramodh commented Apr 8, 2026

Uh oh!

codecov Bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

naiming-zededa left a comment

Choose a reason for hiding this comment

Uh oh!

naiming-zededa commented Apr 9, 2026

Uh oh!

milan-zededa commented Apr 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

milan-zededa commented Apr 8, 2026 •

edited

Loading

codecov Bot commented Apr 8, 2026 •

edited

Loading