Skip to content

Fix and simplify broken leader election in Kubernetes cluster#5773

Merged
rene merged 1 commit into
lf-edge:masterfrom
milan-zededa:fix-leader-election
Apr 14, 2026
Merged

Fix and simplify broken leader election in Kubernetes cluster#5773
rene merged 1 commit into
lf-edge:masterfrom
milan-zededa:fix-leader-election

Conversation

@milan-zededa
Copy link
Copy Markdown
Contributor

@milan-zededa milan-zededa commented Apr 8, 2026

Description

Fix 6 bugs in the original leader election state machine:

  1. Stale TLS: getKubeClientSet() was called once and cached. When K3s regenerates certs during cluster join (when converting from single-node to cluster mode), the client became permanently broken. Now a fresh clientset is created on each election attempt.

  2. Race condition: rapid ConfigGetStatus flapping (success/fail/success within milliseconds during boot) would start then immediately cancel the election. Replace separate start/stop channels with an atomic bool (electionShouldRun) plus a notification channel. The atomic always holds the latest desired state, so rapid updates just overwrite it and the handler always sees the final value.

  3. Retry deadlock: after the election goroutine exited, the 5-min retry would go through handleControllerStatusChange which saw inKubeLeaderElection=true and skipped. Retry now directly notifies the handler via time.AfterFunc + notifyElection.

  4. Unbounded acquisition: when the election goroutine could never acquire the lease (e.g. stale TLS), it retried every 15s forever with the same broken client. Add a 5-minute acquire timeout that cancels the context if neither OnNewLeader nor OnStartedLeading have fired (suggesting a connectivity issue), so the goroutine exits and a fresh clientset can be created on retry. Once the lease is acquired, the timeout is stopped and the context remains valid for normal leader renewal.

  5. retryTimer and retryTimerStarted were accessed from both the handler goroutine and the RunOrDie goroutine without synchronization. Replace with fire-and-forget time.AfterFunc(5min, notifyElection).

  6. leaderIdentity (plain string) was written from both the handler goroutine (stop branch) and the RunOrDie goroutine (OnNewLeader callback). Move all writes to the RunOrDie goroutine only: the OnNewLeader callback sets it, and the goroutine clears it on exit.

Simplify the state machine:

  • Remove SignalStartLeaderElection and SignalStopLeaderElection. handleControllerStatusChange now simply sets electionShouldRun to the desired state and calls notifyElection(). No need to check current state — duplicate signals are harmless since the handler always reads the latest atomic value.

  • Move inKubeLeaderElection writes from the signal methods into handleLeaderElection, making it the sole writer. This variable now reflects the actual handler state rather than the caller's intent.

  • Replace retryTimer/retryTimerStarted with time.AfterFunc that calls notifyElection(). No shared mutable state between goroutines.

  • Flatten the select in handleLeaderElection: with retryTimer removed, only one channel remains, so a plain receive replaces the select.

  • On stop, cancel the election context and use a WaitGroup to block until the RunOrDie goroutine fully exits before updating state and publishing. This ensures a single consistent publish with final values instead of multiple intermediate publishes.

How to test and validate this PR

All issues addressed by this PR are race conditions -- they may or may not appear in testing. They are far more likely to surface in a virtualized environment, where timing is slower and less predictable than on bare metal.

During my testing of EVE-k cluster establishment with EVE nodes running as VMs, I observed that after the Kubernetes cluster was formed, nodes often failed to elect a leader. As a result, ZInfoKubeCluster was never published (one of the responsibilities of the elected leader).

I recommend testing EVE-k cluster establishment repeatedly and verifying that ZInfoKubeCluster is eventually published with all nodes reported as ready.

I also plan to release a new testing framework (aka eden 2.0) that supports EVE-k clusters and makes it easy to repeatedly test cluster establishment in a virtualized environment.

Changelog notes

Fixed multiple bugs in the Kubernetes leader election state machine:

  • stale TLS clients after cert rotation
  • a race condition from rapid config flapping during boot
  • retry deadlock after election goroutine exit
  • unbounded retry with a broken client
  • unsynchronized timer access across goroutines
  • and a data race on leader identity.

Simplified the state machine by replacing signal methods and shared mutable state with an atomic bool and fire-and-forget timers.

PR Backports

Checklist

  • I've provided a proper description
  • I've added the proper documentation
  • I've tested my PR on amd64 device
  • I've tested my PR on arm64 device
  • I've written the test verification instructions
  • I've set the proper labels to this PR
  • I've checked the boxes above, or I've provided a good reason why I didn't check them.

@milan-zededa milan-zededa self-assigned this Apr 8, 2026
@milan-zededa milan-zededa added bug Something isn't working stable Should be backported to stable release(s) labels Apr 8, 2026
@github-actions github-actions Bot requested a review from eriknordmark April 8, 2026 14:08
@zedi-pramodh
Copy link
Copy Markdown

Good find @milan-zededa, hope this makes our clustering more robust.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 8, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 28.34%. Comparing base (2281599) to head (1a5238a).
⚠️ Report is 489 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5773      +/-   ##
==========================================
+ Coverage   19.52%   28.34%   +8.81%     
==========================================
  Files          19       18       -1     
  Lines        3021     2417     -604     
==========================================
+ Hits          590      685      +95     
+ Misses       2310     1588     -722     
- Partials      121      144      +23     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@naiming-zededa naiming-zededa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@naiming-zededa
Copy link
Copy Markdown
Contributor

thanks @milan-zededa for those race condition fixes.

Fix 6 bugs in the original leader election state machine:

1. Stale TLS: getKubeClientSet() was called once and cached. When K3s
   regenerates certs during cluster join, the client became permanently
   broken. Now a fresh clientset is created on each election attempt.

2. Race condition: rapid ConfigGetStatus flapping (success/fail/success
   within milliseconds during boot) would start then immediately cancel
   the election. Replace separate start/stop channels with an atomic
   bool (electionShouldRun) plus a notification channel. The atomic
   always holds the latest desired state, so rapid updates just
   overwrite it and the handler always sees the final value.

3. Retry deadlock: after the election goroutine exited, the 5-min retry
   would go through handleControllerStatusChange which saw
   inKubeLeaderElection=true and skipped. Retry now directly notifies
   the handler via time.AfterFunc + notifyElection.

4. Unbounded acquisition: when the election goroutine could never
   acquire the lease (e.g. stale TLS), it retried every 15s forever
   with the same broken client. Add a 5-minute acquire timeout that
   cancels the context if neither OnNewLeader nor OnStartedLeading
   have fired (suggesting a connectivity issue), so the goroutine
   exits and a fresh clientset can be created on retry.
   Once the lease is acquired, the timeout is stopped and the context
   remains valid for normal leader renewal.

5. retryTimer and retryTimerStarted were accessed from both the handler
   goroutine and the RunOrDie goroutine without synchronization.
   Replace with fire-and-forget time.AfterFunc(5min, notifyElection).

6. leaderIdentity (plain string) was written from both the handler
   goroutine (stop branch) and the RunOrDie goroutine (OnNewLeader
   callback). Move all writes to the RunOrDie goroutine only: the
   OnNewLeader callback sets it, and the goroutine clears it on exit.

Simplify the state machine:

- Remove SignalStartLeaderElection and SignalStopLeaderElection.
  handleControllerStatusChange now simply sets electionShouldRun to
  the desired state and calls notifyElection(). No need to check
  current state — duplicate signals are harmless since the handler
  always reads the latest atomic value.

- Move inKubeLeaderElection writes from the signal methods into
  handleLeaderElection, making it the sole writer. This variable now
  reflects the actual handler state rather than the caller's intent.

- Replace retryTimer/retryTimerStarted with time.AfterFunc that calls
  notifyElection(). No shared mutable state between goroutines.

- Flatten the select in handleLeaderElection: with retryTimer removed,
  only one channel remains, so a plain receive replaces the select.

- On stop, cancel the election context and use a WaitGroup to block
  until the RunOrDie goroutine fully exits before updating state and
  publishing. This ensures a single consistent publish with final
  values instead of multiple intermediate publishes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Milan Lenco <milan@zededa.com>
@milan-zededa
Copy link
Copy Markdown
Contributor Author

Rebased on the top of the latest master to get all the CI improvements. Please re-approve.

@rene rene merged commit 1215ac7 into lf-edge:master Apr 14, 2026
59 of 65 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working stable Should be backported to stable release(s)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants