Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft loop prober with counter #16713

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

chaochn47
Copy link
Member

@chaochn47
Copy link
Member Author

cc @ahrtr

@ahrtr
Copy link
Member

ahrtr commented Oct 9, 2023

This approach depends on the liveness probe's frequency / interval, the principle is the instance is alive as long as there is at least one tick event since last liveness check.

Overall looks good. Please feel free to reuse (might need minor update) the test case in #16710.

@@ -1643,6 +1644,10 @@ func (s *EtcdServer) AppliedIndex() uint64 { return s.getAppliedIndex() }

func (s *EtcdServer) Term() uint64 { return s.getTerm() }

// ProbeRaftLoopProgress probes if the etcdserver raft loop is deadlocked
// since last time the function is invoked.
func (s *EtcdServer) ProbeRaftLoopProgress() bool { return s.r.safeReadTickElapsedAndClear() != 0 }
Copy link
Member

@serathius serathius Oct 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probe response should not be determined based on frequency of probing. If someone probes etcd more frequently then proposals are happening then this will return failure.

I still need to think more about the best approach (maybe a document that collects and compares the approaches would help?), but this could be improved by waiting for a tick instead of failing if there was no ticks since last probe. Implementation would be tricky with concurrency, let me know if this sounds like a good approach to you so I could help. :P

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probe response should not be determined based on frequency of probing. If someone probes etcd more frequently then proposals are happening then this will return failure.

Thanks, in that case maybe adding a simple throttler in etcdserver raft Node should help.

I still need to think more about the best approach (maybe a document that collects and compares the approaches would help?), but this could be improved by waiting for a tick instead of failing if there was no ticks since last probe.

Yeah, a small doc compares with pros and cons should help.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also interested in how apiserver implements /livez and /readyz.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, health probes in apiserver generally are not complicated. They mostly checks if something has started or initialized, such as https://github.com/kubernetes/kubernetes/blob/c7d270302c8de3afc9d7b01c70faf3a18407ce44/staging/src/k8s.io/client-go/tools/cache/shared_informer.go#L173

Copy link
Member Author

@chaochn47 chaochn47 Oct 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added throttler to make raft loop prober check effective at most once every 5 seconds.

Copy link
Member Author

@chaochn47 chaochn47 Oct 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As promised, I wrote a small doc to compare the pros and cons.

Could you please take a look? @ahrtr @serathius @siyuanfoundation Thanks~

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Overall this PR looks good to me, also added comments in the doc.

@chaochn47 chaochn47 force-pushed the raft-loop-prober branch 3 times, most recently from 6e24d08 to a109a7e Compare October 18, 2023 04:32
Signed-off-by: Chao Chen <chaochn@amazon.com>
@jmhbnz
Copy link
Member

jmhbnz commented May 23, 2024

Discussed during sig-etcd triage meeting. This is still an important pr for etcd, @chaochn47 do you have some time to rebase and continue this effort?

@chaochn47
Copy link
Member Author

Yeah, I will take a look and drive to the resolution next week.

@k8s-ci-robot
Copy link

@chaochn47: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-etcd-verify d126728 link true /test pull-etcd-verify
pull-etcd-unit-test-amd64 d126728 link true /test pull-etcd-unit-test-amd64
pull-etcd-unit-test-arm64 d126728 link true /test pull-etcd-unit-test-arm64

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

6 participants