raft loop prober with counter #16713

chaochn47 · 2023-10-09T15:11:30Z

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

chaochn47 · 2023-10-09T15:22:27Z

ahrtr · 2023-10-09T16:06:00Z

This approach depends on the liveness probe's frequency / interval, the principle is the instance is alive as long as there is at least one tick event since last liveness check.

Overall looks good. Please feel free to reuse (might need minor update) the test case in #16710.

server/etcdserver/raft.go

serathius · 2023-10-10T10:18:49Z

server/etcdserver/server.go

@@ -1643,6 +1644,10 @@ func (s *EtcdServer) AppliedIndex() uint64 { return s.getAppliedIndex() }

 func (s *EtcdServer) Term() uint64 { return s.getTerm() }

+// ProbeRaftLoopProgress probes if the etcdserver raft loop is deadlocked
+// since last time the function is invoked.
+func (s *EtcdServer) ProbeRaftLoopProgress() bool { return s.r.safeReadTickElapsedAndClear() != 0 }


Probe response should not be determined based on frequency of probing. If someone probes etcd more frequently then proposals are happening then this will return failure.

I still need to think more about the best approach (maybe a document that collects and compares the approaches would help?), but this could be improved by waiting for a tick instead of failing if there was no ticks since last probe. Implementation would be tricky with concurrency, let me know if this sounds like a good approach to you so I could help. :P

Probe response should not be determined based on frequency of probing. If someone probes etcd more frequently then proposals are happening then this will return failure.

Thanks, in that case maybe adding a simple throttler in etcdserver raft Node should help.

I still need to think more about the best approach (maybe a document that collects and compares the approaches would help?), but this could be improved by waiting for a tick instead of failing if there was no ticks since last probe.

Yeah, a small doc compares with pros and cons should help.

I am also interested in how apiserver implements /livez and /readyz.

I am also interested in how apiserver implements /livez and /readyz.

https://kubernetes.io/docs/reference/using-api/health-checks/

kubernetes/kubernetes#70676 (comment)

@siyuanfoundation @logicalhan

As far as I can tell, health probes in apiserver generally are not complicated. They mostly checks if something has started or initialized, such as https://github.com/kubernetes/kubernetes/blob/c7d270302c8de3afc9d7b01c70faf3a18407ce44/staging/src/k8s.io/client-go/tools/cache/shared_informer.go#L173

Added throttler to make raft loop prober check effective at most once every 5 seconds.

As promised, I wrote a small doc to compare the pros and cons.

Could you please take a look? @ahrtr @serathius @siyuanfoundation Thanks~

Thanks. Overall this PR looks good to me, also added comments in the doc.

Signed-off-by: Chao Chen <chaochn@amazon.com>

jmhbnz · 2024-05-23T18:15:38Z

Discussed during sig-etcd triage meeting. This is still an important pr for etcd, @chaochn47 do you have some time to rebase and continue this effort?

chaochn47 · 2024-06-02T17:32:09Z

Yeah, I will take a look and drive to the resolution next week.

k8s-ci-robot · 2024-08-05T22:34:52Z

@chaochn47: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-etcd-verify	`d126728`	link	true	`/test pull-etcd-verify`
pull-etcd-unit-test-amd64	`d126728`	link	true	`/test pull-etcd-unit-test-amd64`
pull-etcd-unit-test-arm64	`d126728`	link	true	`/test pull-etcd-unit-test-arm64`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

chaochn47 mentioned this pull request Oct 9, 2023

Add method (*EtcdServer) IsRaftLoopBlocked to support checking whether the raft loop is blocked #16710

Draft

serathius reviewed Oct 10, 2023

View reviewed changes

server/etcdserver/raft.go Outdated Show resolved Hide resolved

serathius reviewed Oct 10, 2023

View reviewed changes

chaochn47 force-pushed the raft-loop-prober branch 3 times, most recently from 6e24d08 to a109a7e Compare October 18, 2023 04:32

raft loop prober with counter

d126728

Signed-off-by: Chao Chen <chaochn@amazon.com>

chaochn47 force-pushed the raft-loop-prober branch from a109a7e to d126728 Compare October 18, 2023 04:35

k8s-ci-robot added the needs-rebase label Jun 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

raft loop prober with counter #16713

raft loop prober with counter #16713

chaochn47 commented Oct 9, 2023

chaochn47 commented Oct 9, 2023

ahrtr commented Oct 9, 2023

serathius Oct 10, 2023 •

edited

Loading

chaochn47 Oct 10, 2023

ahrtr Oct 11, 2023

chaochn47 Oct 13, 2023

siyuanfoundation Oct 16, 2023

chaochn47 Oct 18, 2023 •

edited

Loading

chaochn47 Oct 30, 2023 •

edited

Loading

ahrtr Nov 3, 2023

jmhbnz commented May 23, 2024

chaochn47 commented Jun 2, 2024

k8s-ci-robot commented Aug 5, 2024

raft loop prober with counter #16713

Are you sure you want to change the base?

raft loop prober with counter #16713

Conversation

chaochn47 commented Oct 9, 2023

chaochn47 commented Oct 9, 2023

ahrtr commented Oct 9, 2023

serathius Oct 10, 2023 • edited Loading

Choose a reason for hiding this comment

chaochn47 Oct 10, 2023

Choose a reason for hiding this comment

ahrtr Oct 11, 2023

Choose a reason for hiding this comment

chaochn47 Oct 13, 2023

Choose a reason for hiding this comment

siyuanfoundation Oct 16, 2023

Choose a reason for hiding this comment

chaochn47 Oct 18, 2023 • edited Loading

Choose a reason for hiding this comment

chaochn47 Oct 30, 2023 • edited Loading

Choose a reason for hiding this comment

ahrtr Nov 3, 2023

Choose a reason for hiding this comment

jmhbnz commented May 23, 2024

chaochn47 commented Jun 2, 2024

k8s-ci-robot commented Aug 5, 2024

serathius Oct 10, 2023 •

edited

Loading

chaochn47 Oct 18, 2023 •

edited

Loading

chaochn47 Oct 30, 2023 •

edited

Loading