-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add readIndex check in readyz #16792
base: main
Are you sure you want to change the base?
Conversation
} | ||
} | ||
|
||
func (s *EtcdServer) GetReadIndex(ctx context.Context) (uint64, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we re-use LinearizableReadNotify
to achieve the same semantic of readIndex
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is more clear even though there is some duplicate code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@siyuanfoundation Is there any counter argument of this above comment? Could you please clarify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GetReadIndex
and LinearizableReadNotify
sends messages to different channels, and waits on different notifiers. The former needs a index returned from the notifier, while the latter only needs the err.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, misread that the new readIndexLoop
won't wait for the applied index catch up.
Right after read index is received in linearizableReadLoop
, readIndexNotifier
would notify ReadIndex
that listens on the channel. I think that simplifies the implementation, wdyt?
In this way, we can reduce the new background go routine readIndexLoop
, avoid readMu
lock twice and don't need to add index into the notifier struct.
Please merge the e2e tests before this. |
2461ffd
to
b82cd28
Compare
@siyuanfoundation DCO check failed. Understood the corrupt alarm e2e test was authored by @serathius but you have to sign off this commit to pass the check.
|
@serathius Did you mean separating the e2e test to another PR for validating existing livez / readyz checks behaviors? ReadIndex check changes and related test case could stay in the current PR, correct? |
@siyuanfoundation Existing livez readyz e2e test cases are added. Could you please rebase on top of main and resolve the comments? |
Signed-off-by: Siyuan Zhang <sizhang@google.com>
b82cd28
to
e998d9a
Compare
{ | ||
name: "Alive ok", | ||
healthCheckURL: "/livez", | ||
readIndexError: fmt.Errorf("failed to get read index"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error is here is to test livez is ok even when server cannot get read index.
} | ||
} | ||
|
||
func (s *EtcdServer) GetReadIndex(ctx context.Context) (uint64, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, misread that the new readIndexLoop
won't wait for the applied index catch up.
Right after read index is received in linearizableReadLoop
, readIndexNotifier
would notify ReadIndex
that listens on the channel. I think that simplifies the implementation, wdyt?
In this way, we can reduce the new background go routine readIndexLoop
, avoid readMu
lock twice and don't need to add index into the notifier struct.
@chaochn47 |
} | ||
|
||
nextnr := newNotifier() | ||
s.readMu.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about adding a separate lock dedicated for read index so adding this loop has no impact on the linearizableReadLoop
?
nc := s.readIndexNotifier | ||
s.readMu.RUnlock() | ||
|
||
// signal linearizable loop for current notify if it hasn't been already |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// signal linearizable loop for current notify if it hasn't been already | |
// signal readIndexLoop for current notify if it hasn't been already |
// pass some values in the notifier | ||
uint64Val uint64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// pass some values in the notifier | |
uint64Val uint64 | |
readIndex uint64 |
It's an unexported pkg local struct, I think explicitly naming it is better than generic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
notifier was meant to be generic. Not sure if we should add fields here.
Makes sense.
It may not be necessary if |
`[+]serializable_read ok`, | ||
`[+]data_corruption ok`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to validate other checks, just read_index
.
@@ -799,8 +794,9 @@ func (s *EtcdServer) linearizableReadLoop() { | |||
nr := s.readNotifier | |||
s.readNotifier = nextnr | |||
s.readMu.Unlock() | |||
|
|||
confirmedIndex, err := s.requestCurrentIndex(leaderChangedNotifier, requestId) | |||
ctx, cancel := context.WithTimeout(context.Background(), s.Cfg.ReqTimeout()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why add timeout here?
|
||
func readIndexCheck(srv ServerHealth) func(ctx context.Context) error { | ||
return func(ctx context.Context) error { | ||
_, err := srv.GetReadIndex(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing a critical loop like linearizableReadLoop
seems like a bad idea for backport. What's wrong with directly just calling requestCurrentIndex
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just need to have a public function.
func (s *EtcdServer) RequestCurrentIndex() (uint64, error) {
requestId := s.reqIDGen.Next()
leaderChangedNotifier := s.leaderChanged.Receive()
retun s.requestCurrentIndex(leaderChangedNotifier, requestId)
}
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would problematic to use requestCurrentIndex
, because each health check would start a new instance of requestCurrentIndex
. In that function, the index responses are read from the single readStateC channel, and if there are multiple requestCurrentIndex instances running, some calls may never get to see the response.
Even though the requestId is unique, but requestCurrentIndex
ignores the response if id does not match. So if the response falls to a different requestCurrentIndex , the right owner would not see it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see. requestCurrentIndex
was designed to be only called from single routine like linearizableReadLoop
. Nice catch!
Let me think about this more. Need to think on how to minimize the change to linearizableReadLoop
for a safer backport.
cc @ahrtr for their ideas
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@siyuanfoundation: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.
part of the work for #16007
cc @chaochn47 @serathius