-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Watch dropping an event when compacting on delete #18089
Comments
cc @MadhavJivrajani @siyuanfoundation @fuweid Can someone take a look and double check my findings? |
/assign |
I rerun the report on latest version of #17833 and it found that linearization was broken. Looks like a false positive due to some bug. |
Ok, ok I encountered another case. Failpoint used:
Report: 1717749910517706368.tar.gz Found when adding compaction to operations serathius@5959110 Logs showing broken watch guarantee due to missing log:
Screenshot showing a transaction for revision 141 and fact that key "key6" was present so delete really happened. Watch response showing missing delete event Complete log:
|
@fuweid Think this is related to #17780 See the following logs:
Looks like we lost first delete after restoring from restoring from "last compact revision". |
Yes. Let me try to reproduce it in my local. Update it later. |
I managed to reproduce it without multi TXN, just a delete request. This extends the impact to Kubernetes. Again the common thing is that etcd misses a watch even if it restores kvstore on revision before a delete. In last repro:
etcd bootstrap log:
|
Something similar reproduced on v3.4, however this time it was long after crash, and it flagged broken resumable properly. Which just means it provided incorrect event on a newly opened watch which provided revision.
|
For this one, the tombstone of
|
Correct. :( Currently, watcher isn't able to send out all the possible deleted events. |
Did the client side receive an ErrCompacted? If yes, then it should be expected behaviour.
|
No, it would be recorded in the report. |
@fuweid, don't understand the statement. Distributed system needs to uphold it's guarantees even if a node is down. Events on WAL used for replay should match what users observes. For watch client should observe either full history or or get ErrCompacted, as @ahrtr mentioned. |
One thing I noticed, that the WatchChan is not broken during etcd downtime, so the time it's down is short enough that the etcd client code retries the request in a transparent way. There might be a bug there too. Still there is a possibility of a bug in recording client responses, it is a little more complicated to record history of watch than KV requests. etcd/tests/robustness/client/client.go Lines 268 to 304 in 8a0054f
|
@fuweid Please feel free to ping me on slack if you want a quick discussion on this issue. |
It's expected behaviour and by design as long as etcdserver doesn't return any error (e.g ErrCompacted). |
Added logs to ensure that issue is not one robustness test, got: They clearly show that revision 125 was not provided to client.
|
Where did you add the log? Also suggest to provide a clear summary on the exact steps you did. |
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Code #18145 |
The code to handle compacted version in watch was written 8 years ago, and it never handled the etcd/server/storage/mvcc/watcher_group.go Lines 253 to 262 in 5790774
|
Just as a thought exercise, how can admin of etcd cluster be alerted about the dropped watch events issue in general? In kubernetes/kubernetes#123072, However, this issue seems difficult to catch by alarms because just one delete event is missed. Diff by 1 could be caused by different times that client restarted. Trying to monitor in production and measure how many kubernetes clusters are impacted. Any insights are much appreciated !! |
Excellent question! Unfortunately, there's no simple way for etcd cluster admins to be directly alerted about dropped watch events in a production environment. The robustness tests achieve this by collecting comprehensive data – requests from all clients, WAL entries, etc. – and performing in-depth analysis. This level of data collection and analysis isn't feasible in production due to the sheer volume of information involved. This highlights the importance of rigorous, continuous robustness testing with diverse traffic types and failure injection. It allows us to thoroughly analyze smaller chunks of data in a controlled environment, identifying issues that might otherwise go unnoticed in the vastness of production data. On Kubernetes side, I have some plans to improve watch cache observability. Stay tuned! |
Tried replicating #18089 (comment) in an e2e here: #18201 |
Thanks @MadhavJivrajani . We need to add more tests ((or update existing tests)), please see all the proposed test cases as mentioned in #18162 (comment). Note I also added the test cases in the doc (under the option 2). @fuweid @MadhavJivrajani Please consider a whole plan to resolve this issue. Please feel free to ping on slack if you want any discussion. |
… other issues Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Disable robustness test detection of #18089 to allow detecting other issues
… other issues Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
…etecting other issues" This reverts commit 4fe227c. Signed-off-by: Wei Fu <fuweid89@gmail.com>
…etecting other issues" This reverts commit 4fe227c. Signed-off-by: Wei Fu <fuweid89@gmail.com>
Revert "Disable robustness test detection of #18089 to allow detecting other issues
Awesome work @fuweid ! |
Let's close the issue and create dedicated issue for robustness testing . |
…etecting other issues" This reverts commit 4fe227c. Signed-off-by: Wei Fu <fuweid89@gmail.com>
…etecting other issues" This reverts commit 4fe227c. Signed-off-by: Wei Fu <fuweid89@gmail.com>
Bug report criteria
What happened?
When testing #17833 I encountered watch breaking reliable guarantee.
Report included
1716928418720891814.zip
What did you expect to happen?
Watch should not break guarantee
How can we reproduce it (as minimally and precisely as possible)?
TODO
Anything else we need to know?
No response
Etcd version (please run commands below)
v3.5
Etcd configuration (command line flags or environment variables)
paste your configuration here
Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)
Relevant log output
The text was updated successfully, but these errors were encountered: