Fix standby cache persistence checks during failover#488
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
|
There was a problem hiding this comment.
Pull request overview
This PR updates standby-cache persistence and forward-message discard logic to correctly treat candidate-standby as shared-storage standby during rolling update / failover, preventing cache entries from becoming non-reclaimable and blocking new reads.
Changes:
- Extend shared-storage persistence checks to include
CandidateStandbyNodeTerm()(not justStandbyNodeTerm()). - Apply the same candidate-standby condition when deciding whether to discard already-checkpointed standby forward messages.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
tx_service/src/cc/cc_entry.cpp |
Treat candidate-standby as shared-storage standby when determining whether CC entries are persistent based on native NG checkpoint TS. |
tx_service/include/cc/object_cc_map.h |
Include candidate-standby in the shared-storage “already checkpointed” forward-message discard condition. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if ((Sharder::Instance().StandbyNodeTerm() >= 0 || | ||
| Sharder::Instance().CandidateStandbyNodeTerm() >= 0) && | ||
| Sharder::Instance().GetDataStoreHandler()->IsSharedStorage() && | ||
| commit_ts < Sharder::Instance().NativeNodeGroupCkptTs()) | ||
| { |
Summary
Problem
During rolling update with RocksDBCloud shared storage, Redis cluster failover can complete normally while the new master still inherits standby cache entries that never become reclaimable. In this state:
cluster nodeslooks normalCcShard::EnqueueWaitListIfMemoryFull()keeps enqueueing requestsShardCleanCc::Execute()andCcShard::Clean()run, butfree_cntstays0getrequests cannot allocate cce and remain blocked incc_wait_list_for_memory_The root cause is that shared-storage persistence checks only recognized
StandbyNodeTerm(), but failover traffic can populate cc map while the node is still inCandidateStandbyNodeTerm(). Those entries were treated as non-persistent and therefore never became freeable.Validation
free_cntstayed0Notes
I did not run a full build or regression suite in this environment.