Skip to content

Fix standby cache persistence checks during failover#488

Open
starrysky9959 wants to merge 1 commit into
eloqdata:rel_1_2_3_eloqkvfrom
starrysky9959:fix/candidate-standby-cce-clean-rel-1-2-3
Open

Fix standby cache persistence checks during failover#488
starrysky9959 wants to merge 1 commit into
eloqdata:rel_1_2_3_eloqkvfrom
starrysky9959:fix/candidate-standby-cce-clean-rel-1-2-3

Conversation

@starrysky9959

Copy link
Copy Markdown
Collaborator

Summary

  • treat candidate-standby as shared-storage standby when deciding whether a cce is already persistent
  • apply the same condition when discarding already-checkpointed forward messages
  • unblock cce cleanup after rolling update / failover so new reads do not stay stuck in the shard memory wait list

Problem

During rolling update with RocksDBCloud shared storage, Redis cluster failover can complete normally while the new master still inherits standby cache entries that never become reclaimable. In this state:

  • cluster nodes looks normal
  • CcShard::EnqueueWaitListIfMemoryFull() keeps enqueueing requests
  • ShardCleanCc::Execute() and CcShard::Clean() run, but free_cnt stays 0
  • new get requests cannot allocate cce and remain blocked in cc_wait_list_for_memory_

The root cause is that shared-storage persistence checks only recognized StandbyNodeTerm(), but failover traffic can populate cc map while the node is still in CandidateStandbyNodeTerm(). Those entries were treated as non-persistent and therefore never became freeable.

Validation

  • reproduced the bad state from gdb during rolling update / failover
  • observed wait list growth while clean was running and free_cnt stayed 0
  • code inspection shows the persistence and forward-discard checks were inconsistent with candidate-standby state

Notes

I did not run a full build or regression suite in this environment.

Copilot AI review requested due to automatic review settings June 8, 2026 09:13
@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0ad72dd3-dabf-47a4-bab7-1017a57d7be3

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@CLAassistant

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates standby-cache persistence and forward-message discard logic to correctly treat candidate-standby as shared-storage standby during rolling update / failover, preventing cache entries from becoming non-reclaimable and blocking new reads.

Changes:

  • Extend shared-storage persistence checks to include CandidateStandbyNodeTerm() (not just StandbyNodeTerm()).
  • Apply the same candidate-standby condition when deciding whether to discard already-checkpointed standby forward messages.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
tx_service/src/cc/cc_entry.cpp Treat candidate-standby as shared-storage standby when determining whether CC entries are persistent based on native NG checkpoint TS.
tx_service/include/cc/object_cc_map.h Include candidate-standby in the shared-storage “already checkpointed” forward-message discard condition.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1912 to 1916
if ((Sharder::Instance().StandbyNodeTerm() >= 0 ||
Sharder::Instance().CandidateStandbyNodeTerm() >= 0) &&
Sharder::Instance().GetDataStoreHandler()->IsSharedStorage() &&
commit_ts < Sharder::Instance().NativeNodeGroupCkptTs())
{
@starrysky9959 starrysky9959 requested a review from liunyl June 8, 2026 09:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants