Fix standby cache persistence checks during failover by starrysky9959 · Pull Request #488 · eloqdata/tx_service

starrysky9959 · 2026-06-08T09:13:41Z

Summary

treat candidate-standby as shared-storage standby when deciding whether a cce is already persistent
apply the same condition when discarding already-checkpointed forward messages
unblock cce cleanup after rolling update / failover so new reads do not stay stuck in the shard memory wait list

Problem

During rolling update with RocksDBCloud shared storage, Redis cluster failover can complete normally while the new master still inherits standby cache entries that never become reclaimable. In this state:

cluster nodes looks normal
CcShard::EnqueueWaitListIfMemoryFull() keeps enqueueing requests
ShardCleanCc::Execute() and CcShard::Clean() run, but free_cnt stays 0
new get requests cannot allocate cce and remain blocked in cc_wait_list_for_memory_

The root cause is that shared-storage persistence checks only recognized StandbyNodeTerm(), but failover traffic can populate cc map while the node is still in CandidateStandbyNodeTerm(). Those entries were treated as non-persistent and therefore never became freeable.

Validation

reproduced the bad state from gdb during rolling update / failover
observed wait list growth while clean was running and free_cnt stayed 0
code inspection shows the persistence and forward-discard checks were inconsistent with candidate-standby state

Notes

I did not run a full build or regression suite in this environment.

coderabbitai · 2026-06-08T09:13:51Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0ad72dd3-dabf-47a4-bab7-1017a57d7be3

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

CLAassistant · 2026-06-08T09:13:54Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Copilot

Pull request overview

This PR updates standby-cache persistence and forward-message discard logic to correctly treat candidate-standby as shared-storage standby during rolling update / failover, preventing cache entries from becoming non-reclaimable and blocking new reads.

Changes:

Extend shared-storage persistence checks to include CandidateStandbyNodeTerm() (not just StandbyNodeTerm()).
Apply the same candidate-standby condition when deciding whether to discard already-checkpointed standby forward messages.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
`tx_service/src/cc/cc_entry.cpp`	Treat candidate-standby as shared-storage standby when determining whether CC entries are persistent based on native NG checkpoint TS.
`tx_service/include/cc/object_cc_map.h`	Include candidate-standby in the shared-storage “already checkpointed” forward-message discard condition.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        if ((Sharder::Instance().StandbyNodeTerm() >= 0 ||
+             Sharder::Instance().CandidateStandbyNodeTerm() >= 0) &&
            Sharder::Instance().GetDataStoreHandler()->IsSharedStorage() &&
            commit_ts < Sharder::Instance().NativeNodeGroupCkptTs())
        {


Fix standby cache persistence checks during failover

d7a1183

Copilot AI review requested due to automatic review settings June 8, 2026 09:13

Copilot started reviewing on behalf of starrysky9959 June 8, 2026 09:13 View session

Copilot AI reviewed Jun 8, 2026

View reviewed changes

starrysky9959 requested a review from liunyl June 8, 2026 09:18

liunyl approved these changes Jun 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix standby cache persistence checks during failover#488

Fix standby cache persistence checks during failover#488
starrysky9959 wants to merge 1 commit into
eloqdata:rel_1_2_3_eloqkvfrom
starrysky9959:fix/candidate-standby-cce-clean-rel-1-2-3

starrysky9959 commented Jun 8, 2026

Uh oh!

coderabbitai Bot commented Jun 8, 2026

Review skipped

Uh oh!

CLAassistant commented Jun 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

starrysky9959 commented Jun 8, 2026

Summary

Problem

Validation

Notes

Uh oh!

coderabbitai Bot commented Jun 8, 2026

Review skipped

Uh oh!

CLAassistant commented Jun 8, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants