fix(inkless:switch): Advance HW past stale checkpoint for sealed leader after restart#605
Merged
Merged
Conversation
…after restart Add tests for the scenario where a sealed leader partition becomes unreadable after broker restart due to a stale HW checkpoint. ## Root cause makeLeader reloads HW from the on-disk checkpoint file. The checkpoint can be stale when: - Unclean shutdown (no final flush) - Checkpoint interval hadn't fired since HW advanced to the seal offset - First restart after seal commit Once seal() is called, no produces or follower fetches can advance HW naturally, so consumers cannot read classic data below the seal offset. ## Reproduction (3-broker cluster) 1. Create topic with RF=2 (leader=broker1, follower=broker2) 2. Produce messages (e.g. 100 records, LEO=100) 3. Switch topic to diskless (diskless.enable=true) 4. Wait for seal to commit (classicToDisklessStartOffset=100) 5. Restart the leader broker (broker1) 6. Broker1 reclaims leadership via applyDelta → makeLeader + seal 7. Consumer reading from offset 0 times out: HW is stuck at the stale checkpointed value because seal prevents advancement Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…er after restart In the post-restart leader path of applyLocalLeadersDelta, after makeLeader + seal, advance HW to the classicToDisklessStartOffset when the checkpointed value is below it. Without this, makeLeader loads the stale on-disk checkpoint and seal() permanently prevents any natural HW advancement, leaving classic data unreadable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR fixes a post-restart edge case for classic-to-diskless switched partitions where a sealed leader can load a stale high watermark (HW) from disk and then never advance it, preventing consumers from reading the classic portion of the log up to the seal offset.
Changes:
- Advance the leader’s local log HW to
classicToDisklessStartOffsetaftermakeLeader+seal()when the checkpointed HW is below the seal offset. - Add unit tests covering both stale-checkpoint and up-to-date-checkpoint restart scenarios for a sealed leader.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| core/src/main/scala/kafka/server/ReplicaManager.scala | On post-restart “diskless topic with local classic log” leader path, advances HW up to the seal offset when needed. |
| core/src/test/scala/unit/kafka/server/ReplicaManagerTest.scala | Adds tests asserting HW is advanced (or not) for sealed leaders depending on whether the checkpointed HW is stale. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
giuseppelillo
approved these changes
May 22, 2026
giuseppelillo
pushed a commit
that referenced
this pull request
May 29, 2026
…er after restart (#605) * test(inkless:switch): Reproduce stale HW checkpoint on sealed leader after restart Add tests for the scenario where a sealed leader partition becomes unreadable after broker restart due to a stale HW checkpoint. ## Root cause makeLeader reloads HW from the on-disk checkpoint file. The checkpoint can be stale when: - Unclean shutdown (no final flush) - Checkpoint interval hadn't fired since HW advanced to the seal offset - First restart after seal commit Once seal() is called, no produces or follower fetches can advance HW naturally, so consumers cannot read classic data below the seal offset. ## Reproduction (3-broker cluster) 1. Create topic with RF=2 (leader=broker1, follower=broker2) 2. Produce messages (e.g. 100 records, LEO=100) 3. Switch topic to diskless (diskless.enable=true) 4. Wait for seal to commit (classicToDisklessStartOffset=100) 5. Restart the leader broker (broker1) 6. Broker1 reclaims leadership via applyDelta → makeLeader + seal 7. Consumer reading from offset 0 times out: HW is stuck at the stale checkpointed value because seal prevents advancement Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(inkless:switch): Advance HW past stale checkpoint for sealed leader after restart In the post-restart leader path of applyLocalLeadersDelta, after makeLeader + seal, advance HW to the classicToDisklessStartOffset when the checkpointed value is below it. Without this, makeLeader loads the stale on-disk checkpoint and seal() permanently prevents any natural HW advancement, leaving classic data unreadable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
giuseppelillo
pushed a commit
that referenced
this pull request
May 29, 2026
…er after restart (#605) * test(inkless:switch): Reproduce stale HW checkpoint on sealed leader after restart Add tests for the scenario where a sealed leader partition becomes unreadable after broker restart due to a stale HW checkpoint. ## Root cause makeLeader reloads HW from the on-disk checkpoint file. The checkpoint can be stale when: - Unclean shutdown (no final flush) - Checkpoint interval hadn't fired since HW advanced to the seal offset - First restart after seal commit Once seal() is called, no produces or follower fetches can advance HW naturally, so consumers cannot read classic data below the seal offset. ## Reproduction (3-broker cluster) 1. Create topic with RF=2 (leader=broker1, follower=broker2) 2. Produce messages (e.g. 100 records, LEO=100) 3. Switch topic to diskless (diskless.enable=true) 4. Wait for seal to commit (classicToDisklessStartOffset=100) 5. Restart the leader broker (broker1) 6. Broker1 reclaims leadership via applyDelta → makeLeader + seal 7. Consumer reading from offset 0 times out: HW is stuck at the stale checkpointed value because seal prevents advancement Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(inkless:switch): Advance HW past stale checkpoint for sealed leader after restart In the post-restart leader path of applyLocalLeadersDelta, after makeLeader + seal, advance HW to the classicToDisklessStartOffset when the checkpointed value is below it. Without this, makeLeader loads the stale on-disk checkpoint and seal() permanently prevents any natural HW advancement, leaving classic data unreadable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
After a broker restart, sealed leader partitions can have their high watermark stuck at a stale value loaded from the on-disk checkpoint. Since
seal()prevents any natural HW advancement (no produces, no follower fetches), consumers cannot read classic data below the seal offset.This PR advances HW to
classicToDisklessStartOffsetin the post-restart leader path when the checkpointed value is below it.Root cause
makeLeaderreloads HW from the checkpoint file. The checkpoint can be stale when:Once
seal()is called immediately after, HW is permanently stuck — no mechanism exists to advance it.Reproduction (3-broker cluster)
diskless.enable=true)classicToDisklessStartOffset=100)applyDelta->makeLeader+seal