fix(inkless:switch): Advance HW past stale checkpoint for sealed leader after restart by jeqo · Pull Request #605 · aiven/inkless

jeqo · 2026-05-21T22:27:59Z

After a broker restart, sealed leader partitions can have their high watermark stuck at a stale value loaded from the on-disk checkpoint. Since seal() prevents any natural HW advancement (no produces, no follower fetches), consumers cannot read classic data below the seal offset.

This PR advances HW to classicToDisklessStartOffset in the post-restart leader path when the checkpointed value is below it.

Root cause

makeLeader reloads HW from the checkpoint file. The checkpoint can be stale when:

Unclean shutdown (no final flush)
Checkpoint interval hadn't fired since HW advanced to the seal offset
First restart after seal commit

Once seal() is called immediately after, HW is permanently stuck — no mechanism exists to advance it.

Reproduction (3-broker cluster)

Create topic with RF=2 (leader=broker1, follower=broker2)
Produce messages (e.g. 100 records, LEO=100)
Switch topic to diskless (diskless.enable=true)
Wait for seal to commit (classicToDisklessStartOffset=100)
Restart the leader broker (broker1)
Broker1 reclaims leadership via applyDelta -> makeLeader + seal
Consumer reading from offset 0 times out: HW stuck at the stale checkpointed value

…after restart Add tests for the scenario where a sealed leader partition becomes unreadable after broker restart due to a stale HW checkpoint. ## Root cause makeLeader reloads HW from the on-disk checkpoint file. The checkpoint can be stale when: - Unclean shutdown (no final flush) - Checkpoint interval hadn't fired since HW advanced to the seal offset - First restart after seal commit Once seal() is called, no produces or follower fetches can advance HW naturally, so consumers cannot read classic data below the seal offset. ## Reproduction (3-broker cluster) 1. Create topic with RF=2 (leader=broker1, follower=broker2) 2. Produce messages (e.g. 100 records, LEO=100) 3. Switch topic to diskless (diskless.enable=true) 4. Wait for seal to commit (classicToDisklessStartOffset=100) 5. Restart the leader broker (broker1) 6. Broker1 reclaims leadership via applyDelta → makeLeader + seal 7. Consumer reading from offset 0 times out: HW is stuck at the stale checkpointed value because seal prevents advancement Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…er after restart In the post-restart leader path of applyLocalLeadersDelta, after makeLeader + seal, advance HW to the classicToDisklessStartOffset when the checkpointed value is below it. Without this, makeLeader loads the stale on-disk checkpoint and seal() permanently prevents any natural HW advancement, leaving classic data unreadable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

This PR fixes a post-restart edge case for classic-to-diskless switched partitions where a sealed leader can load a stale high watermark (HW) from disk and then never advance it, preventing consumers from reading the classic portion of the log up to the seal offset.

Changes:

Advance the leader’s local log HW to classicToDisklessStartOffset after makeLeader + seal() when the checkpointed HW is below the seal offset.
Add unit tests covering both stale-checkpoint and up-to-date-checkpoint restart scenarios for a sealed leader.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
core/src/main/scala/kafka/server/ReplicaManager.scala	On post-restart “diskless topic with local classic log” leader path, advances HW up to the seal offset when needed.
core/src/test/scala/unit/kafka/server/ReplicaManagerTest.scala	Adds tests asserting HW is advanced (or not) for sealed leaders depending on whether the checkpointed HW is stale.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…er after restart (#605) * test(inkless:switch): Reproduce stale HW checkpoint on sealed leader after restart Add tests for the scenario where a sealed leader partition becomes unreadable after broker restart due to a stale HW checkpoint. ## Root cause makeLeader reloads HW from the on-disk checkpoint file. The checkpoint can be stale when: - Unclean shutdown (no final flush) - Checkpoint interval hadn't fired since HW advanced to the seal offset - First restart after seal commit Once seal() is called, no produces or follower fetches can advance HW naturally, so consumers cannot read classic data below the seal offset. ## Reproduction (3-broker cluster) 1. Create topic with RF=2 (leader=broker1, follower=broker2) 2. Produce messages (e.g. 100 records, LEO=100) 3. Switch topic to diskless (diskless.enable=true) 4. Wait for seal to commit (classicToDisklessStartOffset=100) 5. Restart the leader broker (broker1) 6. Broker1 reclaims leadership via applyDelta → makeLeader + seal 7. Consumer reading from offset 0 times out: HW is stuck at the stale checkpointed value because seal prevents advancement Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(inkless:switch): Advance HW past stale checkpoint for sealed leader after restart In the post-restart leader path of applyLocalLeadersDelta, after makeLeader + seal, advance HW to the classicToDisklessStartOffset when the checkpointed value is below it. Without this, makeLeader loads the stale on-disk checkpoint and seal() permanently prevents any natural HW advancement, leaving classic data unreadable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

jeqo and others added 2 commits May 22, 2026 01:24

jeqo requested a review from giuseppelillo May 21, 2026 22:43

jeqo marked this pull request as ready for review May 21, 2026 22:43

jeqo requested a review from Copilot May 21, 2026 22:43

Copilot started reviewing on behalf of jeqo May 21, 2026 22:43 View session

Copilot AI reviewed May 21, 2026

View reviewed changes

Comment thread core/src/main/scala/kafka/server/ReplicaManager.scala

giuseppelillo approved these changes May 22, 2026

View reviewed changes

giuseppelillo merged commit 2762ee2 into main May 22, 2026
10 checks passed

giuseppelillo deleted the jeqo/fix-fetch-on-topic-switching branch May 22, 2026 07:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(inkless:switch): Advance HW past stale checkpoint for sealed leader after restart - #605

fix(inkless:switch): Advance HW past stale checkpoint for sealed leader after restart#605
giuseppelillo merged 2 commits into
mainfrom
jeqo/fix-fetch-on-topic-switching

jeqo commented May 21, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

jeqo commented May 21, 2026

Root cause

Reproduction (3-broker cluster)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants