Skip to content

fix(inkless:switch): Advance HW past stale checkpoint for sealed leader after restart#605

Merged
giuseppelillo merged 2 commits into
mainfrom
jeqo/fix-fetch-on-topic-switching
May 22, 2026
Merged

fix(inkless:switch): Advance HW past stale checkpoint for sealed leader after restart#605
giuseppelillo merged 2 commits into
mainfrom
jeqo/fix-fetch-on-topic-switching

Conversation

@jeqo

@jeqo jeqo commented May 21, 2026

Copy link
Copy Markdown
Contributor

After a broker restart, sealed leader partitions can have their high watermark stuck at a stale value loaded from the on-disk checkpoint. Since seal() prevents any natural HW advancement (no produces, no follower fetches), consumers cannot read classic data below the seal offset.

This PR advances HW to classicToDisklessStartOffset in the post-restart leader path when the checkpointed value is below it.

Root cause

makeLeader reloads HW from the checkpoint file. The checkpoint can be stale when:

  • Unclean shutdown (no final flush)
  • Checkpoint interval hadn't fired since HW advanced to the seal offset
  • First restart after seal commit

Once seal() is called immediately after, HW is permanently stuck — no mechanism exists to advance it.

Reproduction (3-broker cluster)

  1. Create topic with RF=2 (leader=broker1, follower=broker2)
  2. Produce messages (e.g. 100 records, LEO=100)
  3. Switch topic to diskless (diskless.enable=true)
  4. Wait for seal to commit (classicToDisklessStartOffset=100)
  5. Restart the leader broker (broker1)
  6. Broker1 reclaims leadership via applyDelta -> makeLeader + seal
  7. Consumer reading from offset 0 times out: HW stuck at the stale checkpointed value

jeqo and others added 2 commits May 22, 2026 01:24
…after restart

Add tests for the scenario where a sealed leader partition becomes
unreadable after broker restart due to a stale HW checkpoint.

## Root cause

makeLeader reloads HW from the on-disk checkpoint file. The checkpoint
can be stale when:
- Unclean shutdown (no final flush)
- Checkpoint interval hadn't fired since HW advanced to the seal offset
- First restart after seal commit

Once seal() is called, no produces or follower fetches can advance HW
naturally, so consumers cannot read classic data below the seal offset.

## Reproduction (3-broker cluster)

1. Create topic with RF=2 (leader=broker1, follower=broker2)
2. Produce messages (e.g. 100 records, LEO=100)
3. Switch topic to diskless (diskless.enable=true)
4. Wait for seal to commit (classicToDisklessStartOffset=100)
5. Restart the leader broker (broker1)
6. Broker1 reclaims leadership via applyDelta → makeLeader + seal
7. Consumer reading from offset 0 times out: HW is stuck at
   the stale checkpointed value because seal prevents advancement

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…er after restart

In the post-restart leader path of applyLocalLeadersDelta, after
makeLeader + seal, advance HW to the classicToDisklessStartOffset
when the checkpointed value is below it.

Without this, makeLeader loads the stale on-disk checkpoint and seal()
permanently prevents any natural HW advancement, leaving classic data
unreadable.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jeqo jeqo requested a review from giuseppelillo May 21, 2026 22:43
@jeqo jeqo marked this pull request as ready for review May 21, 2026 22:43
@jeqo jeqo requested a review from Copilot May 21, 2026 22:43

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a post-restart edge case for classic-to-diskless switched partitions where a sealed leader can load a stale high watermark (HW) from disk and then never advance it, preventing consumers from reading the classic portion of the log up to the seal offset.

Changes:

  • Advance the leader’s local log HW to classicToDisklessStartOffset after makeLeader + seal() when the checkpointed HW is below the seal offset.
  • Add unit tests covering both stale-checkpoint and up-to-date-checkpoint restart scenarios for a sealed leader.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
core/src/main/scala/kafka/server/ReplicaManager.scala On post-restart “diskless topic with local classic log” leader path, advances HW up to the seal offset when needed.
core/src/test/scala/unit/kafka/server/ReplicaManagerTest.scala Adds tests asserting HW is advanced (or not) for sealed leaders depending on whether the checkpointed HW is stale.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread core/src/main/scala/kafka/server/ReplicaManager.scala
@giuseppelillo giuseppelillo merged commit 2762ee2 into main May 22, 2026
10 checks passed
@giuseppelillo giuseppelillo deleted the jeqo/fix-fetch-on-topic-switching branch May 22, 2026 07:55
giuseppelillo pushed a commit that referenced this pull request May 29, 2026
…er after restart (#605)

* test(inkless:switch): Reproduce stale HW checkpoint on sealed leader after restart

Add tests for the scenario where a sealed leader partition becomes
unreadable after broker restart due to a stale HW checkpoint.

## Root cause

makeLeader reloads HW from the on-disk checkpoint file. The checkpoint
can be stale when:
- Unclean shutdown (no final flush)
- Checkpoint interval hadn't fired since HW advanced to the seal offset
- First restart after seal commit

Once seal() is called, no produces or follower fetches can advance HW
naturally, so consumers cannot read classic data below the seal offset.

## Reproduction (3-broker cluster)

1. Create topic with RF=2 (leader=broker1, follower=broker2)
2. Produce messages (e.g. 100 records, LEO=100)
3. Switch topic to diskless (diskless.enable=true)
4. Wait for seal to commit (classicToDisklessStartOffset=100)
5. Restart the leader broker (broker1)
6. Broker1 reclaims leadership via applyDelta → makeLeader + seal
7. Consumer reading from offset 0 times out: HW is stuck at
   the stale checkpointed value because seal prevents advancement

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(inkless:switch): Advance HW past stale checkpoint for sealed leader after restart

In the post-restart leader path of applyLocalLeadersDelta, after
makeLeader + seal, advance HW to the classicToDisklessStartOffset
when the checkpointed value is below it.

Without this, makeLeader loads the stale on-disk checkpoint and seal()
permanently prevents any natural HW advancement, leaving classic data
unreadable.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
giuseppelillo pushed a commit that referenced this pull request May 29, 2026
…er after restart (#605)

* test(inkless:switch): Reproduce stale HW checkpoint on sealed leader after restart

Add tests for the scenario where a sealed leader partition becomes
unreadable after broker restart due to a stale HW checkpoint.

## Root cause

makeLeader reloads HW from the on-disk checkpoint file. The checkpoint
can be stale when:
- Unclean shutdown (no final flush)
- Checkpoint interval hadn't fired since HW advanced to the seal offset
- First restart after seal commit

Once seal() is called, no produces or follower fetches can advance HW
naturally, so consumers cannot read classic data below the seal offset.

## Reproduction (3-broker cluster)

1. Create topic with RF=2 (leader=broker1, follower=broker2)
2. Produce messages (e.g. 100 records, LEO=100)
3. Switch topic to diskless (diskless.enable=true)
4. Wait for seal to commit (classicToDisklessStartOffset=100)
5. Restart the leader broker (broker1)
6. Broker1 reclaims leadership via applyDelta → makeLeader + seal
7. Consumer reading from offset 0 times out: HW is stuck at
   the stale checkpointed value because seal prevents advancement

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(inkless:switch): Advance HW past stale checkpoint for sealed leader after restart

In the post-restart leader path of applyLocalLeadersDelta, after
makeLeader + seal, advance HW to the classicToDisklessStartOffset
when the checkpointed value is below it.

Without this, makeLeader loads the stale on-disk checkpoint and seal()
permanently prevents any natural HW advancement, leaving classic data
unreadable.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants