Skip to content

feat(metadata:diskless): implement managed replicas for diskless topics#492

Open
jeqo wants to merge 3 commits intomainfrom
jeqo/pod-2001-diskless-managed-replica
Open

feat(metadata:diskless): implement managed replicas for diskless topics#492
jeqo wants to merge 3 commits intomainfrom
jeqo/pod-2001-diskless-managed-replica

Conversation

@jeqo
Copy link
Contributor

@jeqo jeqo commented Jan 22, 2026

Implements Phase 1 of Diskless Managed Replicas (see #478 docs/inkless/ts-unification/DISKLESS_MANAGED_RF.md).

When diskless.managed.rf.enable=true, new diskless topics are created with RF = rack_count (one replica per rack) using standard KRaft replica placement, instead of legacy RF=1.

Changes

  • Add diskless.managed.rf.enable server config (default: false)
  • Compute RF from rack cardinality at topic creation
  • Use standard replicaPlacer.place() for rack-aware assignment
  • Allow manual replica assignments for operational flexibility
  • Comprehensive unit tests for managed and unmanaged modes

Known Limitations (Phase 1)

This feature is opt-in only via diskless.managed.rf.enable=true.
Existing clusters and topics are unaffected unless explicitly enabled.

When enabled:

  • Partition Reassignment: Not yet supported for managed replica topics. Reassignment will hang waiting for replica sync. Will be fixed in follow-up PR.

  • Transformer: Not updated - uses legacy routing (Phase 2). KRaft metadata shows RF=3, but clients are still routed to any alive broker. Doesn't affect correctness for DISKLESS_ONLY topics.

  • Add Partitions: Works correctly - inherits RF and uses rack-aware placement.

  • Observability metrics: Deferred to Phase 2.

Safe to merge because:

  • Feature flag is false by default
  • Only affects new topic creation when explicitly enabled
  • Existing diskless topics (RF=1) continue to work unchanged
  • No impact on classic (non-diskless) topics

Follow-up PRs Sequence

  1. This PR (Phase 1): Topic creation with managed replicas
  2. Phase 2 PR (jeqo/pod-2001-unlock-diskless-reassignment currently has):
  • Transformer changes for mode-aware routing
  • Immediate partition reassignment fix (critical!)
  • Phase 3 test for add partitions

Testing

  • Tests for: no-racks, with-racks, invalid input, internal topics, broker fencing, unregister scenarios

Configuration

Config Default Description
diskless.managed.rf.enable false When true, new diskless topics get RF=rack_count

@jeqo jeqo force-pushed the jeqo/pod-2001-diskless-managed-replica branch from 259d178 to d64e450 Compare February 5, 2026 14:54
@jeqo jeqo requested a review from Copilot February 9, 2026 09:58
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements Phase 1 of “Diskless Managed Replicas”: when enabled via a new server config, newly-created diskless topics use standard KRaft replica placement with replication factor derived from rack cardinality (instead of legacy RF=1), and manual replica assignments are allowed for diskless topics in managed mode.

Changes:

  • Add diskless.managed.rf.enable server config and plumb it through broker/controller config wiring.
  • Update ReplicationControlManager.createTopic to compute diskless RF from rack cardinality and use replicaPlacer.place() in managed mode; allow manual assignments in managed mode.
  • Extend unit tests to cover managed vs unmanaged behavior across no-rack / with-rack / invalid input / fencing & unregister scenarios.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
server-common/src/main/java/org/apache/kafka/server/config/ServerConfigs.java Introduces new diskless.managed.rf.enable config definition and documentation.
metadata/src/main/java/org/apache/kafka/controller/ReplicationControlManager.java Adds managed-replica enable flag, rack-cardinality RF computation, and managed-mode placement path in topic creation.
metadata/src/main/java/org/apache/kafka/controller/QuorumController.java Wires the managed-replica flag from controller builder into ReplicationControlManager.
core/src/main/scala/kafka/server/KafkaConfig.scala Exposes disklessManagedReplicasEnabled from server properties.
core/src/main/scala/kafka/server/ControllerServer.scala Passes the new config into the QuorumController.Builder.
metadata/src/test/java/org/apache/kafka/controller/ReplicationControlManagerTest.java Adds/updates tests and test context plumbing for managed vs unmanaged diskless behavior.
checkstyle/suppressions.xml Adds a MethodLength suppression for ReplicationControlManager.java.
Comments suppressed due to low confidence (1)

metadata/src/main/java/org/apache/kafka/controller/ReplicationControlManager.java:792

  • The INVALID_REPLICATION_FACTOR message hard-codes the “default value (1)”, but when diskless managed replicas are enabled the effective RF is derived from rack cardinality (can be > 1). Update the error message (and/or the preceding comment) to describe the actual allowed inputs (1 or -1) and what -1/managed mode means, without implying the default is always 1.
            // Diskless RF: only -1 (system-computed from rack count) or 1 (backward compat) allowed.
            // Explicit RF > 1 rejected: users shouldn't need to know rack topology.
            if (Math.abs(topic.replicationFactor()) != 1) {
                return new ApiError(Errors.INVALID_REPLICATION_FACTOR,
                    "Replication factor for diskless topics must be 1 or -1 to use the default value (1).");
            }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

… unregister scenarios

- Add _noRacks and _withRacks test variants for consistent coverage
- Fix tests that assumed broker 0 was always the leader
- Get actual leader from partition registration before fencing/unregistering
- Use dynamic assertions based on actual partition state
- Improve assertion error messages for clarity
@jeqo jeqo force-pushed the jeqo/pod-2001-diskless-managed-replica branch from d64e450 to 3e7a42b Compare February 9, 2026 16:37
@jeqo jeqo requested a review from Copilot February 9, 2026 16:37
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@jeqo jeqo requested a review from Copilot February 9, 2026 16:38
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@jeqo jeqo requested a review from Copilot February 9, 2026 16:59
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jeqo jeqo force-pushed the jeqo/pod-2001-diskless-managed-replica branch from 3e7a42b to 6fba0c9 Compare February 9, 2026 18:02
@jeqo jeqo requested a review from Copilot February 9, 2026 18:12
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jeqo jeqo force-pushed the jeqo/pod-2001-diskless-managed-replica branch from 6fba0c9 to 1f3c7e1 Compare February 9, 2026 19:06
@jeqo jeqo requested a review from Copilot February 9, 2026 19:06
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jeqo jeqo force-pushed the jeqo/pod-2001-diskless-managed-replica branch from 1f3c7e1 to 6f01be4 Compare February 9, 2026 23:23
@jeqo jeqo requested a review from Copilot February 9, 2026 23:29
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jeqo jeqo force-pushed the jeqo/pod-2001-diskless-managed-replica branch from 6f01be4 to 82dbca5 Compare February 10, 2026 00:16
@jeqo jeqo requested a review from Copilot February 10, 2026 00:16
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

jeqo added 2 commits February 10, 2026 02:45
Add diskless.managed.rf.enable config (default: false) to control whether
diskless topics use managed replicas with RF=rack_count or legacy RF=1.

This config only affects topic creation. When enabled, new diskless topics
will be created with one replica per rack using standard KRaft placement.

Part of Phase 1: Diskless Managed Replicas
(See #478 docs/inkless/ts-unification/DISKLESS_MANAGED_RF.md)
When diskless.managed.rf.enable=true, new diskless topics are created with
RF=rack_count using standard KRaft replica placement instead of legacy RF=1.

Changes:
- Compute RF from rack cardinality via rackCardinality()
- Use standard replicaPlacer.place() for rack-aware assignment
- Allow manual replica assignments when managed replicas enabled
- Add checkstyle suppression for extended createTopic method

Phase 1 limitations:
- Add Partitions inherits RF from existing partitions (Phase 3)
- Transformer not updated, uses legacy routing (Phase 2)
- Integration tests deferred to Phase 2
(See #478 docs/inkless/ts-unification/DISKLESS_MANAGED_RF.md)
@jeqo jeqo force-pushed the jeqo/pod-2001-diskless-managed-replica branch from 82dbca5 to 5f41dde Compare February 10, 2026 00:46
@jeqo jeqo requested a review from Copilot February 10, 2026 00:46
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jeqo jeqo marked this pull request as ready for review February 10, 2026 00:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant