fix(inkless:controller): fix leader skew for managed diskless after rolling restart by jeqo · Pull Request #643 · aiven/inkless

jeqo · 2026-06-12T10:40:11Z

Three changes that together ensure balanced leadership for managed diskless topics:

Enable preferred leader rebalance: the previous blanket skip for all diskless topics was correct for unmanaged (RF=1) but wrong for managed (RF>1) where tiered storage upload/deletion are leader-only.
ISR = all replicas at creation/reassignment/addPartitions: matches unmanaged diskless semantics. Data is in object storage — broker fencing doesn't affect availability.
Expand ISR on broker unfence: when a broker returns, re-add it to ISR for diskless managed partitions where it is a replica. This repairs stale ISR from prior shrinks and enables the preferred leader election to redistribute leadership.

Copilot

Pull request overview

This PR adjusts controller preferred-leader balancing behavior for diskless topics so that managed diskless topics (RF>1) participate in periodic preferred leader election, avoiding tiered-storage leader-only work concentrating on a single broker after rolling restarts, while keeping the existing skip behavior for unmanaged/legacy diskless topics (RF=1).

Changes:

Update imbalance tracking (imbalancedPartitions) to include diskless topics when isDisklessManagedReplicasEnabled is true.
Allow maybeBalancePartitionLeaders / preferred leader elections to run for diskless topics when managed replicas are enabled, while continuing to skip legacy/unmanaged diskless.
Refine and extend unit tests to distinguish unmanaged vs managed diskless periodic leader balancing behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
metadata/src/main/java/org/apache/kafka/controller/ReplicationControlManager.java	Includes managed-diskless partitions in preferred-leader imbalance tracking and enables periodic preferred leader election for managed diskless topics.
metadata/src/test/java/org/apache/kafka/controller/ReplicationControlManagerTest.java	Splits the prior diskless balancing test into unmanaged-skip vs managed-rebalance scenarios.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

jeqo · 2026-06-12T12:43:34Z

Looking good:

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (2)

metadata/src/main/java/org/apache/kafka/controller/ReplicationControlManager.java:2433

For diskless topics, brokerFilter currently allows all replicas (including fenced/controlled-shutdown) into the ISR. Since buildPartitionRegistration always selects the leader as isr.get(0), a manual assignment that lists a fenced broker first would create a partition whose leader is not active. Consider (a) rejecting diskless assignments where none of the replicas are active, and (b) ordering the diskless ISR so that active replicas come first (while still including all replicas) to ensure the initial leader is always active.

        int partitionId = startPartitionId;
        for (int i = 0; i < partitionAssignments.size(); i++) {
            PartitionAssignment partitionAssignment = partitionAssignments.get(i);
            List<Integer> isr = isrs.get(i).stream().
                filter(brokerFilter).toList();
            // If the ISR is empty, it means that all brokers are fenced or
            // in controlled shutdown. To be consistent with the replica placer,
            // we reject the create topic request with INVALID_REPLICATION_FACTOR.
            if (isr.isEmpty()) {
                throw new InvalidReplicationFactorException(
                    "Unable to replicate the partition " + replicationFactor +
                        " time(s): All brokers are currently fenced or in controlled shutdown.");
            }
            records.add(buildPartitionRegistration(partitionAssignment, isr)
                .toRecord(topicId, partitionId, new ImageWriterOptions.Builder(featureControl.metadataVersionOrThrow()).

metadata/src/main/java/org/apache/kafka/controller/ReplicationControlManager.java:996

Diskless topics created via manual assignments still compute ISR by filtering only active brokers (assignment.brokerIds().stream().filter(clusterControl::isActive)), which contradicts the new diskless semantics elsewhere in this PR (ISR should include all replicas regardless of fenced state, while still ensuring at least one replica is active for leader election). This means managed diskless topics created with a manual assignment can still end up with a reduced ISR.

                TopicAssignment topicAssignment;
                Predicate<Integer> brokerFilter;
                // Diskless managed-replicas uses standard rack-aware assignment
                // with user-defined RF (or defaultReplicationFactor if RF=-1)
                if (!disklessEnabled || isDisklessManagedReplicasEnabled) {
                    topicAssignment = clusterControl.replicaPlacer().place(new PlacementSpec(
                        0,
                        numPartitions,
                        replicationFactor
                    ), clusterDescriber);
                    // For diskless (managed or not): ISR = all replicas regardless of fenced state.
                    // Data lives in object storage, so broker fencing doesn't affect availability.
                    brokerFilter = disklessEnabled ? x -> true : clusterControl::isActive;
                } else {
                    topicAssignment = createDisklessAssignment(numPartitions);
                    if (topicAssignment == null) {
                        return new ApiError(Errors.BROKER_NOT_AVAILABLE, "No brokers available to create diskless topic.");
                    }
                    brokerFilter = x -> true;
                }

                for (int partitionId = 0; partitionId < topicAssignment.assignments().size(); partitionId++) {
                    PartitionAssignment partitionAssignment = topicAssignment.assignments().get(partitionId);
                    List<Integer> isr = partitionAssignment.replicas().stream().
                        filter(brokerFilter).toList();
                    // If the ISR is empty, it means that all brokers are fenced or

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (2)

metadata/src/main/java/org/apache/kafka/controller/ReplicationControlManager.java:2419

For diskless, isrs is currently set to the full replica list (including fenced brokers), and later code uses isr.get(0) as the initial leader. If the replica placer can return fenced brokers first, this can create a partition with a fenced leader. Diskless can include fenced replicas in ISR, but we should still require at least one active replica and order ISR with active replicas first.

        for (int i = 0; i < partitionAssignments.size(); i++) {
            PartitionAssignment partitionAssignment = partitionAssignments.get(i);
            List<Integer> isr = isrs.get(i).stream().

metadata/src/main/java/org/apache/kafka/controller/ReplicationControlManager.java:997

The subsequent ISR/leader selection logic relies on filter(brokerFilter) + isr.isEmpty() to reject the “all replicas fenced” case. But for diskless brokerFilter is x -> true, so that validation can never trigger and buildPartitionRegistration may pick a fenced broker as the initial leader (isr.get(0)). Diskless can include fenced replicas in ISR, but we still need at least one active replica and should order ISR with active replicas first.

                    brokerFilter = x -> true;
                }

                for (int partitionId = 0; partitionId < topicAssignment.assignments().size(); partitionId++) {
                    PartitionAssignment partitionAssignment = topicAssignment.assignments().get(partitionId);

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

…olling restart Three changes that together ensure balanced leadership for managed diskless topics: 1. Enable preferred leader rebalance: the previous blanket skip for all diskless topics was correct for unmanaged (RF=1) but wrong for managed (RF>1) where tiered storage upload/deletion are leader-only. 2. ISR = all replicas at creation/reassignment/addPartitions: matches unmanaged diskless semantics. Data is in object storage — broker fencing doesn't affect availability. 3. Expand ISR on broker unfence: when a broker returns, re-add it to ISR for diskless managed partitions where it is a replica. This repairs stale ISR from prior shrinks and enables the preferred leader election to redistribute leadership. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (2)

metadata/src/main/java/org/apache/kafka/controller/ReplicationControlManager.java:591

When preferred-leader tracking is disabled for a topic (e.g., unmanaged diskless), this block skips updating imbalancedPartitions but also never removes any existing entry for the partition. If the topic previously participated in tracking (before a config change), stale entries can keep arePartitionLeadersImbalanced() returning true and cause unnecessary periodic rebalance scheduling. Consider explicitly removing the partition from imbalancedPartitions when shouldTrackPreferredLeader is false.

        if (shouldTrackPreferredLeader(topicInfo.name)) {
            if (newPartInfo.hasPreferredLeader()) {
                imbalancedPartitions.remove(new TopicIdPartition(record.topicId(), record.partitionId()));
            } else {
                imbalancedPartitions.add(new TopicIdPartition(record.topicId(), record.partitionId()));

metadata/src/main/java/org/apache/kafka/controller/ReplicationControlManager.java:639

Same issue as in replay(PartitionRecord): when shouldTrackPreferredLeader is false, this code skips updating imbalancedPartitions but does not remove any pre-existing entry. That can leave stale imbalances around after a topic transitions into a mode where preferred-leader tracking is disabled, keeping rebalance tasks rescheduled unnecessarily.

        if (shouldTrackPreferredLeader(topicInfo.name)) {
            if (newPartitionInfo.hasPreferredLeader()) {
                imbalancedPartitions.remove(new TopicIdPartition(record.topicId(), record.partitionId()));
            } else {
                imbalancedPartitions.add(new TopicIdPartition(record.topicId(), record.partitionId()));

jeqo requested a review from Copilot June 12, 2026 10:40

Copilot started reviewing on behalf of jeqo June 12, 2026 10:40 View session