Skip to content

docs(MADR): add MADR 101 for status syncing rules in multizone#16083

Open
lahabana wants to merge 1 commit into
kumahq:masterfrom
lahabana:madr-status-syncing
Open

docs(MADR): add MADR 101 for status syncing rules in multizone#16083
lahabana wants to merge 1 commit into
kumahq:masterfrom
lahabana:madr-status-syncing

Conversation

@lahabana

@lahabana lahabana commented Apr 1, 2026

Copy link
Copy Markdown
Contributor

Consolidate conventions scattered from MADRs 014, 039, 044, 051, 056, 072, and 096 into explicit rules for resource status ownership and syncing in multizone deployments.

Consolidate conventions scattered from MADRs 014, 039, 044, 051, 056,
072, and 096 into explicit rules for resource status ownership and
syncing in multizone deployments.

Signed-off-by: Charly Molter <charly.molter@konghq.com>
Copilot AI review requested due to automatic review settings April 1, 2026 12:38
@lahabana lahabana requested a review from a team as a code owner April 1, 2026 12:38
@lahabana lahabana requested review from Automaat and lukidzi April 1, 2026 12:38
@lahabana lahabana changed the title docs(madr): add MADR 101 for status syncing rules in multizone docs(MADR): add MADR 101 for status syncing rules in multizone Apr 1, 2026
@lahabana lahabana added the ci/skip-test PR: Don't run unit and e2e tests (maybe this is just a doc change) label Apr 1, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds MADR 101 to consolidate and make explicit the project’s multizone conventions around resource status ownership, computation, and KDS syncing, so future resource/API work follows a consistent single-writer model and avoids cross-zone status overwrites.

Changes:

  • Introduces a unified set of rules: status is zone-local, computed only on zone CPs, and not synced onward cross-zone.
  • Documents the existing implementation mechanics that enforce this (global→zone status stripping + zone-side ignore/preserve behavior).
  • Clarifies how to provide global-level visibility without storing/aggregating global-computed status (computed endpoints vs KDS reverse-unary RPC forwarding).

@github-actions

github-actions Bot commented Apr 7, 2026

Copy link
Copy Markdown
Contributor

Reviewer Checklist

🔍 Each of these sections need to be checked by the reviewer of the PR 🔍:
If something doesn't apply please check the box and add a justification if the reason is non obvious.

  • Is the PR title satisfactory? Is this part of a larger feature and should be grouped using > Changelog?
  • PR description is clear and complete. It Links to relevant issue as well as docs and UI issues
  • This will not break child repos: it doesn't hardcode values (.e.g "kumahq" as an image registry)
  • IPv6 is taken into account (.e.g: no string concatenation of host port)
  • Tests (Unit test, E2E tests, manual test on universal and k8s)
    • Don't forget ci/ labels to run additional/fewer tests
  • Does this contain a change that needs to be notified to users? In this case, UPGRADE.md should be updated.
  • Does it need to be backported according to the backporting policy? (this GH action will add "backport" label based on these file globs, if you want to prevent it from adding the "backport" label use no-backport-autolabel label)

4. Computing status on global would be costly and scale in O(number of entities globally)
which we try to avoid.

Previous MADRs addressed this:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the list above and below is missing 014 and 072.

**independently** computing or aggregating status:

Global CP MUST NOT:
- Independently compute or populate status fields on resources

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Independently compute or populate status fields on resources
- Independently compute or populate status fields on resources (passively storing zone-synced data via KDS is fine)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it the 3rd point?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sort of is, the "synced data into stored resources" part indicates that something is ok to be synced, wanted to make it clear. If no one else feels the need then feel free to drop this.

dedicated API. Data that needs to travel across zones belongs in `spec`.

These rules consolidate and strengthen existing conventions from MADRs 014, 039,
044, 051, 056, 072, and 096.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does enforcement look like? The guards already exist in code but they're ad-hoc per component. Is code review sufficient or should there be a systematic check?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I was mostly trying to focus on persisting this. For what it's worth I got this MADR and the code reviewed by agents a few times with current use-cases for completeness. Which worked very well. I think as our usage of AI grows I think having skill to help self-reviews with targeted agents will probably get us a bigger effort/outcome than code enforcements.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should think about whether code enforcements are feasible.

Comment on lines +17 to +19
1. Resources are authored on one control plane (zone or global) and synced to
others via KDS.
2. Status contains information that is inherently local to the zone where the

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These examples (VIPs, hostnames, proxy counts, availability) seem MeshService-specific - what about the other resources with HasStatus: true (MeshIdentity, MeshOpenTelemetryBackend, Workload, MeshTrust, MeshTrace, MeshMetric, MeshAccessLog, HostnameGenerator)? Should the framing be broader to cover those cases or is this MADR only about Mesh*Service?

Comment on lines +9 to +13
Across multiple MADRs (039, 044, 051, 056, 096) we have established conventions
for how resource `status` is handled in multizone deployments. These conventions
are scattered and implicit. This MADR consolidates the rules into a single
reference and strengthens them with an explicit prohibition on computing status
on Global CP.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The context says these conventions are "scattered and implicit" - are there implicit ones in the code that aren't captured by any of the referenced MADRs?

Looking at the codebase:

  1. Every status updater and VIP allocator is gated by rt.GetMode() == config_core.Zone (and pkg/ipam/components.go).
  2. IsLocalMeshService() filtering in the status updater

Is it worth mentioning these as existing enforcement mechanisms, or at least acknowledging them so future contributors know the guards already exist?

(`pkg/kds/v2/store/sync.go`) ensures that locally-computed status is preserved
and not overwritten by the empty status arriving from global.

**Implication: global-originated resources have no status on global.** Since

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to compute it for some resources? For MeshService-style status (VIPs, hostnames) the answer is clear - global lacks the zone-local inputs.

Is the same true for MeshOpenTelemetryBackend, MeshIdentity etc.?

Was this considered?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this MADR only tries to persist what the state of things are. I'm purposefully not trying to expand to new things to keep it focused.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's adding more restrictions (e.g. "... strengthens them with an explicit prohibition ...") but I understand that we want to keep scope minimal.

resource instance's status.

Note that computed spec fields like `Spec.Identities` and `Spec.State` (see
Rule 4) are also written by the local zone's status updater, not by the user.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit related: should we consider rename since it's no longer "status updater"? computed fields reconciler?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we could but didn't want to create work out of this MADR

This separation eliminates race conditions between spec updates from the origin
and status updates on the destination (the scenario described in MADR 056).

### Rule 4: Data that must cross zone boundaries belongs in spec

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rule 4 formalizes "computed spec fields" as a concept but there's no structural way to distinguish them from user-authored spec fields. Spec.Identities and Spec.State sit next to user-authored fields. Should we come up with a naming pattern for this?

@lahabana lahabana Apr 8, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, though the goal of the MADR is to persist existing state not create new work.

@kumahq kumahq deleted a comment from lahabana Apr 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/skip-test PR: Don't run unit and e2e tests (maybe this is just a doc change)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants