docs(MADR): add MADR 101 for status syncing rules in multizone#16083
docs(MADR): add MADR 101 for status syncing rules in multizone#16083lahabana wants to merge 1 commit into
Conversation
Consolidate conventions scattered from MADRs 014, 039, 044, 051, 056, 072, and 096 into explicit rules for resource status ownership and syncing in multizone deployments. Signed-off-by: Charly Molter <charly.molter@konghq.com>
There was a problem hiding this comment.
Pull request overview
Adds MADR 101 to consolidate and make explicit the project’s multizone conventions around resource status ownership, computation, and KDS syncing, so future resource/API work follows a consistent single-writer model and avoids cross-zone status overwrites.
Changes:
- Introduces a unified set of rules: status is zone-local, computed only on zone CPs, and not synced onward cross-zone.
- Documents the existing implementation mechanics that enforce this (global→zone status stripping + zone-side ignore/preserve behavior).
- Clarifies how to provide global-level visibility without storing/aggregating global-computed status (computed endpoints vs KDS reverse-unary RPC forwarding).
Reviewer Checklist🔍 Each of these sections need to be checked by the reviewer of the PR 🔍:
|
| 4. Computing status on global would be costly and scale in O(number of entities globally) | ||
| which we try to avoid. | ||
|
|
||
| Previous MADRs addressed this: |
There was a problem hiding this comment.
I think the list above and below is missing 014 and 072.
| **independently** computing or aggregating status: | ||
|
|
||
| Global CP MUST NOT: | ||
| - Independently compute or populate status fields on resources |
There was a problem hiding this comment.
| - Independently compute or populate status fields on resources | |
| - Independently compute or populate status fields on resources (passively storing zone-synced data via KDS is fine) |
There was a problem hiding this comment.
Isn't it the 3rd point?
There was a problem hiding this comment.
Sort of is, the "synced data into stored resources" part indicates that something is ok to be synced, wanted to make it clear. If no one else feels the need then feel free to drop this.
| dedicated API. Data that needs to travel across zones belongs in `spec`. | ||
|
|
||
| These rules consolidate and strengthen existing conventions from MADRs 014, 039, | ||
| 044, 051, 056, 072, and 096. |
There was a problem hiding this comment.
What does enforcement look like? The guards already exist in code but they're ad-hoc per component. Is code review sufficient or should there be a systematic check?
There was a problem hiding this comment.
I think I was mostly trying to focus on persisting this. For what it's worth I got this MADR and the code reviewed by agents a few times with current use-cases for completeness. Which worked very well. I think as our usage of AI grows I think having skill to help self-reviews with targeted agents will probably get us a bigger effort/outcome than code enforcements.
There was a problem hiding this comment.
We should think about whether code enforcements are feasible.
| 1. Resources are authored on one control plane (zone or global) and synced to | ||
| others via KDS. | ||
| 2. Status contains information that is inherently local to the zone where the |
There was a problem hiding this comment.
These examples (VIPs, hostnames, proxy counts, availability) seem MeshService-specific - what about the other resources with HasStatus: true (MeshIdentity, MeshOpenTelemetryBackend, Workload, MeshTrust, MeshTrace, MeshMetric, MeshAccessLog, HostnameGenerator)? Should the framing be broader to cover those cases or is this MADR only about Mesh*Service?
| Across multiple MADRs (039, 044, 051, 056, 096) we have established conventions | ||
| for how resource `status` is handled in multizone deployments. These conventions | ||
| are scattered and implicit. This MADR consolidates the rules into a single | ||
| reference and strengthens them with an explicit prohibition on computing status | ||
| on Global CP. |
There was a problem hiding this comment.
The context says these conventions are "scattered and implicit" - are there implicit ones in the code that aren't captured by any of the referenced MADRs?
Looking at the codebase:
- Every status updater and VIP allocator is gated by
rt.GetMode() == config_core.Zone(andpkg/ipam/components.go). IsLocalMeshService()filtering in the status updater
Is it worth mentioning these as existing enforcement mechanisms, or at least acknowledging them so future contributors know the guards already exist?
| (`pkg/kds/v2/store/sync.go`) ensures that locally-computed status is preserved | ||
| and not overwritten by the empty status arriving from global. | ||
|
|
||
| **Implication: global-originated resources have no status on global.** Since |
There was a problem hiding this comment.
Does it make sense to compute it for some resources? For MeshService-style status (VIPs, hostnames) the answer is clear - global lacks the zone-local inputs.
Is the same true for MeshOpenTelemetryBackend, MeshIdentity etc.?
Was this considered?
There was a problem hiding this comment.
I think this MADR only tries to persist what the state of things are. I'm purposefully not trying to expand to new things to keep it focused.
There was a problem hiding this comment.
It's adding more restrictions (e.g. "... strengthens them with an explicit prohibition ...") but I understand that we want to keep scope minimal.
| resource instance's status. | ||
|
|
||
| Note that computed spec fields like `Spec.Identities` and `Spec.State` (see | ||
| Rule 4) are also written by the local zone's status updater, not by the user. |
There was a problem hiding this comment.
A bit related: should we consider rename since it's no longer "status updater"? computed fields reconciler?
There was a problem hiding this comment.
Yes we could but didn't want to create work out of this MADR
| This separation eliminates race conditions between spec updates from the origin | ||
| and status updates on the destination (the scenario described in MADR 056). | ||
|
|
||
| ### Rule 4: Data that must cross zone boundaries belongs in spec |
There was a problem hiding this comment.
Rule 4 formalizes "computed spec fields" as a concept but there's no structural way to distinguish them from user-authored spec fields. Spec.Identities and Spec.State sit next to user-authored fields. Should we come up with a naming pattern for this?
There was a problem hiding this comment.
Probably, though the goal of the MADR is to persist existing state not create new work.
Consolidate conventions scattered from MADRs 014, 039, 044, 051, 056, 072, and 096 into explicit rules for resource status ownership and syncing in multizone deployments.