-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Description
Is this a bug report or feature request?
- Feature Request
I consider this a feature request more than a bug report. It may also reasonably be a large cleanup-related task.
What is use case behind this feature:
There are several minor problems that Rook users face that could be improved if Rook had one operator per CephCluster. The problems are outlined below, with additional notes about investigations related to fixing each issue individually:
The OSD wait loop is long and can starve other CephCluster reconciles
Captured in: #11330
Using a per-CephCluster operator would mean that the OSD wait loops we have written -- that help avoid unnecessary k8s and ceph API calls -- don't need to be refactored.
Logging framework does not enforce use of namespaced names in logs
Our logging framework does not automatically capture (or force use of) namespaced names when logging. Many contributing devs forget to include the namespace of a resource in logs, resulting in unclear logs in multi-CephCluster environments.
Doing the work to separate the k8s-cluster-scope functions and ceph-cluster-scope functions may actually be less work than it would take to replace the logging framework with a clearer one.
In addition, users have asked in the past if it's possible to easily see logs only for a single one of their multiple clusters. This is not easy today, because all cluster logs are interleaved together into a single operator log.
Version mismatch errors during usage of radosgw-admin
When the version of radosgw-admin in the rook-operator image is lower than the version present in the Ceph image in the CephCluster, there is a risk that radosgw-admin commands fail (with random, unexpected failure messages and error codes) due to a non-forward-compatible radosgw-admin API.
Historical issue: #13241
Abandoned to resolve the issue: #13430
The abandoned attempt to resolve the issue is workable, but it continues to retain the hackiness needed by requiring a mgr sidecar. This would be totally unnecessary with a per-cephcluster operator.
In non-default network configs, rook isn't connected to the same network as the Ceph cluster
This isn't a bug, but it does require maintenance of hacky workarounds, especially in multus-enabled clusters. For example:
- needing to use a multus network CIDR detection canary job (hacky, could consume all IP allocations in DHCP environments if there is an operator CLBO bug)
- needing to exec commands in a mgr sidecar pod in multus-enabled clusters (extra cpu/mem resources needed, complicated scenario for krew plugin)
It also means that Rook doesn't have ease of flexibility in nonstandard network environments that could help implement new network features or more advanced modifications to existing features.
If a per-cephcluster operator were run in the same network space as the CephCluster, this would be unnecessary and allow cleanup of several complicated workarounds. Admittedly, the extra CPU/Mem resources would move from the mgr sidecar to being used by the per-cephcluster operator -- a net zero change.
Ceph version-detection jobs are hacky
Rook uses Ceph version detection jobs to detect the Ceph version present in a given Ceph image version.
This could be avoided if a per-cephcluster operator were to always run on top of the same Ceph image defined in the CephCluster.
What should the feature do:
The overall proposal for what can be changed to help resolve these issues would be to have a global rook operator handle these tasks:
- run CSI driver
- run COSI driver
- spawn a sub-operator for each CephCluster
- update the sub-operator when
spec.versionchanges - update the sub-operator when
spec.networkchanges
- update the sub-operator when
The CephCluster sub-operator would continue to handle the same functions as the rook-ceph-operator handles today with the exception of the changes handled by the global operator above. For what it's worth, this more closely resembles the suggestion that controller-runtime makes to use jobs for long-running reconcile steps: kubernetes-sigs/controller-runtime#1763 (comment)
Implementation suggestion:
- First modification
- Sub-operator should watch for all changes to CephClusters, but ignore changes to
spec.versionandspec.network-- the "self" sub-operator pod spec when these changes occur - Sub-operator should take resource requests/limits and placement from rook-ceph-operator as a starting point
- Top-level operator should keep the same name (
rook-ceph-operator) for easy migration, and sub-operator will be namedceph-operator
- Sub-operator should watch for all changes to CephClusters, but ignore changes to
- Second modification [per Travis' suggestion]
- Sub-operator should run in the same Ceph image as defined in
spec.version - Sub-operator should be configured to use the same
spec.networkconfig as Ceph pods when necessary -- e.g., must be attached to multus net(s) (possibly host network?)
- Sub-operator should run in the same Ceph image as defined in
- Follow-up modifications:
- Refactor ceph version detection jobs to just call
ceph version locally - Refactor multus network canary job to check for network CIDRs locally
- Remove usage of mgr sidecar when multus is enabled
- Refactor krew plugin for new architecture
- Refactor OBC controller to ensure it (or they) operate as needed
- Update troubleshooting docs to ensure the correct operator is referenced
- (if needed) Create a new serviceaccount for the global operator, and refactor RBAC to clean up
- Provide CephCluster options for
ceph-operatorplacement and resource requests/limits to be specified optionally
- Refactor ceph version detection jobs to just call
Open questions:
- Which operator manages OBCs? Global or per-CephCluster?
I propose the per-CephCluster operator, and we should take extra care to ensure that each sub-operator will only respond to OBC changes related to its namespace - Should we put the sub-operator on the host network if
hostNetwork: true? I think not since it isn't necessary and not doing so is more secure, and we can always change this in the future if needed.
Environment
All, and especially:
- multi-CephCluster environments
- Multus environments
- environments with CephObjectStores