-
Notifications
You must be signed in to change notification settings - Fork 2.8k
WIP: ceph: add nvme-of gateway crd support #16689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
70f289f to
b5e8c8e
Compare
d2344fc to
61c2a4a
Compare
pkg/operator/ceph/cr_manager.go
Outdated
|
|
||
| metricsBindAddress := k8sutil.GetOperatorSetting("ROOK_OPERATOR_METRICS_BIND_ADDRESS", "0") | ||
| skipNameValidation := true | ||
| cacheSyncTimeout := 10 * time.Minute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding the NVMeOF controller causes operator startup timeouts.
Increased CacheSyncTimeout to 10 minutes.
Rook operator logs with the private image show errors with the new NVMeoF code.
@travisn Do you know why this is happening?
2025-11-10 14:29:16.627898 D | clusterdisruption-controller: could not match failure domain. defaulting to "host"
2025-11-10 14:29:16.627938 D | exec: Running command: ceph osd metadata --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --name=client.admin --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json
2025-11-10 14:29:16.790834 E | clusterdisruption-controller: failed to get OSD status: failed to get osd metadata: exit status 1
2025-11-10 14:29:16.963003 I | ceph-csi: cluster info for cluster "my-cluster" is not ready yet, will retry in 10s, proceeding with ready clusters
2025-11-10 14:29:16.970832 C | rookcmd: failed to run operator: gave up to run the operator manager: failed to run the controller-runtime manager: failed to wait for ceph-object-realm-controller caches to sync kind source: *v1.CephObjectRealm: timed out waiting for cache to be synced for Kind *v1.CephObjectRealm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably check if the CephObjectRealm crd is present
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On another test, I got this error
2025-11-11 12:41:59.314806 D | exec: Running command: ceph osd metadata --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --name=client.admin --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json
2025-11-11 12:41:59.506880 E | clusterdisruption-controller: failed to get OSD status: failed to get osd metadata: exit status 1
2025-11-11 12:41:59.509731 C | rookcmd: failed to run operator: gave up to run the operator manager: failed to run the controller-runtime manager: failed to wait for ceph-object-zone-controller caches to sync kind source: *v1.CephObjectZone: timed out waiting for cache to be synced for Kind *v1.CephObjectZone
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @subhamkrai mentioned to add the register to scheme, can you try that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is done in PR https://github.com/rook/rook/pull/16689/files#diff-6c58bb0ea33fb8d02e4b3fbc0319f912784be096e78b84b7188e2f570b946998R68
@subhamkrai did you do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may something like this will resolve this
rook/pkg/operator/ceph/cluster/controller.go
Lines 153 to 161 in 8424d09
| err = addonsv1alpha1.AddToScheme(mgr.GetScheme()) | |
| if err != nil { | |
| return err | |
| } | |
| err = csiopv1.AddToScheme(mgr.GetScheme()) | |
| if err != nil { | |
| return err | |
| } |
pkg/operator/ceph/nvmeof/spec.go
Outdated
| script := fmt.Sprintf(` | ||
| set -xEeuo pipefail | ||
| cat << EOF > /etc/ceph/ceph.conf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That has been manually copied from the NFS-Ganesha deployment that Rook made. I think the script is generated by this:
rook/pkg/operator/ceph/controller/spec.go
Lines 570 to 613 in de8c2d2
| // GenerateMinimalCephConfInitContainer returns an init container that will generate the most basic | |
| // Ceph config for connecting non-Ceph daemons to a Ceph cluster (e.g., nfs-ganesha). Effectively | |
| // what this means is that it generates '/etc/ceph/ceph.conf' with 'mon_host' populated and a | |
| // keyring path associated with the user given. 'mon_host' is determined by the 'ROOK_CEPH_MON_HOST' | |
| // env var present in other Ceph daemon pods, and the keyring is expected to be mounted into the | |
| // container with a Kubernetes pod volume+mount. | |
| func GenerateMinimalCephConfInitContainer( | |
| username, keyringPath string, | |
| containerImage string, | |
| containerImagePullPolicy v1.PullPolicy, | |
| volumeMounts []v1.VolumeMount, | |
| resources v1.ResourceRequirements, | |
| securityContext *v1.SecurityContext, | |
| ) v1.Container { | |
| cfgPath := client.DefaultConfigFilePath() | |
| // Note that parameters like $(PARAM) will be replaced by Kubernetes with env var content before | |
| // container creation. | |
| confScript := ` | |
| set -xEeuo pipefail | |
| cat << EOF > ` + cfgPath + ` | |
| [global] | |
| mon_host = $(ROOK_CEPH_MON_HOST) | |
| [` + username + `] | |
| keyring = ` + keyringPath + ` | |
| EOF | |
| chmod 444 ` + cfgPath + ` | |
| cat ` + cfgPath + ` | |
| ` | |
| return v1.Container{ | |
| Name: "generate-minimal-ceph-conf", | |
| Command: []string{"/bin/bash", "-c", confScript}, | |
| Args: []string{}, | |
| Image: containerImage, | |
| ImagePullPolicy: containerImagePullPolicy, | |
| VolumeMounts: volumeMounts, | |
| Env: opconfig.StoredMonHostEnvVars(), | |
| Resources: resources, | |
| SecurityContext: securityContext, | |
| } | |
| } |
The nvmeof.conf needs to be generated as well, maybe with a similar workflow in it's own init-container?
| ) | ||
|
|
||
| const ( | ||
| controllerName = "ceph-nvmeof-gateway-controller" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be a fairly long prefix to log messages. Perhaps -controller suffix could be omitted from the logs similar to the ceph-bucket-notification controller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| [ceph] | ||
| pool = `) | ||
| config.WriteString(poolName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not nacking this but I'm concerned this will be brittle. Perhaps it would be better to use a template?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this is sufficient and works everywhere. With my testing on different cloud provider clusters and within Ceph-CSI e2e, I found that I needed to disable some defaults. But the documentation of the gateway configuration may give better clarity than what I managed to put together at https://github.com/ceph/ceph-csi/blob/devel/e2e/nvmeof/config.yaml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also concerned, and I want to make sure we are clear on why we are choosing this pattern.
We are creating a configmap that has domain-specific templating defined @@VAR@@ instead of using well-understood templating semantics
Are we reconciling the configmap and overwriting any user modifications each time, or do we write it once and allow users to use it to modify their config. If we overwite it each time, we would be better off having an init container that creates this file each time so that users don't modify it and then complain when it changes later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cross reference: #16689 (comment)
| func (r *ReconcileCephNVMeOFGateway) generateConfigMap(nvmeof *cephv1.CephNVMeOFGateway) *v1.ConfigMap { | ||
| poolName := nvmeof.Spec.Pool | ||
| if poolName == "" { | ||
| poolName = "nvmeofpool" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a pool that Ceph will automatically create on when a nvmeof daemon is started or is it a default being created by Rook? If its the later, it might be better to require the end user to explicitly set the pool name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gadididi could we get a clear response here instead of a thumbs up? I'm not sure how to interpret this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pool is not automatically created. The customer must create the CephBlockPool CR first, then create the CephNVMeOFGateway CR.
The default "nvmeofpool" is only used when spec.pool is omitted. The pool must already exist (created via CephBlockPool CR) before the gateway can use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this isn't a Ceph default, I would suggest not having this default. Instead, we can make the pool name a required input on the custom resource. This will avoid a minor amount of complexity to manage internally (maintenance and test simplification) and could avoid some bugs or user confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree strongly with @BlaineEXE that it is better to require explicit declaration of the pool rather than magic defaults.
It might be worth considering if new CRDs should require that pools already exist or go a step further and require that a CephBlockPool CR exists (and its phase is Ready). The argument for introspecting on CephBlockPool CRs is that it allows decisions to be made about continuing with a reconcile by probing only k8s state (which may already be cached) instead of having to probe ceph's state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made spec.pool required and removed the default nvmeofpool value. If you prefer, I can add logic in the controller to automatically create the CephBlockPool CR when a CephNVMeOFGateway is created.
Will the controller create the block pool via the Ceph CLI, similar to this function?
rook/pkg/operator/ceph/object/objectstore.go
Lines 1116 to 1117 in 70ec50b
| func createRGWPool(ctx *Context, cluster *cephv1.ClusterSpec, poolSpec cephv1.PoolSpec, pgCount, requestedName string) error { | |
| // create the pool if it doesn't exist yet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to do initial major-feature development by stripping away as many automations and minor features as possible to still get to a useful working result for end users.
In this case, automatic pool creation is what I consider premature user-workflow optimization in Rook controllers.
From observing Rook feature development over time, I have seen that trying automating things like this during the first development phase often results in users not having flexibility in their desired end state, requiring more developer rework over time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BlaineEXE Hi, I missed you tagged me.
yes, as Oded said, the rbd pool is not created automatically.
you can create it before or after the nvmeof GWs are up.
Also FYI, the nvmeof GWs can work with multiple pools.
but I don't know the requirements
pkg/operator/ceph/nvmeof/spec.go
Outdated
| sed -e "s/@@POD_NAME@@/${POD_NAME}/g" \ | ||
| -e "s/@@ANA_GROUP@@/${ANA_GROUP}/g" \ | ||
| -e "s/@@POD_IP@@/${POD_IP}/g" \ | ||
| < /config/ceph-nvmeof.conf > /etc/ceph/nvmeof.conf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cm that's been generated is only a template to have interpolation done in the init container? Why not generate the the entire config here and reduce the number of moving parts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is surely an option. However the nvmeof.conf contains really a lot of options, and it may be difficult to decide what options should be configurable in a CRD. Documentation is a little sparse (or I don't know where to find it), but here is one example of the config file: https://github.com/ceph/ceph-nvmeof/blob/devel/ceph-nvmeof.conf
The one included in the ConfigMap is a minimal version that does not depend on specific memory configurations or pre-allocations. It works, but probably can be tuned for better resource usage or performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a glance, it looks like that ini format file could be represented in a CR as a map[string]map[string]string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel similarly. We can have users modify the configmap to set advanced configs, but this makes it awkward when (and there will be a "when") we have to make decisions about when it is safe for Rook to update the configmap versus when it can't because users have modified it.
I think it will be simpler for users and for code maintenance to provide some sort of passthru in Rook and avoid the configmap. The pattern of Josh's suggestion is already being used to set arbitrary ceph config options here:
rook/pkg/apis/ceph.rook.io/v1/types.go
Lines 239 to 242 in 8db5274
| // Ceph Config options | |
| // +optional | |
| // +nullable | |
| CephConfig map[string]map[string]string `json:"cephConfig,omitempty"` |
| var configMapName, configHash string | ||
| var err error | ||
|
|
||
| if cephNVMeOFGateway.Spec.ConfigMapRef == "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the CM is edited by a user, it will never be overwritten, is that an intentional design choice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally the ConfigMap with nvmeof.conf can be edited by the admin so that the configuration of the gateway can be modified according to their hardware/needs/...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that is the intended interface, then argo/flux/fleet/etc. might create the CM before the CephNVMeOFGateway and the controller would need to accommodate that. It would also mean that the "template" in the CM becomes a stable interface and any changes would need to be carefully backwards compatible. Another downside to this approach is that typos/errors in the template string would either have to be discovered by trial and error by the user or a manual validator would need to be written. Whereas field validation comes essentially for free with a CRD.
61c2a4a to
a0c490d
Compare
subhamkrai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@OdedViner I was able to find the issue, RBAC is missing for cephnvmeofgateways. Please add those in common.yaml. I can share the common.yaml with you offline.
4e47848 to
ff746ea
Compare
| @@ -0,0 +1,67 @@ | |||
| /* | |||
| Copyright 2018 The Rook Authors. All rights reserved. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: 2025 ⌚
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I run make codegen, the year is automatically changed to 2018 (Copyright 2018 The Rook Authors. All rights reserved.) in pkg/client/clientset/versioned/typed/ceph.rook.io/v1/cephnvmeofgateway.go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
I think a new nvmeof integration suite should probably be created instead of trying to add tests for this into one of the existing suites. |
pkg/operator/ceph/nvmeof/spec.go
Outdated
| < /config/ceph-nvmeof.conf > /etc/ceph/nvmeof.conf | ||
| ceph nvme-gw create ${POD_NAME} %s ${ANA_GROUP} | ||
| ceph nvme-gw show %s ${ANA_GROUP} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add a parsing the return value (json) from ceph nvme-gw show %s ${ANA_GROUP} and validate if everything was ok
pkg/operator/ceph/nvmeof/spec.go
Outdated
| } | ||
|
|
||
| script := fmt.Sprintf(` | ||
| set -xEeuo pipefail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest that we not inline this script in the go code. Instead, I've found development and maintenance to be easier with separate scripts that are loaded into strings. We do this with the MDS liveness probe:
https://github.com/rook/rook/blob/master/pkg/operator/ceph/file/mds/livenessprobe.sh
https://github.com/rook/rook/blob/master/pkg/operator/ceph/file/mds/livenessprobe.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pkg/operator/ceph/nvmeof/spec.go
Outdated
| [global] | ||
| mon_host = $(CEPH_MON_HOST) | ||
| log_to_stderr = true | ||
| keyring = /etc/ceph/keyring | ||
| EOF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was needed in the NFS script because NFS-Ganesha doesn't accept the --mon-host argument. If the commands we are running accept these args, we should use the args instead of the config file
pkg/operator/ceph/nvmeof/spec.go
Outdated
| -e "s/@@POD_IP@@/${POD_IP}/g" \ | ||
| < /config/ceph-nvmeof.conf > /etc/ceph/nvmeof.conf | ||
| ceph nvme-gw create ${POD_NAME} %s ${ANA_GROUP} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may need to be careful here when it comes to pod restarts. A k8s deployment will end up having many pod names (e.g., rook-ceph-nvme-gw-abc123) over its lifetime. Will this mean Ceph will register many unique gateway names for each pod restart? And would this mean we are responsible for cleanup of the old pod names/IDs?
If we use a more stable naming here, like deployment name, we will also need to make sure this is idempotent -- or we must make our script idempotent.
| nvmeofGatewayPort = 5500 | ||
| nvmeofMonitorPort = 5499 | ||
| nvmeofDiscoveryPort = 8009 | ||
| defaultNVMeOFImage = "quay.io/ceph/nvmeof:1.5" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious about @travisn 's thoughts, but I would suggest not having a default here. If we keep a default, we will always be obligated to keep it up to date. If we require it as input, we can avoid a minor, repeated (and probably often overlooked) maintenance task
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added an optional image parameter to the NVMeOFGatewaySpec. If not specified, it defaults to quay.io/ceph/nvmeof:1.5. This allows users to override the image when needed while maintaining backward compatibility.
pkg/operator/ceph/nvmeof/spec.go
Outdated
| ceph nvme-gw create ${POD_NAME} %s ${ANA_GROUP} | ||
| ceph nvme-gw show %s ${ANA_GROUP} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a neat bash trick we can do for arguments here that I think will help us.
Let's imagine ths simplified script:
#!/usr/bin/env bash
set -xEeuo pipefail
# ARGS: all args are assumed to be 'ceph' CLI args
ceph "$@" nvme-gw <subcommand> <args>
ceph "$@" nvme-gw show %s ${ANA_GROUP}
We can then call the script like this:
bash script.sh --connect-timeout=90 --mon-host="${CEPH_MON_HOST}" --etc-etc-etc
When the script runs, it will use the mon args where "$@" exists in the script. This allows easy passthru, and we can use Rook's methods for setting CLI options for Ceph daemons in the init container to make our lives easier and to make sure we don't miss options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
271c1fc to
1689667
Compare
@jhoblitt Done https://github.com/rook/rook/pull/16689/files#diff-e92d2b50640744c370e7d59b0510338412d324658ce89df677dbcf1105a4ae57 |
1689667 to
7eb9d57
Compare
7eb9d57 to
610bcce
Compare
ae40c70 to
e7a4393
Compare
adds CephNVMeOFGateway crd for managing ceph nvme-of gateways. supports configurable instances, placement, resources, and networking options. Signed-off-by: Oded Viner <oviner@redhat.com>
0fbbafe to
e23bf35
Compare
e23bf35 to
e94122e
Compare
83c2fa5 to
f2f1718
Compare
90f06b5 to
a1088d0
Compare
travisn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My review mainly focused on getting the examples and documentation created to be clear how users can consume the feature.
| // For example, quay.io/ceph/nvmeof:1.5 | ||
| // If not specified, a default image will be used. | ||
| // +optional | ||
| Image string `json:"image,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nvmeof gateway image is separate from the ceph/ceph image? Until now, everything in ceph has been a single image, and we could get it from the cephcluster CR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an NVMe-oF image, not a Ceph image, based on:
https://github.com/ceph/ceph-csi/blob/devel/e2e/nvmeof/deployment.yaml#L32
| @@ -0,0 +1,69 @@ | |||
| # This example is for Ceph v19 and above only | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this require Ceph v20?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| @@ -0,0 +1,69 @@ | |||
| # This example is for Ceph v19 and above only | |||
| # The CephBlockPool must be created before the CephNVMeOFGateway | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The order with CRDs should not matter. If the controller does not find a needed pool, it can just fail the reconcile and retry until it finds the pool. So no need to mention this order.
| # - ConfigMap (if configMapRef is not specified) | ||
| # - Service (one per gateway instance) | ||
| # - Deployment (one per gateway instance) | ||
| # The ServiceAccount "ceph-nvmeof-gateway" must exist in the namespace. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add the service account and needed rbac to common.yaml
| apiVersion: ceph.rook.io/v1 | ||
| kind: CephNVMeOFGateway | ||
| metadata: | ||
| name: my-nvmeof |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about just remove the my prefix? In retrospect, we should not have included that in the other example names either.
| name: my-nvmeof | |
| name: nvmeof |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| namespace: rook-ceph # namespace:cluster | ||
| spec: | ||
| # Pool name that will be used by the NVMe-oF gateway | ||
| pool: nvmeofpool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest no "pool" suffix, since it is already of type pool, the name is redundant.
| pool: nvmeofpool | |
| pool: nvmeof |
| # Pool name that will be used by the NVMe-oF gateway | ||
| pool: nvmeofpool | ||
| # ANA (Asymmetric Namespace Access) group name | ||
| group: mygroup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about this example name instead of the "my" prefix?
| group: mygroup | |
| group: group-a |
| # This example is for Ceph v19 and above only | ||
| # The CephBlockPool must be created before the CephNVMeOFGateway | ||
| # | ||
| # Note: The operator will automatically create the following resources: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please create a document that describes how the user can use this feature. Some thoughts:
- Create the doc: Documentation/Storage-Configuration/Block-Storage-RBD/nvme-of.md
- The content will be similar as block-storage.md, to describe the scenario, show a simple yaml example, and describe how the user can enable the nvmeof csi driver, and do all the steps to configure the scenario.
With this document, reviewers can even better understand the end-to-end flow to better review the feature, and then the feature will be documented for users to try out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest creating two example yamls:
- nvmeof.yaml: This is the production example. Basically, just rename this file, and change the pool replication to 3.
- nvmeof-test.yaml: This will be the replica 1 pool example, and the nvmeof CR will be as simple as possible, no need to have all the comments with all possible settings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need an example storage class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this involves CSI as well, I liked Travis's comment from yesterday to add a deploy/examples/csi/nvmeof directory with examples similar to what we see in the other CSI dirs. We can focus on a single storageclass, pvc, and pod for now. More examples might just make things complicated at this early stage.
BlaineEXE
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most common feedback item here (of which there are more than I flagged) is that in Rook, it is almost always undesirable to use logger.Error() in a reconciliation loop. Only in rare cases do we actually do this in Rook. The ReportReconileResult() handles logging errors that are "bubbled up" via error return wrapping for nearly all cases.
In other projects that use a different logging framework, logger.Error() can be more common, but not in Rook. Please remove all instances unless you believe that it is imperative to log the error. If you do decide that it's imperative, please tag me and Travis on a comment around it, and we will most likely find a way to help avoid logging the error.
For most errors, I see errors.Wrapf() used, which is generally what I am looking for when reviewing error handling in Rook.
| metav1.TypeMeta `json:",inline"` | ||
| metav1.ObjectMeta `json:"metadata"` | ||
| Spec NVMeOFGatewaySpec `json:"spec"` | ||
| // +kubebuilder:pruning:PreserveUnknownFields |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is generally only needed to keep fields around when dealing with legacy types. I'd suggest getting rid of this here because it's a whole new resource.
| // +optional | ||
| Image string `json:"image,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest making this a required input for users and setting // +kubebuilder:validation:MinLength=1
| // The number of active gateway instances | ||
| Instances int `json:"instances"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current K8s API conventions suggest omitempty for all fields. Please add here and below.
They also recommend setting either // +required or // +optional for all fields. Please add one of those here and for fields below.
Optional fields are recommended to be pointers, except for maps and slices.
| // NVMeOFConfig is a map of section names to key-value pairs for nvmeof.conf configuration | ||
| // This allows users to override or add configuration options without needing to manage a ConfigMap | ||
| // +optional | ||
| // +nullable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// +nullable is actively discouraged by modern API conventions. Please remove any additions of nullable
| const ( | ||
| packageName = "ceph-nvmeof-gateway" | ||
| controllerName = packageName + "-controller" | ||
| CephNVMeOFGatewayNameLabelKey = "ceph_nvmeof_gateway" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to avoid adding a new label and instead rely on ceph_daemon_id added by CephDaemonAppLabels()?
|
|
||
| configContent, err := getNVMeOFGatewayConfig(poolName, podName, podIP, anaGroup, nvmeof.Spec.NVMeOFConfig) | ||
| if err != nil { | ||
| logger.Errorf("failed to generate config: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to log an error when returning the error.
| for key, value := range options { | ||
| section.Key(key).SetValue(value) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend also using section.Key(key).GetValue() here to determine if a value has already been specified. If a value is set already, let's log something like below, which can help us and users debug the code used to set defaults and apply user overrides:
logger.NamespacedDebugf("overriding section %q config %q value %q with user-specified value %q", section, key, currentValue, newValue)
| cat << EOF > /etc/ceph/ceph.conf | ||
| [global] | ||
| mon_host = ${ROOK_CEPH_MON_HOST} | ||
| log_to_stderr = true | ||
| keyring = /etc/ceph/keyring | ||
| EOF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed for the nvme-gw itself, or is this used for the ceph commands below?
If it's needed for the nvme-gw itself, please provide a comment that indicates why CLI args are insufficient.
If it's for the ceph commands below, I'm quite sure this can be omitted.
| gatewaySection.Key("port").SetValue("5500") | ||
| gatewaySection.Key("enable_auth").SetValue("False") | ||
| gatewaySection.Key("state_update_notify").SetValue("True") | ||
| gatewaySection.Key("state_update_timeout_in_msec").SetValue("2000") | ||
| gatewaySection.Key("state_update_interval_sec").SetValue("5") | ||
| gatewaySection.Key("enable_spdk_discovery_controller").SetValue("False") | ||
| gatewaySection.Key("encryption_key").SetValue("/etc/ceph/encryption.key") | ||
| gatewaySection.Key("rebalance_period_sec").SetValue("7") | ||
| gatewaySection.Key("max_gws_in_grp").SetValue("16") | ||
| gatewaySection.Key("max_ns_to_change_lb_grp").SetValue("8") | ||
| gatewaySection.Key("verify_listener_ip").SetValue("False") | ||
| gatewaySection.Key("enable_monitor_client").SetValue("True") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm personally fine with establishing defaults here without using package consts to define them. However, I do think we should make comments that indicate why we believe Rook's defaults to be needed instead of relying on nvme-gw's own internal defaults.
For example, if enable_auth is False by default, I would prefer to either remove Rook's default, or add a comment that explains why Rook believes it's necessary to establish default control for the value.
For things like IP and port, I think it is obvious enough that we don't need comments to doc why Rook is modifying the value. For anything beyond this, I'd prefer to see a comment to help with code maintenance and future decision making around changes to these values.
In general, in Rook we tend to prefer to allow components to do their own defaulting, which trusts the component to change configurations when necessary for user benefit.
Hypothetically, if rebalance_period_sec = 7 is nvme-gw's default and nvme-gw devs determine that 7 seconds isn't enough to accounting for 99% tail latencies in most deployments, Rook will benefit when nvme-gw changes to rebalance_period_sec = 9, but Rook would lose this benefit if Rook establishes its own control.
In the inverse hypothetical, if rebalance_period_sec = 5 is nvme-gw's default and this isn't enough to account for 99% tail latencies in most Rook deployments where the k8s Pod network is adding some latency, it is totally appropriate for Rook to establish a more preferred default of rebalance_period_sec = 7. In this case, we should document in a comment that nvme-gw's default was too long for Rook users and why we chose 7 speciffically as the new value for Rook users.
| if err != nil { | ||
| return "", errors.Wrap(err, "failed to create gateway-logs section") | ||
| } | ||
| gatewayLogsSection.Key("log_level").SetValue("debug") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consider changing the default based on ROOK_LOG_LEVEL? Thoughts @travisn ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's revert it. @OdedViner while testing the feature, you can just set the log level to debug in the operator.yaml, no need to set it in code.
Fixes: #15551
adds support for ceph nvme-of gateways through a new crd.
Test NVMeOF on minikube Based on: https://github.com/ceph/ceph-csi/tree/devel/e2e/nvmeof
ROOK_CSI_CEPH_IMAGE: "quay.io/gdidi/cephcsi:nvmeof"
Also set Ceph version 20: quay.io/ceph/ceph:v20.”
Working with ceph version
image: quay.io/ceph/ceph:v20rook/deploy/examples/cluster-test.yaml
Line 18 in e3341a4
2.Create nvmeofpool and verify in ready phase
3.Create CephNVMeOFGateway CR
4.Verify ceph-nvmeof-gateway pod runing state
Checklist: