Skip to content

Conversation

@OdedViner
Copy link
Contributor

@OdedViner OdedViner commented Nov 10, 2025

Fixes: #15551
adds support for ceph nvme-of gateways through a new crd.

  • introduces CephNVMeOFGateway custom resource
  • implements controller for managing nvme-of gateway lifecycle
  • supports configurable instances, placement, resources, and host networking
  • generates gateway configuration and manages deployments/services
  • follows existing patterns from nfs controller implementation

Test NVMeOF on minikube Based on: https://github.com/ceph/ceph-csi/tree/devel/e2e/nvmeof

  1. Deploy the Rook operator without the CSI operator, and use a private image for Ceph-CSI:
    ROOK_CSI_CEPH_IMAGE: "quay.io/gdidi/cephcsi:nvmeof"
    Also set Ceph version 20: quay.io/ceph/ceph:v20.”
kind: ConfigMap
apiVersion: v1
metadata:
  name: rook-ceph-operator-config
  # should be in the namespace of the operator
  namespace: rook-ceph # namespace:operator
data:
  ROOK_USE_CSI_OPERATOR: "false"
  ROOK_CSI_CEPH_IMAGE: "quay.io/gdidi/cephcsi:nvmeof"
$ kubectl get pods -n rook-ceph  csi-rbdplugin-l2rgc -o yaml| grep didi
    image: quay.io/gdidi/cephcsi:nvmeof
    image: quay.io/gdidi/cephcsi:nvmeof

Working with ceph version image: quay.io/ceph/ceph:v20

image: quay.io/ceph/ceph:v19

2.Create nvmeofpool and verify in ready phase

apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
  name: nvmeofpool
  namespace: rook-ceph # namespace:cluster
spec:
  failureDomain: osd
  replicated:
    size: 1
$ kubectl get Cephblockpool -A
NAMESPACE   NAME          PHASE   TYPE         FAILUREDOMAIN   AGE
rook-ceph   nvmeofpool    Ready   Replicated   osd             6m3s

3.Create CephNVMeOFGateway CR

apiVersion: ceph.rook.io/v1
kind: CephNVMeOFGateway
metadata:
  name: my-nvmeof
  namespace: rook-ceph # namespace:cluster
spec:
  # Pool name that will be used by the NVMe-oF gateway
  pool: nvmeofpool
  # ANA (Asymmetric Namespace Access) group name
  group: mygroup
  # Number of gateway instances to run
  instances: 1

4.Verify ceph-nvmeof-gateway pod runing state

$ kubectl get pods ceph-nvmeof-gateway-765dffc5bc-sfscr -n rook-ceph 
NAME                                   READY   STATUS    RESTARTS   AGE
ceph-nvmeof-gateway-765dffc5bc-sfscr   1/1     Running   0          3m20s

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide.
  • Reviewed the developer guide on Submitting a Pull Request
  • Pending release notes updated with breaking and/or notable changes for the next minor release.
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.

@OdedViner OdedViner force-pushed the nmveof_new2 branch 3 times, most recently from 70f289f to b5e8c8e Compare November 10, 2025 08:47
@OdedViner OdedViner requested a review from ideepika November 10, 2025 08:48
@OdedViner OdedViner force-pushed the nmveof_new2 branch 5 times, most recently from d2344fc to 61c2a4a Compare November 10, 2025 15:07

metricsBindAddress := k8sutil.GetOperatorSetting("ROOK_OPERATOR_METRICS_BIND_ADDRESS", "0")
skipNameValidation := true
cacheSyncTimeout := 10 * time.Minute
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the NVMeOF controller causes operator startup timeouts.
Increased CacheSyncTimeout to 10 minutes.
Rook operator logs with the private image show errors with the new NVMeoF code.
@travisn Do you know why this is happening?

2025-11-10 14:29:16.627898 D | clusterdisruption-controller: could not match failure domain. defaulting to "host"
2025-11-10 14:29:16.627938 D | exec: Running command: ceph osd metadata --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --name=client.admin --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json
2025-11-10 14:29:16.790834 E | clusterdisruption-controller: failed to get OSD status: failed to get osd metadata: exit status 1
2025-11-10 14:29:16.963003 I | ceph-csi: cluster info for cluster "my-cluster" is not ready yet, will retry in 10s, proceeding with ready clusters
2025-11-10 14:29:16.970832 C | rookcmd: failed to run operator: gave up to run the operator manager: failed to run the controller-runtime manager: failed to wait for ceph-object-realm-controller caches to sync kind source: *v1.CephObjectRealm: timed out waiting for cache to be synced for Kind *v1.CephObjectRealm

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably check if the CephObjectRealm crd is present

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On another test, I got this error

2025-11-11 12:41:59.314806 D | exec: Running command: ceph osd metadata --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --name=client.admin --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json
2025-11-11 12:41:59.506880 E | clusterdisruption-controller: failed to get OSD status: failed to get osd metadata: exit status 1
2025-11-11 12:41:59.509731 C | rookcmd: failed to run operator: gave up to run the operator manager: failed to run the controller-runtime manager: failed to wait for ceph-object-zone-controller caches to sync kind source: *v1.CephObjectZone: timed out waiting for cache to be synced for Kind *v1.CephObjectZone

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @subhamkrai mentioned to add the register to scheme, can you try that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may something like this will resolve this

err = addonsv1alpha1.AddToScheme(mgr.GetScheme())
if err != nil {
return err
}
err = csiopv1.AddToScheme(mgr.GetScheme())
if err != nil {
return err
}

script := fmt.Sprintf(`
set -xEeuo pipefail
cat << EOF > /etc/ceph/ceph.conf
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That has been manually copied from the NFS-Ganesha deployment that Rook made. I think the script is generated by this:

// GenerateMinimalCephConfInitContainer returns an init container that will generate the most basic
// Ceph config for connecting non-Ceph daemons to a Ceph cluster (e.g., nfs-ganesha). Effectively
// what this means is that it generates '/etc/ceph/ceph.conf' with 'mon_host' populated and a
// keyring path associated with the user given. 'mon_host' is determined by the 'ROOK_CEPH_MON_HOST'
// env var present in other Ceph daemon pods, and the keyring is expected to be mounted into the
// container with a Kubernetes pod volume+mount.
func GenerateMinimalCephConfInitContainer(
username, keyringPath string,
containerImage string,
containerImagePullPolicy v1.PullPolicy,
volumeMounts []v1.VolumeMount,
resources v1.ResourceRequirements,
securityContext *v1.SecurityContext,
) v1.Container {
cfgPath := client.DefaultConfigFilePath()
// Note that parameters like $(PARAM) will be replaced by Kubernetes with env var content before
// container creation.
confScript := `
set -xEeuo pipefail
cat << EOF > ` + cfgPath + `
[global]
mon_host = $(ROOK_CEPH_MON_HOST)
[` + username + `]
keyring = ` + keyringPath + `
EOF
chmod 444 ` + cfgPath + `
cat ` + cfgPath + `
`
return v1.Container{
Name: "generate-minimal-ceph-conf",
Command: []string{"/bin/bash", "-c", confScript},
Args: []string{},
Image: containerImage,
ImagePullPolicy: containerImagePullPolicy,
VolumeMounts: volumeMounts,
Env: opconfig.StoredMonHostEnvVars(),
Resources: resources,
SecurityContext: securityContext,
}
}

The nvmeof.conf needs to be generated as well, maybe with a similar workflow in it's own init-container?

)

const (
controllerName = "ceph-nvmeof-gateway-controller"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be a fairly long prefix to log messages. Perhaps -controller suffix could be omitted from the logs similar to the ceph-bucket-notification controller?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

[ceph]
pool = `)
config.WriteString(poolName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not nacking this but I'm concerned this will be brittle. Perhaps it would be better to use a template?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this is sufficient and works everywhere. With my testing on different cloud provider clusters and within Ceph-CSI e2e, I found that I needed to disable some defaults. But the documentation of the gateway configuration may give better clarity than what I managed to put together at https://github.com/ceph/ceph-csi/blob/devel/e2e/nvmeof/config.yaml

Copy link
Member

@BlaineEXE BlaineEXE Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also concerned, and I want to make sure we are clear on why we are choosing this pattern.

We are creating a configmap that has domain-specific templating defined @@VAR@@ instead of using well-understood templating semantics

Are we reconciling the configmap and overwriting any user modifications each time, or do we write it once and allow users to use it to modify their config. If we overwite it each time, we would be better off having an init container that creates this file each time so that users don't modify it and then complain when it changes later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cross reference: #16689 (comment)

func (r *ReconcileCephNVMeOFGateway) generateConfigMap(nvmeof *cephv1.CephNVMeOFGateway) *v1.ConfigMap {
poolName := nvmeof.Spec.Pool
if poolName == "" {
poolName = "nvmeofpool"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a pool that Ceph will automatically create on when a nvmeof daemon is started or is it a default being created by Rook? If its the later, it might be better to require the end user to explicitly set the pool name.

Copy link
Member

@BlaineEXE BlaineEXE Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gadididi could we get a clear response here instead of a thumbs up? I'm not sure how to interpret this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pool is not automatically created. The customer must create the CephBlockPool CR first, then create the CephNVMeOFGateway CR.
The default "nvmeofpool" is only used when spec.pool is omitted. The pool must already exist (created via CephBlockPool CR) before the gateway can use it.

https://github.com/rook/rook/pull/16689/files#diff-0e2dc8235e5a70baa91cb4a7aee6db662b882a358e73af0a4a3a8ec255220cc8R15

Copy link
Member

@BlaineEXE BlaineEXE Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this isn't a Ceph default, I would suggest not having this default. Instead, we can make the pool name a required input on the custom resource. This will avoid a minor amount of complexity to manage internally (maintenance and test simplification) and could avoid some bugs or user confusion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree strongly with @BlaineEXE that it is better to require explicit declaration of the pool rather than magic defaults.

It might be worth considering if new CRDs should require that pools already exist or go a step further and require that a CephBlockPool CR exists (and its phase is Ready). The argument for introspecting on CephBlockPool CRs is that it allows decisions to be made about continuing with a reconcile by probing only k8s state (which may already be cached) instead of having to probe ceph's state.

Copy link
Contributor Author

@OdedViner OdedViner Nov 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made spec.pool required and removed the default nvmeofpool value. If you prefer, I can add logic in the controller to automatically create the CephBlockPool CR when a CephNVMeOFGateway is created.

Will the controller create the block pool via the Ceph CLI, similar to this function?

func createRGWPool(ctx *Context, cluster *cephv1.ClusterSpec, poolSpec cephv1.PoolSpec, pgCount, requestedName string) error {
// create the pool if it doesn't exist yet

@BlaineEXE @jhoblitt

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to do initial major-feature development by stripping away as many automations and minor features as possible to still get to a useful working result for end users.

In this case, automatic pool creation is what I consider premature user-workflow optimization in Rook controllers.

From observing Rook feature development over time, I have seen that trying automating things like this during the first development phase often results in users not having flexibility in their desired end state, requiring more developer rework over time.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BlaineEXE Hi, I missed you tagged me.
yes, as Oded said, the rbd pool is not created automatically.
you can create it before or after the nvmeof GWs are up.
Also FYI, the nvmeof GWs can work with multiple pools.
but I don't know the requirements

sed -e "s/@@POD_NAME@@/${POD_NAME}/g" \
-e "s/@@ANA_GROUP@@/${ANA_GROUP}/g" \
-e "s/@@POD_IP@@/${POD_IP}/g" \
< /config/ceph-nvmeof.conf > /etc/ceph/nvmeof.conf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cm that's been generated is only a template to have interpolation done in the init container? Why not generate the the entire config here and reduce the number of moving parts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is surely an option. However the nvmeof.conf contains really a lot of options, and it may be difficult to decide what options should be configurable in a CRD. Documentation is a little sparse (or I don't know where to find it), but here is one example of the config file: https://github.com/ceph/ceph-nvmeof/blob/devel/ceph-nvmeof.conf

The one included in the ConfigMap is a minimal version that does not depend on specific memory configurations or pre-allocations. It works, but probably can be tuned for better resource usage or performance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a glance, it looks like that ini format file could be represented in a CR as a map[string]map[string]string.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel similarly. We can have users modify the configmap to set advanced configs, but this makes it awkward when (and there will be a "when") we have to make decisions about when it is safe for Rook to update the configmap versus when it can't because users have modified it.

I think it will be simpler for users and for code maintenance to provide some sort of passthru in Rook and avoid the configmap. The pattern of Josh's suggestion is already being used to set arbitrary ceph config options here:

// Ceph Config options
// +optional
// +nullable
CephConfig map[string]map[string]string `json:"cephConfig,omitempty"`

var configMapName, configHash string
var err error

if cephNVMeOFGateway.Spec.ConfigMapRef == "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the CM is edited by a user, it will never be overwritten, is that an intentional design choice?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally the ConfigMap with nvmeof.conf can be edited by the admin so that the configuration of the gateway can be modified according to their hardware/needs/...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that is the intended interface, then argo/flux/fleet/etc. might create the CM before the CephNVMeOFGateway and the controller would need to accommodate that. It would also mean that the "template" in the CM becomes a stable interface and any changes would need to be carefully backwards compatible. Another downside to this approach is that typos/errors in the template string would either have to be discovered by trial and error by the user or a manual validator would need to be written. Whereas field validation comes essentially for free with a CRD.

@OdedViner OdedViner changed the title ceph: add nvme-of gateway crd support WIP: ceph: add nvme-of gateway crd support Nov 11, 2025
Copy link
Contributor

@subhamkrai subhamkrai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@OdedViner I was able to find the issue, RBAC is missing for cephnvmeofgateways. Please add those in common.yaml. I can share the common.yaml with you offline.

@OdedViner OdedViner force-pushed the nmveof_new2 branch 4 times, most recently from 4e47848 to ff746ea Compare November 13, 2025 12:54
@@ -0,0 +1,67 @@
/*
Copyright 2018 The Rook Authors. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 2025 ⌚

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor Author

@OdedViner OdedViner Nov 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I run make codegen, the year is automatically changed to 2018 (Copyright 2018 The Rook Authors. All rights reserved.) in pkg/client/clientset/versioned/typed/ceph.rook.io/v1/cephnvmeofgateway.go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jhoblitt
Copy link
Contributor

I think a new nvmeof integration suite should probably be created instead of trying to add tests for this into one of the existing suites.

< /config/ceph-nvmeof.conf > /etc/ceph/nvmeof.conf
ceph nvme-gw create ${POD_NAME} %s ${ANA_GROUP}
ceph nvme-gw show %s ${ANA_GROUP}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a parsing the return value (json) from ceph nvme-gw show %s ${ANA_GROUP} and validate if everything was ok

}

script := fmt.Sprintf(`
set -xEeuo pipefail
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest that we not inline this script in the go code. Instead, I've found development and maintenance to be easier with separate scripts that are loaded into strings. We do this with the MDS liveness probe:

https://github.com/rook/rook/blob/master/pkg/operator/ceph/file/mds/livenessprobe.sh

https://github.com/rook/rook/blob/master/pkg/operator/ceph/file/mds/livenessprobe.go

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 211 to 306
[global]
mon_host = $(CEPH_MON_HOST)
log_to_stderr = true
keyring = /etc/ceph/keyring
EOF
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was needed in the NFS script because NFS-Ganesha doesn't accept the --mon-host argument. If the commands we are running accept these args, we should use the args instead of the config file

-e "s/@@POD_IP@@/${POD_IP}/g" \
< /config/ceph-nvmeof.conf > /etc/ceph/nvmeof.conf
ceph nvme-gw create ${POD_NAME} %s ${ANA_GROUP}
Copy link
Member

@BlaineEXE BlaineEXE Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to be careful here when it comes to pod restarts. A k8s deployment will end up having many pod names (e.g., rook-ceph-nvme-gw-abc123) over its lifetime. Will this mean Ceph will register many unique gateway names for each pod restart? And would this mean we are responsible for cleanup of the old pod names/IDs?

If we use a more stable naming here, like deployment name, we will also need to make sure this is idempotent -- or we must make our script idempotent.

nvmeofGatewayPort = 5500
nvmeofMonitorPort = 5499
nvmeofDiscoveryPort = 8009
defaultNVMeOFImage = "quay.io/ceph/nvmeof:1.5"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious about @travisn 's thoughts, but I would suggest not having a default here. If we keep a default, we will always be obligated to keep it up to date. If we require it as input, we can avoid a minor, repeated (and probably often overlooked) maintenance task

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added an optional image parameter to the NVMeOFGatewaySpec. If not specified, it defaults to quay.io/ceph/nvmeof:1.5. This allows users to override the image when needed while maintaining backward compatibility.

Comment on lines 227 to 323
ceph nvme-gw create ${POD_NAME} %s ${ANA_GROUP}
ceph nvme-gw show %s ${ANA_GROUP}
Copy link
Member

@BlaineEXE BlaineEXE Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a neat bash trick we can do for arguments here that I think will help us.

Let's imagine ths simplified script:

#!/usr/bin/env bash
set -xEeuo pipefail

# ARGS: all args are assumed to be 'ceph' CLI args

ceph "$@" nvme-gw <subcommand> <args>
ceph "$@" nvme-gw show %s ${ANA_GROUP}

We can then call the script like this:

bash script.sh --connect-timeout=90 --mon-host="${CEPH_MON_HOST}" --etc-etc-etc

When the script runs, it will use the mon args where "$@" exists in the script. This allows easy passthru, and we can use Rook's methods for setting CLI options for Ceph daemons in the init container to make our lives easier and to make sure we don't miss options.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@OdedViner OdedViner force-pushed the nmveof_new2 branch 5 times, most recently from 271c1fc to 1689667 Compare November 24, 2025 09:34
@OdedViner
Copy link
Contributor Author

I think a new nvmeof integration suite should probably be created instead of trying to add tests for this into one of the existing suites.

@jhoblitt Done https://github.com/rook/rook/pull/16689/files#diff-e92d2b50640744c370e7d59b0510338412d324658ce89df677dbcf1105a4ae57

https://github.com/rook/rook/pull/16689/files#diff-99105bed8e71810279170b92d9b1480d330c081f15de6dd30a2e9dce6a0a027a

adds CephNVMeOFGateway crd for managing ceph nvme-of gateways.
supports configurable instances, placement, resources,
and networking options.

Signed-off-by: Oded Viner <oviner@redhat.com>
@OdedViner OdedViner force-pushed the nmveof_new2 branch 3 times, most recently from 0fbbafe to e23bf35 Compare December 14, 2025 12:52
Signed-off-by: Oded Viner <oviner@redhat.com>
Signed-off-by: Oded Viner <oviner@redhat.com>
Signed-off-by: Oded Viner <oviner@redhat.com>
Signed-off-by: Oded Viner <oviner@redhat.com>
Signed-off-by: Oded Viner <oviner@redhat.com>
Signed-off-by: Oded Viner <oviner@redhat.com>
Copy link
Member

@travisn travisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My review mainly focused on getting the examples and documentation created to be clear how users can consume the feature.

// For example, quay.io/ceph/nvmeof:1.5
// If not specified, a default image will be used.
// +optional
Image string `json:"image,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nvmeof gateway image is separate from the ceph/ceph image? Until now, everything in ceph has been a single image, and we could get it from the cephcluster CR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an NVMe-oF image, not a Ceph image, based on:
https://github.com/ceph/ceph-csi/blob/devel/e2e/nvmeof/deployment.yaml#L32

@@ -0,0 +1,69 @@
# This example is for Ceph v19 and above only
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this require Ceph v20?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -0,0 +1,69 @@
# This example is for Ceph v19 and above only
# The CephBlockPool must be created before the CephNVMeOFGateway
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order with CRDs should not matter. If the controller does not find a needed pool, it can just fail the reconcile and retry until it finds the pool. So no need to mention this order.

# - ConfigMap (if configMapRef is not specified)
# - Service (one per gateway instance)
# - Deployment (one per gateway instance)
# The ServiceAccount "ceph-nvmeof-gateway" must exist in the namespace.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add the service account and needed rbac to common.yaml

apiVersion: ceph.rook.io/v1
kind: CephNVMeOFGateway
metadata:
name: my-nvmeof
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about just remove the my prefix? In retrospect, we should not have included that in the other example names either.

Suggested change
name: my-nvmeof
name: nvmeof

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

namespace: rook-ceph # namespace:cluster
spec:
# Pool name that will be used by the NVMe-oF gateway
pool: nvmeofpool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest no "pool" suffix, since it is already of type pool, the name is redundant.

Suggested change
pool: nvmeofpool
pool: nvmeof

# Pool name that will be used by the NVMe-oF gateway
pool: nvmeofpool
# ANA (Asymmetric Namespace Access) group name
group: mygroup
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this example name instead of the "my" prefix?

Suggested change
group: mygroup
group: group-a

# This example is for Ceph v19 and above only
# The CephBlockPool must be created before the CephNVMeOFGateway
#
# Note: The operator will automatically create the following resources:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please create a document that describes how the user can use this feature. Some thoughts:

  • Create the doc: Documentation/Storage-Configuration/Block-Storage-RBD/nvme-of.md
  • The content will be similar as block-storage.md, to describe the scenario, show a simple yaml example, and describe how the user can enable the nvmeof csi driver, and do all the steps to configure the scenario.

With this document, reviewers can even better understand the end-to-end flow to better review the feature, and then the feature will be documented for users to try out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest creating two example yamls:

  • nvmeof.yaml: This is the production example. Basically, just rename this file, and change the pool replication to 3.
  • nvmeof-test.yaml: This will be the replica 1 pool example, and the nvmeof CR will be as simple as possible, no need to have all the comments with all possible settings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need an example storage class.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this involves CSI as well, I liked Travis's comment from yesterday to add a deploy/examples/csi/nvmeof directory with examples similar to what we see in the other CSI dirs. We can focus on a single storageclass, pvc, and pod for now. More examples might just make things complicated at this early stage.

Copy link
Member

@BlaineEXE BlaineEXE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most common feedback item here (of which there are more than I flagged) is that in Rook, it is almost always undesirable to use logger.Error() in a reconciliation loop. Only in rare cases do we actually do this in Rook. The ReportReconileResult() handles logging errors that are "bubbled up" via error return wrapping for nearly all cases.

In other projects that use a different logging framework, logger.Error() can be more common, but not in Rook. Please remove all instances unless you believe that it is imperative to log the error. If you do decide that it's imperative, please tag me and Travis on a comment around it, and we will most likely find a way to help avoid logging the error.

For most errors, I see errors.Wrapf() used, which is generally what I am looking for when reviewing error handling in Rook.

metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata"`
Spec NVMeOFGatewaySpec `json:"spec"`
// +kubebuilder:pruning:PreserveUnknownFields
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is generally only needed to keep fields around when dealing with legacy types. I'd suggest getting rid of this here because it's a whole new resource.

Comment on lines +2792 to +2793
// +optional
Image string `json:"image,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest making this a required input for users and setting // +kubebuilder:validation:MinLength=1

Comment on lines +2795 to +2796
// The number of active gateway instances
Instances int `json:"instances"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current K8s API conventions suggest omitempty for all fields. Please add here and below.

They also recommend setting either // +required or // +optional for all fields. Please add one of those here and for fields below.

Optional fields are recommended to be pointers, except for maps and slices.

// NVMeOFConfig is a map of section names to key-value pairs for nvmeof.conf configuration
// This allows users to override or add configuration options without needing to manage a ConfigMap
// +optional
// +nullable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// +nullable is actively discouraged by modern API conventions. Please remove any additions of nullable

const (
packageName = "ceph-nvmeof-gateway"
controllerName = packageName + "-controller"
CephNVMeOFGatewayNameLabelKey = "ceph_nvmeof_gateway"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to avoid adding a new label and instead rely on ceph_daemon_id added by CephDaemonAppLabels()?


configContent, err := getNVMeOFGatewayConfig(poolName, podName, podIP, anaGroup, nvmeof.Spec.NVMeOFConfig)
if err != nil {
logger.Errorf("failed to generate config: %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to log an error when returning the error.

Comment on lines +495 to +497
for key, value := range options {
section.Key(key).SetValue(value)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend also using section.Key(key).GetValue() here to determine if a value has already been specified. If a value is set already, let's log something like below, which can help us and users debug the code used to set defaults and apply user overrides:

logger.NamespacedDebugf("overriding section %q config %q value %q with user-specified value %q", section, key, currentValue, newValue)

Comment on lines +4 to +9
cat << EOF > /etc/ceph/ceph.conf
[global]
mon_host = ${ROOK_CEPH_MON_HOST}
log_to_stderr = true
keyring = /etc/ceph/keyring
EOF
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed for the nvme-gw itself, or is this used for the ceph commands below?

If it's needed for the nvme-gw itself, please provide a comment that indicates why CLI args are insufficient.

If it's for the ceph commands below, I'm quite sure this can be omitted.

Comment on lines +419 to +430
gatewaySection.Key("port").SetValue("5500")
gatewaySection.Key("enable_auth").SetValue("False")
gatewaySection.Key("state_update_notify").SetValue("True")
gatewaySection.Key("state_update_timeout_in_msec").SetValue("2000")
gatewaySection.Key("state_update_interval_sec").SetValue("5")
gatewaySection.Key("enable_spdk_discovery_controller").SetValue("False")
gatewaySection.Key("encryption_key").SetValue("/etc/ceph/encryption.key")
gatewaySection.Key("rebalance_period_sec").SetValue("7")
gatewaySection.Key("max_gws_in_grp").SetValue("16")
gatewaySection.Key("max_ns_to_change_lb_grp").SetValue("8")
gatewaySection.Key("verify_listener_ip").SetValue("False")
gatewaySection.Key("enable_monitor_client").SetValue("True")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm personally fine with establishing defaults here without using package consts to define them. However, I do think we should make comments that indicate why we believe Rook's defaults to be needed instead of relying on nvme-gw's own internal defaults.

For example, if enable_auth is False by default, I would prefer to either remove Rook's default, or add a comment that explains why Rook believes it's necessary to establish default control for the value.

For things like IP and port, I think it is obvious enough that we don't need comments to doc why Rook is modifying the value. For anything beyond this, I'd prefer to see a comment to help with code maintenance and future decision making around changes to these values.

In general, in Rook we tend to prefer to allow components to do their own defaulting, which trusts the component to change configurations when necessary for user benefit.

Hypothetically, if rebalance_period_sec = 7 is nvme-gw's default and nvme-gw devs determine that 7 seconds isn't enough to accounting for 99% tail latencies in most deployments, Rook will benefit when nvme-gw changes to rebalance_period_sec = 9, but Rook would lose this benefit if Rook establishes its own control.

In the inverse hypothetical, if rebalance_period_sec = 5 is nvme-gw's default and this isn't enough to account for 99% tail latencies in most Rook deployments where the k8s Pod network is adding some latency, it is totally appropriate for Rook to establish a more preferred default of rebalance_period_sec = 7. In this case, we should document in a comment that nvme-gw's default was too long for Rook users and why we chose 7 speciffically as the new value for Rook users.

if err != nil {
return "", errors.Wrap(err, "failed to create gateway-logs section")
}
gatewayLogsSection.Key("log_level").SetValue("debug")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider changing the default based on ROOK_LOG_LEVEL? Thoughts @travisn ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's revert it. @OdedViner while testing the feature, you can just set the log level to debug in the operator.yaml, no need to set it in code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

add nvme-of support for rook

8 participants