Skip to content

Conversation

@BlaineEXE
Copy link
Member

@BlaineEXE BlaineEXE commented Aug 7, 2025

Implement CephX key rotation for Rook's client.admin user.

Admin user rotation is risky, so this has been tested extensively both
in unit tests as well as by manually injecting failures during runtime.
In testing, all failures were able to be recovered by the recovery
routine.

A mutex is also added to help ensure that two simultaneous admin key
rotation processes cannot be running simultaneously for any given
namespace. The mutex is tested in unit tests, and it was verified during
runtime via manual testing.

TODO:

  • admin key status initialization
  • mutex for admin rotation
  • test mutex IRL
  • test context cancellation optimization
  • manually test fail-at locations
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide.
  • Reviewed the developer guide on Submitting a Pull Request
  • Pending release notes updated with breaking and/or notable changes for the next minor release.
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.

@BlaineEXE BlaineEXE force-pushed the admin-key-rotation branch 2 times, most recently from 873cf1d to c552777 Compare August 11, 2025 20:51
@BlaineEXE BlaineEXE force-pushed the admin-key-rotation branch 3 times, most recently from d93e30d to 7cef0ab Compare August 13, 2025 19:40
@BlaineEXE BlaineEXE force-pushed the admin-key-rotation branch 2 times, most recently from 3dcbc61 to 9b766f6 Compare August 13, 2025 22:49
@BlaineEXE BlaineEXE force-pushed the admin-key-rotation branch 2 times, most recently from 23be5b3 to 9a6ff5f Compare August 14, 2025 21:07
@mergify
Copy link

mergify bot commented Aug 15, 2025

This pull request has merge conflicts that must be resolved before it can be merged. @BlaineEXE please rebase it. https://rook.io/docs/rook/latest/Contributing/development-flow/#updating-your-fork

@BlaineEXE BlaineEXE force-pushed the admin-key-rotation branch 3 times, most recently from 8c2a1bb to 7bfb66d Compare August 20, 2025 21:49
@BlaineEXE BlaineEXE changed the title NOT READY FOR REVIEW: core: rotate client.admin cephx key core: rotate client.admin cephx key Aug 20, 2025
@BlaineEXE BlaineEXE requested a review from sp98 August 20, 2025 22:08
Comment on lines 112 to 131
var failAt = "" // TESTING TESTING TESTING - global var to induce failures

func init() {
at, ok := os.LookupEnv("ROOK_DEV_ROTATE_ADMIN_KEY_FAIL_AT")
if !ok {
return
}
failAt = at
logger.Debugf("TESTING TESTING TESTING - set fail at: %s", failAt)
}

func failIf(s string) { // TESTING TESTING TESTING - global var to induce failures
if s == failAt {
logger.Debugf("TESTING TESTING TESTING - failing admin key rotation at %s", s)
failAt = "" // empty failAt so rotation passes next time
reloadManagerFunc()
time.Sleep(5 * time.Second)
}
logger.Debugf("TESTING TESTING TESTING - continuing admin key rotation at %s", s)
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, this is how I "injected" failures in to key rotation for manual testing. Adding this comment so it is quick to add this back in if needed after final cleanup

@BlaineEXE BlaineEXE added cephx-rotation Feature required for cephx key rotation feature backport-release-1.18 labels Aug 20, 2025
@BlaineEXE BlaineEXE marked this pull request as ready for review August 20, 2025 22:23
@BlaineEXE BlaineEXE requested a review from travisn August 26, 2025 19:07
// through in the event of a CephCluster spec change, which can happen for any number of reasons
// during normal runtime. While this rotation is designed to be able to be recovered if
// interrupted, it is still safest to allow the rotation process to finish in full.
clusterInfoCopy := clusterInfo.ShallowCopy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of implementing a shallow copy method, would dereferencing the pointer have the desired effect?

Suggested change
clusterInfoCopy := clusterInfo.ShallowCopy()
clusterInfoCopy := *clusterInfo

Copy link
Member Author

@BlaineEXE BlaineEXE Aug 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I think it should be the same thing. Especially since all we want to do is change the context

I manually tested this change to make sure it really is equivalent, and it is working the same way. For future reference in case we want to test manually again, I just added this code block after the shallow copy and context replacement.

	// TESTING - edit cephcluster to check context/shallow copy
	clusterObj := cephv1.CephCluster{}
	err = clusterdCtx.Client.Get(clusterInfo.Context, clusterInfo.NamespacedName(), &clusterObj)
	if err != nil {
		panic(err)
	}
	clusterObj.Spec.Mon.AllowMultiplePerNode = !clusterObj.Spec.Mon.AllowMultiplePerNode
	err = clusterdCtx.Client.Update(clusterInfo.Context, &clusterObj)
	if err != nil {
		panic(err)
	}
	// TESTING

This simulates the situation where a cluster spec change happens right as admin key rotation begins, which is approximately the worst possible case. I verified that the rotation routine continues to run, and I also saw from logs that the admin key rotation lock does its job of preventing the next reconcile's recovery routine from acting simultaneously.

Copy link
Member

@travisn travisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

return nil
}

desiredCephVersion := clusterInfo.CephVersion // TODO: update this when/if WithCephVersionUpdate is implemented
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: no need for this comment? Seems like we won't implement this policy for the admin key

Suggested change
desiredCephVersion := clusterInfo.CephVersion // TODO: update this when/if WithCephVersionUpdate is implemented
desiredCephVersion := clusterInfo.CephVersion

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment is still valid as long as the admin key is 'bundled' with other daemon keys. I know we talked about the posssibility of separating admin to its own config, but I would propose to make that decision later

// as `client.admin-rotator`: run `ceph auth rotate client.admin`
logger.Infof("admin cephx key will be rotated for cluster in namespace %q. rook will restart afterwards. some reconciles and health checks may fail in between - this is normal", clusterInfo.Namespace)
newAdminKey, err := cephclient.AuthRotate(clusterdCtx, rotatorInfo, cephclient.AdminUsername)
// newAdminKey, err := cephclient.AuthGetKey(clusterdCtx, rotatorInfo, "client.admin") // TESTING TESTING TESTING
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done testing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, thought I got all those. Fixed

Implement CephX key rotation for Rook's client.admin user.

Admin user rotation is risky, so this has been tested extensively both
in unit tests as well as by manually injecting failures during runtime.
In testing, all failures were able to be recovered by the recovery
routine.

A mutex is also added to help ensure that two simultaneous admin key
rotation processes cannot be running simultaneously for any given
namespace. The mutex is tested in unit tests, and it was verified during
runtime via  manual testing.

Signed-off-by: Blaine Gardner <blaine.gardner@ibm.com>
@BlaineEXE BlaineEXE merged commit b162873 into rook:master Aug 27, 2025
53 of 55 checks passed
@BlaineEXE BlaineEXE deleted the admin-key-rotation branch August 27, 2025 21:20
Copy link
Member

@parth-gr parth-gr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!
It is a piece of art, I learned a lot reviewing it:)

// through in the event of a CephCluster spec change, which can happen for any number of reasons
// during normal runtime. While even the recovery is designed to be able to be recovered if
// interrupted, it is still safest to allow the rotation process to recover in full.
clusterInfoShallowCopy := *clusterInfo
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How this make sure it has the minimalCopyClusterInfo of the admin rotator?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They serve different purposes.

This is a shallow copy of the whole struct with the purpose of switching the context, described by the long comment above.

The minimal copy primarily acts as a helper to modify the username and keyring file path. It's implemented as the smallest copy that can still run the ceph client commands, so it omits a lot of fields - thus "minimal". I deliberately avoided a shallow copy to guarantee that the commands being run as different users wouldn't affect the original context.

BlaineEXE added a commit that referenced this pull request Sep 2, 2025
core: rotate client.admin cephx key (backport #16271)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-release-1.18 cephx-rotation Feature required for cephx key rotation feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants