-
Notifications
You must be signed in to change notification settings - Fork 2.8k
core: rotate client.admin cephx key #16271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
873cf1d to
c552777
Compare
d93e30d to
7cef0ab
Compare
3dcbc61 to
9b766f6
Compare
23be5b3 to
9a6ff5f
Compare
9a6ff5f to
30f51ee
Compare
|
This pull request has merge conflicts that must be resolved before it can be merged. @BlaineEXE please rebase it. https://rook.io/docs/rook/latest/Contributing/development-flow/#updating-your-fork |
8c2a1bb to
7bfb66d
Compare
7bfb66d to
5b161b5
Compare
pkg/operator/ceph/cluster/cephx.go
Outdated
| var failAt = "" // TESTING TESTING TESTING - global var to induce failures | ||
|
|
||
| func init() { | ||
| at, ok := os.LookupEnv("ROOK_DEV_ROTATE_ADMIN_KEY_FAIL_AT") | ||
| if !ok { | ||
| return | ||
| } | ||
| failAt = at | ||
| logger.Debugf("TESTING TESTING TESTING - set fail at: %s", failAt) | ||
| } | ||
|
|
||
| func failIf(s string) { // TESTING TESTING TESTING - global var to induce failures | ||
| if s == failAt { | ||
| logger.Debugf("TESTING TESTING TESTING - failing admin key rotation at %s", s) | ||
| failAt = "" // empty failAt so rotation passes next time | ||
| reloadManagerFunc() | ||
| time.Sleep(5 * time.Second) | ||
| } | ||
| logger.Debugf("TESTING TESTING TESTING - continuing admin key rotation at %s", s) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, this is how I "injected" failures in to key rotation for manual testing. Adding this comment so it is quick to add this back in if needed after final cleanup
5b161b5 to
fa68d24
Compare
pkg/operator/ceph/cluster/cephx.go
Outdated
| // through in the event of a CephCluster spec change, which can happen for any number of reasons | ||
| // during normal runtime. While this rotation is designed to be able to be recovered if | ||
| // interrupted, it is still safest to allow the rotation process to finish in full. | ||
| clusterInfoCopy := clusterInfo.ShallowCopy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of implementing a shallow copy method, would dereferencing the pointer have the desired effect?
| clusterInfoCopy := clusterInfo.ShallowCopy() | |
| clusterInfoCopy := *clusterInfo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I think it should be the same thing. Especially since all we want to do is change the context
I manually tested this change to make sure it really is equivalent, and it is working the same way. For future reference in case we want to test manually again, I just added this code block after the shallow copy and context replacement.
// TESTING - edit cephcluster to check context/shallow copy
clusterObj := cephv1.CephCluster{}
err = clusterdCtx.Client.Get(clusterInfo.Context, clusterInfo.NamespacedName(), &clusterObj)
if err != nil {
panic(err)
}
clusterObj.Spec.Mon.AllowMultiplePerNode = !clusterObj.Spec.Mon.AllowMultiplePerNode
err = clusterdCtx.Client.Update(clusterInfo.Context, &clusterObj)
if err != nil {
panic(err)
}
// TESTINGThis simulates the situation where a cluster spec change happens right as admin key rotation begins, which is approximately the worst possible case. I verified that the rotation routine continues to run, and I also saw from logs that the admin key rotation lock does its job of preventing the next reconcile's recovery routine from acting simultaneously.
fa68d24 to
ffdd24f
Compare
travisn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
| return nil | ||
| } | ||
|
|
||
| desiredCephVersion := clusterInfo.CephVersion // TODO: update this when/if WithCephVersionUpdate is implemented |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: no need for this comment? Seems like we won't implement this policy for the admin key
| desiredCephVersion := clusterInfo.CephVersion // TODO: update this when/if WithCephVersionUpdate is implemented | |
| desiredCephVersion := clusterInfo.CephVersion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the comment is still valid as long as the admin key is 'bundled' with other daemon keys. I know we talked about the posssibility of separating admin to its own config, but I would propose to make that decision later
pkg/operator/ceph/cluster/cephx.go
Outdated
| // as `client.admin-rotator`: run `ceph auth rotate client.admin` | ||
| logger.Infof("admin cephx key will be rotated for cluster in namespace %q. rook will restart afterwards. some reconciles and health checks may fail in between - this is normal", clusterInfo.Namespace) | ||
| newAdminKey, err := cephclient.AuthRotate(clusterdCtx, rotatorInfo, cephclient.AdminUsername) | ||
| // newAdminKey, err := cephclient.AuthGetKey(clusterdCtx, rotatorInfo, "client.admin") // TESTING TESTING TESTING |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, thought I got all those. Fixed
Implement CephX key rotation for Rook's client.admin user. Admin user rotation is risky, so this has been tested extensively both in unit tests as well as by manually injecting failures during runtime. In testing, all failures were able to be recovered by the recovery routine. A mutex is also added to help ensure that two simultaneous admin key rotation processes cannot be running simultaneously for any given namespace. The mutex is tested in unit tests, and it was verified during runtime via manual testing. Signed-off-by: Blaine Gardner <blaine.gardner@ibm.com>
ffdd24f to
7c33186
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
It is a piece of art, I learned a lot reviewing it:)
| // through in the event of a CephCluster spec change, which can happen for any number of reasons | ||
| // during normal runtime. While even the recovery is designed to be able to be recovered if | ||
| // interrupted, it is still safest to allow the rotation process to recover in full. | ||
| clusterInfoShallowCopy := *clusterInfo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How this make sure it has the minimalCopyClusterInfo of the admin rotator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @BlaineEXE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They serve different purposes.
This is a shallow copy of the whole struct with the purpose of switching the context, described by the long comment above.
The minimal copy primarily acts as a helper to modify the username and keyring file path. It's implemented as the smallest copy that can still run the ceph client commands, so it omits a lot of fields - thus "minimal". I deliberately avoided a shallow copy to guarantee that the commands being run as different users wouldn't affect the original context.
core: rotate client.admin cephx key (backport #16271)
Implement CephX key rotation for Rook's client.admin user.
Admin user rotation is risky, so this has been tested extensively both
in unit tests as well as by manually injecting failures during runtime.
In testing, all failures were able to be recovered by the recovery
routine.
A mutex is also added to help ensure that two simultaneous admin key
rotation processes cannot be running simultaneously for any given
namespace. The mutex is tested in unit tests, and it was verified during
runtime via manual testing.
TODO:
Checklist: