-
Notifications
You must be signed in to change notification settings - Fork 1.2k
fix: rolling restart admission webhooks after helm upgrade #4396
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: rolling restart admission webhooks after helm upgrade #4396
Conversation
|
/assign @shinytang6 |
|
You mean you're doing helm upgrade at the same time as you deploy vcjob? |
|
I meant every time you do a |
OK it's reasonable, please fix the code verify CI, you can use |
|
/cc |
|
@JesseStutler can we have this approved and merged? |
|
@junzebao Please squash your commits into one, and then we can merge it |
Signed-off-by: Junze Bao <junze@poolside.ai>
4af5db4 to
da003ee
Compare
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hwdef The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/lgtm |
What type of PR is this?
fix
What this PR does / why we need it:
When we're creating VolcanoJobs, we've seen errors like below
I think the reason is as follows:
Every time the volcano helm chart gets upgraded, it generates a new tls certificate and the secret is updated with the new certificate. When we're running the admission webhook with 5 replicas and let's say at
t0they're started with the initial certificatec0, a helm upgrade att1would result in a new certificatec1in the secret. So any newly started pod (e.g.,p4) fetches the new certificatec1while the rest podsp0-p3are still running with the old certificatec0.Long story short I think the admission webhook deployment needs a restart after the certificate is updated, however the certificate is only generated in the
admission-initjob, so I can't annotate the pods with the certificate checksum.Special notes for your reviewer:
Does this PR introduce a user-facing change?