Skip to content

Conversation

@junzebao
Copy link
Contributor

What type of PR is this?

fix

What this PR does / why we need it:

When we're creating VolcanoJobs, we've seen errors like below

error occurred: failed calling webhook "mutatejob.volcano.sh": failed to call webhook: Post "https://volcano-admission-service.volcano-system.svc:443/jobs/mutate?timeout=10s\": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "volcano-admission-service.volcano-system.svc"

I think the reason is as follows:
Every time the volcano helm chart gets upgraded, it generates a new tls certificate and the secret is updated with the new certificate. When we're running the admission webhook with 5 replicas and let's say at t0 they're started with the initial certificate c0, a helm upgrade at t1 would result in a new certificate c1 in the secret. So any newly started pod (e.g., p4) fetches the new certificate c1 while the rest pods p0-p3 are still running with the old certificate c0.

Long story short I think the admission webhook deployment needs a restart after the certificate is updated, however the certificate is only generated in the admission-init job, so I can't annotate the pods with the certificate checksum.

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

@volcano-sh-bot volcano-sh-bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Jun 23, 2025
@junzebao
Copy link
Contributor Author

/assign @shinytang6

@JesseStutler
Copy link
Member

You mean you're doing helm upgrade at the same time as you deploy vcjob?

@junzebao
Copy link
Contributor Author

I meant every time you do a helm upgrade, the admission-init Job updates the secret that contains the webhook certificate, but the pods aren't restarted to use the new certificate, so when there is any interruption to one of the pod, the new pod will use the new certificate and thus having a different one from the rest.

@JesseStutler
Copy link
Member

I meant every time you do a helm upgrade, the admission-init Job updates the secret that contains the webhook certificate, but the pods aren't restarted to use the new certificate, so when there is any interruption to one of the pod, the new pod will use the new certificate and thus having a different one from the rest.

OK it's reasonable, please fix the code verify CI, you can use make update-development-yaml to upgrade all the development yaml files

@JesseStutler
Copy link
Member

/cc

@junzebao
Copy link
Contributor Author

junzebao commented Jul 2, 2025

@JesseStutler can we have this approved and merged?

@JesseStutler
Copy link
Member

@junzebao Please squash your commits into one, and then we can merge it

Signed-off-by: Junze Bao <junze@poolside.ai>
@junzebao junzebao force-pushed the restart-webhook-on-upgrade branch from 4af5db4 to da003ee Compare July 3, 2025 07:22
@hwdef
Copy link
Member

hwdef commented Jul 4, 2025

/approve

@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hwdef

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 4, 2025
@Monokaix
Copy link
Member

Monokaix commented Jul 7, 2025

/lgtm

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Jul 7, 2025
@volcano-sh-bot volcano-sh-bot merged commit 34b3ac4 into volcano-sh:master Jul 7, 2025
18 checks passed
@junzebao junzebao deleted the restart-webhook-on-upgrade branch July 8, 2025 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants