-
Notifications
You must be signed in to change notification settings - Fork 1.1k
OCPBUGS-58229: server: handle missing network namespace gracefully during networkStop #9301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@sohankunkerkar: This pull request references Jira Issue OCPBUGS-58229, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
After a reboot there would be no iptables/nftables hostport rules anyway, so I don't think this is the problem. (And I don't see anything about hostports in the linked bug, though I didn't pull the full logs from the Google drive link...) |
b281307
to
7b24a96
Compare
fa257cc
to
56a1b4e
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #9301 +/- ##
==========================================
- Coverage 66.92% 66.89% -0.03%
==========================================
Files 198 199 +1
Lines 27306 27456 +150
==========================================
+ Hits 18274 18367 +93
- Misses 7525 7568 +43
- Partials 1507 1521 +14 🚀 New features to boost your workflow:
|
56a1b4e
to
3ef3837
Compare
d0787db
to
72cbd03
Compare
/approve it's possible you'd be able to make a integration test by killing a pod and unmounting its netns, and restarting crio. may be annoying and not worth the specific case, but worth a shot I think |
72cbd03
to
95f7c65
Compare
dbf2aa3
to
bd9856e
Compare
f18dd4f
to
b1f23b2
Compare
/retest |
edbd7a7
to
c0b6856
Compare
After host reboot, network namespaces are destroyed but CRI-O attempts to clean them up during pod sandbox destruction, causing CNI plugin failures and preventing pods from restarting properly. The fix ensures pods can restart normally after host reboots. Signed-off-by: Sohan Kunkerkar <sohank2602@gmail.com>
…etns Signed-off-by: Sohan Kunkerkar <sohank2602@gmail.com>
Signed-off-by: Sohan Kunkerkar <sohank2602@gmail.com>
Kata VMs use real infra containers that persist in storage, unlike normal containers that use spoofed infra containers. This fundamental architectural difference means the 'Network recovery after reboot with destroyed netns' test scenario doesn't apply to Kata VMs in the same way. Signed-off-by: Sohan Kunkerkar <sohank2602@gmail.com>
c0b6856
to
faec3f5
Compare
/test e2e-gcp-ovn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a non-blocking nit, otherwise LGTM
return fmt.Errorf("invalid network namespace: %w", err) | ||
} | ||
|
||
defer netns.Close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd remove the defer
here.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: haircommander, saschagrunert, sohankunkerkar The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/override ci/prow/ci-e2e-evented-pleg |
@saschagrunert: Overrode contexts on behalf of saschagrunert: ci/prow/ci-e2e-evented-pleg In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@sohankunkerkar: Jira Issue OCPBUGS-58229: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-58229 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/cherry-pick release-1.33 |
@sohankunkerkar: new pull request created: #9337 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
retErr := fmt.Errorf("failed to destroy network for pod sandbox %s(%s): %w", sb.Name(), sb.ID(), err) | ||
|
||
// Check if the network namespace file exists and is valid before attempting CNI teardown. | ||
// If the file doesn't exist or is invalid, skip CNI teardown and mark network as stopped. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had IP leaks on containerd because of a similar change that ignored calling the network CNI plugin on some cases, causing the CNI to no release the IP address containerd/containerd#12132
I can not see this is similar or not, but @danwinship you should ensure that in this scenario the plugin gets the CNI DEL to release any resource associated to the Pod
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aojea Thanks for calling that out. I can see the potential issue there. Let me go ahead and fix the issue.
After host reboot, network namespaces are destroyed but CRI-O attempts to clean them up during pod sandbox destruction, causing CNI plugin failures and preventing pods from restarting properly. The fix ensures pods can restart normally after host reboots.
What type of PR is this?
/kind bug
What this PR does / why we need it:
Which issue(s) this PR fixes:
Special notes for your reviewer:
Link to journal logs
Does this PR introduce a user-facing change?