Skip to content

Conversation

sohankunkerkar
Copy link
Member

@sohankunkerkar sohankunkerkar commented Jun 30, 2025

After host reboot, network namespaces are destroyed but CRI-O attempts to clean them up during pod sandbox destruction, causing CNI plugin failures and preventing pods from restarting properly. The fix ensures pods can restart normally after host reboots.

What type of PR is this?

/kind bug

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Link to journal logs

Does this PR introduce a user-facing change?

Handle missing network namespace gracefully during networkStop

@sohankunkerkar sohankunkerkar requested a review from mrunalp as a code owner June 30, 2025 13:36
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 30, 2025
@openshift-ci-robot openshift-ci-robot added the jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. label Jun 30, 2025
@openshift-ci openshift-ci bot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Jun 30, 2025
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 30, 2025
@openshift-ci-robot
Copy link

@sohankunkerkar: This pull request references Jira Issue OCPBUGS-58229, which is invalid:

  • expected the bug to target the "4.20.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

After host reboot, network namespaces are destroyed but CRI-O attempts to clean them up during pod sandbox destruction, causing CNI plugin failures and preventing pods from restarting properly. The fix ensures pods can restart normally after host reboots.

What type of PR is this?

/kind bug

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Handle missing network namespace gracefully during networkStop

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jun 30, 2025
@openshift-ci openshift-ci bot added kind/bug Categorizes issue or PR as related to a bug. dco-signoff: no Indicates the PR's author has not DCO signed all their commits. labels Jun 30, 2025
@openshift-ci openshift-ci bot requested review from littlejawa and QiWang19 June 30, 2025 13:36
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 30, 2025
@danwinship
Copy link
Contributor

After a reboot there would be no iptables/nftables hostport rules anyway, so I don't think this is the problem. (And I don't see anything about hostports in the linked bug, though I didn't pull the full logs from the Google drive link...)

@openshift-ci openshift-ci bot added dco-signoff: yes Indicates the PR's author has DCO signed all their commits. and removed dco-signoff: no Indicates the PR's author has not DCO signed all their commits. labels Jun 30, 2025
@sohankunkerkar sohankunkerkar force-pushed the fix-cni-issue branch 2 times, most recently from fa257cc to 56a1b4e Compare June 30, 2025 14:18
Copy link

codecov bot commented Jun 30, 2025

Codecov Report

Attention: Patch coverage is 61.53846% with 15 lines in your changes missing coverage. Please review.

Project coverage is 66.89%. Comparing base (2edb23f) to head (faec3f5).
Report is 19 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9301      +/-   ##
==========================================
- Coverage   66.92%   66.89%   -0.03%     
==========================================
  Files         198      199       +1     
  Lines       27306    27456     +150     
==========================================
+ Hits        18274    18367      +93     
- Misses       7525     7568      +43     
- Partials     1507     1521      +14     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sohankunkerkar sohankunkerkar force-pushed the fix-cni-issue branch 3 times, most recently from d0787db to 72cbd03 Compare July 1, 2025 14:42
@haircommander
Copy link
Member

/approve

it's possible you'd be able to make a integration test by killing a pod and unmounting its netns, and restarting crio. may be annoying and not worth the specific case, but worth a shot I think

@sohankunkerkar sohankunkerkar changed the title [WIP] OCPBUGS-58229: server: handle missing network namespace gracefully during networkStop OCPBUGS-58229: server: handle missing network namespace gracefully during networkStop Jul 1, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 1, 2025
@sohankunkerkar sohankunkerkar force-pushed the fix-cni-issue branch 2 times, most recently from dbf2aa3 to bd9856e Compare July 1, 2025 18:14
@sohankunkerkar
Copy link
Member Author

/retest

@sohankunkerkar sohankunkerkar force-pushed the fix-cni-issue branch 3 times, most recently from edbd7a7 to c0b6856 Compare July 10, 2025 13:03
After host reboot, network namespaces are destroyed but CRI-O attempts
to clean them up during pod sandbox destruction, causing CNI plugin
failures and preventing pods from restarting properly. The fix ensures
pods can restart normally after host reboots.

Signed-off-by: Sohan Kunkerkar <sohank2602@gmail.com>
…etns

Signed-off-by: Sohan Kunkerkar <sohank2602@gmail.com>
Signed-off-by: Sohan Kunkerkar <sohank2602@gmail.com>
Kata VMs use real infra containers that persist in storage, unlike normal
containers that use spoofed infra containers. This fundamental architectural
difference means the 'Network recovery after reboot with destroyed netns'
test scenario doesn't apply to Kata VMs in the same way.

Signed-off-by: Sohan Kunkerkar <sohank2602@gmail.com>
@sohankunkerkar
Copy link
Member Author

/test e2e-gcp-ovn

Copy link
Member

@saschagrunert saschagrunert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a non-blocking nit, otherwise LGTM

return fmt.Errorf("invalid network namespace: %w", err)
}

defer netns.Close()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd remove the defer here.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 11, 2025
Copy link
Contributor

openshift-ci bot commented Jul 11, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: haircommander, saschagrunert, sohankunkerkar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [haircommander,saschagrunert,sohankunkerkar]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@saschagrunert
Copy link
Member

/override ci/prow/ci-e2e-evented-pleg

Copy link
Contributor

openshift-ci bot commented Jul 11, 2025

@saschagrunert: Overrode contexts on behalf of saschagrunert: ci/prow/ci-e2e-evented-pleg

In response to this:

/override ci/prow/ci-e2e-evented-pleg

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-merge-bot openshift-merge-bot bot merged commit 9cdf516 into cri-o:main Jul 11, 2025
85 of 89 checks passed
@openshift-ci-robot
Copy link

@sohankunkerkar: Jira Issue OCPBUGS-58229: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-58229 has been moved to the MODIFIED state.

In response to this:

After host reboot, network namespaces are destroyed but CRI-O attempts to clean them up during pod sandbox destruction, causing CNI plugin failures and preventing pods from restarting properly. The fix ensures pods can restart normally after host reboots.

What type of PR is this?

/kind bug

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Link to journal logs

Does this PR introduce a user-facing change?

Handle missing network namespace gracefully during networkStop

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@sohankunkerkar
Copy link
Member Author

/cherry-pick release-1.33

@openshift-cherrypick-robot

@sohankunkerkar: new pull request created: #9337

In response to this:

/cherry-pick release-1.33

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

retErr := fmt.Errorf("failed to destroy network for pod sandbox %s(%s): %w", sb.Name(), sb.ID(), err)

// Check if the network namespace file exists and is valid before attempting CNI teardown.
// If the file doesn't exist or is invalid, skip CNI teardown and mark network as stopped.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had IP leaks on containerd because of a similar change that ignored calling the network CNI plugin on some cases, causing the CNI to no release the IP address containerd/containerd#12132

I can not see this is similar or not, but @danwinship you should ensure that in this scenario the plugin gets the CNI DEL to release any resource associated to the Pod

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aojea Thanks for calling that out. I can see the potential issue there. Let me go ahead and fix the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants