Skip to content

Conversation

sohankunkerkar
Copy link
Member

@sohankunkerkar sohankunkerkar commented Aug 19, 2025

Fixes an issue where empty network namespace paths cause CNI teardown failures, preventing pod deletion and creating stuck pods. This happens when infra containers die, leaving NetNsPath() returning empty strings that CNI plugins cannot handle.

The failure cascade: dead infra container → empty NetNS → CNI failure → StopPodSandbox failure → stuck pod → systemd cgroup conflicts → new pod creation failures. Observed in MicroShift production environments.

What type of PR is this?

/kind bug

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

The network cleanup failure is blocking pod deletion, which is causing the systemd transaction conflicts.

kubelet E0819 16:05:51.010476 pod_workers.go:1301] "Error syncing pod, skipping" err="failed to \"KillPodSandbox\" [...] network teardown failed [...] failed to Statfs \"\": no such file or directory"

Does this PR introduce a user-facing change?

server: Fix network cleanup failures when NetNS path is empty

@openshift-ci openshift-ci bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: no Indicates the PR's author has not DCO signed all their commits. kind/bug Categorizes issue or PR as related to a bug. labels Aug 19, 2025
@openshift-ci openshift-ci bot requested review from klihub and QiWang19 August 19, 2025 16:47
@openshift-ci openshift-ci bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. and removed dco-signoff: no Indicates the PR's author has not DCO signed all their commits. labels Aug 19, 2025
@sohankunkerkar sohankunkerkar marked this pull request as draft August 19, 2025 17:52
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 19, 2025
Copy link

codecov bot commented Aug 19, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 67.06%. Comparing base (81e69a5) to head (02cd675).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9410      +/-   ##
==========================================
- Coverage   67.24%   67.06%   -0.18%     
==========================================
  Files         202      202              
  Lines       28025    28043      +18     
==========================================
- Hits        18845    18807      -38     
- Misses       7602     7662      +60     
+ Partials     1578     1574       -4     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sohankunkerkar sohankunkerkar changed the title server: Fix network cleanup failures when NetNS path is empty [WIP] server: Fix network cleanup failures when NetNS path is empty Aug 19, 2025
@sohankunkerkar sohankunkerkar changed the title [WIP] server: Fix network cleanup failures when NetNS path is empty server: Fix network cleanup failures when NetNS path is empty Sep 15, 2025
@sohankunkerkar sohankunkerkar marked this pull request as ready for review September 15, 2025 13:09
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 15, 2025
@openshift-ci openshift-ci bot requested a review from hasan4791 September 15, 2025 13:09
Copy link
Member

@saschagrunert saschagrunert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM, do we need an additional integration test for that?

/hold

@openshift-ci openshift-ci bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged. labels Sep 15, 2025
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Sep 15, 2025
@sohankunkerkar sohankunkerkar force-pushed the fix-empty-netns branch 3 times, most recently from 4ef1089 to 0a6a3d4 Compare September 16, 2025 01:23
@sohankunkerkar sohankunkerkar force-pushed the fix-empty-netns branch 2 times, most recently from ba9c580 to 1b98d6f Compare September 16, 2025 02:53
@asahay19
Copy link

I validated that upgrading CRI-O to the latest build resolves the issue with the help of this PR:

Steps Performed:

Output: The pod was successfully created in the test-ai namespace after reboot.
This confirms that updating to the latest CRI-O fixes the pod creation issue after node reboots.

@asahay19
Copy link

/lgtm

Copy link
Contributor

openshift-ci bot commented Sep 16, 2025

@asahay19: changing LGTM is restricted to collaborators

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@sohankunkerkar
Copy link
Member Author

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 16, 2025
@sohankunkerkar
Copy link
Member Author

CI is broken due to the known issues with ansible snippet and the container-selinux change.

@haircommander
Copy link
Member

/approve

LGTM, @cri-o/cri-o-maintainers PTAL

Copy link
Contributor

openshift-ci bot commented Sep 16, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: haircommander, saschagrunert, sohankunkerkar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [haircommander,saschagrunert,sohankunkerkar]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Fixes an issue where empty network namespace paths cause CNI teardown
failures, preventing pod deletion and creating stuck pods. This happens when
infra containers die, leaving NetNsPath() returning empty strings that CNI
plugins cannot handle.

The failure cascade: dead infra container → empty NetNS → CNI failure →
StopPodSandbox failure → stuck pod → systemd cgroup conflicts → new pod
creation failures. Observed in MicroShift environments.

Signed-off-by: Sohan Kunkerkar <sohank2602@gmail.com>
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes network cleanup failures that occur when the network namespace path is empty, which prevents pod deletion and can create stuck pods in production environments. The issue arises when infra containers die, leaving NetNsPath() returning empty strings that CNI plugins cannot handle properly.

  • Added early detection and handling for empty network namespace paths during network teardown
  • Modified test to expect network as already stopped when sandbox is stopped
  • Added comprehensive integration test for pod deletion with missing NetNS scenarios

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
server/sandbox_network.go Adds logic to handle empty NetNS paths gracefully during network cleanup
server/sandbox_stop_test.go Updates test expectations and removes unused import
test/network.bats Adds integration test for pod deletion when NetNS path is missing

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines 6 to 9
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
"go.uber.org/mock/gomock"
types "k8s.io/cri-api/pkg/apis/runtime/v1"
)
Copy link

Copilot AI Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removal of the gomock import appears incomplete. The removed test case that used gomock.InOrder() and gomock.Any() suggests this import was necessary for proper mock testing. Consider whether this test removal reduces coverage for error handling scenarios.

Copilot uses AI. Check for mistakes.

log.Debugf(ctx, "CNI teardown failed due to missing/invalid network namespace for pod sandbox %s(%s): %v", sb.Name(), sb.ID(), err)

// Clean up CNI result files even when NetNS is invalid.
s.cleanupCNIResultFiles(ctx, sb.ID())
Copy link

Copilot AI Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cleanupCNIResultFiles method is called but there's no documentation or context about what this method does or whether it can fail. Consider adding a comment explaining its purpose and whether its failure should be handled.

Copilot uses AI. Check for mistakes.

@sohankunkerkar
Copy link
Member Author

/retest

@bitoku
Copy link
Contributor

bitoku commented Sep 18, 2025

/retest-required

@bitoku
Copy link
Contributor

bitoku commented Sep 18, 2025

/lgtm
/hold
@sohankunkerkar please unhold when you feel ok to merge

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 18, 2025
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 18, 2025
@sohankunkerkar
Copy link
Member Author

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 18, 2025
@openshift-merge-bot openshift-merge-bot bot merged commit c4f1653 into cri-o:main Sep 18, 2025
88 of 90 checks passed
@sohankunkerkar
Copy link
Member Author

/cherry-pick release-1.34

@sohankunkerkar
Copy link
Member Author

/cherry-pick release-1.33

@openshift-cherrypick-robot

@sohankunkerkar: new pull request created: #9471

In response to this:

/cherry-pick release-1.34

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-cherrypick-robot

@sohankunkerkar: new pull request created: #9472

In response to this:

/cherry-pick release-1.33

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@sohankunkerkar sohankunkerkar changed the title server: Fix network cleanup failures when NetNS path is empty [OCPBUGS-58229]: server: Fix network cleanup failures when NetNS path is empty Sep 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants