-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Description
CRI-O tears down networking after a reboot, risking pid reuse.
Steps to reproduce the issue:
Sadly, reproducing this is probabilistic. It should still be easy to fix, though
- Reboot the node. Create some containers, so they have a low pid number
- Reboot the node again
- Kubelet starts tearing down sandboxes that were killed because of the reboot
- cri-o issues a CNI delete with /proc/$pid/ns/net, even though
$pid
is meaningless since the reboot.
Even if you don't get a pid collision, I was able to see pretty clearly getting a CNI DEL for a stale pid. For example, from crio logs at level Info:
About to add CNI network lo (type=loopback)
Got pod network &{Name:alertmanager-main-1 Namespace:openshift-monitoring ID:... NetNS:/proc/8036/ns/net Networks:[] RuntimeConfig:map[]}
-- reboot --
About to del CNI network lo (type=loopback)
Error deleting network: failed to Statfs "/proc/8036/ns/net": no such file or directory
This clearly shows that it is looking for /proc/8036...
, and it happens to not be a process. However, reboot enough times and you will eventually lose and it will point to a running pid (but not the one started by cri-o). We typically see this in about 1-in-10 reboots.
Describe the results you received:
We got a CNI Delete with the netns of /proc/<pid>/ns/net
, which is correct, except that the node was rebooted in the mean time, and /proc/<pid>/ns/net
pointed to the root netns.
Describe the results you expected:
The CNI delete should be with an empty netns parameter, which signifies to the plugins that the namespace is gone and only bookkeeping operations (e.g IPAM cleanup) are to be done. CRI-O should only pass the netns parameter if it points to a known-good crio-created process that is still running.
Output of crio --version
:
crio version 1.14.10-0.19.dev.rhaos4.2.gita86dae7.el8