-
Notifications
You must be signed in to change notification settings - Fork 41.6k
Description
What happened?
As a result of #124216, which was introduced in v.1.32, a pod CPU limit calculated in ResourceConfigForPod() is rounded up to the nearest 10ms in libcontainer at resizing the pod:
- Resize a pod:
$ kubectl patch pod resize-pod --subresource=resize --patch '{"spec":{"containers":[{"name":"resize-container", "resources":{"limits":{"cpu":"417m"}}}]}}' pod/resize-pod patched - The container cgroup value is set with 1ms precision:
$ kubectl exec resize-pod -- cat /sys/fs/cgroup/cpu.max 41700 100000 - The pod cgroup value is rounded up:
$ cat /sys/fs/cgroup/kubelet.slice/kubelet-kubepods.slice/kubelet-kubepods-burstable.slice/kubelet-kubepods-burstable-pod68a17b59_0d31_40b2_ba86_ea43f3b2f05c.slice/cpu.max 42000 100000
When systemd cgroup driver is used, libcontainer passes the CPU Quota to systemd with rounding up:
kubernetes/vendor/github.com/opencontainers/runc/libcontainer/cgroups/systemd/common.go
Lines 304 to 311 in a4b8a3b
| // systemd converts CPUQuotaPerSecUSec (microseconds per CPU second) to CPUQuota | |
| // (integer percentage of CPU) internally. This means that if a fractional percent of | |
| // CPU is indicated by Resources.CpuQuota, we need to round up to the nearest | |
| // 10ms (1% of a second) such that child cgroups can set the cpu.cfs_quota_us they expect. | |
| cpuQuotaPerSecUSec = uint64(quota*1000000) / period | |
| if cpuQuotaPerSecUSec%10000 != 0 { | |
| cpuQuotaPerSecUSec = ((cpuQuotaPerSecUSec / 10000) + 1) * 10000 | |
| } |
In addition, there seems to be a race in libcontainer. It directly writes values to the cgroup file without roundup after it passes the rounded value to systemd:
kubernetes/vendor/github.com/opencontainers/runc/libcontainer/cgroups/systemd/v2.go
Lines 489 to 493 in a4b8a3b
| if err := setUnitProperties(m.dbus, getUnitName(m.cgroups), properties...); err != nil { | |
| return fmt.Errorf("unable to set unit properties: %w", err) | |
| } | |
| return m.fsMgr.Set(r) |
So, there is also a case where the cgroup value is set as calculated. As far as I tried, decreasing CPU limits usually hits this case though I’m not sure why:
-
Decrease the CPU limits:
$ kubectl patch pod resize-pod --subresource=resize --patch '{"spec":{"containers":[{"name":"resize-container", "resources":{"limits":{"cpu":"365m"}}}]}}' pod/resize-pod patched -
Both the container and the pod cgroup values are set with 1ms precision:
$ kubectl exec resize-pod -- cat /sys/fs/cgroup/cpu.max 36500 100000 $ cat /sys/fs/cgroup/kubelet.slice/kubelet-kubepods.slice/kubelet-kubepods-burstable.slice/kubelet-kubepods-burstable-pod68a17b59_0d31_40b2_ba86_ea43f3b2f05c.slice/cpu.max 36500 100000
What did you expect to happen?
This roundup looks like the intended behavior of systemd cgroup driver because CPU quota is also rounded up when a pod is just created with 1ms precision CPU limits. However, I have the following concerns:
- We might need to confirm this tiny gap doesn’t cause a similar issue to [FG:InPlacePodVerticalScaling] containers with a CPU limit below 10m have a resize status of InProgress indefinetly #128769 at resizing pods.
- We might need to clarify why the CPU quota of pod cgroup is sometimes not rounded up. This is especially necessary to complete [FG:InPlacePodVerticalScaling] e2e node: Verify pod cgroups in resize test #127192, which is going to add pod cgroup verification to resize tests.
How can we reproduce it (as minimally and precisely as possible)?
- Use
systemdcgroup driver and enableInPlacePodVertialScaling. - Resize CPU limits of a pod with 1ms precision.
Anything else we need to know?
No response
Kubernetes version
V1.32
$ kubectl version
# paste output here
Client Version: v1.31.4
Kustomize Version: v5.4.2
Server Version: v1.32.0Cloud provider
N/A
OS version
# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here
# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output hereInstall tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Status