Skip to content

[FG:InPlacePodVerticalScaling] Pod CPU limit is not configured to cgroups as calculated if systemd cgroup driver is used #129357

@hshiina

Description

@hshiina

What happened?

As a result of #124216, which was introduced in v.1.32, a pod CPU limit calculated in ResourceConfigForPod() is rounded up to the nearest 10ms in libcontainer at resizing the pod:

  • Resize a pod:
    $ kubectl patch pod resize-pod --subresource=resize --patch '{"spec":{"containers":[{"name":"resize-container", "resources":{"limits":{"cpu":"417m"}}}]}}'
    pod/resize-pod patched
    
  • The container cgroup value is set with 1ms precision:
    $ kubectl exec resize-pod -- cat /sys/fs/cgroup/cpu.max
    41700 100000
    
  • The pod cgroup value is rounded up:
    $ cat /sys/fs/cgroup/kubelet.slice/kubelet-kubepods.slice/kubelet-kubepods-burstable.slice/kubelet-kubepods-burstable-pod68a17b59_0d31_40b2_ba86_ea43f3b2f05c.slice/cpu.max
    42000 100000
    

When systemd cgroup driver is used, libcontainer passes the CPU Quota to systemd with rounding up:

// systemd converts CPUQuotaPerSecUSec (microseconds per CPU second) to CPUQuota
// (integer percentage of CPU) internally. This means that if a fractional percent of
// CPU is indicated by Resources.CpuQuota, we need to round up to the nearest
// 10ms (1% of a second) such that child cgroups can set the cpu.cfs_quota_us they expect.
cpuQuotaPerSecUSec = uint64(quota*1000000) / period
if cpuQuotaPerSecUSec%10000 != 0 {
cpuQuotaPerSecUSec = ((cpuQuotaPerSecUSec / 10000) + 1) * 10000
}

In addition, there seems to be a race in libcontainer. It directly writes values to the cgroup file without roundup after it passes the rounded value to systemd:

if err := setUnitProperties(m.dbus, getUnitName(m.cgroups), properties...); err != nil {
return fmt.Errorf("unable to set unit properties: %w", err)
}
return m.fsMgr.Set(r)

So, there is also a case where the cgroup value is set as calculated. As far as I tried, decreasing CPU limits usually hits this case though I’m not sure why:

  • Decrease the CPU limits:

    $ kubectl patch pod resize-pod --subresource=resize --patch '{"spec":{"containers":[{"name":"resize-container", "resources":{"limits":{"cpu":"365m"}}}]}}'
    pod/resize-pod patched
    
  • Both the container and the pod cgroup values are set with 1ms precision:

    $ kubectl exec resize-pod -- cat /sys/fs/cgroup/cpu.max
    36500 100000
    $ cat /sys/fs/cgroup/kubelet.slice/kubelet-kubepods.slice/kubelet-kubepods-burstable.slice/kubelet-kubepods-burstable-pod68a17b59_0d31_40b2_ba86_ea43f3b2f05c.slice/cpu.max
    36500 100000
    

What did you expect to happen?

This roundup looks like the intended behavior of systemd cgroup driver because CPU quota is also rounded up when a pod is just created with 1ms precision CPU limits. However, I have the following concerns:

How can we reproduce it (as minimally and precisely as possible)?

  1. Use systemd cgroup driver and enable InPlacePodVertialScaling.
  2. Resize CPU limits of a pod with 1ms precision.

Anything else we need to know?

No response

Kubernetes version

V1.32

$ kubectl version
# paste output here
Client Version: v1.31.4
Kustomize Version: v5.4.2
Server Version: v1.32.0

Cloud provider

N/A

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

Metadata

Metadata

Assignees

Labels

kind/bugCategorizes issue or PR as related to a bug.sig/nodeCategorizes an issue or PR as relevant to SIG Node.triage/acceptedIndicates an issue or PR is ready to be actively worked on.

Type

No type

Projects

Status

No status

Status

Triaged

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions