Skip to content

Conversation

@iholder101
Copy link
Contributor

What type of PR is this?

/kind feature
/sig node

What this PR does / why we need it:

After this PR, libcontainer's cgroup manager is used in order to update pod resources instead of writing directly to cgroupfs.

Further to solving #113857 by avoiding pod updates to be overridden when systemd daemon reloads, it simplifies the code and delegates the heavy lifting to libcontainer's cgroup manager.

Which issue(s) this PR fixes:

Fixes #113857

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fix: Avoid overwriting in-pod vertical scaling updates on systemd daemon reloads when using systemd

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- KEP: https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1287-in-place-update-pod-resources

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Apr 7, 2024
@k8s-ci-robot
Copy link
Contributor

Please note that we're already in Test Freeze for the release-1.30 branch. This means every merged PR will be automatically fast-forwarded via the periodic ci-fast-forward job to the release branch of the upcoming v1.30.0 release.

Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Sun Apr 7 07:41:51 UTC 2024.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/kubelet labels Apr 7, 2024
@iholder101
Copy link
Contributor Author

/test pull-kubernetes-conformance-kind-ga-only-parallel
/test pull-kubernetes-e2e-kind
/test pull-kubernetes-e2e-gce
/test pull-kubernetes-unit

/test pull-kubernetes-e2e-inplace-pod-resize-containerd-main-v2

@iholder101
Copy link
Contributor Author

/test pull-kubernetes-node-e2e-containerd

@iholder101
Copy link
Contributor Author

/test ?

@k8s-ci-robot
Copy link
Contributor

@iholder101: The following commands are available to trigger required jobs:

  • /test pull-cadvisor-e2e-kubernetes
  • /test pull-cos-containerd-e2e-ubuntu-gce
  • /test pull-kubernetes-conformance-kind-ga-only-parallel
  • /test pull-kubernetes-coverage-unit
  • /test pull-kubernetes-dependencies
  • /test pull-kubernetes-dependencies-go-canary
  • /test pull-kubernetes-e2e-gce
  • /test pull-kubernetes-e2e-gce-100-performance
  • /test pull-kubernetes-e2e-gce-big-performance
  • /test pull-kubernetes-e2e-gce-cos
  • /test pull-kubernetes-e2e-gce-cos-canary
  • /test pull-kubernetes-e2e-gce-cos-no-stage
  • /test pull-kubernetes-e2e-gce-network-proxy-http-connect
  • /test pull-kubernetes-e2e-gce-scale-performance-manual
  • /test pull-kubernetes-e2e-kind
  • /test pull-kubernetes-e2e-kind-ipv6
  • /test pull-kubernetes-integration
  • /test pull-kubernetes-integration-go-canary
  • /test pull-kubernetes-kubemark-e2e-gce-scale
  • /test pull-kubernetes-node-e2e-containerd
  • /test pull-kubernetes-typecheck
  • /test pull-kubernetes-unit
  • /test pull-kubernetes-unit-go-canary
  • /test pull-kubernetes-update
  • /test pull-kubernetes-verify
  • /test pull-kubernetes-verify-go-canary
  • /test pull-kubernetes-verify-lint

The following commands are available to trigger optional jobs:

  • /test check-dependency-stats
  • /test pull-ci-kubernetes-unit-windows
  • /test pull-crio-cgroupv1-node-e2e-eviction
  • /test pull-crio-cgroupv1-node-e2e-features
  • /test pull-crio-cgroupv1-node-e2e-hugepages
  • /test pull-crio-cgroupv1-node-e2e-resource-managers
  • /test pull-crio-cgroupv2-imagefs-separatedisktest
  • /test pull-crio-cgroupv2-node-e2e-eviction
  • /test pull-crio-cgroupv2-splitfs-splitdisk
  • /test pull-e2e-gce-cloud-provider-disabled
  • /test pull-kubernetes-conformance-image-test
  • /test pull-kubernetes-conformance-kind-ga-only
  • /test pull-kubernetes-conformance-kind-ipv6-parallel
  • /test pull-kubernetes-cos-cgroupv1-containerd-node-e2e
  • /test pull-kubernetes-cos-cgroupv1-containerd-node-e2e-features
  • /test pull-kubernetes-cos-cgroupv2-containerd-node-e2e
  • /test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-eviction
  • /test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-features
  • /test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-serial
  • /test pull-kubernetes-crio-node-memoryqos-cgrpv2
  • /test pull-kubernetes-cross
  • /test pull-kubernetes-e2e-autoscaling-hpa-cm
  • /test pull-kubernetes-e2e-autoscaling-hpa-cpu
  • /test pull-kubernetes-e2e-capz-azure-disk
  • /test pull-kubernetes-e2e-capz-azure-disk-vmss
  • /test pull-kubernetes-e2e-capz-azure-file
  • /test pull-kubernetes-e2e-capz-azure-file-vmss
  • /test pull-kubernetes-e2e-capz-conformance
  • /test pull-kubernetes-e2e-capz-windows-alpha-feature-vpa
  • /test pull-kubernetes-e2e-capz-windows-alpha-features
  • /test pull-kubernetes-e2e-capz-windows-master
  • /test pull-kubernetes-e2e-capz-windows-serial-slow
  • /test pull-kubernetes-e2e-capz-windows-serial-slow-hpa
  • /test pull-kubernetes-e2e-containerd-gce
  • /test pull-kubernetes-e2e-ec2
  • /test pull-kubernetes-e2e-ec2-arm64
  • /test pull-kubernetes-e2e-ec2-conformance
  • /test pull-kubernetes-e2e-ec2-conformance-arm64
  • /test pull-kubernetes-e2e-ec2-device-plugin-gpu
  • /test pull-kubernetes-e2e-gce-canary
  • /test pull-kubernetes-e2e-gce-correctness
  • /test pull-kubernetes-e2e-gce-cos-alpha-features
  • /test pull-kubernetes-e2e-gce-csi-serial
  • /test pull-kubernetes-e2e-gce-device-plugin-gpu
  • /test pull-kubernetes-e2e-gce-disruptive-canary
  • /test pull-kubernetes-e2e-gce-kubelet-credential-provider
  • /test pull-kubernetes-e2e-gce-network-policies
  • /test pull-kubernetes-e2e-gce-network-proxy-grpc
  • /test pull-kubernetes-e2e-gce-providerless
  • /test pull-kubernetes-e2e-gce-serial
  • /test pull-kubernetes-e2e-gce-serial-canary
  • /test pull-kubernetes-e2e-gce-storage-disruptive
  • /test pull-kubernetes-e2e-gce-storage-slow
  • /test pull-kubernetes-e2e-gce-storage-snapshot
  • /test pull-kubernetes-e2e-gci-gce-autoscaling
  • /test pull-kubernetes-e2e-gci-gce-ingress
  • /test pull-kubernetes-e2e-gci-gce-ipvs
  • /test pull-kubernetes-e2e-inplace-pod-resize-containerd-main-v2
  • /test pull-kubernetes-e2e-kind-alpha-beta-features
  • /test pull-kubernetes-e2e-kind-alpha-features
  • /test pull-kubernetes-e2e-kind-beta-features
  • /test pull-kubernetes-e2e-kind-canary
  • /test pull-kubernetes-e2e-kind-cloud-provider-loadbalancer
  • /test pull-kubernetes-e2e-kind-dual-canary
  • /test pull-kubernetes-e2e-kind-evented-pleg
  • /test pull-kubernetes-e2e-kind-ipv6-canary
  • /test pull-kubernetes-e2e-kind-ipvs-dual-canary
  • /test pull-kubernetes-e2e-kind-kms
  • /test pull-kubernetes-e2e-kind-multizone
  • /test pull-kubernetes-e2e-kind-nftables
  • /test pull-kubernetes-e2e-storage-kind-alpha-features
  • /test pull-kubernetes-e2e-storage-kind-disruptive
  • /test pull-kubernetes-kind-dra
  • /test pull-kubernetes-kind-json-logging
  • /test pull-kubernetes-kind-text-logging
  • /test pull-kubernetes-kubemark-e2e-gce-big
  • /test pull-kubernetes-linter-hints
  • /test pull-kubernetes-local-e2e
  • /test pull-kubernetes-node-arm64-e2e-containerd-ec2
  • /test pull-kubernetes-node-arm64-e2e-containerd-serial-ec2
  • /test pull-kubernetes-node-arm64-ubuntu-serial-gce
  • /test pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e
  • /test pull-kubernetes-node-crio-cgrpv2-e2e
  • /test pull-kubernetes-node-crio-cgrpv2-e2e-kubetest2
  • /test pull-kubernetes-node-crio-cgrpv2-imagefs-e2e
  • /test pull-kubernetes-node-crio-cgrpv2-splitfs-e2e
  • /test pull-kubernetes-node-crio-e2e
  • /test pull-kubernetes-node-crio-e2e-kubetest2
  • /test pull-kubernetes-node-e2e-containerd-1-7-dra
  • /test pull-kubernetes-node-e2e-containerd-alpha-features
  • /test pull-kubernetes-node-e2e-containerd-ec2
  • /test pull-kubernetes-node-e2e-containerd-features
  • /test pull-kubernetes-node-e2e-containerd-features-kubetest2
  • /test pull-kubernetes-node-e2e-containerd-kubetest2
  • /test pull-kubernetes-node-e2e-containerd-serial-ec2
  • /test pull-kubernetes-node-e2e-containerd-standalone-mode
  • /test pull-kubernetes-node-e2e-containerd-standalone-mode-all-alpha
  • /test pull-kubernetes-node-e2e-crio-dra
  • /test pull-kubernetes-node-kubelet-containerd-flaky
  • /test pull-kubernetes-node-kubelet-credential-provider
  • /test pull-kubernetes-node-kubelet-serial-containerd
  • /test pull-kubernetes-node-kubelet-serial-containerd-alpha-features
  • /test pull-kubernetes-node-kubelet-serial-containerd-kubetest2
  • /test pull-kubernetes-node-kubelet-serial-containerd-sidecar-containers
  • /test pull-kubernetes-node-kubelet-serial-cpu-manager
  • /test pull-kubernetes-node-kubelet-serial-cpu-manager-kubetest2
  • /test pull-kubernetes-node-kubelet-serial-crio-cgroupv1
  • /test pull-kubernetes-node-kubelet-serial-crio-cgroupv2
  • /test pull-kubernetes-node-kubelet-serial-hugepages
  • /test pull-kubernetes-node-kubelet-serial-memory-manager
  • /test pull-kubernetes-node-kubelet-serial-pod-disruption-conditions
  • /test pull-kubernetes-node-kubelet-serial-topology-manager
  • /test pull-kubernetes-node-kubelet-serial-topology-manager-kubetest2
  • /test pull-kubernetes-node-swap-conformance-fedora-serial
  • /test pull-kubernetes-node-swap-conformance-ubuntu-serial
  • /test pull-kubernetes-node-swap-fedora
  • /test pull-kubernetes-node-swap-fedora-serial
  • /test pull-kubernetes-node-swap-ubuntu-serial
  • /test pull-kubernetes-unit-experimental
  • /test pull-publishing-bot-validate

Use /test all to run the following jobs that were automatically triggered:

  • pull-kubernetes-conformance-kind-ga-only-parallel
  • pull-kubernetes-dependencies
  • pull-kubernetes-e2e-ec2
  • pull-kubernetes-e2e-ec2-conformance
  • pull-kubernetes-e2e-gce
  • pull-kubernetes-e2e-gce-canary
  • pull-kubernetes-e2e-inplace-pod-resize-containerd-main-v2
  • pull-kubernetes-e2e-kind
  • pull-kubernetes-e2e-kind-ipv6
  • pull-kubernetes-integration
  • pull-kubernetes-linter-hints
  • pull-kubernetes-node-e2e-containerd
  • pull-kubernetes-typecheck
  • pull-kubernetes-unit
  • pull-kubernetes-verify
  • pull-kubernetes-verify-lint

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@iholder101 iholder101 marked this pull request as ready for review April 7, 2024 09:00
@k8s-ci-robot k8s-ci-robot requested review from mrunalp and mtaufen April 7, 2024 09:00
@iholder101 iholder101 changed the title [WIP] [FG:InPlacePodVerticalScaling] Handle systemd cgroup driver by using libcontainer for updating pod cgroup values [FG:InPlacePodVerticalScaling] Handle systemd cgroup driver by using libcontainer for updating pod cgroup values Apr 7, 2024
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 7, 2024
@iholder101
Copy link
Contributor Author

/cc @tallclair @vinaykul

@ffromani
Copy link
Contributor

ffromani commented Apr 7, 2024

/triage accepted
/priority important-longterm

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Apr 7, 2024
@iholder101 iholder101 force-pushed the in-pod-vertical-scaling/update-cgroup-systemd-instead-of-cgroupfs branch 3 times, most recently from f428c16 to 1622026 Compare September 25, 2024 10:08
@iholder101
Copy link
Contributor Author

/test pull-kubernetes-node-e2e-containerd
/test pull-kubernetes-unit
/test pull-kubernetes-e2e-inplace-pod-resize-containerd-main-v2

@iholder101 iholder101 marked this pull request as ready for review September 25, 2024 13:12
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 25, 2024
@iholder101
Copy link
Contributor Author

Can we just update the calling code to use the CgroupManager.Update function directly instead?

Nice catch!
Done :)

@hshiina
Copy link
Contributor

hshiina commented Sep 30, 2024

As I tried this fix locally, the CPU quota of a pod cgroup was updated to max when the CPU limits are resized.

It looks that libcontainer updates cpu.max if either CpuQuota or CpuPeriod is set to Resources:

So, we should set CPUPeriod to podCpuResources only when CPUQuota in a pod cgroup is updated:

case v1.ResourceCPU:
podCpuResources := &cm.ResourceConfig{CPUPeriod: podResources.CPUPeriod}
if setLimitValue {
podCpuResources.CPUQuota = podResources.CPUQuota
} else {
podCpuResources.CPUShares = podResources.CPUShares
}
err = pcm.SetPodCgroupConfig(pod, rName, podCpuResources)

Otherwise, CPUQuota will be updated to max when CPUShares is updated at resizing a pod cgroup.

@iholder101 iholder101 force-pushed the in-pod-vertical-scaling/update-cgroup-systemd-instead-of-cgroupfs branch from a3a2a11 to e80235b Compare October 30, 2024 11:58
Signed-off-by: Itamar Holder <iholder@redhat.com>
libcontainer's cgroup manager is version agnostic, and is agnostic
to whether systemd is used. This way if systemd is used, the cgroup
manager would be able to update resources properly so that if
the daemon would be restarted the changes would not be reverted.

Signed-off-by: Itamar Holder <iholder@redhat.com>
Signed-off-by: Itamar Holder <iholder@redhat.com>
Signed-off-by: Itamar Holder <iholder@redhat.com>
@iholder101 iholder101 force-pushed the in-pod-vertical-scaling/update-cgroup-systemd-instead-of-cgroupfs branch from e80235b to 590cc46 Compare October 30, 2024 11:58
@iholder101
Copy link
Contributor Author

As I tried this fix locally, the CPU quota of a pod cgroup was updated to max when the CPU limits are resized.

It looks that libcontainer updates cpu.max if either CpuQuota or CpuPeriod is set to Resources:

So, we should set CPUPeriod to podCpuResources only when CPUQuota in a pod cgroup is updated:

case v1.ResourceCPU:
podCpuResources := &cm.ResourceConfig{CPUPeriod: podResources.CPUPeriod}
if setLimitValue {
podCpuResources.CPUQuota = podResources.CPUQuota
} else {
podCpuResources.CPUShares = podResources.CPUShares
}
err = pcm.SetPodCgroupConfig(pod, rName, podCpuResources)

Otherwise, CPUQuota will be updated to max when CPUShares is updated at resizing a pod cgroup.

Great catch! Thanks a lot @hshiina!
Fixed 👍

Signed-off-by: Itamar Holder <iholder@redhat.com>
@iholder101 iholder101 force-pushed the in-pod-vertical-scaling/update-cgroup-systemd-instead-of-cgroupfs branch from 590cc46 to f21473b Compare October 30, 2024 12:21
}
}
if resourceConfig.CPUShares != nil {
cpuWeight := cpuSharesToCPUWeight(*resourceConfig.CPUShares)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to preserve this conversion. I think this can just be done in the common path, so both CPU shares & CPU weight are set, and libcontainer just uses the correct one depending on if it's cgroups v1 or v2

// NOTE: .CpuShares is not used here. Conversion is the caller's responsibility.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @tallclair!
I believe this conversion already takes place in the current state of this PR.

The setCgroupCPUConfig is now unified here: https://github.com/kubernetes/kubernetes/pull/124216/files#diff-2f332b11d91d53583a6eeca75fc6e124aee6bc547a264e6987b89b1c786e0674R257.

In turn, the Update() function uses libctCgroupConfig() to convert the cgroup config to a runc struct:

func (m *cgroupCommon) Update(cgroupConfig *CgroupConfig) error {
start := time.Now()
defer func() {
metrics.CgroupManagerDuration.WithLabelValues("update").Observe(metrics.SinceInSeconds(start))
}()
libcontainerCgroupConfig := m.libctCgroupConfig(cgroupConfig, true)

And indeed this function handles the conversion between cpu shares to cpu weight:

func (m *cgroupCommon) libctCgroupConfig(in *CgroupConfig, needResources bool) *libcontainerconfigs.Cgroup {
config := &libcontainerconfigs.Cgroup{
Systemd: m.useSystemd,
}
if needResources {
config.Resources = m.toResources(in.ResourceParameters)

func (m *cgroupCommon) toResources(resourceConfig *ResourceConfig) *libcontainerconfigs.Resources {
resources := &libcontainerconfigs.Resources{
SkipDevices: true,
SkipFreezeOnSet: true,
}
if resourceConfig == nil {
return resources
}
if resourceConfig.Memory != nil {
resources.Memory = *resourceConfig.Memory
}
if resourceConfig.CPUShares != nil {
if libcontainercgroups.IsCgroup2UnifiedMode() {
resources.CpuWeight = getCPUWeight(resourceConfig.CPUShares)
} else {
resources.CpuShares = *resourceConfig.CPUShares
}
}

func getCPUWeight(cpuShares *uint64) uint64 {
if cpuShares == nil {
return 0
}
if *cpuShares >= 262144 {
return 10000
}
return 1 + ((*cpuShares-2)*9999)/262142
}

WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks. I missed that. Looks good then.

@tallclair
Copy link
Member

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 4, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: fb92c006d96019ec5848b889171e8a812c0882e4

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: iholder101, tallclair

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 4, 2024
@k8s-ci-robot k8s-ci-robot merged commit a8e1f41 into kubernetes:master Nov 4, 2024
14 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.32 milestone Nov 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Development

Successfully merging this pull request may close these issues.

[FG:InPlacePodVerticalScaling] Handle systemd cgroup driver by using libcontainer for updating pod cgroup values

8 participants