Skip to content

Conversation

@tallclair
Copy link
Member

What type of PR is this?

/kind feature

What this PR does / why we need it:

Motivation & Background:

The Kubelet's PLEG only triggers events and fetches the latest PodStatus from the runtime when a container changes state. This is a problem for in-place pod resize, since resource updates are not considered a state change. To get around this, pleg.UpdateCache was called at the end of SyncPod if a pod was resizing to force a "reinspection" of the pod status. This violates some principles of the PLEG, and also means that resizes need to wait a full resync period (1 minute) to detect that the resize completed.

What this PR does:

This PR adds a new concept of "Watch Conditions" to the PLEG. If a pod has a watch condition set, then the pod is reinspected (via runtime.GetPodStatus) on every Relist loop (every 2 seconds) until the watch condition function returns true, indicating that the condition was reached. The Kubelet sets a watch condition for each resource resize, so it no longer needs to manually do the reinspection via update cache.

Not Yet Implemented: When a watch condition is completed, an event is emitted by the PLEG to trigger SyncPod. This should greatly speed up pod resize completion.

Which issue(s) this PR fixes:

Fixes #123940

Special notes for your reviewer:

This PR starts with some refactoring. I can pull the refactor into a separate PR if you prefer.

Does this PR introduce a user-facing change?

NONE

/sig node
/priority important-longterm

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. sig/node Categorizes an issue or PR as relevant to SIG Node. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 3, 2024
@tallclair
Copy link
Member Author

/triage accepted
/milestone v1.32

@k8s-ci-robot k8s-ci-robot added this to the v1.32 milestone Nov 3, 2024
@k8s-ci-robot k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Nov 3, 2024
@k8s-ci-robot k8s-ci-robot requested review from dims and feiskyer November 3, 2024 21:56
@k8s-ci-robot k8s-ci-robot added area/kubelet approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 3, 2024
@tallclair tallclair force-pushed the pleg-watch-conditions branch from bd966a6 to f460c5b Compare November 5, 2024 00:52
@tallclair tallclair changed the title [WIP] [FG:InPlacePodVerticalScaling] PLEG watch conditions: rapid polling for expected changes [FG:InPlacePodVerticalScaling] PLEG watch conditions: rapid polling for expected changes Nov 5, 2024
@k8s-ci-robot k8s-ci-robot added area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Nov 5, 2024
@tallclair tallclair force-pushed the pleg-watch-conditions branch from f460c5b to 826044c Compare November 5, 2024 00:53
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tallclair

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 5, 2024
@tallclair
Copy link
Member Author

I pulled in the change to block on the ResizeStatus being cleared from #128377, and ran the e2e tests locally. Without this PR, the tests timed out (at 27/35 resize tests). With this PR, the tests complete in about 10 minutes.

@tallclair
Copy link
Member Author

/assign @yujuhong
/cc @smarterclayton

@tallclair tallclair force-pushed the pleg-watch-conditions branch from de4e5f0 to 7fce6f2 Compare November 6, 2024 19:06
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 6, 2024
"pod", format.Pod(pod), "resourceName", resourceName)
return err
}
resizeKey := fmt.Sprintf("%s:resize:%s", container.Name, resourceName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we only run this function updatePodContainerResources when there''s actual resizing to perform, i.e., subsequent sync will skip this function even if the resources haven't converged yet?

I'm thinking about the corner cases such as kubelet restart, whether the watch condition will be set again. And also how often we may see the watch condition being set.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function will get called on every SyncPod iteration until the resources converge. I think that's OK here. Updating the resources should take much less than the 1 minute resync period, so I'd expect this to be a very small number of times updating the watch condition. In the restart case, the watch condition will get added back if the resources still haven't converged.

@knabben
Copy link
Member

knabben commented Nov 6, 2024

Hey @yujuhong @tallclair
⚠️ Do we still intend to merge this for v1.32? Just a reminder that the code freeze is starting 02:00 UTC Friday November 8th 2024 (a little less than 1 week from now). Please make sure the PR has both lgtm and approved labels before the code freeze. Thanks!

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 7, 2024
@yujuhong
Copy link
Contributor

yujuhong commented Nov 7, 2024

/lgtm

Can we also run the resize test to verify?

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 7, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 69f7ae5ac4a623c24beca49cf1a397b1eeb58481

@tallclair tallclair force-pushed the pleg-watch-conditions branch from 71b3ea2 to 24443b6 Compare November 7, 2024 01:01
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 7, 2024
@k8s-ci-robot k8s-ci-robot requested a review from yujuhong November 7, 2024 01:01
@tallclair
Copy link
Member Author

Fixed lint error

@tallclair
Copy link
Member Author

/test ?

@k8s-ci-robot
Copy link
Contributor

@tallclair: The following commands are available to trigger required jobs:

  • /test pull-cos-containerd-e2e-ubuntu-gce
  • /test pull-kubernetes-conformance-kind-ga-only-parallel
  • /test pull-kubernetes-coverage-unit
  • /test pull-kubernetes-dependencies
  • /test pull-kubernetes-dependencies-go-canary
  • /test pull-kubernetes-e2e-gce
  • /test pull-kubernetes-e2e-gce-100-performance
  • /test pull-kubernetes-e2e-gce-cos
  • /test pull-kubernetes-e2e-gce-cos-canary
  • /test pull-kubernetes-e2e-gce-cos-no-stage
  • /test pull-kubernetes-e2e-gce-network-proxy-http-connect
  • /test pull-kubernetes-e2e-gce-pull-through-cache
  • /test pull-kubernetes-e2e-kind
  • /test pull-kubernetes-e2e-kind-ipv6
  • /test pull-kubernetes-integration
  • /test pull-kubernetes-integration-go-canary
  • /test pull-kubernetes-kubemark-e2e-gce-scale
  • /test pull-kubernetes-node-e2e-containerd
  • /test pull-kubernetes-typecheck
  • /test pull-kubernetes-unit
  • /test pull-kubernetes-unit-go-canary
  • /test pull-kubernetes-update
  • /test pull-kubernetes-verify
  • /test pull-kubernetes-verify-go-canary
  • /test pull-kubernetes-verify-lint

The following commands are available to trigger optional jobs:

  • /test check-dependency-stats
  • /test pull-crio-cgroupv1-node-e2e-eviction
  • /test pull-crio-cgroupv1-node-e2e-eviction-kubetest2
  • /test pull-crio-cgroupv1-node-e2e-features
  • /test pull-crio-cgroupv1-node-e2e-features-kubetest2
  • /test pull-crio-cgroupv1-node-e2e-hugepages
  • /test pull-crio-cgroupv1-node-e2e-hugepages-kubetest2
  • /test pull-crio-cgroupv1-node-e2e-resource-managers
  • /test pull-crio-cgroupv1-node-e2e-resource-managers-kubetest2
  • /test pull-crio-cgroupv2-imagefs-separatedisktest
  • /test pull-crio-cgroupv2-imagefs-separatedisktest-kubetest2
  • /test pull-crio-cgroupv2-node-e2e-eviction
  • /test pull-crio-cgroupv2-node-e2e-eviction-kubetest2
  • /test pull-crio-cgroupv2-node-e2e-hugepages
  • /test pull-crio-cgroupv2-node-e2e-hugepages-kubetest2
  • /test pull-crio-cgroupv2-node-e2e-resource-managers
  • /test pull-crio-cgroupv2-node-e2e-resource-managers-kubetest2
  • /test pull-crio-cgroupv2-splitfs-separate-disk
  • /test pull-crio-cgroupv2-splitfs-separate-disk-kubetest2
  • /test pull-e2e-gce-cloud-provider-disabled
  • /test pull-e2e-gci-gce-alpha-enabled-default
  • /test pull-kubernetes-apidiff
  • /test pull-kubernetes-conformance-image-test
  • /test pull-kubernetes-conformance-kind-ga-only
  • /test pull-kubernetes-conformance-kind-ipv6-parallel
  • /test pull-kubernetes-cos-cgroupv1-containerd-node-e2e
  • /test pull-kubernetes-cos-cgroupv1-containerd-node-e2e-features
  • /test pull-kubernetes-cos-cgroupv2-containerd-node-e2e
  • /test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-eviction
  • /test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-features
  • /test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-serial
  • /test pull-kubernetes-crio-node-memoryqos-cgrpv2
  • /test pull-kubernetes-crio-node-memoryqos-cgrpv2-kubetest2
  • /test pull-kubernetes-cross
  • /test pull-kubernetes-e2e-autoscaling-hpa-cm
  • /test pull-kubernetes-e2e-autoscaling-hpa-cpu
  • /test pull-kubernetes-e2e-capz-azure-disk
  • /test pull-kubernetes-e2e-capz-azure-disk-vmss
  • /test pull-kubernetes-e2e-capz-azure-file
  • /test pull-kubernetes-e2e-capz-azure-file-vmss
  • /test pull-kubernetes-e2e-capz-conformance
  • /test pull-kubernetes-e2e-capz-master-windows-nodelogquery
  • /test pull-kubernetes-e2e-capz-windows-alpha-feature-vpa
  • /test pull-kubernetes-e2e-capz-windows-alpha-features
  • /test pull-kubernetes-e2e-capz-windows-master
  • /test pull-kubernetes-e2e-capz-windows-serial-slow
  • /test pull-kubernetes-e2e-capz-windows-serial-slow-hpa
  • /test pull-kubernetes-e2e-containerd-gce
  • /test pull-kubernetes-e2e-ec2
  • /test pull-kubernetes-e2e-ec2-arm64
  • /test pull-kubernetes-e2e-ec2-conformance
  • /test pull-kubernetes-e2e-ec2-conformance-arm64
  • /test pull-kubernetes-e2e-ec2-device-plugin-gpu
  • /test pull-kubernetes-e2e-gce-canary
  • /test pull-kubernetes-e2e-gce-correctness
  • /test pull-kubernetes-e2e-gce-cos-alpha-features
  • /test pull-kubernetes-e2e-gce-csi-serial
  • /test pull-kubernetes-e2e-gce-device-plugin-gpu
  • /test pull-kubernetes-e2e-gce-disruptive-canary
  • /test pull-kubernetes-e2e-gce-kubelet-credential-provider
  • /test pull-kubernetes-e2e-gce-network-policies
  • /test pull-kubernetes-e2e-gce-network-proxy-grpc
  • /test pull-kubernetes-e2e-gce-serial
  • /test pull-kubernetes-e2e-gce-serial-canary
  • /test pull-kubernetes-e2e-gce-storage-disruptive
  • /test pull-kubernetes-e2e-gce-storage-selinux
  • /test pull-kubernetes-e2e-gce-storage-slow
  • /test pull-kubernetes-e2e-gce-storage-snapshot
  • /test pull-kubernetes-e2e-gci-gce-autoscaling
  • /test pull-kubernetes-e2e-gci-gce-ingress
  • /test pull-kubernetes-e2e-gci-gce-ipvs
  • /test pull-kubernetes-e2e-gci-gce-nftables
  • /test pull-kubernetes-e2e-inplace-pod-resize-containerd-main-v2
  • /test pull-kubernetes-e2e-kind-alpha-beta-features
  • /test pull-kubernetes-e2e-kind-alpha-features
  • /test pull-kubernetes-e2e-kind-beta-features
  • /test pull-kubernetes-e2e-kind-canary
  • /test pull-kubernetes-e2e-kind-cloud-provider-loadbalancer
  • /test pull-kubernetes-e2e-kind-dual-canary
  • /test pull-kubernetes-e2e-kind-evented-pleg
  • /test pull-kubernetes-e2e-kind-ipv6-canary
  • /test pull-kubernetes-e2e-kind-ipvs
  • /test pull-kubernetes-e2e-kind-kms
  • /test pull-kubernetes-e2e-kind-multizone
  • /test pull-kubernetes-e2e-kind-nftables
  • /test pull-kubernetes-e2e-relaxed-environment-variable-validation
  • /test pull-kubernetes-e2e-storage-kind-alpha-beta-features
  • /test pull-kubernetes-e2e-storage-kind-disruptive
  • /test pull-kubernetes-e2e-storage-kind-volume-group-snapshots
  • /test pull-kubernetes-kind-dra
  • /test pull-kubernetes-kind-dra-all
  • /test pull-kubernetes-kind-json-logging
  • /test pull-kubernetes-kind-text-logging
  • /test pull-kubernetes-kubemark-e2e-gce-big
  • /test pull-kubernetes-linter-hints
  • /test pull-kubernetes-local-e2e
  • /test pull-kubernetes-node-arm64-e2e-containerd-ec2
  • /test pull-kubernetes-node-arm64-e2e-containerd-serial-ec2
  • /test pull-kubernetes-node-arm64-ubuntu-serial-gce
  • /test pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e
  • /test pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e-kubetest2
  • /test pull-kubernetes-node-crio-cgrpv2-e2e
  • /test pull-kubernetes-node-crio-cgrpv2-e2e-kubetest2
  • /test pull-kubernetes-node-crio-cgrpv2-imagefs-e2e
  • /test pull-kubernetes-node-crio-cgrpv2-imagefs-e2e-kubetest2
  • /test pull-kubernetes-node-crio-cgrpv2-imagevolume-e2e
  • /test pull-kubernetes-node-crio-cgrpv2-imagevolume-e2e-kubetest2
  • /test pull-kubernetes-node-crio-cgrpv2-splitfs-e2e
  • /test pull-kubernetes-node-crio-cgrpv2-splitfs-e2e-kubetest2
  • /test pull-kubernetes-node-crio-cgrpv2-userns-e2e-serial
  • /test pull-kubernetes-node-crio-cgrpv2-userns-e2e-serial-kubetest2
  • /test pull-kubernetes-node-crio-e2e
  • /test pull-kubernetes-node-crio-e2e-kubetest2
  • /test pull-kubernetes-node-e2e-alpha-ec2
  • /test pull-kubernetes-node-e2e-containerd-1-7-dra
  • /test pull-kubernetes-node-e2e-containerd-alpha-features
  • /test pull-kubernetes-node-e2e-containerd-ec2
  • /test pull-kubernetes-node-e2e-containerd-features
  • /test pull-kubernetes-node-e2e-containerd-features-kubetest2
  • /test pull-kubernetes-node-e2e-containerd-kubetest2
  • /test pull-kubernetes-node-e2e-containerd-serial-ec2
  • /test pull-kubernetes-node-e2e-containerd-serial-ec2-eks
  • /test pull-kubernetes-node-e2e-containerd-standalone-mode
  • /test pull-kubernetes-node-e2e-containerd-standalone-mode-all-alpha
  • /test pull-kubernetes-node-e2e-cri-proxy-serial
  • /test pull-kubernetes-node-e2e-crio-cgrpv1-dra
  • /test pull-kubernetes-node-e2e-crio-cgrpv1-dra-kubetest2
  • /test pull-kubernetes-node-e2e-crio-cgrpv2-dra
  • /test pull-kubernetes-node-e2e-crio-cgrpv2-dra-kubetest2
  • /test pull-kubernetes-node-e2e-resource-health-status
  • /test pull-kubernetes-node-kubelet-containerd-flaky
  • /test pull-kubernetes-node-kubelet-credential-provider
  • /test pull-kubernetes-node-kubelet-serial-containerd
  • /test pull-kubernetes-node-kubelet-serial-containerd-alpha-features
  • /test pull-kubernetes-node-kubelet-serial-containerd-kubetest2
  • /test pull-kubernetes-node-kubelet-serial-containerd-sidecar-containers
  • /test pull-kubernetes-node-kubelet-serial-cpu-manager
  • /test pull-kubernetes-node-kubelet-serial-cpu-manager-kubetest2
  • /test pull-kubernetes-node-kubelet-serial-crio-cgroupv1
  • /test pull-kubernetes-node-kubelet-serial-crio-cgroupv1-kubetest2
  • /test pull-kubernetes-node-kubelet-serial-crio-cgroupv2
  • /test pull-kubernetes-node-kubelet-serial-crio-cgroupv2-kubetest2
  • /test pull-kubernetes-node-kubelet-serial-hugepages
  • /test pull-kubernetes-node-kubelet-serial-memory-manager
  • /test pull-kubernetes-node-kubelet-serial-podresize
  • /test pull-kubernetes-node-kubelet-serial-podresources
  • /test pull-kubernetes-node-kubelet-serial-topology-manager
  • /test pull-kubernetes-node-kubelet-serial-topology-manager-kubetest2
  • /test pull-kubernetes-node-swap-conformance-fedora-serial
  • /test pull-kubernetes-node-swap-conformance-ubuntu-serial
  • /test pull-kubernetes-node-swap-fedora
  • /test pull-kubernetes-node-swap-fedora-serial
  • /test pull-kubernetes-node-swap-ubuntu-serial
  • /test pull-kubernetes-scheduler-perf
  • /test pull-kubernetes-unit-experimental
  • /test pull-publishing-bot-validate

Use /test all to run the following jobs that were automatically triggered:

  • pull-kubernetes-conformance-kind-ga-only-parallel
  • pull-kubernetes-dependencies
  • pull-kubernetes-e2e-ec2
  • pull-kubernetes-e2e-ec2-conformance
  • pull-kubernetes-e2e-gce
  • pull-kubernetes-e2e-gce-canary
  • pull-kubernetes-e2e-inplace-pod-resize-containerd-main-v2
  • pull-kubernetes-e2e-kind
  • pull-kubernetes-e2e-kind-ipv6
  • pull-kubernetes-integration
  • pull-kubernetes-linter-hints
  • pull-kubernetes-node-e2e-containerd
  • pull-kubernetes-typecheck
  • pull-kubernetes-unit
  • pull-kubernetes-verify
  • pull-kubernetes-verify-lint

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@tallclair
Copy link
Member Author

/test pull-e2e-gci-gce-alpha-enabled-default

@tallclair
Copy link
Member Author

/retest

@yujuhong
Copy link
Contributor

yujuhong commented Nov 7, 2024

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 7, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: b63385b6bdf2f370f2fcb77d61649b9c29d9d0f3

@k8s-ci-robot k8s-ci-robot merged commit 25101d3 into kubernetes:master Nov 7, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework area/kubelet area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Projects

Archived in project
Archived in project

Development

Successfully merging this pull request may close these issues.

[FG:InPlacePodVerticalScaling] PLEG doesn't work well with alpha feature InPlacePodVerticalScaling

4 participants