Resolved compatibility issue between Kubelet PLEG and inplace VPA #123941

Jeffwan · 2024-03-14T20:39:01Z

Make In-place VPA feature work with PLEG relist. Use auxiliary runtime pod status and PLEG cache pod status to distinguish the resize pod and make sure it generate correct PLEG event and come into the event channel.

What type of PR is this?

/kind bug

What this PR does / why we need it:

Pleg doesn't handle resized pod well. See #123940 for more details

This is part of kubernetes/enhancements#4433

Which issue(s) this PR fixes:

Fixes

Special notes for your reviewer:

I have not added tests yet. I did some manual e2e tests. If the idea looks good to you, I will spend some time improving the test coverage.

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Jeffwan · 2024-03-14T20:39:25Z

/sig node

Jeffwan · 2024-03-14T20:46:51Z

pkg/kubelet/kubelet.go

-	if utilfeature.DefaultFeatureGate.Enabled(features.InPlacePodVerticalScaling) && isPodResizeInProgress(pod, &apiPodStatus) {
-		// While resize is in progress, periodically call PLEG to update pod cache
-		runningPod := kubecontainer.ConvertPodStatusToRunningPod(kl.getRuntime().Type(), podStatus)
-		if err, _ := kl.pleg.UpdateCache(&runningPod, pod.UID); err != nil {


The only purpose of this logic was to update the cache. However, UpdateCache underneath invoke the runtime.GetPodStatus() which retrieves the latest CRI status, then the cache object stores the latest state which can not be used for state comparison in future Relist() loop.

Jeffwan · 2024-03-14T20:49:40Z

/cc @smarterclayton @bobbypage @liggitt @kubernetes/sig-node-pr-reviews

k8s-ci-robot · 2024-03-14T20:49:44Z

@Jeffwan: GitHub didn't allow me to request PR reviews from the following users: kubernetes/sig-node-pr-reviews.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @smarterclayton @bobbypage @liggitt @kubernetes/sig-node-pr-reviews

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

smarterclayton · 2024-04-02T17:17:33Z

pkg/kubelet/pleg/generic.go

 		oldPod := g.podRecords.getOld(pid)
 		pod := g.podRecords.getCurrent(pid)
+
+		var cachePodStatus *kubecontainer.PodStatus


We would also need to calculate in the evented pleg, so make sure we don't miss it there.

that's a good point. I will cover it in new coming commit

dchen1107 · 2024-04-02T17:18:36Z

/assign @tallclair

bart0sh · 2024-04-05T11:19:40Z

/triage accepted
/priority important-longterm

bart0sh · 2024-04-05T11:21:10Z

@Jeffwan Please fix CI test failures, thanks.

As this is a bugfix, it would be great to get this use case covered by e2e tests.

linux-foundation-easycla · 2024-05-14T23:36:08Z

The committers listed above are authorized under a signed CLA.

✅ login: horacexd / name: Zewei Ding (74310c3)
✅ login: Jeffwan / name: Jiaxin Shan (58b3c49)

hshiina · 2024-05-23T09:25:26Z

pkg/kubelet/pleg/generic.go

+		newContainerStatus := podStatus.FindContainerStatusByContainerID(cid)
+		if oldContainerStatus != nil && newContainerStatus != nil && !containerResourceSame(oldContainerStatus.Resources, newContainerStatus.Resources) {
+			klog.V(5).InfoS("resize pods triggers the plegContainerUnknown event", "oldContainerStatus", oldContainerStatus, "newContainerStatus", newContainerStatus)
+			return generateEvents(pid, cid.ID, oldState, plegContainerUnknown)


It might be better to consider an edge case where a resized container has exited at the same time (newState==exited).

hshiina · 2024-05-23T09:48:24Z

pkg/kubelet/pleg/generic.go

 		for i := range events {
 			// Filter out events that are not reliable and no other components use yet.
 			if events[i].Type == ContainerChanged {
-				continue


This prevents a ContainerChangedevent from being sent when a container status is created, which is sometimes detected and converted to unknown at L99. Even if this event is sent, it doesn't seem to cause a problem at a glance. However, it would be better not to send an event when a container status is created in order to avoid any unexpected side-effects by using another event (PodSync or new one) for resizing or by checking the container status.

hshiina · 2024-05-31T12:21:56Z

pkg/kubelet/pleg/generic.go

+				return
+			}
+			if pod != nil {
+				podStatus, err = g.runtime.GetPodStatus(ctx, pod.ID, pod.Name, pod.Namespace)


Do we have to call it for all pods? Wouldn't it be enough to get a pod status from the runtime only when the pod is being resized (InProgress)?

k8s-triage-robot · 2024-08-29T13:05:38Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jeffwan · 2024-09-09T17:46:43Z

/remove-lifecycle stale

Jeffwan · 2024-09-09T17:53:59Z

@horacexd and I are still working on this story so remove the stale label. We will address the comments and polish the code to production grade soon.

k8s-ci-robot · 2024-09-09T18:36:04Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Jeffwan
Once this PR has been reviewed and has the lgtm label, please ask for approval from tallclair. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

pkg/kubelet/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Make In-place VPA feature work with PLEG relist. Use auxiliary runtime pod status and PLEG cache pod status to distinguish the resize pod and make sure it generate correct PLEG event and come into the event channel. Co-authored-by: Lingyan Yin <yin.387@osu.edu> Co-authored-by: Zewei Ding <horace.d@outlook.com> Co-authored-by: Shengjie Xue <3150104939@zju.edu.cn>

Change-Id: I7a715b8525832f0c39ae0fa25dc42cbb3b9043f9

k8s-ci-robot · 2024-10-22T00:28:00Z

@Jeffwan: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kubernetes-typecheck	`74310c3`	link	true	`/test pull-kubernetes-typecheck`
pull-kubernetes-verify-lint	`74310c3`	link	true	`/test pull-kubernetes-verify-lint`
pull-kubernetes-unit	`74310c3`	link	true	`/test pull-kubernetes-unit`
pull-kubernetes-verify	`74310c3`	link	true	`/test pull-kubernetes-verify`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

tallclair · 2024-10-25T17:14:38Z

pkg/kubelet/pleg/generic.go

+				return
+			}
+			if pod != nil {
+				podStatus, err = g.runtime.GetPodStatus(ctx, pod.ID, pod.Name, pod.Namespace)


If I'm reading this correctly, this is going to call GetPodStatus() on every pod, on ever relist loop? Won't that make the relist too expensive? I assume the logic that made this conditional originally was intentional to avoid this.

I think this needs a lot of scale & performance testing before we can proceed with this change.

k8s-ci-robot · 2024-10-26T02:21:46Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

tallclair · 2024-11-03T21:58:41Z

I'm proposing an alternative approach to this in #128518

k8s-triage-robot · 2025-02-02T16:41:06Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen
Mark this PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2025-02-02T16:41:11Z

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen

Mark this PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. area/kubelet and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 14, 2024

k8s-ci-robot requested review from dchen1107 and derekwaynecarr March 14, 2024 20:39

Jeffwan commented Mar 14, 2024

View reviewed changes

Jeffwan mentioned this pull request Mar 14, 2024

[FG:InPlacePodVerticalScaling] PLEG doesn't work well with alpha feature InPlacePodVerticalScaling #123940

Closed

k8s-ci-robot requested review from bobbypage, liggitt and smarterclayton March 14, 2024 20:49

Jeffwan force-pushed the jiaxin/kep-1287-pleg-optimization branch from 8c90373 to e58fbfe Compare March 15, 2024 00:12

smarterclayton reviewed Apr 2, 2024

View reviewed changes

k8s-ci-robot assigned tallclair Apr 2, 2024

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 14, 2024

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 14, 2024

hshiina reviewed May 23, 2024

View reviewed changes

hshiina reviewed May 31, 2024

View reviewed changes

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 29, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 9, 2024

horacexd force-pushed the jiaxin/kep-1287-pleg-optimization branch from 1f53437 to 1700047 Compare September 9, 2024 18:35

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Sep 9, 2024

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 17, 2024

Jeffwan and others added 2 commits October 21, 2024 16:36

Add unit test for Kubelet PLEG with InPlacePodVerticalScaling

74310c3

Change-Id: I7a715b8525832f0c39ae0fa25dc42cbb3b9043f9

horacexd force-pushed the jiaxin/kep-1287-pleg-optimization branch from 1700047 to 74310c3 Compare October 21, 2024 23:36

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 21, 2024

tallclair reviewed Oct 25, 2024

View reviewed changes

tallclair mentioned this pull request Oct 25, 2024

[FG:InPlacePodVerticalScaling] Rework handling of allocated resources #128269

Merged

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 26, 2024

dims added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 3, 2025

k8s-ci-robot closed this Feb 2, 2025

Resolved compatibility issue between Kubelet PLEG and inplace VPA #123941

Resolved compatibility issue between Kubelet PLEG and inplace VPA #123941

Uh oh!

Conversation

Jeffwan commented Mar 14, 2024

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

Jeffwan commented Mar 14, 2024

Uh oh!

Jeffwan Mar 14, 2024

Choose a reason for hiding this comment

Uh oh!

Jeffwan commented Mar 14, 2024

Uh oh!

k8s-ci-robot commented Mar 14, 2024

Uh oh!

smarterclayton Apr 2, 2024

Choose a reason for hiding this comment

Uh oh!

Jeffwan Apr 15, 2024

Choose a reason for hiding this comment

Uh oh!

dchen1107 commented Apr 2, 2024

Uh oh!

bart0sh commented Apr 5, 2024

Uh oh!

bart0sh commented Apr 5, 2024

Uh oh!

linux-foundation-easycla bot commented May 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hshiina May 23, 2024

Choose a reason for hiding this comment

Uh oh!

hshiina May 23, 2024

Choose a reason for hiding this comment

Uh oh!

hshiina May 31, 2024

Choose a reason for hiding this comment

Uh oh!

k8s-triage-robot commented Aug 29, 2024

Uh oh!

Jeffwan commented Sep 9, 2024

Uh oh!

Jeffwan commented Sep 9, 2024

Uh oh!

k8s-ci-robot commented Sep 9, 2024

Uh oh!

k8s-ci-robot commented Oct 22, 2024

Uh oh!

tallclair Oct 25, 2024

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Oct 26, 2024

Uh oh!

tallclair commented Nov 3, 2024

Uh oh!

k8s-triage-robot commented Feb 2, 2025

Uh oh!

k8s-ci-robot commented Feb 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

linux-foundation-easycla bot commented May 14, 2024 •

edited

Loading