-
Notifications
You must be signed in to change notification settings - Fork 41.6k
[FG:InPlacePodVerticalScaling] PLEG watch conditions: rapid polling for expected changes #128518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FG:InPlacePodVerticalScaling] PLEG watch conditions: rapid polling for expected changes #128518
Conversation
|
/triage accepted |
bd966a6 to
f460c5b
Compare
f460c5b to
826044c
Compare
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: tallclair The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
I pulled in the change to block on the ResizeStatus being cleared from #128377, and ran the e2e tests locally. Without this PR, the tests timed out (at 27/35 resize tests). With this PR, the tests complete in about 10 minutes. |
|
/assign @yujuhong |
de4e5f0 to
7fce6f2
Compare
| "pod", format.Pod(pod), "resourceName", resourceName) | ||
| return err | ||
| } | ||
| resizeKey := fmt.Sprintf("%s:resize:%s", container.Name, resourceName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we only run this function updatePodContainerResources when there''s actual resizing to perform, i.e., subsequent sync will skip this function even if the resources haven't converged yet?
I'm thinking about the corner cases such as kubelet restart, whether the watch condition will be set again. And also how often we may see the watch condition being set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function will get called on every SyncPod iteration until the resources converge. I think that's OK here. Updating the resources should take much less than the 1 minute resync period, so I'd expect this to be a very small number of times updating the watch condition. In the restart case, the watch condition will get added back if the resources still haven't converged.
|
Hey @yujuhong @tallclair |
|
/lgtm Can we also run the resize test to verify? |
|
LGTM label has been added. Git tree hash: 69f7ae5ac4a623c24beca49cf1a397b1eeb58481
|
71b3ea2 to
24443b6
Compare
|
Fixed lint error |
|
/test ? |
|
@tallclair: The following commands are available to trigger required jobs:
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/test pull-e2e-gci-gce-alpha-enabled-default |
|
/retest |
|
/lgtm |
|
LGTM label has been added. Git tree hash: b63385b6bdf2f370f2fcb77d61649b9c29d9d0f3
|
What type of PR is this?
/kind feature
What this PR does / why we need it:
Motivation & Background:
The Kubelet's PLEG only triggers events and fetches the latest PodStatus from the runtime when a container changes state. This is a problem for in-place pod resize, since resource updates are not considered a state change. To get around this,
pleg.UpdateCachewas called at the end of SyncPod if a pod was resizing to force a "reinspection" of the pod status. This violates some principles of the PLEG, and also means that resizes need to wait a full resync period (1 minute) to detect that the resize completed.What this PR does:
This PR adds a new concept of "Watch Conditions" to the PLEG. If a pod has a watch condition set, then the pod is reinspected (via
runtime.GetPodStatus) on everyRelistloop (every 2 seconds) until the watch condition function returns true, indicating that the condition was reached. The Kubelet sets a watch condition for each resource resize, so it no longer needs to manually do the reinspection via update cache.Not Yet Implemented: When a watch condition is completed, an event is emitted by the PLEG to trigger SyncPod. This should greatly speed up pod resize completion.
Which issue(s) this PR fixes:
Fixes #123940
Special notes for your reviewer:
This PR starts with some refactoring. I can pull the refactor into a separate PR if you prefer.
Does this PR introduce a user-facing change?
/sig node
/priority important-longterm