[FG:InPlacePodVerticalScaling] Move resize allocation logic out of the sync loop #131612

natasha41575 · 2025-05-05T16:33:55Z

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

This moves the in-place pod resize allocation logic out of the sync loop. This PR is organized into the following 5 commits:

Untangle the HandlePodResourcesResize unit tests and move them into the allocation package
Add helper methods IsPodResizeInfeasible and IsPodResizeDeferred to the status_manager.
Update the allocation_manager methods to hold all the control logic required for handling pod resize allocation + update the kubelet to no longer attempt to allocate pod resizes in the sync loop (and update unit tests accordingly).
Update the allocation manager's unit tests to cover PushPendingResizes and RetryPendingResizes
Skip pending resize evaluation if sources aren't ready (per discussion on the previous PR)

The intention of this PR is to reattempt pending resizes:

whenever HandlePodAdditions or HandlePodUpdates receives a resize request that it didn't already have,
upon deletion of another pod,
upon the successful actuation of another resize,
or periodically. This PR sets a timer for every 3 minutes, but we should probably think about if that is the right amount of time.

Special notes for your reviewer

Intended follow-ups:

This PR is required for but does not include implementation of prioritized resizes. That is because the PR was already getting a bit too large to review, and because design for prioritized resizes is still pending (KEP-1287: Priority of Resize Requests enhancements#5266). This is also useful as its own standalone change without having prioritized resizes yet, but I left a TODO for that.
Some cleanup (such as moving some unit tests around, unexporting functions that no longer need to be exported, removing some code that's not needed anymore etc), I left some of these things out of this PR to keep the size down

Which issue(s) this PR fixes:

Does not yet fix it, but this is part of #116971.

Does this PR introduce a user-facing change?

NONE

/sig node
/priority important-soon
/triage accepted
/cc @tallclair

TODO:

~~retry deferred resizes in HandlePodCleanups~~
- I don't think anything in HandlePodCleanups affects the admission decision (but I could be wrong)? It looks like the admission decision depends on the pod manager as the source of truth (through kl.podManager.GetPods), and the pod manager is not updated in HandlePodCleanups, so I don't think retrying the pending resizes here is necessary
double check the logic in HandlePodAdditions and HandlePodUpdates is correct (maybe add unit tests covering resize cases)
allocation manager unit tests
need to fix an issue where even when the resize is deferred and not allocated or actuated, the pod status is showing updated allocated and actual resources
need to fix an issue where a pending resize that gets reverted does not have its pending condition cleared quickly enough
sanity check with running this e2e locally
there seems to be more latency than should be necessary in accepting a pending resize after another pod is scaled down to make room, want to investigate this (but this doesn't necessarily have to be blocking)
skip retry of pending resizes if sources aren't ready (!kl.sourcesReady.AllReady())
rebase on move pod admission and resize logic into the allocation manager #131801

k8s-ci-robot · 2025-05-05T16:33:58Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all