Fix nil pointer access panic in kubelet from uninitialized pod allocation checkpoint manager in standalone kubelet scenario #116271

vinaykul · 2023-03-04T08:16:35Z

What type of PR is this?

/kind bug

What this PR does / why we need it: In-place pod resize checkpoint store code panics when standalone kubelet attempts to start. The current code has nil pointer access bug. This PR fixes the issue.

Which issue(s) this PR fixes: #116262

Fixes #116262

Special notes for your reviewer:

Testing done:

Hacky way but not initializing checpoint manager
Pods come up normally with local cluster.
Verified local cluster pod resize E2E tests work.
Unit tests

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

…tion checkpoint manager in standalone kubelet scenario

vinaykul · 2023-03-04T08:18:34Z

/assign @derekwaynecarr @liggitt @thockin @Random-Liu

liggitt · 2023-03-04T23:13:37Z

pkg/kubelet/kubelet.go

-			otherPods = append(otherPods, op)
+			attrs.OtherPods = otherPods
+		} else {
+			klog.ErrorS(nil, "pod resource allocation checkpoint manager is not initialized.")


If checkpoint state is nil in non-error cases (like standalone kubelet) we can't log errors on every pod sync. We'll flood the error log with these otherwise.

Is there a way we can make the VPA feature completely inactive/inert for standalone kubelets, rather than sprinkling nil checks throughout, since a kubelet running that way won't be getting pod updates from the API that change resources?

Thanks for the review and catching this issue. I think I can add a couple of Get interfaces to status manager (similar pattern to GetPodStatus) and avoid these nil checks. I'll work on it.

BTW, how did you find this issue? Is this it: https://github.com/kelseyhightower/standalone-kubelet-tutorial ?

how did you find this issue?

we (GKE) run standalone kubelets, and our CI started failing once the VPA change merged to master

I think I can add a couple of Get interfaces to status manager (similar pattern to GetPodStatus) and avoid these nil checks

Does the VPA feature make sense for a standalone kubelet? I don't think it does. It would be easier to reason about if standalone kubelets didn't ever hit VPA code paths.

maybe something like this:

mkdir -p /var/run/kubernetes/static-pods echo " kind: Pod apiVersion: v1 metadata: name: busybox spec: containers: - name: busybox image: busybox " > /var/run/kubernetes/static-pods/busybox.yaml START_MODE=kubeletonly FEATURE_GATES=InPlacePodVerticalScaling=true hack/local-up-cluster.sh

but I bet local-up-cluster.sh has atrophied and START_MODE=kubeletonly won't work quite right

@SergeyKanzhelev @smarterclayton looks like we have a general test gap for standalone kubelet that regularly causes problems:

kubelet: fix nil pointer in startReflector for standalone mode #113501

Managing nil pointer in VolumeManager #108442

standalone kubelet panic because of nil pointer in VolumeManager #108063

Standalone kubelet demands pki/kubelet.key #87558

kubelet: Observed a panic: "invalid memory address or nil pointer dereference" #77174

ignore kubeclient nil in csi plugin init #75308 (this actually seems similar to this PR in keeping code paths active and trying to add nil checks, which exposed later code to unexpected nils)

START_MODE=kubeletonly FEATURE_GATES=InPlacePodVerticalScaling=true hack/local-up-cluster.sh

but I bet local-up-cluster.sh has atrophied and START_MODE=kubeletonly won't work quite right

Thanks, that works. Needs one additional envvar:
POD_MANIFEST_PATH=/var/run/kubernetes/static-pods START_MODE=kubeletonly FEATURE_GATES=InPlacePodVerticalScaling=true hack/local-up-cluster.sh

Anyways, I have another commitment I need to get to. I'll verify this with the above suggested optimization later in the day today.

Thanks Jordan for raising this, AFAIU I don't think we have good coverage of standalone kubelet in CI today. I think we could think about extending node e2e to also support a mode where we spin up kubelet in standalone mode specifically and only run static pods as a few tests.

In general, it may help as well to have an environment we can dedicate to static pod related tests since we have a few issues there that @smarterclayton and others been trying to tackle recently.

Looking more closely, kubeletonly mode conveniently brings up just kubelet (POD_MANIFEST_PATH default works) but kubeClient is created. But, I was able to set kubeDeps.KubeClient = nil just before NewMainKubelet and get a repro to verify the before & after.

We don't need a new check for kubeClient==nil before invoking handlePodResourcesResize because !kubetypes.IsStaticPod(pod) check achieves the same goal.

bart0sh · 2023-03-05T10:00:34Z

/triage accepted
/priority important-soon

valaparthvi · 2023-03-06T05:38:21Z

/cc

…d Resize values, remove error logging for valid standalone kubelet scenario

vinaykul · 2023-03-06T09:52:30Z

/test pull-kubernetes-e2e-inplace-pod-resize-containerd-main-v2

liggitt · 2023-03-07T03:25:24Z

pkg/kubelet/status/status_manager.go


+// GetContainerResourceAllocation returns the last checkpointed ResourcesAllocated values
+// If checkpoint manager has not been initialized, it returns nil, false
+func (m *manager) GetContainerResourceAllocation(podUID string, containerName string) (v1.ResourceList, bool) {


check nil and return not found before locking?

state will still be nil in all of these functions if the feature gate is disabled.

sweeping all the callers, it's pretty twisty to tell that all of the call sites are guarded by feature gate checks (these are sometimes called from functions which are only called under feature gate guard, or are called from declared closures which are then invoked inside a feature gate guard)

this PR is an improvement on current state of master, but this still seems like a sharp edge to improve before 1.27 cut

True. I can add feature gates check in these functions as a conservative check that makes it easier for future reviews vs. adding a nil check and forgetting to remove at GA. It is one extra compare and branch instruction we don't need on every pod sync (if we can count on pull & CI tests).

this PR is an improvement on current state of master, but this still seems like a sharp edge to improve before 1.27 cut

What other things do you want to see fixed? I initially wanted to move away from node local checkpoint in favor of relying on values persisted in PodStatus (if it is legit use of KEP 2527), but that wouldn't fly in standalone kubelet scenario. I feel kubelet could use a more generic KV abstraction that better decouples the checkpointing implementation details, but that is a bit larger scope work.. jmho.

probably not leaving those manager methods NPE landmines if the feature-gate is disabled? either doing nil checks in those functions, or assigning a dummy/no-op impl to state when the feature gate is off

but a follow-up is fine... I'm more concerned with getting alpha standalone kubelets green again at this point

no-op is a good idea. Since this fix is urgent and already LTGM'd, I'll create a separate PR follow-up or tack it on to an existing follow-up PR.

Does PR #116351 get it done?

pkg/kubelet/status/status_manager.go

pacoxu · 2023-03-07T07:01:52Z

/test pull-kubernetes-e2e-gce-cos-alpha-features

vinaykul · 2023-03-07T14:56:29Z

/test pull-kubernetes-e2e-inplace-pod-resize-containerd-main-v2

liggitt · 2023-03-07T15:07:58Z

/lgtm
/approve

k8s-ci-robot · 2023-03-07T15:08:06Z

LGTM label has been added.

Git tree hash: f1154edb10756ad19d1cfe4a9bf49248c16118fa

liggitt · 2023-03-07T15:08:21Z

/hold for testing if desired

k8s-ci-robot · 2023-03-07T15:08:23Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: liggitt, vinaykul

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/kubelet/OWNERS~~ [liggitt]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

vinaykul · 2023-03-07T17:43:57Z

The containerd/main e2e test I care about has passed. https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/116271/pull-kubernetes-e2e-inplace-pod-resize-containerd-main-v2/1633119440885256192

liggitt · 2023-03-07T17:44:44Z

/hold cancel

Fix nil pointer access panic in kubelet from uninitialized pod alloca…

12435b2

…tion checkpoint manager in standalone kubelet scenario

k8s-ci-robot requested review from bart0sh and yujuhong March 4, 2023 08:17

k8s-ci-robot assigned derekwaynecarr, liggitt, Random-Liu and thockin Mar 4, 2023

vinaykul mentioned this pull request Mar 4, 2023

In-place Pod Vertical Scaling feature #102884

Merged

liggitt reviewed Mar 4, 2023

View reviewed changes

k8s-ci-robot requested a review from valaparthvi March 6, 2023 05:38

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 6, 2023

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 6, 2023

Add Get interfaces for container's checkpointed ResourcesAllocated an…

b0dce92

…d Resize values, remove error logging for valid standalone kubelet scenario

vinaykul force-pushed the restart-free-pod-vertical-scaling-kubelet-panic-fix branch from 531263c to b0dce92 Compare March 6, 2023 09:51

liggitt reviewed Mar 7, 2023

View reviewed changes

panic on pod resources alloc checkpoint failure

98e8f42

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 7, 2023

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 7, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 7, 2023

liggitt added this to the v1.27 milestone Mar 7, 2023

k8s-ci-robot merged commit 6bce018 into kubernetes:master Mar 7, 2023

vinaykul mentioned this pull request Mar 8, 2023

Initialize pod resource allocation checkpoint manager to noop #116351

Merged

pacoxu mentioned this pull request Sep 5, 2023

In-Place Update of Pod Resources kubernetes/enhancements#1287

Open

95 tasks

Fix nil pointer access panic in kubelet from uninitialized pod allocation checkpoint manager in standalone kubelet scenario #116271

Fix nil pointer access panic in kubelet from uninitialized pod allocation checkpoint manager in standalone kubelet scenario #116271

Uh oh!

Conversation

vinaykul commented Mar 4, 2023 • edited by liggitt Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it: In-place pod resize checkpoint store code panics when standalone kubelet attempts to start. The current code has nil pointer access bug. This PR fixes the issue.

Which issue(s) this PR fixes: #116262

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

vinaykul commented Mar 4, 2023

Uh oh!

liggitt Mar 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bobbypage Mar 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bart0sh commented Mar 5, 2023

Uh oh!

valaparthvi commented Mar 6, 2023

Uh oh!

vinaykul commented Mar 6, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liggitt Mar 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pacoxu commented Mar 7, 2023

Uh oh!

vinaykul commented Mar 7, 2023

Uh oh!

liggitt commented Mar 7, 2023

Uh oh!

k8s-ci-robot commented Mar 7, 2023

Uh oh!

liggitt commented Mar 7, 2023

Uh oh!

k8s-ci-robot commented Mar 7, 2023

Uh oh!

vinaykul commented Mar 4, 2023 •

edited by liggitt

Loading

liggitt Mar 4, 2023 •

edited

Loading

bobbypage Mar 6, 2023 •

edited

Loading

liggitt Mar 7, 2023 •

edited

Loading