Skip to content

pod_status_manager_state: checkpoint is corrupted #117589

@aheng-ch

Description

@aheng-ch

What happened?

After enabling In-Place Pod Vertical Scaling, if deploy a pod without setting memory (or cpu) request, kubelet will be failed at its second restart

What did you expect to happen?

kubelet restart successfully

How can we reproduce it (as minimally and precisely as possible)?

  1. start kubelet with In-Place Pod Vertical Scaling enabled
  2. deploy a pod without setting memory request, e.g:
apiVersion: v1
kind: Pod
metadata:
  name: test-pod
spec:
  containers:
  - image: nginx:1.24.0
    imagePullPolicy: IfNotPresent
    name: nginx
    resources:
      requests:
        cpu: 100m
  1. restart kubelet for the first time
  2. restart kubelet for the second time
  3. kubelet will fail with error: panic: could not restore state from checkpoint: checkpoint is corrupted, please drain this node and delete pod allocation checkpoint file "/var/lib/kubelet/pod_status_manager_state" before restarting Kubelet

Anything else we need to know?

After deploying the pod, the following information is saved in the file pod_status_manager_state:

      "nginx": {
        "cpu": "100m"
      }

Then restart kubelet for the first time, the relevant content in the file becomes:

      "nginx": {
        "cpu": "100m",
        "memory": "0"
      }

When deploying the pod, as there are no records in the checkpoint (pod_status_manager_state), kubelet will get resource config from pod.Spec.Container.Resources directly and save it to the checkpoint (without memory request)

if utilfeature.DefaultFeatureGate.Enabled(features.InPlacePodVerticalScaling) {
// To handle kubelet restarts, test pod admissibility using AllocatedResources values
// (for cpu & memory) from checkpoint store. If found, that is the source of truth.
podCopy := pod.DeepCopy()
for _, c := range podCopy.Spec.Containers {
allocatedResources, found := kl.statusManager.GetContainerResourceAllocation(string(pod.UID), c.Name)
if c.Resources.Requests != nil && found {
c.Resources.Requests[v1.ResourceCPU] = allocatedResources[v1.ResourceCPU]
c.Resources.Requests[v1.ResourceMemory] = allocatedResources[v1.ResourceMemory]
}
}
// Check if we can admit the pod; if not, reject it.
if ok, reason, message := kl.canAdmitPod(activePods, podCopy); !ok {
kl.rejectPod(pod, reason, message)
continue
}
// For new pod, checkpoint the resource values at which the Pod has been admitted
if err := kl.statusManager.SetPodAllocation(podCopy); err != nil {

When restarting kubelet for the first time, it will restore the previously saved data from the checkpoint. Since the memory request was not previously set, it will be set to an empty value and resaved to the checkpoint
for _, c := range podCopy.Spec.Containers {
allocatedResources, found := kl.statusManager.GetContainerResourceAllocation(string(pod.UID), c.Name)
if c.Resources.Requests != nil && found {
c.Resources.Requests[v1.ResourceCPU] = allocatedResources[v1.ResourceCPU]
c.Resources.Requests[v1.ResourceMemory] = allocatedResources[v1.ResourceMemory]
}
}

When restarting kubelet for the second time, it will process memory: 0 as a value of 0 (not empty), which resulting in the inconsistency between the calculated checksum during data recovery and the previous one

Kubernetes version

$ kubectl version
# paste output here

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

Metadata

Metadata

Assignees

Labels

kind/bugCategorizes issue or PR as related to a bug.sig/nodeCategorizes an issue or PR as relevant to SIG Node.triage/acceptedIndicates an issue or PR is ready to be actively worked on.

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions