Skip to content

DRA: kubelet: checkpoint checksum changes when API gets extended #123552

@pohly

Description

@pohly

What happened?

When adding the StructuredData field to resourcev1alpha2.ResourceHandle, old checkpoint files became invalid because the checksum that gets calculated by hashing the spew output in pkg/util/hash no longer matches. This caused kubelet to fail to start with:

E0228 10:46:47.293483  363727 run.go:74] "command failed" err="failed to run Kubelet: failed to create claimInfo cache: error calling GetOrCreate() on checkpoint state: failed to get checkpoint dra_manager_state: checkpoint is corrupted"

Here's what the checksum in a unit test gets calculcated for:

"(*state.DRAManagerCheckpoint){Version:(string)v1 Entries:(state.ClaimInfoStateList)[{DriverName:(string)[test-driver.cdi.k8s.io](http://test-driver.cdi.k8s.io/) ClassName:(string)class-name ClaimUID:(types.UID)067798be-454e-4be4-9047-1aa06aea63f7 ClaimName:(string)example Namespace:(string)default PodUIDs:(sets.Set[string])map[139cdb46-f989-4f17-9561-ca10cfb509a6:{}] ResourceHandles:([]v1alpha2.ResourceHandle)[{DriverName:(string)[test-driver.cdi.k8s.io](http://test-driver.cdi.k8s.io/) Data:(string){\"a\": \"b\"} StructuredData:(*v1alpha2.StructuredResourceHandle)<nil>}] CDIDevices:(map[string][]string)map[[test-driver.cdi.k8s.io](http://test-driver.cdi.k8s.io/):[[example.com/example=cdi-example]]}]](http://example.com/example=cdi-example]]%7D]) Checksum:(checksum.Checksum)0}"

Note the new StructuredData:(*v1alpha2.StructuredResourceHandle)<nil>.

What did you expect to happen?

Not sure what the right behavior is. Perhaps kubelet should have ignored the invalid checkpoint?

How can we reproduce it (as minimally and precisely as possible)?

Revert the pkg/kubelet/cm/dra/state/state_checkpoint_test.go update in #123516.

Anything else we need to know?

The state.DRAManagerCheckpoint contains a version field, which is set to v1.I think that's where things fall apart: that version must exactly match the type that gets serialized, which is not the case here. The type changes over time, but the version is fixed.

Let's reconsider what should get checkpointed and how we can make that stable. Also, is checkpointing really worth it? What bad things happen if we remove it?

Kubernetes version

master

Cloud provider

n/a

OS version

No response

Install tools

No response

Container runtime (CRI) and version (if applicable)

No response

Related plugins (CNI, CSI, ...) and versions (if applicable)

/sig node
/cc @bart0sh @klueska

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/featureCategorizes issue or PR as related to a new feature.priority/important-longtermImportant over the long term, but may not be staffed and/or may need multiple releases to complete.sig/nodeCategorizes an issue or PR as relevant to SIG Node.triage/acceptedIndicates an issue or PR is ready to be actively worked on.

    Type

    No type

    Projects

    Status

    ✅ Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions