-
Notifications
You must be signed in to change notification settings - Fork 41.6k
Description
What happened?
When adding the StructuredData field to resourcev1alpha2.ResourceHandle, old checkpoint files became invalid because the checksum that gets calculated by hashing the spew output in pkg/util/hash no longer matches. This caused kubelet to fail to start with:
E0228 10:46:47.293483 363727 run.go:74] "command failed" err="failed to run Kubelet: failed to create claimInfo cache: error calling GetOrCreate() on checkpoint state: failed to get checkpoint dra_manager_state: checkpoint is corrupted"
Here's what the checksum in a unit test gets calculcated for:
"(*state.DRAManagerCheckpoint){Version:(string)v1 Entries:(state.ClaimInfoStateList)[{DriverName:(string)[test-driver.cdi.k8s.io](http://test-driver.cdi.k8s.io/) ClassName:(string)class-name ClaimUID:(types.UID)067798be-454e-4be4-9047-1aa06aea63f7 ClaimName:(string)example Namespace:(string)default PodUIDs:(sets.Set[string])map[139cdb46-f989-4f17-9561-ca10cfb509a6:{}] ResourceHandles:([]v1alpha2.ResourceHandle)[{DriverName:(string)[test-driver.cdi.k8s.io](http://test-driver.cdi.k8s.io/) Data:(string){\"a\": \"b\"} StructuredData:(*v1alpha2.StructuredResourceHandle)<nil>}] CDIDevices:(map[string][]string)map[[test-driver.cdi.k8s.io](http://test-driver.cdi.k8s.io/):[[example.com/example=cdi-example]]}]](http://example.com/example=cdi-example]]%7D]) Checksum:(checksum.Checksum)0}"
Note the new StructuredData:(*v1alpha2.StructuredResourceHandle)<nil>.
What did you expect to happen?
Not sure what the right behavior is. Perhaps kubelet should have ignored the invalid checkpoint?
How can we reproduce it (as minimally and precisely as possible)?
Revert the pkg/kubelet/cm/dra/state/state_checkpoint_test.go update in #123516.
Anything else we need to know?
The state.DRAManagerCheckpoint contains a version field, which is set to v1.I think that's where things fall apart: that version must exactly match the type that gets serialized, which is not the case here. The type changes over time, but the version is fixed.
Let's reconsider what should get checkpointed and how we can make that stable. Also, is checkpointing really worth it? What bad things happen if we remove it?
Kubernetes version
master
Cloud provider
n/a
OS version
No response
Install tools
No response
Container runtime (CRI) and version (if applicable)
No response
Related plugins (CNI, CSI, ...) and versions (if applicable)
Metadata
Metadata
Assignees
Labels
Type
Projects
Status