DRA: kubelet: checkpoint checksum changes when API gets extended

### What happened?

When adding the `StructuredData` field to `resourcev1alpha2.ResourceHandle`, old checkpoint files became invalid because the checksum that gets calculated by hashing the `spew` output in `pkg/util/hash` no longer matches. This caused kubelet to fail to start with:
```
E0228 10:46:47.293483  363727 run.go:74] "command failed" err="failed to run Kubelet: failed to create claimInfo cache: error calling GetOrCreate() on checkpoint state: failed to get checkpoint dra_manager_state: checkpoint is corrupted"
```

Here's what the checksum in a unit test gets calculcated for:
```
"(*state.DRAManagerCheckpoint){Version:(string)v1 Entries:(state.ClaimInfoStateList)[{DriverName:(string)[test-driver.cdi.k8s.io](http://test-driver.cdi.k8s.io/) ClassName:(string)class-name ClaimUID:(types.UID)067798be-454e-4be4-9047-1aa06aea63f7 ClaimName:(string)example Namespace:(string)default PodUIDs:(sets.Set[string])map[139cdb46-f989-4f17-9561-ca10cfb509a6:{}] ResourceHandles:([]v1alpha2.ResourceHandle)[{DriverName:(string)[test-driver.cdi.k8s.io](http://test-driver.cdi.k8s.io/) Data:(string){\"a\": \"b\"} StructuredData:(*v1alpha2.StructuredResourceHandle)<nil>}] CDIDevices:(map[string][]string)map[[test-driver.cdi.k8s.io](http://test-driver.cdi.k8s.io/):[[example.com/example=cdi-example]]}]](http://example.com/example=cdi-example]]%7D]) Checksum:(checksum.Checksum)0}"
```

Note the new `StructuredData:(*v1alpha2.StructuredResourceHandle)<nil>`.



### What did you expect to happen?

Not sure what the right behavior is. Perhaps kubelet should have ignored the invalid checkpoint?


### How can we reproduce it (as minimally and precisely as possible)?

Revert the pkg/kubelet/cm/dra/state/state_checkpoint_test.go update in https://github.com/kubernetes/kubernetes/pull/123516.


### Anything else we need to know?

The state.DRAManagerCheckpoint contains a version field, which is set to v1.I think that's where things fall apart: that version must exactly match the type that gets serialized, which is not the case here. The type changes over time, but the version is fixed.

Let's reconsider what should get checkpointed and how we can make that stable. Also, is checkpointing really worth it? What bad things happen if we remove it?

### Kubernetes version

master


### Cloud provider

n/a


### OS version

_No response_

### Install tools

_No response_

### Container runtime (CRI) and version (if applicable)

_No response_

### Related plugins (CNI, CSI, ...) and versions (if applicable)

/sig node
/cc @bart0sh @klueska 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DRA: kubelet: checkpoint checksum changes when API gets extended #123552

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DRA: kubelet: checkpoint checksum changes when API gets extended #123552

Description

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions