-
Notifications
You must be signed in to change notification settings - Fork 41.6k
[InPlacePodVerticalScaling] fix restore checkpoint bug: failed to verify pod status checkpoint checksum because of different behaviors of func Quantity.Marshal and Quantity.Unmarshal #126620
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Please note that we're already in Test Freeze for the Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Sat Aug 10 11:05:50 UTC 2024. |
|
Welcome @yunwang0911! |
|
Hi @yunwang0911. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@yunwang0911 This reminds me #123552. To properly fix this, checksum should be calculated after marshalling, as we did in #126303. |
9bc1d68 to
9058cb9
Compare
@bart0sh Thank you for your suggestion. I have updated the code. Please help to review again |
|
@vinaykul @iholder101 @tallclair could you please take a look? |
…ead from file pod_status are different
| // Checkpoint represents a structure to store pod resource allocation checkpoint data | ||
| type Checkpoint struct { | ||
| // Data is a JSON serialized checkpoint data | ||
| Data string `json:"data"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason to use the type string rather than []byte?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No particular reason, just following existing code in dra checkpoint link. How do you think? Which is better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[]byte seems more idiomatic, and makes the intent clearer. No strong preference though.
That said, a lot of this checkpoint code is identical to the DRA checkpoint, including this Checkpoint type. Should we just move all of this to the pkg/kubelet/checkpointmanager package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said, a lot of this checkpoint code is identical to the DRA checkpoint, including this Checkpoint type. Should we just move all of this to the pkg/kubelet/checkpointmanager package?
Could we separate this into a different task? Since there are many checkpoints, cpu, memory, dra, device and status, refactoring the code will impact not only InPlacePodVerticalScaling feature gate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think? @tallclair
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, sounds good. Can you file an issue and/or leave a TODO?
Co-authored-by: Tim Allclair <timallclair@gmail.com>
Co-authored-by: Tim Allclair <timallclair@gmail.com>
Co-authored-by: Tim Allclair <timallclair@gmail.com>
Co-authored-by: Tim Allclair <timallclair@gmail.com>
…ead from file pod_status are different
| // Checkpoint represents a structure to store pod resource allocation checkpoint data | ||
| type Checkpoint struct { | ||
| // Data is a JSON serialized checkpoint data | ||
| Data string `json:"data"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[]byte seems more idiomatic, and makes the intent clearer. No strong preference though.
That said, a lot of this checkpoint code is identical to the DRA checkpoint, including this Checkpoint type. Should we just move all of this to the pkg/kubelet/checkpointmanager package?
Co-authored-by: Tim Allclair <timallclair@gmail.com>
Co-authored-by: Tim Allclair <timallclair@gmail.com>
|
/lgtm |
|
LGTM label has been added. Git tree hash: 4c3f8b4864ddf73a20eb80f7fb2cf261597e2b9f
|
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: tallclair, yunwang0911 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind bug
What this PR does / why we need it:
Fix one bug of the feature
pod in-place resource resize, feature gate: InPlacePodVerticalScalingWhich issue(s) this PR fixes:
Fixes #128375
Special notes for your reviewer:
One line bug description: the content func stateCheckpoint.storeState writes into the file
/var/lib/kubelet/pod_status_manager_stateis different from it func stateCheckpoint.restoreState reads from the same file, which causes function VerifyChecksum always fails for some format of Quantity value. It should only happen for static pods in static dir.**How to reproduce? **
{ "apiVersion": "v1", "kind": "Pod", "metadata": { "name": "nginx2" }, "spec": { "containers": [ { "name": "nginx", "image": "nginx:1.14.2", "ports": [ { "containerPort": 80 } ], "resources": { "requests": { "cpu": "0.4", "memory": "1Gi" }, "limits": { "cpu": "1.5", "memory": "1Gi" } } } ] } }/etc/kubernetes/manifest/in master nodesystemctl restart kubelet/var/lib/kubelet/pod_status_manager_stateis deletedRoot cause
The json marshal function of Quantity type will convert itself to
CanonicalByteshttps://github.com/kubernetes/apimachinery/blob/95b78024e3feada7739b40426690b4f287933fd8/pkg/api/resource/quantity.go#L452C25-L452C41
But the json unmarshal function of Quantity type won't do that.
https://github.com/kubernetes/apimachinery/blob/95b78024e3feada7739b40426690b4f287933fd8/pkg/api/resource/quantity.go#L701
For some formats for Quantity value, the Quantity contents, wrote to and read from the same file, might be different.
For example, if the pod cpu request is defined as
cpu: "0.4", the Quantity content iscpu:{{4 -1} {<nil>} DecimalSIbefore writing it into the file, but the content iscpu:{{400 -3} {<nil>} 400m DecimalSIafter reading from the same file.According to the reason above, the function VerifyChecksum might fail.
Detailed test case
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
None