-
Notifications
You must be signed in to change notification settings - Fork 41.6k
[FG:InPlacePodVerticalScaling] kubelet: Propagate error in doPodResizeAction() to the caller #127300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @hshiina. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
|
||
| // MilliCPUToShares converts the milliCPU to CFS shares. | ||
| func MilliCPUToShares(milliCPU int64) int64 { | ||
| func MilliCPUToShares(milliCPU int64) uint64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is being fixed separately in #127293. After it is merged, I will rebase this PR.
|
/cc @esotsal /ok-to-test |
|
/cc @vinaykul |
|
/triage accepted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about mockery. The use of mocks is sparse and I didn't find precise guidelines (but I haven't ever looked too hard either). That's pretty minor though.
|
/test pull-kubernetes-e2e-inplace-pod-resize-containerd-main-v2 |
No, this PR does not fix #123441. I mentioned this issue because I guessed it was caused by an error in |
Ack. Let's use this opportunity to make the error propagation more precise and consistent with Create|Kill-container/Create|Kill-podsandbox. Please add two SyncActions for UpdateContainer & UpdatePodSandbox, define two new errors associated with them, and follow the killPodwithSyncResult pattern for doPodResizeAction. Thanks. |
|
It also looks like pull-kubernetes-e2e-inplace-pod-resize-containerd-main-v2 job is failing a lot: https://prow.k8s.io/?job=pull-kubernetes-e2e-inplace-pod-resize-containerd-main-v2 .. the scheduler-focussed test mostly but there is some other flakiness. Not related to this change though, but it needs to be fixed. |
|
@hshiina It seems like this may be a more complicated change. Would you prefer to work on this and have the event emission as a follow-up, or should they be in the same PR? |
#127479 is adopting this approach. It might be better to continue this work on #127479. |
| } | ||
| if len(podContainerChanges.ContainersToUpdate[v1.ResourceMemory]) > 0 || podContainerChanges.UpdatePodResources { | ||
| if podResources.Memory == nil { | ||
| klog.ErrorS(nil, "podResources.Memory is nil", "pod", pod.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete this
| //TODO(vinaykul,InPlacePodVerticalScaling): Figure out best way to get enforceMemoryQoS value (parameter #4 below) in platform-agnostic way | ||
| podResources := cm.ResourceConfigForPod(pod, m.cpuCFSQuota, uint64((m.cpuCFSQuotaPeriod.Duration)/time.Microsecond), false) | ||
| if podResources == nil { | ||
| klog.ErrorS(nil, "Unable to get resource configuration", "pod", pod.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete this
| klog.ErrorS(nil, "Unable to get resource configuration", "pod", pod.Name) | ||
| result.Fail(fmt.Errorf("Unable to get resource configuration processing resize for pod %s", pod.Name)) | ||
| return | ||
| return fmt.Errorf("unable to get resource configuration processing resize for pod %s", pod.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return fmt.Errorf("unable to get resource configuration processing resize for pod %s", pod.Name) | |
| return fmt.Errorf("unable to get resource configuration processing resize for pod %s", format.Pod(pod)) |
| klog.ErrorS(nil, "podResources.Memory is nil", "pod", pod.Name) | ||
| result.Fail(fmt.Errorf("podResources.Memory is nil for pod %s", pod.Name)) | ||
| return | ||
| return fmt.Errorf("podResources.Memory is nil for pod %s", pod.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return fmt.Errorf("podResources.Memory is nil for pod %s", pod.Name) | |
| return fmt.Errorf("podResources.Memory is nil for pod %s", format.Pod(pod)) |
| } | ||
| currentPodMemoryConfig, err := pcm.GetPodCgroupConfig(pod, v1.ResourceMemory) | ||
| if err != nil { | ||
| klog.ErrorS(err, "GetPodCgroupConfig for memory failed", "pod", pod.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add this context to the returned error instead
| } | ||
| currentPodCpuConfig, err := pcm.GetPodCgroupConfig(pod, v1.ResourceCPU) | ||
| if err != nil { | ||
| klog.ErrorS(err, "GetPodCgroupConfig for CPU failed", "pod", pod.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete log & add context to return
| int64(*currentPodCpuConfig.CPUShares), int64(*podResources.CPUShares)); errResize != nil { | ||
| result.Fail(errResize) | ||
| return | ||
| return errResize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add context to this error.
| if len(podContainerChanges.ContainersToUpdate) > 0 || podContainerChanges.UpdatePodResources { | ||
| m.doPodResizeAction(pod, podStatus, podContainerChanges, result) | ||
| if err := m.doPodResizeAction(pod, podStatus, podContainerChanges); err != nil { | ||
| result.Fail(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Log the error here too
| } | ||
| } | ||
|
|
||
| // This test focuses on verifying `doPodResizeAction()` propagates an error correctly with a few error cases for PR #127300. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // This test focuses on verifying `doPodResizeAction()` propagates an error correctly with a few error cases for PR #127300. | |
| // This test focuses on verifying `doPodResizeAction()` propagates an error correctly with a few error cases. |
No need to add the PR context, that's what git blame is for
| Limits: v1.ResourceList{v1.ResourceCPU: cpu200m, v1.ResourceMemory: mem200M}, | ||
| }, | ||
| qosClass: v1.PodQOSGuaranteed, | ||
| expectedError: "not implemented", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this error coming from? The fake runtime? Add a comment to explain.
@hshiina Hope you don't mind. But as I started thinking holistically about this in context of PRs #127479 and #127184 , I wrote some code to flesh out what I have in mind to model it consistent with killPodWithSyncResult . Please feel free to pick up the below patch and run with it. The change includes some of the changes @tallclair requested, and seems to work, but needs more testing and unit tests. diff --git a/pkg/kubelet/container/sync_result.go b/pkg/kubelet/container/sync_result.go
index 39ab04a3cac..f243f389c7d 100644
--- a/pkg/kubelet/container/sync_result.go
+++ b/pkg/kubelet/container/sync_result.go
@@ -45,6 +45,12 @@ var (
ErrConfigPodSandbox = errors.New("ConfigPodSandboxError")
// ErrKillPodSandbox returned when runtime failed to stop pod's sandbox.
ErrKillPodSandbox = errors.New("KillPodSandboxError")
+ // ErrUpdatePodSandbox returned when runtime failed to update the pod's sandbox config.
+ ErrUpdatePodSandbox = errors.New("UpdatePodSandboxError")
+ // ErrUpdateContainerMemory returned when runtime failed to update the pod's container config.
+ ErrUpdateContainerMemory = errors.New("UpdateContainerMemoryError")
+ // ErrUpdateContainerCPU returned when runtime failed to update the pod's container config.
+ ErrUpdateContainerCPU = errors.New("UpdateContainerCPUError")
)
// SyncAction indicates different kind of actions in SyncPod() and KillPod(). Now there are only actions
@@ -68,6 +74,12 @@ const (
ConfigPodSandbox SyncAction = "ConfigPodSandbox"
// KillPodSandbox action
KillPodSandbox SyncAction = "KillPodSandbox"
+ // UpdatePodSandbox action
+ UpdatePodSandbox SyncAction = "UpdatePodSandbox"
+ // UpdateContainerMemory action
+ UpdateContainerMemory SyncAction = "UpdateContainerMemory"
+ // UpdateContainerCPU action
+ UpdateContainerCPU SyncAction = "UpdateContainerCPU"
)
// SyncResult is the result of sync action.
diff --git a/pkg/kubelet/kuberuntime/kuberuntime_container.go b/pkg/kubelet/kuberuntime/kuberuntime_container.go
index 2a1ef3d15cd..80d12aefb4c 100644
--- a/pkg/kubelet/kuberuntime/kuberuntime_container.go
+++ b/pkg/kubelet/kuberuntime/kuberuntime_container.go
@@ -391,12 +391,11 @@ func (m *kubeGenericRuntimeManager) generateContainerConfig(ctx context.Context,
return config, cleanupAction, nil
}
-func (m *kubeGenericRuntimeManager) updateContainerResources(pod *v1.Pod, container *v1.Container, containerID kubecontainer.ContainerID) error {
+func (m *kubeGenericRuntimeManager) updateContainerResources(ctx context.Context, pod *v1.Pod, container *v1.Container, containerID kubecontainer.ContainerID) error {
containerResources := m.generateContainerResources(pod, container)
if containerResources == nil {
return fmt.Errorf("container %q updateContainerResources failed: cannot generate resources config", containerID.String())
}
- ctx := context.Background()
err := m.runtimeService.UpdateContainerResources(ctx, containerID.ID, containerResources)
if err != nil {
klog.ErrorS(err, "UpdateContainerResources failed", "container", containerID.String())
diff --git a/pkg/kubelet/kuberuntime/kuberuntime_manager.go b/pkg/kubelet/kuberuntime/kuberuntime_manager.go
index 0c7a9b24c63..3936357a489 100644
--- a/pkg/kubelet/kuberuntime/kuberuntime_manager.go
+++ b/pkg/kubelet/kuberuntime/kuberuntime_manager.go
@@ -650,7 +650,9 @@ func (m *kubeGenericRuntimeManager) computePodResizeAction(pod *v1.Pod, containe
return true
}
-func (m *kubeGenericRuntimeManager) doPodResizeAction(pod *v1.Pod, podStatus *kubecontainer.PodStatus, podContainerChanges podActions, result kubecontainer.PodSyncResult) {
+func (m *kubeGenericRuntimeManager) doPodResizeAction(ctx context.Context, pod *v1.Pod, podStatus *kubecontainer.PodStatus, podContainerChanges podActions) (result kubecontainer.PodSyncResult) {
+ updatePodResult := kubecontainer.NewSyncResult(kubecontainer.UpdatePodSandbox, pod.UID)
+ result.AddSyncResult(updatePodResult)
pcm := m.containerManager.NewPodContainerManager()
//TODO(vinaykul,InPlacePodVerticalScaling): Figure out best way to get enforceMemoryQoS value (parameter #4 below) in platform-agnostic way
podResources := cm.ResourceConfigForPod(pod, m.cpuCFSQuota, uint64((m.cpuCFSQuotaPeriod.Duration)/time.Microsecond), false)
@@ -696,9 +698,13 @@ func (m *kubeGenericRuntimeManager) doPodResizeAction(pod *v1.Pod, podStatus *ku
}
}
if len(podContainerChanges.ContainersToUpdate[rName]) > 0 {
- if err = m.updatePodContainerResources(pod, rName, podContainerChanges.ContainersToUpdate[rName]); err != nil {
- klog.ErrorS(err, "updatePodContainerResources failed", "pod", format.Pod(pod), "resource", rName)
- return err
+ updateContainerResults, errUpdate := m.updatePodContainerResources(ctx, pod, rName, podContainerChanges.ContainersToUpdate[rName])
+ for _, containerResult := range updateContainerResults {
+ result.AddSyncResult(containerResult)
+ }
+ if errUpdate != nil {
+ klog.ErrorS(errUpdate, "updatePodContainerResources failed", "pod", format.Pod(pod), "resource", rName)
+ return errUpdate
}
}
// At downsizing, requests should shrink prior to limits in order to keep "requests <= limits".
@@ -760,17 +766,20 @@ func (m *kubeGenericRuntimeManager) doPodResizeAction(pod *v1.Pod, podStatus *ku
return
}
}
+ return
}
-func (m *kubeGenericRuntimeManager) updatePodContainerResources(pod *v1.Pod, resourceName v1.ResourceName, containersToUpdate []containerToUpdateInfo) error {
+func (m *kubeGenericRuntimeManager) updatePodContainerResources(ctx context.Context, pod *v1.Pod, resourceName v1.ResourceName, containersToUpdate []containerToUpdateInfo) (updateResults []*kubecontainer.SyncResult, err error) {
klog.V(5).InfoS("Updating container resources", "pod", klog.KObj(pod))
for _, cInfo := range containersToUpdate {
+ var updateContainerResult *kubecontainer.SyncResult
container := pod.Spec.Containers[cInfo.apiContainerIdx].DeepCopy()
// If updating memory limit, use most recently configured CPU request and limit values.
// If updating CPU request and limit, use most recently configured memory request and limit values.
switch resourceName {
case v1.ResourceMemory:
+ updateContainerResult = kubecontainer.NewSyncResult(kubecontainer.UpdateContainerMemory, container.Name)
container.Resources.Limits = v1.ResourceList{
v1.ResourceCPU: *resource.NewMilliQuantity(cInfo.currentContainerResources.cpuLimit, resource.DecimalSI),
v1.ResourceMemory: *resource.NewQuantity(cInfo.desiredContainerResources.memoryLimit, resource.BinarySI),
@@ -780,6 +789,7 @@ func (m *kubeGenericRuntimeManager) updatePodContainerResources(pod *v1.Pod, res
v1.ResourceMemory: *resource.NewQuantity(cInfo.desiredContainerResources.memoryRequest, resource.BinarySI),
}
case v1.ResourceCPU:
+ updateContainerResult = kubecontainer.NewSyncResult(kubecontainer.UpdateContainerCPU, container.Name)
container.Resources.Limits = v1.ResourceList{
v1.ResourceCPU: *resource.NewMilliQuantity(cInfo.desiredContainerResources.cpuLimit, resource.DecimalSI),
v1.ResourceMemory: *resource.NewQuantity(cInfo.currentContainerResources.memoryLimit, resource.BinarySI),
@@ -789,12 +799,19 @@ func (m *kubeGenericRuntimeManager) updatePodContainerResources(pod *v1.Pod, res
v1.ResourceMemory: *resource.NewQuantity(cInfo.currentContainerResources.memoryRequest, resource.BinarySI),
}
}
- if err := m.updateContainerResources(pod, container, cInfo.kubeContainerID); err != nil {
+ updateResults = append(updateResults, updateContainerResult)
+ if err = m.updateContainerResources(ctx, pod, container, cInfo.kubeContainerID); err != nil {
// Log error and abort as container updates need to succeed in the order determined by computePodResizeAction.
// The recovery path is for SyncPod to keep retrying at later times until it succeeds.
klog.ErrorS(err, "updateContainerResources failed", "container", container.Name, "cID", cInfo.kubeContainerID,
"pod", format.Pod(pod), "resourceName", resourceName)
- return err
+ switch resourceName {
+ case v1.ResourceMemory:
+ updateContainerResult.Fail(kubecontainer.ErrUpdateContainerMemory, err.Error())
+ case v1.ResourceCPU:
+ updateContainerResult.Fail(kubecontainer.ErrUpdateContainerCPU, err.Error())
+ }
+ return
}
// If UpdateContainerResources is error-free, it means desired values for 'resourceName' was accepted by runtime.
// So we update currentContainerResources for 'resourceName', which is our view of most recently configured resources.
@@ -808,7 +825,7 @@ func (m *kubeGenericRuntimeManager) updatePodContainerResources(pod *v1.Pod, res
cInfo.currentContainerResources.cpuRequest = cInfo.desiredContainerResources.cpuRequest
}
}
- return nil
+ return
}
// computePodActions checks whether the pod spec has changed and returns the changes if true.
@@ -1308,7 +1325,11 @@ func (m *kubeGenericRuntimeManager) SyncPod(ctx context.Context, pod *v1.Pod, po
// Step 7: For containers in podContainerChanges.ContainersToUpdate[CPU,Memory] list, invoke UpdateContainerResources
if IsInPlacePodVerticalScalingAllowed(pod) {
if len(podContainerChanges.ContainersToUpdate) > 0 || podContainerChanges.UpdatePodResources {
- m.doPodResizeAction(pod, podStatus, podContainerChanges, result)
+ resizeResult := m.doPodResizeAction(ctx, pod, podStatus, podContainerChanges)
+ result.AddPodSyncResult(resizeResult)
+ if resizeResult.Error() != nil {
+ klog.ErrorS(resizeResult.Error(), "doPodResizeAction failed")
+ }
}
}
|
91cab7c to
e16d23e
Compare
|
Thank you for the feedback. I updated this PR. |
This fix makes doPodResizeAction() return the result instead of setting an error to the `result` argument, which should have been passed as a pointer, so that the error is propagated to the caller. This fix also makes the usage of PodSyncResult more consistent with other operations like starting and killing a container.
e16d23e to
5562cb1
Compare
|
Can you help me understand the impact of this change? Is this mostly just a cleanup, or does this result in errors being handled differently? |
It adds more precise logging and any associated errors for the action in a manner similar to what is done in killPodWithSyncResult. e.g from my quick-n-dirty test kubelet log. E1104 06:40:39.043476 589620 kubelet.go:1988] DBG: SR: &{Action:UpdatePodSandbox Target:bbdfbdad-fd0d-47ec-8666-3ce74d260256 Error:<nil> Message:}
E1104 06:40:39.043480 589620 kubelet.go:1988] DBG: SR: &{Action:UpdateContainerMemory Target:ctr Error:<nil> Message:}
E1104 06:40:39.043484 589620 kubelet.go:1988] DBG: SR: &{Action:UpdateContainerCPU Target:ctr Error:<nil> Message:} |
|
/lgtm |
|
LGTM label has been added. Git tree hash: 31b031d05d823a3c8a639c172435fd7f82381565
|
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hshiina, tallclair, vinaykul The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/milestone v1.32 |
| klog.ErrorS(err, "Failed to set cgroup config", "resource", rName, "pod", pod.Name) | ||
| } | ||
| return err | ||
| return fmt.Errorf("failed to set cgroup config for %s of pod %s: %w", rName, format.Pod(pod), err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unconditionally returning an error for this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry I forgot to run tests after updating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the e2e tests presubmit job runs when these files are touched. But I overlooked checking for the outcome on this occasion and missed the bug in the review as well, my bad. (note to self: don't review code with tired eyes)
Imho this may be a good time to make pull-kubernetes-e2e-inplace-pod-resize-containerd-main-v2 visible in PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What type of PR is this?
/kind bug
What this PR does / why we need it:
This fix makes
doPodResizeAction()return the result instead of setting an error to theresultargument, which should have been passed as a pointer, so that the error is propagated to the caller. Even with this fix, errors fromdoPodResizeAction()still do not surface to users (xref: #123441). So, no description will be added to the release notes.Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: