[GCS FT][Refactor] Redefine the behavior for deleting Pods and stop listening to Kubernetes events #1341

kevin85421 · 2023-08-17T06:19:03Z

Why are these changes needed?

This PR addresses the following questions:

1. KubeRay is aggressive in deleting running Pods

Without this PR, the RayFTTestCase test is somewhat flaky. When GCS fault tolerance is enabled, KubeRay will inject both liveness and readiness probes into both the head and worker Pods. However, KubeRay will delete the Pod immediately if it receives an unhealthy event from the probes no matter the value of failureThreshold. Hence, it is very easy to trigger the deletion of running Pods, which is not a good idea from any perspective. In CI, the initialDelaySeconds of the probes is 30 seconds, but the GCS server may take more than 30 seconds to be ready. Hence, sometimes the probe will create an unhealthy event if the GCS server isn't ready in time. KubeRay will then delete this head Pod and create a new one in the next reconcile loop.

Example CI failure

2. KubeRay listens to "all unhealthy events" in the Kubernetes cluster.

Without this PR, KubeRay listens to "all unhealthy events" in the Kubernetes cluster, regardless of whether they are related to KubeRay or not. This may cause OOM if there are a lot of unhealthy events in the Kubernetes cluster, and see #601 (comment) for more details.

3. Undefined lifecycles of "Failed" and "Succeeded" Pods

KubeRay does not define the lifecycle of Pods in terminated states (i.e., "Failed" or "Succeeded"). To be more specific, for the head Pod, KubeRay will neither delete nor create a new one when it fails. For the worker Pod, KubeRay will not delete the failed one but will create a new worker Pod.

Example 1: restartPolicy: Never, no Ray Autoscaler, delete the ray start process in the head Pod
- The head Pod's status will change to "Failed" because the container's primary process is killed, and the failed Pod will remain there indefinitely. Subsequently, the worker Pod will also fail because it cannot connect to the GCS beyond a certain threshold.
Example 2: Same gist as example 1. In this example, we delete the ray start process in the worker Pod. The failed worker Pod will not be deleted, and a new worker Pod will be created.
Example 3: restartPolicy: Never, add command: ["exit 1"] to the worker's spec.
- The worker Pod will fail automatically. The KubeRay operator will then create a new worker Pod, which will also fail, leading to the creation of another worker Pod, and so on. The loop will repeat forever and eat all resources in the Kubernetes cluster.

Behavior of this PR

If a Pod is able to restart, KubeRay will not delete it. On the other hand, KubeRay only deletes Pods which are in terminated states (i.e., "Failed" or "Succeeded") and not able to restart (i.e., restartPolicy is not Always).
KubeRay will not try to delete any "Running" Pods at this moment.
KubeRay does not listen to Kubernetes events anymore.

Follow up

Good default values for probes
Consider the case with sidecar containers
Delete Pods by the value of restartCount

Related issue number

I closed most issues that were used to track the progress of the flakiness of GCS FT tests.

Closes #1144
Closes #618
Closes #990
Closes #638
Closes #691
Closes #692

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

I used the following script to run the fault tolerance test 50 times on my devbox. It passed 50 times and failed 0 times.

#!/bin/bash
for iter in {1..50}
do
    RAY_IMAGE=rayproject/ray:2.5.0 OPERATOR_IMAGE=controller:latest python3 tests/compatibility-test.py RayFTTestCase 2>&1 | tee logs/log_${iter} 
done

Verify: for i in {1..50}; do tail -n 1 log_${i}; done | grep OK | wc -l => 50

architkulkarni

Overall looks good, no major concerns.

In this PR the logic is to raise an error on the first reconciliation, and only bring up the new head pod on the second reconciliation. Why not do it in the same reconciliation? Is it just for simplicity or is there some other reason?
Perhaps we should leave #709 open, since other tests may be flaky or may become flaky in the future.
This PR removes the part about the RayCluster controller watching the readiness probe. Can you please remind me which other parts of KubeRay currently use the readiness and liveness probes?

ray-operator/controllers/ray/raycluster_controller.go

ray-operator/controllers/ray/raycluster_controller_fake_test.go

tests/compatibility-test.py

tests/kuberay_utils/utils.py

Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org>

kevin85421 · 2023-08-21T20:22:27Z

In this PR the logic is to raise an error on the first reconciliation, and only bring up the new head pod on the second reconciliation. Why not do it in the same reconciliation? Is it just for simplicity or is there some other reason?

In my view, if a Pod keeps failing due to application logic, we should delay the recreation process so that other RayClusters' Pods can be scheduled on the k8s node. Otherwise, if we recreate it immediately, other Pods may be hard to get the resources.

Perhaps we should leave [Feature] [CI] Track flakey KubeRay tests #709 open, since other tests may be flaky or may become flaky in the future.

Done.

This PR removes the part about the RayCluster controller watching the readiness probe. Can you please remind me which other parts of KubeRay currently use the readiness and liveness probes?

When GCS fault tolerance is enabled, KubeRay will inject the probes into Pods automatically.

kuberay/ray-operator/controllers/ray/common/pod.go

Lines 376 to 409 in 8be0a21

    
           if enabledString, ok := podTemplateSpec.Annotations[RayFTEnabledAnnotationKey]; ok { 
        
           	if strings.ToLower(enabledString) == "true" { 
        
           		// Ray FT is enabled and we need to add health checks 
        
           		if pod.Spec.Containers[rayContainerIndex].ReadinessProbe == nil { 
        
           			// it is possible that some user have the probe parameters to override the default, 
        
           			// in this case, this if condition is skipped 
        
           			probe := &v1.Probe{ 
        
           				InitialDelaySeconds: DefaultReadinessProbeInitialDelaySeconds, 
        
           				TimeoutSeconds:      DefaultReadinessProbeTimeoutSeconds, 
        
           				PeriodSeconds:       DefaultReadinessProbePeriodSeconds, 
        
           				SuccessThreshold:    DefaultReadinessProbeSuccessThreshold, 
        
           				FailureThreshold:    DefaultReadinessProbeFailureThreshold, 
        
           			} 
        
           			pod.Spec.Containers[rayContainerIndex].ReadinessProbe = probe 
        
           		} 
        
           		// add readiness probe exec command in case missing. 
        
           		initReadinessProbeHandler(pod.Spec.Containers[rayContainerIndex].ReadinessProbe, rayNodeType) 
        
           		if pod.Spec.Containers[rayContainerIndex].LivenessProbe == nil { 
        
           			// it is possible that some user have the probe parameters to override the default, 
        
           			// in this case, this if condition is skipped 
        
           			probe := &v1.Probe{ 
        
           				InitialDelaySeconds: DefaultLivenessProbeInitialDelaySeconds, 
        
           				TimeoutSeconds:      DefaultLivenessProbeTimeoutSeconds, 
        
           				PeriodSeconds:       DefaultLivenessProbePeriodSeconds, 
        
           				SuccessThreshold:    DefaultLivenessProbeSuccessThreshold, 
        
           				FailureThreshold:    DefaultLivenessProbeFailureThreshold, 
        
           			} 
        
           			pod.Spec.Containers[rayContainerIndex].LivenessProbe = probe 
        
           		} 
        
           		// add liveness probe exec command in case missing 
        
           		initLivenessProbeHandler(pod.Spec.Containers[rayContainerIndex].LivenessProbe, rayNodeType) 
        
           	} 
        
           }

architkulkarni · 2023-08-21T20:32:07Z

In my view, if a Pod keeps failing due to application logic, we should delay the recreation process so that other RayClusters' Pods can be scheduled on the k8s node. Otherwise, if we recreate it immediately, other Pods may be hard to get the resources.

I see, thanks!

When GCS fault tolerance is enabled, KubeRay will inject the probes into Pods automatically.

~~Ah, I meant to ask who reads the probes. Is it some other part of KubeRay, or is it the user's application code?~~ Resolved offline, it's only read by the kubelet, not by any part of the KubeRay code or application code.

…istening to Kubernetes events (ray-project#1341) Redefine the behavior for deleting Pods and stop listening to Kubernetes events

kevin85421 force-pushed the gcs-ft-refactor branch from 0e61b0a to 4d2a1d9 Compare August 17, 2023 06:20

kevin85421 changed the title ~~[GCS FT][Refactor]~~ [WIP][GCS FT][Refactor] Aug 17, 2023

kevin85421 added 5 commits August 18, 2023 01:02

update

efe820e

update

4ebbb0a

update

d539969

add tests

fc6c082

update linter

de90173

kevin85421 force-pushed the gcs-ft-refactor branch from 6bdabd1 to de90173 Compare August 18, 2023 01:02

kevin85421 added 6 commits August 18, 2023 05:14

update

f3ca292

update

f423a1c

update

3f4a818

update tests

6ec25ab

update

9fc7d3b

update test

005b09f

kevin85421 changed the title ~~[WIP][GCS FT][Refactor]~~ [GCS FT][Refactor] Redefine the behavior for deleting Pods and stop listening to Kubernetes events Aug 18, 2023

kevin85421 marked this pull request as ready for review August 18, 2023 22:39

kevin85421 requested review from architkulkarni and gvspraveen August 18, 2023 22:40

architkulkarni approved these changes Aug 21, 2023

View reviewed changes

kevin85421 and others added 3 commits August 21, 2023 10:46

Update ray-operator/controllers/ray/raycluster_controller.go

3912e3f

Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org>

update

ead7ca3

update

411f697

kevin85421 merged commit e79e0b9 into ray-project:master Aug 21, 2023

kevin85421 mentioned this pull request Sep 5, 2023

[GCS FT] Consider the case of sidecar containers #1386

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[GCS FT][Refactor] Redefine the behavior for deleting Pods and stop listening to Kubernetes events #1341

[GCS FT][Refactor] Redefine the behavior for deleting Pods and stop listening to Kubernetes events #1341

Uh oh!

kevin85421 commented Aug 17, 2023 •

edited

Loading

Uh oh!

architkulkarni left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kevin85421 commented Aug 21, 2023

Uh oh!

architkulkarni commented Aug 21, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[GCS FT][Refactor] Redefine the behavior for deleting Pods and stop listening to Kubernetes events #1341

[GCS FT][Refactor] Redefine the behavior for deleting Pods and stop listening to Kubernetes events #1341

Uh oh!

Conversation

kevin85421 commented Aug 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

1. KubeRay is aggressive in deleting running Pods

2. KubeRay listens to "all unhealthy events" in the Kubernetes cluster.

3. Undefined lifecycles of "Failed" and "Succeeded" Pods

Behavior of this PR

Follow up

Related issue number

Checks

Uh oh!

architkulkarni left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kevin85421 commented Aug 21, 2023

Uh oh!

architkulkarni commented Aug 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kevin85421 commented Aug 17, 2023 •

edited

Loading

architkulkarni left a comment •

edited

Loading

architkulkarni commented Aug 21, 2023 •

edited

Loading