Skip to content

Conversation

@kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Aug 17, 2023

Why are these changes needed?

This PR addresses the following questions:

1. KubeRay is aggressive in deleting running Pods

Without this PR, the RayFTTestCase test is somewhat flaky. When GCS fault tolerance is enabled, KubeRay will inject both liveness and readiness probes into both the head and worker Pods. However, KubeRay will delete the Pod immediately if it receives an unhealthy event from the probes no matter the value of failureThreshold. Hence, it is very easy to trigger the deletion of running Pods, which is not a good idea from any perspective. In CI, the initialDelaySeconds of the probes is 30 seconds, but the GCS server may take more than 30 seconds to be ready. Hence, sometimes the probe will create an unhealthy event if the GCS server isn't ready in time. KubeRay will then delete this head Pod and create a new one in the next reconcile loop.

2. KubeRay listens to "all unhealthy events" in the Kubernetes cluster.

Without this PR, KubeRay listens to "all unhealthy events" in the Kubernetes cluster, regardless of whether they are related to KubeRay or not. This may cause OOM if there are a lot of unhealthy events in the Kubernetes cluster, and see #601 (comment) for more details.

3. Undefined lifecycles of "Failed" and "Succeeded" Pods

KubeRay does not define the lifecycle of Pods in terminated states (i.e., "Failed" or "Succeeded"). To be more specific, for the head Pod, KubeRay will neither delete nor create a new one when it fails. For the worker Pod, KubeRay will not delete the failed one but will create a new worker Pod.

  • Example 1: restartPolicy: Never, no Ray Autoscaler, delete the ray start process in the head Pod

    • The head Pod's status will change to "Failed" because the container's primary process is killed, and the failed Pod will remain there indefinitely. Subsequently, the worker Pod will also fail because it cannot connect to the GCS beyond a certain threshold.
  • Example 2: Same gist as example 1. In this example, we delete the ray start process in the worker Pod. The failed worker Pod will not be deleted, and a new worker Pod will be created.

  • Example 3: restartPolicy: Never, add command: ["exit 1"] to the worker's spec.

    • The worker Pod will fail automatically. The KubeRay operator will then create a new worker Pod, which will also fail, leading to the creation of another worker Pod, and so on. The loop will repeat forever and eat all resources in the Kubernetes cluster.

Behavior of this PR

  • If a Pod is able to restart, KubeRay will not delete it. On the other hand, KubeRay only deletes Pods which are in terminated states (i.e., "Failed" or "Succeeded") and not able to restart (i.e., restartPolicy is not Always).
  • KubeRay will not try to delete any "Running" Pods at this moment.
  • KubeRay does not listen to Kubernetes events anymore.

Follow up

  • Good default values for probes
  • Consider the case with sidecar containers
  • Delete Pods by the value of restartCount

Related issue number

I closed most issues that were used to track the progress of the flakiness of GCS FT tests.

Closes #1144
Closes #618
Closes #990
Closes #638
Closes #691
Closes #692

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(
  • I used the following script to run the fault tolerance test 50 times on my devbox. It passed 50 times and failed 0 times.
    #!/bin/bash
    for iter in {1..50}
    do
        RAY_IMAGE=rayproject/ray:2.5.0 OPERATOR_IMAGE=controller:latest python3 tests/compatibility-test.py RayFTTestCase 2>&1 | tee logs/log_${iter} 
    done
  • Verify: for i in {1..50}; do tail -n 1 log_${i}; done | grep OK | wc -l => 50

@kevin85421 kevin85421 changed the title [GCS FT][Refactor] [WIP][GCS FT][Refactor] Aug 17, 2023
@kevin85421 kevin85421 changed the title [WIP][GCS FT][Refactor] [GCS FT][Refactor] Redefine the behavior for deleting Pods and stop listening to Kubernetes events Aug 18, 2023
@kevin85421 kevin85421 marked this pull request as ready for review August 18, 2023 22:39
Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, no major concerns.

  • In this PR the logic is to raise an error on the first reconciliation, and only bring up the new head pod on the second reconciliation. Why not do it in the same reconciliation? Is it just for simplicity or is there some other reason?

  • Perhaps we should leave #709 open, since other tests may be flaky or may become flaky in the future.

  • This PR removes the part about the RayCluster controller watching the readiness probe. Can you please remind me which other parts of KubeRay currently use the readiness and liveness probes?

kevin85421 and others added 3 commits August 21, 2023 10:46
Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org>
@kevin85421
Copy link
Member Author

  • In this PR the logic is to raise an error on the first reconciliation, and only bring up the new head pod on the second reconciliation. Why not do it in the same reconciliation? Is it just for simplicity or is there some other reason?

In my view, if a Pod keeps failing due to application logic, we should delay the recreation process so that other RayClusters' Pods can be scheduled on the k8s node. Otherwise, if we recreate it immediately, other Pods may be hard to get the resources.

Done.

  • This PR removes the part about the RayCluster controller watching the readiness probe. Can you please remind me which other parts of KubeRay currently use the readiness and liveness probes?

When GCS fault tolerance is enabled, KubeRay will inject the probes into Pods automatically.

if enabledString, ok := podTemplateSpec.Annotations[RayFTEnabledAnnotationKey]; ok {
if strings.ToLower(enabledString) == "true" {
// Ray FT is enabled and we need to add health checks
if pod.Spec.Containers[rayContainerIndex].ReadinessProbe == nil {
// it is possible that some user have the probe parameters to override the default,
// in this case, this if condition is skipped
probe := &v1.Probe{
InitialDelaySeconds: DefaultReadinessProbeInitialDelaySeconds,
TimeoutSeconds: DefaultReadinessProbeTimeoutSeconds,
PeriodSeconds: DefaultReadinessProbePeriodSeconds,
SuccessThreshold: DefaultReadinessProbeSuccessThreshold,
FailureThreshold: DefaultReadinessProbeFailureThreshold,
}
pod.Spec.Containers[rayContainerIndex].ReadinessProbe = probe
}
// add readiness probe exec command in case missing.
initReadinessProbeHandler(pod.Spec.Containers[rayContainerIndex].ReadinessProbe, rayNodeType)
if pod.Spec.Containers[rayContainerIndex].LivenessProbe == nil {
// it is possible that some user have the probe parameters to override the default,
// in this case, this if condition is skipped
probe := &v1.Probe{
InitialDelaySeconds: DefaultLivenessProbeInitialDelaySeconds,
TimeoutSeconds: DefaultLivenessProbeTimeoutSeconds,
PeriodSeconds: DefaultLivenessProbePeriodSeconds,
SuccessThreshold: DefaultLivenessProbeSuccessThreshold,
FailureThreshold: DefaultLivenessProbeFailureThreshold,
}
pod.Spec.Containers[rayContainerIndex].LivenessProbe = probe
}
// add liveness probe exec command in case missing
initLivenessProbeHandler(pod.Spec.Containers[rayContainerIndex].LivenessProbe, rayNodeType)
}
}

@architkulkarni
Copy link
Contributor

architkulkarni commented Aug 21, 2023

In my view, if a Pod keeps failing due to application logic, we should delay the recreation process so that other RayClusters' Pods can be scheduled on the k8s node. Otherwise, if we recreate it immediately, other Pods may be hard to get the resources.

I see, thanks!

When GCS fault tolerance is enabled, KubeRay will inject the probes into Pods automatically.

Ah, I meant to ask who reads the probes. Is it some other part of KubeRay, or is it the user's application code? Resolved offline, it's only read by the kubelet, not by any part of the KubeRay code or application code.

blublinsky pushed a commit to blublinsky/kuberay that referenced this pull request Aug 22, 2023
…istening to Kubernetes events (ray-project#1341)

Redefine the behavior for deleting Pods and stop listening to Kubernetes events
blublinsky pushed a commit to blublinsky/kuberay that referenced this pull request Aug 25, 2023
…istening to Kubernetes events (ray-project#1341)

Redefine the behavior for deleting Pods and stop listening to Kubernetes events
blublinsky pushed a commit to blublinsky/kuberay that referenced this pull request Aug 29, 2023
…istening to Kubernetes events (ray-project#1341)

Redefine the behavior for deleting Pods and stop listening to Kubernetes events
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
…istening to Kubernetes events (ray-project#1341)

Redefine the behavior for deleting Pods and stop listening to Kubernetes events
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment