-
Notifications
You must be signed in to change notification settings - Fork 670
[GCS FT][Refactor] Redefine the behavior for deleting Pods and stop listening to Kubernetes events #1341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
0e61b0a to
4d2a1d9
Compare
6bdabd1 to
de90173
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, no major concerns.
-
In this PR the logic is to raise an error on the first reconciliation, and only bring up the new head pod on the second reconciliation. Why not do it in the same reconciliation? Is it just for simplicity or is there some other reason?
-
Perhaps we should leave #709 open, since other tests may be flaky or may become flaky in the future.
-
This PR removes the part about the RayCluster controller watching the readiness probe. Can you please remind me which other parts of KubeRay currently use the readiness and liveness probes?
In my view, if a Pod keeps failing due to application logic, we should delay the recreation process so that other RayClusters' Pods can be scheduled on the k8s node. Otherwise, if we recreate it immediately, other Pods may be hard to get the resources.
Done.
When GCS fault tolerance is enabled, KubeRay will inject the probes into Pods automatically. kuberay/ray-operator/controllers/ray/common/pod.go Lines 376 to 409 in 8be0a21
|
I see, thanks!
|
…istening to Kubernetes events (ray-project#1341) Redefine the behavior for deleting Pods and stop listening to Kubernetes events
…istening to Kubernetes events (ray-project#1341) Redefine the behavior for deleting Pods and stop listening to Kubernetes events
…istening to Kubernetes events (ray-project#1341) Redefine the behavior for deleting Pods and stop listening to Kubernetes events
…istening to Kubernetes events (ray-project#1341) Redefine the behavior for deleting Pods and stop listening to Kubernetes events
Why are these changes needed?
This PR addresses the following questions:
1. KubeRay is aggressive in deleting running Pods
Without this PR, the RayFTTestCase test is somewhat flaky. When GCS fault tolerance is enabled, KubeRay will inject both liveness and readiness probes into both the head and worker Pods. However, KubeRay will delete the Pod immediately if it receives an unhealthy event from the probes no matter the value of failureThreshold. Hence, it is very easy to trigger the deletion of running Pods, which is not a good idea from any perspective. In CI, the initialDelaySeconds of the probes is 30 seconds, but the GCS server may take more than 30 seconds to be ready. Hence, sometimes the probe will create an unhealthy event if the GCS server isn't ready in time. KubeRay will then delete this head Pod and create a new one in the next reconcile loop.
2. KubeRay listens to "all unhealthy events" in the Kubernetes cluster.
Without this PR, KubeRay listens to "all unhealthy events" in the Kubernetes cluster, regardless of whether they are related to KubeRay or not. This may cause OOM if there are a lot of unhealthy events in the Kubernetes cluster, and see #601 (comment) for more details.
3. Undefined lifecycles of "Failed" and "Succeeded" Pods
KubeRay does not define the lifecycle of Pods in terminated states (i.e., "Failed" or "Succeeded"). To be more specific, for the head Pod, KubeRay will neither delete nor create a new one when it fails. For the worker Pod, KubeRay will not delete the failed one but will create a new worker Pod.
Example 1:
restartPolicy: Never, no Ray Autoscaler, delete theray startprocess in the head PodExample 2: Same gist as example 1. In this example, we delete the
ray startprocess in the worker Pod. The failed worker Pod will not be deleted, and a new worker Pod will be created.Example 3:
restartPolicy: Never, addcommand: ["exit 1"]to the worker's spec.Behavior of this PR
restartPolicyis notAlways).Follow up
restartCountRelated issue number
I closed most issues that were used to track the progress of the flakiness of GCS FT tests.
Closes #1144
Closes #618
Closes #990
Closes #638
Closes #691
Closes #692
Checks
for i in {1..50}; do tail -n 1 log_${i}; done | grep OK | wc -l=> 50