-
Notifications
You must be signed in to change notification settings - Fork 670
[RayCluster][Status][1/n] Remove ClusterState Unhealthy #2068
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @Yicheng-Lu-llll would you mind reviewing this PR? Thanks! |
| return nil | ||
| } | ||
|
|
||
| // ValidateHeadRayStartParams will validate the head node's RayStartParams. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that we should let Ray do the validation. One question is that if Ray validation fails, is user able to see the error message and which is wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that validating the start params from KubeRay is not necessary. But if Ray head is crash looping due to bad params, we should also surface cluster state as either unhealthy or failed right? From @Yicheng-Lu-llll's test in #2068 (comment) the state is not updated at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if Ray head is crash looping due to bad params, we should also surface cluster state as either unhealthy or failed right?
Yes, we should add a field in HeadInfo to indicate whether the head Pod is ready or not as I mentioned in #1930 (review).
|
lgtm! I also manually test the case ray start command is incorrect: ubuntu@ip-172-31-38-87:~/others/kuberay$ kubectl get pod
# NAME READY STATUS RESTARTS AGE
# kuberay-operator-8b57f8d84-klv75 1/1 Running 0 2m36s
# raycluster-kuberay-head-56cfw 0/1 CrashLoopBackOff 3 (32s ago) 115s
# raycluster-kuberay-worker-workergroup-8p9np 0/1 Init:0/1 0 115s
ubuntu@ip-172-31-38-87:~/others/kuberay$ kubectl logs raycluster-kuberay-head-56cfw
# Usage: ray start [OPTIONS]
# Try 'ray start --help' for help.
# Error: No such option: --appple
# The ClusterState will not be showed.
ubuntu@ip-172-31-38-87:~/others/kuberay$ kubectl describe raycluster
# Status:
# Desired CPU: 2
# Desired GPU: 0
# Desired Memory: 3G
# Desired TPU: 0
# Desired Worker Replicas: 1
# Endpoints:
# Client: 10001
# Dashboard: 8265
# Metrics: 8080
# Redis: 6379
# Serve: 8000
# Head:
# Pod IP: 10.244.0.6
# Service IP: 10.96.249.54
# Last Update Time: 2024-04-05T23:36:52Z
# Max Worker Replicas: 3
# Min Worker Replicas: 1
# Observed Generation: 1
|
|
@andrewsykim Does #2068 (comment) make sense to you? Thanks! |
Removing states in ClusterState or even the entire field as suggested in #1930 (review) is a breaking change I would consider fine for alpha or beta APIs, but not a v1 API. Adding new fields in HeadInfo is fine, but removing possible values of ClusterState or the ClusterState field is breaking compatibiltiy |
|
My personal suggest is to first add the new field in HeadInfo and then re-evaluate whether removing |
Thank you for raising awareness about compatibility. I agree that we should be careful regarding the compatibility of |
Thanks, I missed this part. I agree it's fine to remove if it only pertains to validating rayStarParams. But #1930 proposes expanding the definition of |
I replied in #1930 (comment) to explain why it is a breaking change and how it may create "unknown unknowns" for users, as I mentioned in #1930 (comment). If we can't reach a consensus, we can have a chat next week. |
|
Thanks @kevin85421 -- thinking about this more I think this PR is good to merge based on the current definition of unhealthy. We can always re-add it after discussion in #1930 and before v1.2 |
Why are these changes needed?
We can think of RayCluster as a collection of K8s ReplicaSets. In this case, RayCluster status should be similar to ReplicaSetStatus. The ReplicaSetStatus doesn't check the Pod commands which belong to the data plane logic. The Ray command validation should be done by Ray instead of KubeRay in my opinion.
Related issue number
Checks