Skip to content

Conversation

@kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Apr 5, 2024

Why are these changes needed?

We can think of RayCluster as a collection of K8s ReplicaSets. In this case, RayCluster status should be similar to ReplicaSetStatus. The ReplicaSetStatus doesn't check the Pod commands which belong to the data plane logic. The Ray command validation should be done by Ray instead of KubeRay in my opinion.

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@kevin85421
Copy link
Member Author

cc @Yicheng-Lu-llll would you mind reviewing this PR? Thanks!

return nil
}

// ValidateHeadRayStartParams will validate the head node's RayStartParams.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that we should let Ray do the validation. One question is that if Ray validation fails, is user able to see the error message and which is wrong?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the error message is propagated to the STDOUT or STDERR of the "ray start" process, users will be able to see it. For example, I set rayStartParams as follows, and then I can see the error message in the head Pod's log.

rayStartParams:
  memory: "abc"
Screen Shot 2024-04-05 at 4 33 23 PM

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that validating the start params from KubeRay is not necessary. But if Ray head is crash looping due to bad params, we should also surface cluster state as either unhealthy or failed right? From @Yicheng-Lu-llll's test in #2068 (comment) the state is not updated at all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if Ray head is crash looping due to bad params, we should also surface cluster state as either unhealthy or failed right?

Yes, we should add a field in HeadInfo to indicate whether the head Pod is ready or not as I mentioned in #1930 (review).

@kevin85421 kevin85421 marked this pull request as ready for review April 5, 2024 23:34
@Yicheng-Lu-llll
Copy link
Member

Yicheng-Lu-llll commented Apr 5, 2024

lgtm!

I also manually test the case ray start command is incorrect:

ubuntu@ip-172-31-38-87:~/others/kuberay$ kubectl get pod
# NAME                                          READY   STATUS             RESTARTS      AGE
# kuberay-operator-8b57f8d84-klv75              1/1     Running            0             2m36s
# raycluster-kuberay-head-56cfw                 0/1     CrashLoopBackOff   3 (32s ago)   115s
# raycluster-kuberay-worker-workergroup-8p9np   0/1     Init:0/1           0             115s

ubuntu@ip-172-31-38-87:~/others/kuberay$ kubectl logs raycluster-kuberay-head-56cfw
# Usage: ray start [OPTIONS]
# Try 'ray start --help' for help.

# Error: No such option: --appple



# The ClusterState will not be showed.
ubuntu@ip-172-31-38-87:~/others/kuberay$ kubectl describe raycluster
# Status:
#   Desired CPU:              2
#   Desired GPU:              0
#   Desired Memory:           3G
#   Desired TPU:              0
#   Desired Worker Replicas:  1
#   Endpoints:
#     Client:     10001
#     Dashboard:  8265
#     Metrics:    8080
#     Redis:      6379
#     Serve:      8000
#   Head:
#     Pod IP:             10.244.0.6
#     Service IP:         10.96.249.54
#   Last Update Time:     2024-04-05T23:36:52Z
#   Max Worker Replicas:  3
#   Min Worker Replicas:  1
#   Observed Generation:  1

@kevin85421
Copy link
Member Author

@andrewsykim Does #2068 (comment) make sense to you? Thanks!

@andrewsykim
Copy link
Member

Yes, we should add a field in HeadInfo to indicate whether the head Pod is ready or not as I mentioned in #1930 (review).

Removing states in ClusterState or even the entire field as suggested in #1930 (review) is a breaking change I would consider fine for alpha or beta APIs, but not a v1 API.

Adding new fields in HeadInfo is fine, but removing possible values of ClusterState or the ClusterState field is breaking compatibiltiy

@andrewsykim
Copy link
Member

andrewsykim commented Apr 8, 2024

My personal suggest is to first add the new field in HeadInfo and then re-evaluate whether removing unhealthy cluster state is necessary. Can we instead just have ClusterState = unhealthy reflect the value of the state in HeadInfo without removing it?

@kevin85421
Copy link
Member Author

My personal suggest is to first add the new field in HeadInfo and then re-evaluate whether removing unhealthy cluster state is necessary. Can we instead just have ClusterState = unhealthy reflect the value of the state in HeadInfo without removing it?

Thank you for raising awareness about compatibility. I agree that we should be careful regarding the compatibility of ready. However, unhealthy is a different case. You can take a look at the definition of unhealthy; it pertains only to a very niche case involving rayStartParams. rayStartParams is typically empty for most users. In the 1.5 years that I have been working on KubeRay, I haven’t encountered the codepath related to unhealthy or seen any user running into it.

@andrewsykim
Copy link
Member

andrewsykim commented Apr 8, 2024

However, unhealthy is a different case. You can take a look at the definition of unhealthy; it pertains only to a very niche case involving rayStartParams

Thanks, I missed this part. I agree it's fine to remove if it only pertains to validating rayStarParams. But #1930 proposes expanding the definition of unhealthy as well, which I think is valid. The definitions of ready and unhealthy are connected, because "not ready" is inherently "unhealthy". See my latest comment (#1930 (comment)) for why I think it's fine to change the definition of ready, which I think also applies to unhealthy.

@kevin85421
Copy link
Member Author

I think it's fine to change the definition of ready

I replied in #1930 (comment) to explain why it is a breaking change and how it may create "unknown unknowns" for users, as I mentioned in #1930 (comment). If we can't reach a consensus, we can have a chat next week.

@andrewsykim
Copy link
Member

Thanks @kevin85421 -- thinking about this more I think this PR is good to merge based on the current definition of unhealthy. We can always re-add it after discussion in #1930 and before v1.2

@kevin85421 kevin85421 merged commit 17809bc into ray-project:master Apr 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants