-
Notifications
You must be signed in to change notification settings - Fork 671
Open
Labels
Description
Search before asking
- I searched the issues and found no similar issues.
KubeRay Component
ray-operator, Others
What happened + What you expected to happen
In ray-service.sample.yaml, the serveConfigV2 defines deployments for MangoStand, OrangeStand, and PearStand.
- If two of these deployments are removed (e.g., keeping only
MangoStandandFruitMarket), runningkubectl get rayserviceshows the state asWaitForServeDeploymentReady, and the service does not reach a ready state. - However, if only one deployment is removed (e.g., keeping two of the three), the service works as expected.
Reproduction script
Edit the ray-service.sample.yaml file to remove two of the three deployments in serveConfigV2 (e.g., keep only MangoStand).
Apply the updated ray-service.sample.yaml using kubectl apply -f ray-service.sample.yaml.
Run kubectl get rayservice and observe the status remaining in WaitForServeDeploymentReady
# Make sure to increase resource requests and limits before using this example in production.
# For examples with more realistic resource configuration, see
# ray-cluster.complete.large.yaml and
# ray-cluster.autoscaler.large.yaml.
apiVersion: ray.io/v1
kind: RayService
metadata:
name: rayservice-sample
spec:
# serveConfigV2 takes a yaml multi-line scalar, which should be a Ray Serve multi-application config. See https://docs.ray.io/en/latest/serve/multi-app.html.
serveConfigV2: |
applications:
- name: fruit_app
import_path: fruit.deployment_graph
route_prefix: /fruit
runtime_env:
working_dir: "https://github.com/ray-project/test_dag/archive/78b4a5da38796123d9f9ffff59bab2792a043e95.zip"
deployments:
- name: MangoStand
num_replicas: 2
max_replicas_per_node: 1
user_config:
price: 3
ray_actor_options:
num_cpus: 0.1
- name: FruitMarket
num_replicas: 1
ray_actor_options:
num_cpus: 0.1
- name: math_app
import_path: conditional_dag.serve_dag
route_prefix: /calc
runtime_env:
working_dir: "https://github.com/ray-project/test_dag/archive/78b4a5da38796123d9f9ffff59bab2792a043e95.zip"
deployments:
- name: Adder
num_replicas: 1
user_config:
increment: 3
ray_actor_options:
num_cpus: 0.1
- name: Multiplier
num_replicas: 1
user_config:
factor: 5
ray_actor_options:
num_cpus: 0.1
- name: Router
num_replicas: 1
rayClusterConfig:
rayVersion: '2.9.0' # should match the Ray version in the image of the containers
######################headGroupSpecs#################################
# Ray head pod template.
headGroupSpec:
# The `rayStartParams` are used to configure the `ray start` command.
# See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
# See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
rayStartParams:
dashboard-host: '0.0.0.0'
#pod template
template:
spec:
containers:
- name: ray-head
image: rayproject/ray:2.9.0
resources:
limits:
cpu: 2
memory: 4Gi
requests:
cpu: 2
memory: 4Gi
ports:
- containerPort: 6379
name: gcs-server
- containerPort: 8265 # Ray dashboard
name: dashboard
- containerPort: 10001
name: client
- containerPort: 8000
name: serve
workerGroupSpecs:
# the pod replicas in this group typed worker
- replicas: 1
minReplicas: 1
maxReplicas: 5
# logical group name, for this called small-group, also can be functional
groupName: small-group
# The `rayStartParams` are used to configure the `ray start` command.
# See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
# See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
rayStartParams: {}
#pod template
template:
spec:
containers:
- name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name', or '123-abc'
image: rayproject/ray:2.9.0
lifecycle:
preStop:
exec:
command: ["/bin/sh","-c","ray stop"]
resources:
limits:
cpu: "1"
memory: "2Gi"
requests:
cpu: "500m"
memory: "2Gi"
Anything else
We need to investigate why removing two deployments causes the issue while removing only one deployment does not. It seems like there might be a threshold or configuration issue in serveConfigV2.
Are you willing to submit a PR?
- Yes I am willing to submit a PR!