Inject the --block option to ray start command automatically #932

Yicheng-Lu-llll · 2023-02-26T22:05:57Z

Why are these changes needed?

This PR:

Inject the --block option to the ray start command automatically if the user has not set the -- block option. See Remove ray-cluster.without-block.yaml #675 and [Feature][Docs] Explain how to specify container command for head pod #912 (comment)
According to ray start options document, there are many parameters that the user should be able to set in the format 'parameters': 'true' or 'parameters': 'false' (e.g --block, --disable-usage-stats) in rayStartParams. However, kuberay can not handle(will fail to run all the pods) if the user set 'parameters': 'false' in rayStartParams. This PR enables the user to set a 'false' value.

Prove that Without --block, it may take a much longer time to detect the unhealthy condition and restart the pod:

You can find without_block.yaml and with_block.yaml here.

# without block case:
# Create head and worker pod
# Note, we need to enable Fault Tolerance. readiness probe and Liveness probe will only be installed if Fault Tolerance is enabled. 
# See https://github.com/ray-project/kuberay/blob/master/docs/guidance/gcs-ft.md
kubectl apply -f /home/lyc/without_block.yaml
#  ulimit -n 65536; ray start  --num-cpus=1  --metrics-export-port=8080  --address=raycluster-external-redis-head-svc:6379  && sleep infinity
#  ulimit -n 65536; ray start --head  --num-cpus=1  --metrics-export-port=8080  --dashboard-host=0.0.0.0  && sleep infinity

# kill gcs server, the pod will continue running.
# Note, you can also do ray stop -f to kill all the ray processes.
# The pod will always continue running due to "sleep infinity" process.
date && kubectl exec -it $(kubectl get pods -o=name | grep head) --  pkill gcs_server
# list all event based on the happening time and see when the liveness probe fail
kubectl get events --sort-by='{.metadata.creationTimestamp}' -o yaml | grep -E 'message|firstTimestamp' 
# get pod's last exit time
kubectl get  $(kubectl get pods -o=name | grep head) -o yaml | grep finishedAt
# get pod's restart time
kubectl get  $(kubectl get pods -o=name | grep head) -o yaml | grep startedAt


# I have run the above three times, here is the result:

#   kill gcs server at :                   Sun Feb 26 04:59:39 UTC 2023
#   liveness probe/readness probe fail at: "2023-02-26T05:01:41Z"
#   pod finishedAt :                       "2023-02-26T05:04:04Z"
#   pod restartedAt:                       "2023-02-26T05:04:04Z"

#   kill gcs server at :                   Sun Feb 26 05:23:51 UTC 2023
#   liveness probe/readness probe fail at: "2023-02-26T05:25:54Z"
#   pod finishedAt :                       "2023-02-26T05:28:17Z"
#   pod restartedAt:                       "2023-02-26T05:28:17Z"

#   kill gcs server at :                   Sun Feb 26 05:38:52 UTC 2023
#   liveness probe/readness probe fail at: "2023-02-26T05:40:56Z"
#   pod finishedAt :                       "2023-02-26T05:43:19Z"
# pod restartedAt:                         "2023-02-26T05:43:19Z"

# with block case:
# Create head and worker pod
kubectl apply -f /home/lyc/with_block.yaml
# 'ulimit -n 65536; ray start  --address=raycluster-complete-head-svc:6379  --metrics-export-port=8080  --num-cpus=1  --memory=1000000000  --block '
# 'ulimit -n 65536; ray start --head  --num-cpus=1  --memory=2000000000  --block  --dashboard-host=0.0.0.0  --metrics-export-port=8080 '

# kill all ray processes, the pod will exit immediately.
date && kubectl exec -it $(kubectl get pods -o=name | grep head) --  ray stop -f
# get pod's last exit time
kubectl get  $(kubectl get pods -o=name | grep head) -o yaml | grep finishedAt
# get pod's restart time
kubectl get  $(kubectl get pods -o=name | grep head) -o yaml | grep startedAt


# I have run the above three times, here is the result:

#   kill all ray processes at:   Sun Feb 26 04:44:50 UTC 2023
#   pod finishedAt :             "2023-02-26T04:44:53Z"
#   pod restartedAt:             "2023-02-26T04:44:53Z"

#   kill all ray processes at:   Sun Feb 26 04:51:07 UTC 2023
#   pod finishedAt :             "2023-02-26T04:51:09Z"
#   pod restartedAt:             "2023-02-26T04:51:10Z"

#   kill all ray processes at:   Sun Feb 26 04:55:09 UTC 2023
#   pod finishedAt :             "2023-02-26T04:55:11Z"
#   pod restartedAt:             "2023-02-26T04:55:12Z"

To summarize the result, In a single, healthy node environment:

with the -- block option, the pod will restart immediately after the ray process crash.
Without -- block option,
- the liveness probe/readiness probe will fail 129.3 seconds after killing the ray processes.
- The pod will restart 143 seconds after the liveness probe/readiness probe fail.

Questions:

Should we enforce adding the --block option (even if the user set 'block': 'false' in rayStartParams)?'

Because not every user will enable Fault Tolerance. But the readiness probe and Liveness probe will only be installed if Fault Tolerance is enabled. That is, if running without Fault Tolerance and ray processes somehow crash, kuberay will never detect the failure and restart the pod for the user.

Related issue number

Closes #915

Checks

I've made sure the tests are passing in various situations:

'block' : 'false' in rayStartParams
'block' : 'true' in rayStartParams
block not set.
Test convertParamMap

func TestConvertParamMap(t *testing.T) {
	rayStartParams := map[string]string{
		"booleanOptionsTrue":  "true",
		"booleanOptionsFalse": "false",
		"ParameterOptions":    "arguments",
                // Following are specialParameterOptions. Their arguments can be true or false.
		"log-color":           "false",
		"include-dashboard":   "true",
	}
	s := convertParamMap(rayStartParams)
	// s will be:
       //  --booleanOptionsTrue  --ParameterOptions=arguments  --log-color=false  --include-dashboard=true 

}

ray-operator/controllers/ray/common/pod.go

Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>

kevin85421

Add an unit test
Open an issue to remove --block from YAML files after 0.5.0, and update #940.

Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>

kevin85421

LGTM!

Approve this PR without running by myself because: (1) This PR has enough unit tests (2) The screenshots in the PR description are reliable.

kevin85421 · 2023-04-12T23:56:42Z

Without fault tolerance
- Case 1 (without --block): Pods cannot detect errors, including pkill gcs_server and ray stop -f, forever.
- Case 2 (with --block): Pods will be terminated after gcs_rpc_server_reconnect_timeout_s seconds (default: 60s).
With fault tolerance
- Case 1 (without --block): Pods can detect errors by the healthiness probe.
- Case 2 (with --block): Pods will be terminated after gcs_rpc_server_reconnect_timeout_s seconds (default: 60s) or detect errors by the healthiness probe.

…oject#932) Inject the --block option to ray start command automatically

Yicheng-Lu-llll commented Feb 26, 2023

View reviewed changes

ray-operator/controllers/ray/common/pod.go Outdated Show resolved Hide resolved

kevin85421 reviewed Mar 14, 2023

View reviewed changes

ray-operator/controllers/ray/common/pod.go Outdated Show resolved Hide resolved

Yicheng-Lu-llll added 6 commits March 15, 2023 01:57

Inject the --block option to ray start command automatically

cbcb366

remove log info

af6a4d7

remove log info

5e82768

remove log info

103e4e5

improve format

14f3838

improve comment

a689ce6

Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>

Yicheng-Lu-llll force-pushed the InjectBlockAutomatically branch from 220124b to a689ce6 Compare March 15, 2023 02:38

improve naming

d21ac16

Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>

kevin85421 reviewed Mar 15, 2023

View reviewed changes

add ci test

963dc3d

Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>

Yicheng-Lu-llll mentioned this pull request Mar 16, 2023

[Feature] Remove block: true in configuration YAML files after release 0.5.0 #971

Closed

2 tasks

kevin85421 approved these changes Mar 16, 2023

View reviewed changes

kevin85421 merged commit 480e128 into ray-project:master Mar 16, 2023

kevin85421 mentioned this pull request Apr 6, 2023

[Post release v0.5.0] Remove block from rayStartParams #1015

Merged

4 tasks

Yicheng-Lu-llll mentioned this pull request Apr 25, 2023

[Post release v0.5.0] Remove block from rayStartParams for python client and KubeRay operator tests #1050

Closed

kevin85421 mentioned this pull request Aug 7, 2023

Add Ray cluster spec for TPU pods #1292

Merged

lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023

Inject the --block option to ray start command automatically (ray-pr…

2444469

…oject#932) Inject the --block option to ray start command automatically

rueian mentioned this pull request Dec 6, 2023

[Feature] Override the block option of rayStartParams to true #1718

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inject the --block option to ray start command automatically #932

Inject the --block option to ray start command automatically #932

Uh oh!

Yicheng-Lu-llll commented Feb 26, 2023 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

kevin85421 left a comment •

edited

Loading

Uh oh!

kevin85421 left a comment

Uh oh!

kevin85421 commented Apr 12, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Inject the --block option to ray start command automatically #932

Inject the --block option to ray start command automatically #932

Uh oh!

Conversation

Yicheng-Lu-llll commented Feb 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

Uh oh!

Uh oh!

kevin85421 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevin85421 left a comment

Choose a reason for hiding this comment

Uh oh!

kevin85421 commented Apr 12, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Yicheng-Lu-llll commented Feb 26, 2023 •

edited

Loading

kevin85421 left a comment •

edited

Loading