Skip to content

Conversation

@Yicheng-Lu-llll
Copy link
Member

@Yicheng-Lu-llll Yicheng-Lu-llll commented Feb 26, 2023

Why are these changes needed?

This PR:

  1. Inject the --block option to the ray start command automatically if the user has not set the -- block option. See Remove ray-cluster.without-block.yaml #675 and [Feature][Docs] Explain how to specify container command for head pod #912 (comment)
  2. According to ray start options document, there are many parameters that the user should be able to set in the format 'parameters': 'true' or 'parameters': 'false' (e.g --block, --disable-usage-stats) in rayStartParams. However, kuberay can not handle(will fail to run all the pods) if the user set 'parameters': 'false' in rayStartParams. This PR enables the user to set a 'false' value.

Prove that Without --block, it may take a much longer time to detect the unhealthy condition and restart the pod:

You can find without_block.yaml and with_block.yaml here.

# without block case:
# Create head and worker pod
# Note, we need to enable Fault Tolerance. readiness probe and Liveness probe will only be installed if Fault Tolerance is enabled. 
# See https://github.com/ray-project/kuberay/blob/master/docs/guidance/gcs-ft.md
kubectl apply -f /home/lyc/without_block.yaml
#  ulimit -n 65536; ray start  --num-cpus=1  --metrics-export-port=8080  --address=raycluster-external-redis-head-svc:6379  && sleep infinity
#  ulimit -n 65536; ray start --head  --num-cpus=1  --metrics-export-port=8080  --dashboard-host=0.0.0.0  && sleep infinity

# kill gcs server, the pod will continue running.
# Note, you can also do ray stop -f to kill all the ray processes.
# The pod will always continue running due to "sleep infinity" process.
date && kubectl exec -it $(kubectl get pods -o=name | grep head) --  pkill gcs_server
# list all event based on the happening time and see when the liveness probe fail
kubectl get events --sort-by='{.metadata.creationTimestamp}' -o yaml | grep -E 'message|firstTimestamp' 
# get pod's last exit time
kubectl get  $(kubectl get pods -o=name | grep head) -o yaml | grep finishedAt
# get pod's restart time
kubectl get  $(kubectl get pods -o=name | grep head) -o yaml | grep startedAt


# I have run the above three times, here is the result:

#   kill gcs server at :                   Sun Feb 26 04:59:39 UTC 2023
#   liveness probe/readness probe fail at: "2023-02-26T05:01:41Z"
#   pod finishedAt :                       "2023-02-26T05:04:04Z"
#   pod restartedAt:                       "2023-02-26T05:04:04Z"

#   kill gcs server at :                   Sun Feb 26 05:23:51 UTC 2023
#   liveness probe/readness probe fail at: "2023-02-26T05:25:54Z"
#   pod finishedAt :                       "2023-02-26T05:28:17Z"
#   pod restartedAt:                       "2023-02-26T05:28:17Z"

#   kill gcs server at :                   Sun Feb 26 05:38:52 UTC 2023
#   liveness probe/readness probe fail at: "2023-02-26T05:40:56Z"
#   pod finishedAt :                       "2023-02-26T05:43:19Z"
# pod restartedAt:                         "2023-02-26T05:43:19Z"
# with block case:
# Create head and worker pod
kubectl apply -f /home/lyc/with_block.yaml
# 'ulimit -n 65536; ray start  --address=raycluster-complete-head-svc:6379  --metrics-export-port=8080  --num-cpus=1  --memory=1000000000  --block '
# 'ulimit -n 65536; ray start --head  --num-cpus=1  --memory=2000000000  --block  --dashboard-host=0.0.0.0  --metrics-export-port=8080 '

# kill all ray processes, the pod will exit immediately.
date && kubectl exec -it $(kubectl get pods -o=name | grep head) --  ray stop -f
# get pod's last exit time
kubectl get  $(kubectl get pods -o=name | grep head) -o yaml | grep finishedAt
# get pod's restart time
kubectl get  $(kubectl get pods -o=name | grep head) -o yaml | grep startedAt


# I have run the above three times, here is the result:

#   kill all ray processes at:   Sun Feb 26 04:44:50 UTC 2023
#   pod finishedAt :             "2023-02-26T04:44:53Z"
#   pod restartedAt:             "2023-02-26T04:44:53Z"

#   kill all ray processes at:   Sun Feb 26 04:51:07 UTC 2023
#   pod finishedAt :             "2023-02-26T04:51:09Z"
#   pod restartedAt:             "2023-02-26T04:51:10Z"

#   kill all ray processes at:   Sun Feb 26 04:55:09 UTC 2023
#   pod finishedAt :             "2023-02-26T04:55:11Z"
#   pod restartedAt:             "2023-02-26T04:55:12Z"

To summarize the result, In a single, healthy node environment:

  • with the -- block option, the pod will restart immediately after the ray process crash.
  • Without -- block option,
    • the liveness probe/readiness probe will fail 129.3 seconds after killing the ray processes.
    • The pod will restart 143 seconds after the liveness probe/readiness probe fail.

Questions:

Should we enforce adding the --block option (even if the user set 'block': 'false' in rayStartParams)?'

Because not every user will enable Fault Tolerance. But the readiness probe and Liveness probe will only be installed if Fault Tolerance is enabled. That is, if running without Fault Tolerance and ray processes somehow crash, kuberay will never detect the failure and restart the pod for the user.

Related issue number

Closes #915

Checks

I've made sure the tests are passing in various situations:

  • 'block' : 'false' in rayStartParams
    59a160e73483e85a7812d98eeabfb1c

  • 'block' : 'true' in rayStartParams
    642bab545267313905186747f46a092

  • block not set.
    497221a8428c87458669903e4291249

  • Test convertParamMap

func TestConvertParamMap(t *testing.T) {
	rayStartParams := map[string]string{
		"booleanOptionsTrue":  "true",
		"booleanOptionsFalse": "false",
		"ParameterOptions":    "arguments",
                // Following are specialParameterOptions. Their arguments can be true or false.
		"log-color":           "false",
		"include-dashboard":   "true",
	}
	s := convertParamMap(rayStartParams)
	// s will be:
       //  --booleanOptionsTrue  --ParameterOptions=arguments  --log-color=false  --include-dashboard=true 

}

@Yicheng-Lu-llll Yicheng-Lu-llll force-pushed the InjectBlockAutomatically branch from 220124b to a689ce6 Compare March 15, 2023 02:38
Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Add an unit test

  2. Open an issue to remove --block from YAML files after 0.5.0, and update #940.

Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Approve this PR without running by myself because: (1) This PR has enough unit tests (2) The screenshots in the PR description are reliable.

@kevin85421 kevin85421 merged commit 480e128 into ray-project:master Mar 16, 2023
@kevin85421
Copy link
Member

  • Without fault tolerance

    • Case 1 (without --block): Pods cannot detect errors, including pkill gcs_server and ray stop -f, forever.
    • Case 2 (with --block): Pods will be terminated after gcs_rpc_server_reconnect_timeout_s seconds (default: 60s).
  • With fault tolerance

    • Case 1 (without --block): Pods can detect errors by the healthiness probe.
    • Case 2 (with --block): Pods will be terminated after gcs_rpc_server_reconnect_timeout_s seconds (default: 60s) or detect errors by the healthiness probe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Inject the --block option to ray start command automatically

2 participants