-
Notifications
You must be signed in to change notification settings - Fork 7k
Description
What happened + What you expected to happen
From a Ray Slack thread:
[A user is] working with Kuberay and Ray 2.0.0 and try to deploy a RayService. It was not easy to get the cluster to start up, because it failed and restarted often. [The user] managed to get it running, but [they] saw that the “task:run_graph” got aborted a lot of times on startup.
The user posted logs from the Serve controller and the Kubernetes operator that imply the controller was receiving new Serve config deployment requests more than once per second. E.g.:
INFO 2022-09-20 00:39:34,046 controller 160 controller.py:439 - Received new config deployment request. Cancelling previous request.
INFO 2022-09-20 00:39:34,288 controller 160 controller.py:439 - Received new config deployment request. Cancelling previous request.
INFO 2022-09-20 00:39:34,446 controller 160 controller.py:439 - Received new config deployment request. Cancelling previous request.
INFO 2022-09-20 00:39:34,572 controller 160 controller.py:439 - Received new config deployment request. Cancelling previous request.
INFO 2022-09-20 00:39:34,750 controller 160 controller.py:439 - Received new config deployment request. Cancelling previous request.This likely doesn't give enough time for Serve to deploy the config completely before the request is canceled and re-issued, preventing the Serve application from being deployed. This in turn likely caused the cluster to be marked unhealthy and restarted.
Ideally, the Serve application should have enough time to be started without being interrupted.
Versions / Dependencies
Ray 2.0.0 and Kuberay.
Reproduction script
See the Ray Slack thread for logs.
The user observed the issue on the FruitStand example.
Issue Severity
No response