Skip to content

[Serve] Kubernetes operator reapplies Serve config too frequently #28652

@shrekris-anyscale

Description

@shrekris-anyscale

What happened + What you expected to happen

From a Ray Slack thread:

[A user is] working with Kuberay and Ray 2.0.0 and try to deploy a RayService. It was not easy to get the cluster to start up, because it failed and restarted often. [The user] managed to get it running, but [they] saw that the “task:run_graph” got aborted a lot of times on startup.

The user posted logs from the Serve controller and the Kubernetes operator that imply the controller was receiving new Serve config deployment requests more than once per second. E.g.:

INFO 2022-09-20 00:39:34,046 controller 160 controller.py:439 - Received new config deployment request. Cancelling previous request.
INFO 2022-09-20 00:39:34,288 controller 160 controller.py:439 - Received new config deployment request. Cancelling previous request.
INFO 2022-09-20 00:39:34,446 controller 160 controller.py:439 - Received new config deployment request. Cancelling previous request.
INFO 2022-09-20 00:39:34,572 controller 160 controller.py:439 - Received new config deployment request. Cancelling previous request.
INFO 2022-09-20 00:39:34,750 controller 160 controller.py:439 - Received new config deployment request. Cancelling previous request.

This likely doesn't give enough time for Serve to deploy the config completely before the request is canceled and re-issued, preventing the Serve application from being deployed. This in turn likely caused the cluster to be marked unhealthy and restarted.

Ideally, the Serve application should have enough time to be started without being interrupted.

Versions / Dependencies

Ray 2.0.0 and Kuberay.

Reproduction script

See the Ray Slack thread for logs.

The user observed the issue on the FruitStand example.

Issue Severity

No response

Metadata

Metadata

Labels

P1Issue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tserveRay Serve Related Issue

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions