[Bug] compatibility test for the nightly Ray image fails #1055
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why are these changes needed?
Use HTTP request to verify serve deployment after the cluster recovers from a failure to fix [Serve] Cannot get serve deployment after a RayCluster recovers ray#34799.
For Ray 2.1.0, containers require tens of seconds to become "READY" after the Pod is running. In addition, the serve deployment takes few seconds to be ready to serve requests after all containers are "READY".
With [Bug][GCS FT] Worker pods crash unexpectedly when gcs_server on head pod is killed #1036, the worker Pods will not crash after the GCS server process is killed. Hence, the output of
test_detached_actor_2.pymay be different based on the actor is assigned to the head or a worker.test_detached_actor_1.pycallsincrement()twice, andtest_detached_actor_2.pycallsincrement()once. Hence, if the actor is scheduled on a worker, the output oftest_detached_actor_2.pyshould be 3. On the other hand, the output should be 1 if it is scheduled on the head Pod.num-cpus: 0to prevent the actor from scheduling on the head.test_detached_actor_2.py(assert(val == 3)).ray_namespace.Update
rayStartParamsinray-service.yaml.template. Worker Pod withnode-ip-address: $$MY_POD_IPcannot connect to the head (See [Bug] Job Sample YAMLray_v1alpha1_rayjob.yamlfails with emptynode-ip-address$MY_POD_IP#805 for more details.)Connect to GCS (
ray.init()) rather than Ray client (10001) ([Feature] Connect to RayCluster via GCS port rather than Ray client in compatibility test #848).ray.init()instead of Ray client, the tests become very unstable.(Bug? Ray 2.1.0) In some cases, HTTPProxy will not be created on the head Pod after the cluster recovers from a failure.
Experiments for 73b37b1
RayFTTestCase25 times on my devbox.test_detached_actornever fails.test_ray_servefails 6 times.test_ray_serve_1.py * 5
test_ray_serve_2.py * 1
The
test_ray_serve_1.py's error message is from the link. The reason seems to be failing to get HTTPProxy actors which is similar to my observation above "(Bug? Ray 2.1.0) In some cases, HTTPProxy will not be created on the head Pod after the cluster recovers from a failure.". I will file an issue later, but the check seems to be legacy code.Related issue number
Closes #1053
#848
Closes ray-project/ray#34799
Checks