[Bug] compatibility test for the nightly Ray image fails #1055

kevin85421 · 2023-04-27T06:53:28Z

Why are these changes needed?

Use HTTP request to verify serve deployment after the cluster recovers from a failure to fix [Serve] Cannot get serve deployment after a RayCluster recovers ray#34799.
For Ray 2.1.0, containers require tens of seconds to become "READY" after the Pod is running. In addition, the serve deployment takes few seconds to be ready to serve requests after all containers are "READY".
With [Bug][GCS FT] Worker pods crash unexpectedly when gcs_server on head pod is killed #1036, the worker Pods will not crash after the GCS server process is killed. Hence, the output of test_detached_actor_2.py may be different based on the actor is assigned to the head or a worker. test_detached_actor_1.py calls increment() twice, and test_detached_actor_2.py calls increment() once. Hence, if the actor is scheduled on a worker, the output of test_detached_actor_2.py should be 3. On the other hand, the output should be 1 if it is scheduled on the head Pod.
- Set num-cpus: 0 to prevent the actor from scheduling on the head.
- Update the assert function in test_detached_actor_2.py (assert(val == 3)).
- Delete ray_namespace.
Update rayStartParams in ray-service.yaml.template. Worker Pod with node-ip-address: $$MY_POD_IP cannot connect to the head (See [Bug] Job Sample YAML ray_v1alpha1_rayjob.yaml fails with empty node-ip-address $MY_POD_IP #805 for more details.)
Connect to GCS (ray.init()) rather than Ray client (10001) ([Feature] Connect to RayCluster via GCS port rather than Ray client in compatibility test #848).
- [Update] When I use ray.init() instead of Ray client, the tests become very unstable.

(Bug? Ray 2.1.0) In some cases, HTTPProxy will not be created on the head Pod after the cluster recovers from a failure.

Experiments for `73b37b1`

#!/bin/bash
for i in {1..25}
do
  RAY_IMAGE=rayproject/ray:2.1.0 python3 tests/compatibility-test.py RayFTTestCase 2>&1 | tee log_$i.txt
done

Run RayFTTestCase 25 times on my devbox.

test_detached_actor never fails.

test_ray_serve fails 6 times.

test_ray_serve_1.py * 5

^[[2m^[[36m(ServeController pid=643)^[[0m INFO 2023-04-28 01:32:33,683 controller 643 http_state.py:132 - Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-c82ea22161ab30914ebd453cc3bd00e7e2088203b1506c26dc98116c' on node 'c82ea22161ab30914ebd453cc3bd00e7e2088203b1506c26dc98116c' listening on '127.0.0.1:8000'
^[[2m^[[36m(ServeController pid=643)^[[0m INFO 2023-04-28 01:32:33,716 controller 643 http_state.py:132 - Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-cc6e7b1793a1f6e801f67f310ef39428d54e15c3ca140fcc193795fd' on node 'cc6e7b1793a1f6e801f67f310ef39428d54e15c3ca140fcc193795fd' listening on '127.0.0.1:8000'
^[[2m^[[36m(ServeController pid=643)^[[0m INFO 2023-04-28 01:32:33,725 controller 643 http_state.py:132 - Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-438634dbcfb7ea6edd2cc3fef2feaf02ae2df4c6198e7a17a3155f72' on node '438634dbcfb7ea6edd2cc3fef2feaf02ae2df4c6198e7a17a3155f72' listening on '127.0.0.1:8000'
^[[2m^[[36m(HTTPProxyActor pid=700)^[[0m INFO:     Started server process [700]
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serve/_private/api.py", line 241, in serve_start
    timeout=HTTP_PROXY_TIMEOUT,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
    return getattr(ray, func.__name__)(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/client/api.py", line 42, in get
    return self.worker.get(vals, timeout=timeout)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/client/worker.py", line 434, in get
    res = self._get(to_get, op_timeout)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/client/worker.py", line 462, in _get
    raise err
ray.exceptions.GetTimeoutError: Get timed out: some object(s) not ready.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "samples/test_ray_serve_1.py", line 15, in <module>
    handle = serve.run(MyModelDeployment.bind(msg="Hello world!"))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serve/api.py", line 483, in run
    http_options={"host": host, "port": port, "location": "EveryNode"},
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serve/_private/api.py", line 245, in serve_start
    f"HTTP proxies not available after {HTTP_PROXY_TIMEOUT}s."
TimeoutError: HTTP proxies not available after 60s.
command terminated with exit code 1

test_ray_serve_2.py * 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 786, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/util/retry.py", line 592, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='127.0.0.1', port=8000): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fcabbf355d0>: Failed to establish a new connection: [Errno 111] Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "samples/test_ray_serve_2.py", line 23, in <module>
    retry_with_timeout(send_req, 180)
  File "samples/test_ray_serve_2.py", line 14, in retry_with_timeout
    raise err
  File "samples/test_ray_serve_2.py", line 9, in retry_with_timeout
    return func()
  File "samples/test_ray_serve_2.py", line 17, in send_req
    response = requests.get('http://127.0.0.1:8000', timeout=10)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/requests/api.py", line 76, in get
    return request('get', url, params=params, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/requests/api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/requests/sessions.py", line 542, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/requests/sessions.py", line 655, in send
    r = adapter.send(request, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='127.0.0.1', port=8000): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fcabbf355d0>: Failed to establish a new connection: [Errno 111] Connection refused'))
command terminated with exit code 1

The test_ray_serve_1.py's error message is from the link. The reason seems to be failing to get HTTPProxy actors which is similar to my observation above "(Bug? Ray 2.1.0) In some cases, HTTPProxy will not be created on the head Pod after the cluster recovers from a failure.". I will file an issue later, but the check seems to be legacy code.

Related issue number

Closes #1053
#848
Closes ray-project/ray#34799

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

kevin85421 · 2023-04-28T20:03:53Z

cc @architkulkarni

test_ray_serve becomes much more flaky after #1036 because:

(1) In most cases, test_detached_actor and test_ray_serve will schedule tasks / actors on head Pod. However, this is not what we really want to test.

(2) Without #1036, all Pods in the cluster will crash (head Pod crashes once, each worker Pod crashes twice). With #1036, only head Pod will crash.

This PR relieves the flakiness, but it is still flaky. See the section "Experiments for 73b37b1" for more details. In GitHub Actions, it passes twice consecutively without any retry. That's why I think this PR is ready to review.

architkulkarni

Looks good to me! If I understand correctly, this PR takes the test from consistently failing to merely flaky. Feel free to keep the original flakiness issue open, or open a new issue for the remaining flakiness.

…#1055) compatibility test for the nightly Ray image fails

kevin85421 changed the title ~~[WIP]~~ [Bug] compatibility test for the nightly Ray image fails Apr 27, 2023

kevin85421 marked this pull request as ready for review April 27, 2023 16:58

kevin85421 mentioned this pull request Apr 27, 2023

[Bug][GCS FT] Worker pods crash unexpectedly when gcs_server on head pod is killed #1036

Merged

4 tasks

kevin85421 added 2 commits April 27, 2023 18:18

fix

762cfea

update

d7756e2

kevin85421 force-pushed the nightly-flaky branch from 2a3a3bf to d7756e2 Compare April 27, 2023 22:29

kevin85421 added 4 commits April 27, 2023 23:04

update

e59e6ed

update

f05243e

update

08386bd

update

97cc682

kevin85421 force-pushed the nightly-flaky branch from 8597764 to 97cc682 Compare April 28, 2023 04:22

update

73b37b1

kevin85421 requested a review from architkulkarni April 28, 2023 19:51

architkulkarni self-assigned this Apr 28, 2023

architkulkarni approved these changes Apr 28, 2023

View reviewed changes

kevin85421 merged commit 2b136c9 into ray-project:master Apr 28, 2023

kevin85421 mentioned this pull request Apr 28, 2023

[Bug] test_ray_serve is flaky #1058

Closed

2 tasks

lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023

[Bug] compatibility test for the nightly Ray image fails (ray-project…

b7f2bb3

…#1055) compatibility test for the nightly Ray image fails

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] compatibility test for the nightly Ray image fails #1055

[Bug] compatibility test for the nightly Ray image fails #1055

Uh oh!

kevin85421 commented Apr 27, 2023 •

edited

Loading

Uh oh!

kevin85421 commented Apr 28, 2023

Uh oh!

architkulkarni left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Bug] compatibility test for the nightly Ray image fails #1055

[Bug] compatibility test for the nightly Ray image fails #1055

Uh oh!

Conversation

kevin85421 commented Apr 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Experiments for 73b37b1

Related issue number

Checks

Uh oh!

kevin85421 commented Apr 28, 2023

Uh oh!

architkulkarni left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kevin85421 commented Apr 27, 2023 •

edited

Loading

Experiments for `73b37b1`