-
Notifications
You must be signed in to change notification settings - Fork 7k
Open
Labels
P2Important issue, but not time-criticalImportant issue, but not time-criticalbugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'tcoreIssues that should be addressed in Ray CoreIssues that should be addressed in Ray Coreusability
Description
What happened + What you expected to happen
- I am working in a managed Kubernetes environment. We have three nodes (managed K8S Deployment + Service + Ingress) setup - one head node, and two worker nodes. Using the Service and Ingress configurations, I expose port 8265 of my container through the (internal) URL
http://head-node-dashboard.company.internal.domain.com, and 6379 throughhttp://head-node-gcs.company.internal.domain.com.
When I try to submit jobs to the dashboard URL, everything works fine:
ray job submit --working-dir ./ --address='http://head-node-dashboard.company.internal.domain.com' -- python ./script.pyBut when I try to connect to the GCS, it fails. There are two ways that this happens:
- Connecting a worker node to the head node with
ray start:
$ > ray start --address='head-node-gcs.company.internal.domain.com:80'
Local node IP: 10.251.222.101
2023-03-18 06:51:17,521 WARNING utils.py:1446 -- Unable to connect to GCS at head-node-gcs.company.internal.domain.com:80. Check that (1) Ray GCS with matching version started successfully at the specified address, and (2) there is no firewall setting preventing access.- Connecting to the head node through
ray.init().
$ > python
>>> import raya) If I connect without any protocol defined:
>>> ray.init(address='head-node-gcs.company.internal.domain.com:80')
2023-03-18 06:58:11,670 INFO worker.py:1333 -- Connecting to existing Ray cluster at address: head-node-gcs.company.internal.domain.com:80...
2023-03-18 06:58:16,743 WARNING utils.py:1333 -- Unable to connect to GCS at head-node-gcs.company.internal.domain.com:80. Check that (1) Ray GCS with matching version started successfully at the specified address, and (2) there is no firewall setting preventing access.b) If I connect with the http:// protocol specified:
>>> ray.init(address='http://head-node-gcs.company.internal.domain.com')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/project/workSpace/venv/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/project/workSpace/venv/lib/python3.8/site-packages/ray/_private/worker.py", line 1230, in init
builder = ray.client(address, _deprecation_warn_enabled=False)
File "/project/workSpace/venv/lib/python3.8/site-packages/ray/client_builder.py", line 382, in client
builder = _get_builder_from_address(address)
File "/project/workSpace/venv/lib/python3.8/site-packages/ray/client_builder.py", line 350, in _get_builder_from_address
assert "ClientBuilder" in dir(
AssertionError: Module: http does not have ClientBuilder.c) If I connect with the ray:// protocol specified:
>>> ray.init(address='ray://head-node-gcs.company.internal.domain.com')
/project/workSpace/venv/lib/python3.8/site-packages/ray/util/client/worker.py:253: UserWarning: Ray Client connection timed out. Ensure that the Ray Client port on the head node is reachable from your local machine. See https://docs.ray.io/en/latest/cluster/ray-client.html#step-2-check-ports for more information.
warnings.warn(
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/project/workSpace/venv/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/project/workSpace/venv/lib/python3.8/site-packages/ray/_private/worker.py", line 1248, in init
ctx = builder.connect()
File "/project/workSpace/venv/lib/python3.8/site-packages/ray/client_builder.py", line 178, in connect
client_info_dict = ray.util.client_connect.connect(
File "/project/workSpace/venv/lib/python3.8/site-packages/ray/util/client_connect.py", line 47, in connect
conn = ray.connect(
File "/project/workSpace/venv/lib/python3.8/site-packages/ray/util/client/__init__.py", line 252, in connect
conn = self.get_context().connect(*args, **kw_args)
File "/project/workSpace/venv/lib/python3.8/site-packages/ray/util/client/__init__.py", line 94, in connect
self.client_worker = Worker(
File "/project/workSpace/venv/lib/python3.8/site-packages/ray/util/client/worker.py", line 139, in __init__
self._connect_channel()
File "/project/workSpace/venv/lib/python3.8/site-packages/ray/util/client/worker.py", line 260, in _connect_channel
raise ConnectionError("ray client connection timeout")
ConnectionError: ray client connection timeout- The worker to head node connection should work with the URL specified. It works if I give the local IP of the head node:
$ > ray init --address='10.251.222.100:6379'
Local node IP: 10.251.222.101
2023-03-18 07:20:41,943 WARNING services.py:1791 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=2.47gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
[2023-03-18 07:20:41,964 I 115596 115596] global_state_accessor.cc:356: This node has an IP address of 10.251.222.101, while we can not find the matched Raylet address. This maybe come from when you connect the Ray cluster with a different IP address or connect a container.
--------------------
Ray runtime started.
--------------------
To terminate the Ray runtime, run
ray stopThis is the behavior I'm hoping to get from the command ray init --address='head-node-gcs.company.internal.domain.com:80'.
- This is the relevant part of the Service config of the head node:
ports:
- name: ray-dashboard
port: 8265
targetPort: 8265
protocol: TCP
- name: ray-gcs
port: 6379
targetPort: 6379
protocol: TCP
- name: ray-client
port: 10001
targetPort: 10001
protocol: TCP
- name: ray-serve
port: 8000
targetPort: 8000
protocol: TCP
type: ClusterIPThis is the relevant part of the Ingress config of the head node:
spec:
rules:
- host: head-node-dashboard.company.internal.domain.com
http:
paths:
- path: /
backend:
serviceName: head-node-svc
servicePort: 8265
- host: head-node-gcs.company.internal.domain.com
http:
paths:
- path: /
backend:
serviceName: head-node-svc
servicePort: 6379
- host: head-node-client.company.internal.domain.com
http:
paths:
- path: /
backend:
serviceName: head-node-svc
servicePort: 10001
- host: head-node-serve.company.internal.domain.com
http:
paths:
- path: /
backend:
serviceName: head-node-svc
servicePort: 8000Versions / Dependencies
$ > ray --version
ray, version 2.3.0
$ > python --version
Python 3.7.4
$ > uname -a
Linux head-node-659568794c-rwmpk 3.10.0-1160.el7.x86_64 #1 SMP Tue Aug 18 14:50:17 EDT 2020 x86_64 GNU/LinuxReproduction script
I don't think this is reproducible since I'm running this in a managed Kubernetes environment. But, the Service and Ingress configuration snippets provided above should help setup the basic networking.
Issue Severity
High: It blocks me from completing my task.
Metadata
Metadata
Assignees
Labels
P2Important issue, but not time-criticalImportant issue, but not time-criticalbugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'tcoreIssues that should be addressed in Ray CoreIssues that should be addressed in Ray Coreusability