Skip to content

[Core] Cannot Connect to Head Node GCS Through URL #33428

@RishabhMalviya

Description

@RishabhMalviya

What happened + What you expected to happen

  1. I am working in a managed Kubernetes environment. We have three nodes (managed K8S Deployment + Service + Ingress) setup - one head node, and two worker nodes. Using the Service and Ingress configurations, I expose port 8265 of my container through the (internal) URL http://head-node-dashboard.company.internal.domain.com, and 6379 through http://head-node-gcs.company.internal.domain.com.

When I try to submit jobs to the dashboard URL, everything works fine:

ray job submit --working-dir ./ --address='http://head-node-dashboard.company.internal.domain.com' -- python ./script.py

But when I try to connect to the GCS, it fails. There are two ways that this happens:

  • Connecting a worker node to the head node with ray start:
$ > ray start --address='head-node-gcs.company.internal.domain.com:80'
Local node IP: 10.251.222.101
2023-03-18 06:51:17,521 WARNING utils.py:1446 -- Unable to connect to GCS at head-node-gcs.company.internal.domain.com:80. Check that (1) Ray GCS with matching version started successfully at the specified address, and (2) there is no firewall setting preventing access.
  • Connecting to the head node through ray.init().
$ > python
>>> import ray

a) If I connect without any protocol defined:

>>> ray.init(address='head-node-gcs.company.internal.domain.com:80')
2023-03-18 06:58:11,670 INFO worker.py:1333 -- Connecting to existing Ray cluster at address: head-node-gcs.company.internal.domain.com:80...
2023-03-18 06:58:16,743 WARNING utils.py:1333 -- Unable to connect to GCS at head-node-gcs.company.internal.domain.com:80. Check that (1) Ray GCS with matching version started successfully at the specified address, and (2) there is no firewall setting preventing access.

b) If I connect with the http:// protocol specified:

>>> ray.init(address='http://head-node-gcs.company.internal.domain.com')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/project/workSpace/venv/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/project/workSpace/venv/lib/python3.8/site-packages/ray/_private/worker.py", line 1230, in init
    builder = ray.client(address, _deprecation_warn_enabled=False)
  File "/project/workSpace/venv/lib/python3.8/site-packages/ray/client_builder.py", line 382, in client
    builder = _get_builder_from_address(address)
  File "/project/workSpace/venv/lib/python3.8/site-packages/ray/client_builder.py", line 350, in _get_builder_from_address
    assert "ClientBuilder" in dir(
AssertionError: Module: http does not have ClientBuilder.

c) If I connect with the ray:// protocol specified:

>>> ray.init(address='ray://head-node-gcs.company.internal.domain.com')
/project/workSpace/venv/lib/python3.8/site-packages/ray/util/client/worker.py:253: UserWarning: Ray Client connection timed out. Ensure that the Ray Client port on the head node is reachable from your local machine. See https://docs.ray.io/en/latest/cluster/ray-client.html#step-2-check-ports for more information.
  warnings.warn(
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/project/workSpace/venv/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/project/workSpace/venv/lib/python3.8/site-packages/ray/_private/worker.py", line 1248, in init
    ctx = builder.connect()
  File "/project/workSpace/venv/lib/python3.8/site-packages/ray/client_builder.py", line 178, in connect
    client_info_dict = ray.util.client_connect.connect(
  File "/project/workSpace/venv/lib/python3.8/site-packages/ray/util/client_connect.py", line 47, in connect
    conn = ray.connect(
  File "/project/workSpace/venv/lib/python3.8/site-packages/ray/util/client/__init__.py", line 252, in connect
    conn = self.get_context().connect(*args, **kw_args)
  File "/project/workSpace/venv/lib/python3.8/site-packages/ray/util/client/__init__.py", line 94, in connect
    self.client_worker = Worker(
  File "/project/workSpace/venv/lib/python3.8/site-packages/ray/util/client/worker.py", line 139, in __init__
    self._connect_channel()
  File "/project/workSpace/venv/lib/python3.8/site-packages/ray/util/client/worker.py", line 260, in _connect_channel
    raise ConnectionError("ray client connection timeout")
ConnectionError: ray client connection timeout
  1. The worker to head node connection should work with the URL specified. It works if I give the local IP of the head node:
$ > ray init --address='10.251.222.100:6379'
Local node IP: 10.251.222.101
2023-03-18 07:20:41,943 WARNING services.py:1791 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=2.47gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
[2023-03-18 07:20:41,964 I 115596 115596] global_state_accessor.cc:356: This node has an IP address of 10.251.222.101, while we can not find the matched Raylet address. This maybe come from when you connect the Ray cluster with a different IP address or connect a container.

--------------------
Ray runtime started.
--------------------

To terminate the Ray runtime, run
  ray stop

This is the behavior I'm hoping to get from the command ray init --address='head-node-gcs.company.internal.domain.com:80'.

  1. This is the relevant part of the Service config of the head node:
  ports:
  - name: ray-dashboard
    port: 8265
    targetPort: 8265
    protocol: TCP
  - name: ray-gcs
    port: 6379
    targetPort: 6379
    protocol: TCP
  - name: ray-client
    port: 10001
    targetPort: 10001
    protocol: TCP
  - name: ray-serve
    port: 8000
    targetPort: 8000
    protocol: TCP
  type: ClusterIP

This is the relevant part of the Ingress config of the head node:

spec:
  rules:
  - host: head-node-dashboard.company.internal.domain.com
    http:
      paths:
      - path: /
        backend:
          serviceName: head-node-svc
          servicePort: 8265
  - host: head-node-gcs.company.internal.domain.com
    http:
      paths:
      - path: /
        backend:
          serviceName: head-node-svc
          servicePort: 6379
  - host: head-node-client.company.internal.domain.com
    http:
      paths:
      - path: /
        backend:
          serviceName: head-node-svc
          servicePort: 10001
  - host: head-node-serve.company.internal.domain.com
    http:
      paths:
      - path: /
        backend:
          serviceName: head-node-svc
          servicePort: 8000

Versions / Dependencies

$ > ray --version
ray, version 2.3.0

$ > python --version
Python 3.7.4

$ > uname -a
Linux head-node-659568794c-rwmpk 3.10.0-1160.el7.x86_64 #1 SMP Tue Aug 18 14:50:17 EDT 2020 x86_64 GNU/Linux

Reproduction script

I don't think this is reproducible since I'm running this in a managed Kubernetes environment. But, the Service and Ingress configuration snippets provided above should help setup the basic networking.

Issue Severity

High: It blocks me from completing my task.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Important issue, but not time-criticalbugSomething that is supposed to be working; but isn'tcoreIssues that should be addressed in Ray Coreusability

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions