Skip to content

Conversation

@Wonki4
Copy link
Contributor

@Wonki4 Wonki4 commented Aug 31, 2025

What type of PR is this?

This PR includes the ray plugin support for volcano job.

What this PR does / why we need it:

This PR supports users to build a ray cluster is composed of head and worker easily.

Example

--- # without ray plugin
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: ray-cluster-job
spec:
  minAvailable: 3
  schedulerName: volcano
  plugins:
    svc: []
  queue: default
  policies:
    - event: PodEvicted
      action: RestartJob
  tasks:
    - replicas: 1
      name: head
      template:
        spec:
          containers:
            - name: head
              command:
                - sh
                - -c
                - ray start --head --block --port=6379 --dashboard-host=0.0.0.0;
              image: rayproject/ray:latest-py311-cpu
              ports:
                - containerPort: 8265
                  name: dashboard
                - containerPort: 6379
                  name: gcs
                - containerPort: 10001
                  name: client
              resources: {}
          restartPolicy: OnFailure
    - replicas: 2
      name: worker
      template:
        spec:
          containers:
            - name: worker
              command:
                - sh
                - -c
                - |
                  ray start --block --address=ray-cluster-job-head-0.ray-cluster-job:6379
              image: rayproject/ray:latest-py311-cpu
              resources: {}
          restartPolicy: OnFailure 
--- # with ray plugin
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: ray-cluster-job
spec:
  minAvailable: 3
  schedulerName: volcano
  plugins:
    ray: []
    svc: []
  policies:
    - event: PodEvicted
      action: RestartJob
  queue: default
  tasks:
    - replicas: 1
      name: head

      template:
        spec:
          containers:
            - name: head
              image: rayproject/ray:latest-py311-cpu
              resources: {}
          restartPolicy: OnFailure
    - replicas: 2
      name: worker
      template:
        spec:
          containers:
            - name: worker
              image: rayproject/ray:latest-py311-cpu
              resources: {}
          restartPolicy: Never

If a user use ray plugin, the cluster configuration is easier than before

  • Head Node Command Setup
  • Worker Node Command Setup
  • Port Setup (gcs, dashboard and client server)
  • How to configure a ray cluster
  • Service connected to a head task pod.

Here is the ray plugin arguments

  • head: The name of a head task
  • headContainer: The container name in a head task
  • worker: The name of a worker task
  • workerContainer: The container name in a worker task
  • port: GCS Port
  • dashboardPort: Ray cluster dashboard port
  • clientServerPort: client server port

The architecture of ray cluster in Volcano Job

ray_plugin_diagram drawio

Which issue(s) this PR fixes:

Fixes #4182

Special notes for your reviewer:

I know that there is another PR about this issue(#4193 ).
but there's no progress for 2 months so I decided to start working on it.

Does this PR introduce a user-facing change?


@volcano-sh-bot
Copy link
Contributor

Welcome @Wonki4!

It looks like this is your first PR to volcano-sh/volcano.

Thank you, and welcome to Volcano. 😃

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @Wonki4, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a dedicated Ray plugin for Volcano, significantly simplifying the deployment and lifecycle management of Ray clusters within the Volcano job scheduling framework. By abstracting away complex manual configurations, it allows users to easily define and run distributed Ray applications, enhancing Volcano's capabilities for AI/ML and distributed computing workloads.

Highlights

  • Ray Plugin Introduction: A new ray plugin is added to Volcano, enabling simplified deployment and management of Ray clusters.
  • Automated Configuration: The plugin automates the setup of Ray head and worker node commands, port configurations (GCS, dashboard, client server), and creates a Kubernetes Service for the Ray head node.
  • Customizable Parameters: Users can specify custom names for head/worker tasks and containers, and configure Ray-specific ports via plugin arguments.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@volcano-sh-bot volcano-sh-bot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Aug 31, 2025
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new ray plugin for Volcano, which simplifies the deployment of Ray clusters. The implementation is well-structured, covering command injection for head and worker pods, service creation, and lifecycle management. My review focuses on improving code clarity, fixing a potential bug related to loop variable usage, and correcting a flaw in the new E2E test to ensure the plugin's functionality is properly verified. Addressing these points will enhance the robustness and maintainability of this new feature.

},
},
Plugins: map[string][]string{
"ray": {"--head=head", "--worker=worker", "--port=2345", "--headContainer=rayproject", "--workerContainer=rayproject"},

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This E2E test is not correctly testing the plugin's main functionality. The plugin is configured here with "--headContainer=rayproject" and "--workerContainer=rayproject". However, the tasks defined in the job spec (lines 49-64) do not specify any containers. The test utility will create a default container, likely named "default-container".

Because the container names do not match, the plugin's OnPodCreate logic will never find the target containers, and therefore the ray start commands will not be injected. The test may pass if the base rayproject/ray image has a default entrypoint that doesn't immediately fail, but the plugin's core command injection logic is not being exercised.

To fix this, you should explicitly define containers named rayproject within the head and worker task specs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The container name is based on the container image repository name. so, I set up the container names as a "rayproject". Should I change the way I test?

@volcano-sh-bot volcano-sh-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 31, 2025
@Wonki4 Wonki4 force-pushed the ray branch 2 times, most recently from cab9682 to 158a256 Compare August 31, 2025 12:24
@JesseStutler
Copy link
Member

/cc @Monokaix I think we can take on this feature in v1.13
/ok-to-test

@volcano-sh-bot volcano-sh-bot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Sep 1, 2025
@Monokaix
Copy link
Member

Monokaix commented Sep 2, 2025

/cc @Monokaix I think we can take on this feature in v1.13 /ok-to-test

+1

@Monokaix
Copy link
Member

Monokaix commented Sep 2, 2025

Welcome! Thanks for your contribution.
Please check CI
image

@Wonki4
Copy link
Contributor Author

Wonki4 commented Sep 2, 2025

Okay! I'll fix it!

@Wonki4 Wonki4 force-pushed the ray branch 2 times, most recently from 4950290 to 88989b6 Compare September 6, 2025 16:27
@Wonki4
Copy link
Contributor Author

Wonki4 commented Sep 6, 2025

@Monokaix
The problem of ray e2e test was related to the ray image size (about 850mb). I changed the test image and resolved the issue.
But, I just wonder how large the disk size of test linux instance is?

@kingeasternsun
Copy link
Contributor

Thanks for your contribution.

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Sep 12, 2025
@kingeasternsun
Copy link
Contributor

/lgtm

@Monokaix
Copy link
Member

Please also add the yaml example into example/integrations: )

@Monokaix
Copy link
Member

Can you also check about the functionalities and API fields we missed and can be added into this plugin? Seems it's a little simple, or if it can meet the production requirements under current implementation?

@Wonki4
Copy link
Contributor Author

Wonki4 commented Sep 17, 2025

Can you also check about the functionalities and API fields we missed and can be added into this plugin? Seems it's a little simple, or if it can meet the production requirements under current implementation?

Okay, I'll check other scenarios using ray. This plugin focus on supporting to create a ray cluster.

@Monokaix
Copy link
Member

Can you also check about the functionalities and API fields we missed and can be added into this plugin? Seems it's a little simple, or if it can meet the production requirements under current implementation?

Okay, I'll check other scenarios using ray. This plugin focus on supporting to create a ray cluster.

Great, but you can resolve other comments first, we can continuously iterate on this plugin: )

@volcano-sh-bot volcano-sh-bot removed the lgtm Indicates that a PR is ready to be merged. label Sep 19, 2025
@Wonki4 Wonki4 force-pushed the ray branch 2 times, most recently from 12692a3 to c6dd4df Compare September 21, 2025 13:36
@Wonki4
Copy link
Contributor Author

Wonki4 commented Sep 21, 2025

@Monokaix
The E2E test was failed due to a disk size.
The ray image is over 400mb at least (the size of ray package is about 300mb) and other images about tensorflow is also big. If the workers of ray and tensorflow are scheduled at the same k8s node, the pod will be pending state because of there are not enough disk size.

Could we increase a disk size of node? or Could you advice me to find a other way to solve it.

Sep 19 16:33:27 integration-worker3 kubelet[272]: E0919 16:33:27.945147 272 pod_workers.go:1301] "Error syncing pod, skipping" err="failed to "StartContainer" for "tensorflow" with ErrImagePull: "failed to pull and unpack image \"docker.io/volcanosh/dist-mnist-tf-example:0.0.1\": failed to extract layer sha256:5061983e267fa42a4403de0a25f41e4ca5ba0c75db3b2a4ceae769f94035816b: write /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/154/fs/opt/conda/lib/python3.5/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so: no space left on device"" pod="x2n8bo60/tensorflow-dist-mnist-ps-0" podUID="89d374cc-6b09-4498-a235-163431fcdb72"

@kingeasternsun
Copy link
Contributor

/lgtm

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Sep 22, 2025
@Monokaix
Copy link
Member

Sep 19 16:33:27 integration-worker3 kubelet[272]: E0919 16:33:27.945147 272 pod_workers.go:1301] "Error syncing pod, skipping" err="failed to "StartContainer" for "tensorflow" with ErrImagePull: "failed to pull and unpack image "docker.io/volcanosh/dist-mnist-tf-example:0.0.1": failed to extract layer sha256:5061983e267fa42a4403de0a25f41e4ca5ba0c75db3b2a4ceae769f94035816b: write /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/154/fs/opt/conda/lib/python3.5/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so: no space left on device"" pod="x2n8bo60/tensorflow-dist-mnist-ps-0" podUID="89d374cc-6b09-4498-a235-163431fcdb72"

You can use sudo docker system prune -a -f or sudo crictl rmi --prune to release space before run a e2e.

- name: worker
image: bitnami/ray:2.49.0
resources: {}
restartPolicy: Never
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems change to OnFailure is better, because the worker may cannot connect the header if the header starts after worker.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested locally and the worker failed to start, so we should change to OnFailure here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the restartPolicy. (Never -> OnFailure)

| 6 | dashboardPort | string | 8265 | No | The port to open for the Ray dashboard | --dashboardPort=8265 |
| 7 | clientServerPort | string | 10001 | No | The port to open for the client server | --clientServerPort=10001 |

## Examples
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need also add ray cluster operation doc to users, like https://docs.ray.io/en/master/cluster/kubernetes/getting-started/raycluster-quick-start.html?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your check!
Maybe, it will be more comfortable and useful.

Copy link
Contributor Author

@Wonki4 Wonki4 Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we suppose that a user doesn't have a ray operator,
I think that below 5 docs are fit for our users.

  1. Step 4 ~ Step 5 of RayCluster Quick Guide can be useful (Step 1 ~ Step 3 is related to a Ray Operator)
  2. What we create and provide is Ray Cluster Key Concepts
  3. What we based on is Launching an On-Premise Cluster
  4. How to use a ray cluster with Ray Jobs API
  5. With a ray cluster, we can serve a model Model Deploy on ray cluster

Could you comment this? (I think the 5 links are helpful for users to use a ray cluster.)
(If you have, could you tell me a form about the ray cluster operation doc?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok, we have provided a ray cluster and user can run their job, and we just give a simple guide and link the guide doc is ok.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The simple guide was integrated into the documentation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@Monokaix
Copy link
Member

Thanks for your contribution! And we are planning to release new version this week, so it's better to solve all comments to update and merge it today: )

@Wonki4
Copy link
Contributor Author

Wonki4 commented Sep 23, 2025

Thanks for your contribution! And we are planning to release new version this week, so it's better to solve all comments to update and merge it today: )

Okay! I'll do it!

Signed-off-by: Wongi, Baek <qordnjsrl13@naver.com>
@volcano-sh-bot volcano-sh-bot removed the lgtm Indicates that a PR is ready to be merged. label Sep 23, 2025
@Monokaix
Copy link
Member

/approve
Great job!

@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Monokaix

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 24, 2025
@JesseStutler
Copy link
Member

/lgtm
Thanks

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Sep 24, 2025
@volcano-sh-bot volcano-sh-bot merged commit c3ad660 into volcano-sh:master Sep 24, 2025
20 checks passed
@YeonghyeonKO
Copy link

LGTM @Wonki4 Thanks for contributing your code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Volcano job natively support Ray framework

6 participants