feat: add ray plugin for job #4581

Wonki4 · 2025-08-31T08:51:38Z

What type of PR is this?

This PR includes the ray plugin support for volcano job.

What this PR does / why we need it:

This PR supports users to build a ray cluster is composed of head and worker easily.

Example

--- # without ray plugin
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: ray-cluster-job
spec:
  minAvailable: 3
  schedulerName: volcano
  plugins:
    svc: []
  queue: default
  policies:
    - event: PodEvicted
      action: RestartJob
  tasks:
    - replicas: 1
      name: head
      template:
        spec:
          containers:
            - name: head
              command:
                - sh
                - -c
                - ray start --head --block --port=6379 --dashboard-host=0.0.0.0;
              image: rayproject/ray:latest-py311-cpu
              ports:
                - containerPort: 8265
                  name: dashboard
                - containerPort: 6379
                  name: gcs
                - containerPort: 10001
                  name: client
              resources: {}
          restartPolicy: OnFailure
    - replicas: 2
      name: worker
      template:
        spec:
          containers:
            - name: worker
              command:
                - sh
                - -c
                - |
                  ray start --block --address=ray-cluster-job-head-0.ray-cluster-job:6379
              image: rayproject/ray:latest-py311-cpu
              resources: {}
          restartPolicy: OnFailure

--- # with ray plugin
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: ray-cluster-job
spec:
  minAvailable: 3
  schedulerName: volcano
  plugins:
    ray: []
    svc: []
  policies:
    - event: PodEvicted
      action: RestartJob
  queue: default
  tasks:
    - replicas: 1
      name: head

      template:
        spec:
          containers:
            - name: head
              image: rayproject/ray:latest-py311-cpu
              resources: {}
          restartPolicy: OnFailure
    - replicas: 2
      name: worker
      template:
        spec:
          containers:
            - name: worker
              image: rayproject/ray:latest-py311-cpu
              resources: {}
          restartPolicy: Never

If a user use ray plugin, the cluster configuration is easier than before

Head Node Command Setup
Worker Node Command Setup
Port Setup (gcs, dashboard and client server)
How to configure a ray cluster
Service connected to a head task pod.

Here is the ray plugin arguments

head: The name of a head task
headContainer: The container name in a head task
worker: The name of a worker task
workerContainer: The container name in a worker task
port: GCS Port
dashboardPort: Ray cluster dashboard port
clientServerPort: client server port

The architecture of ray cluster in Volcano Job

Which issue(s) this PR fixes:

Fixes #4182

Special notes for your reviewer:

I know that there is another PR about this issue(#4193 ).
but there's no progress for 2 months so I decided to start working on it.

Does this PR introduce a user-facing change?

volcano-sh-bot · 2025-08-31T08:51:41Z

Welcome @Wonki4!

It looks like this is your first PR to volcano-sh/volcano.

Thank you, and welcome to Volcano. 😃

gemini-code-assist

Summary of Changes

Hello @Wonki4, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a dedicated Ray plugin for Volcano, significantly simplifying the deployment and lifecycle management of Ray clusters within the Volcano job scheduling framework. By abstracting away complex manual configurations, it allows users to easily define and run distributed Ray applications, enhancing Volcano's capabilities for AI/ML and distributed computing workloads.

Highlights

Ray Plugin Introduction: A new ray plugin is added to Volcano, enabling simplified deployment and management of Ray clusters.
Automated Configuration: The plugin automates the setup of Ray head and worker node commands, port configurations (GCS, dashboard, client server), and creates a Kubernetes Service for the Ray head node.
Customizable Parameters: Users can specify custom names for head/worker tasks and containers, and configure Ray-specific ports via plugin arguments.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new ray plugin for Volcano, which simplifies the deployment of Ray clusters. The implementation is well-structured, covering command injection for head and worker pods, service creation, and lifecycle management. My review focuses on improving code clarity, fixing a potential bug related to loop variable usage, and correcting a flaw in the new E2E test to ensure the plugin's functionality is properly verified. Addressing these points will enhance the robustness and maintainability of this new feature.

pkg/controllers/job/plugins/distributed-framework/ray/ray.go

gemini-code-assist · 2025-08-31T08:53:35Z

test/e2e/jobseq/ray_plugin.go

+				},
+			},
+			Plugins: map[string][]string{
+				"ray": {"--head=head", "--worker=worker", "--port=2345", "--headContainer=rayproject", "--workerContainer=rayproject"},


This E2E test is not correctly testing the plugin's main functionality. The plugin is configured here with "--headContainer=rayproject" and "--workerContainer=rayproject". However, the tasks defined in the job spec (lines 49-64) do not specify any containers. The test utility will create a default container, likely named "default-container".

Because the container names do not match, the plugin's OnPodCreate logic will never find the target containers, and therefore the ray start commands will not be injected. The test may pass if the base rayproject/ray image has a default entrypoint that doesn't immediately fail, but the plugin's core command injection logic is not being exercised.

To fix this, you should explicitly define containers named rayproject within the head and worker task specs.

The container name is based on the container image repository name. so, I set up the container names as a "rayproject". Should I change the way I test?

pkg/controllers/job/plugins/distributed-framework/ray/ray.go

pkg/controllers/job/plugins/distributed-framework/ray/ray_test.go

JesseStutler · 2025-09-01T02:13:11Z

/cc @Monokaix I think we can take on this feature in v1.13
/ok-to-test

Monokaix · 2025-09-02T02:21:25Z

/cc @Monokaix I think we can take on this feature in v1.13 /ok-to-test

+1

Monokaix · 2025-09-02T02:23:01Z

Welcome! Thanks for your contribution.
Please check CI

Wonki4 · 2025-09-02T15:07:22Z

Okay! I'll fix it!

Wonki4 · 2025-09-06T17:04:39Z

@Monokaix
The problem of ray e2e test was related to the ray image size (about 850mb). I changed the test image and resolved the issue.
But, I just wonder how large the disk size of test linux instance is?

kingeasternsun · 2025-09-12T03:32:40Z

Thanks for your contribution.

kingeasternsun · 2025-09-16T01:17:37Z

/lgtm

pkg/controllers/job/plugins/distributed-framework/ray/ray.go

Monokaix · 2025-09-17T10:45:37Z

Please also add the yaml example into example/integrations: )

Monokaix · 2025-09-17T10:48:36Z

Can you also check about the functionalities and API fields we missed and can be added into this plugin? Seems it's a little simple, or if it can meet the production requirements under current implementation?

Wonki4 · 2025-09-17T14:55:11Z

Can you also check about the functionalities and API fields we missed and can be added into this plugin? Seems it's a little simple, or if it can meet the production requirements under current implementation?

Okay, I'll check other scenarios using ray. This plugin focus on supporting to create a ray cluster.

Monokaix · 2025-09-18T06:42:18Z

Can you also check about the functionalities and API fields we missed and can be added into this plugin? Seems it's a little simple, or if it can meet the production requirements under current implementation?

Okay, I'll check other scenarios using ray. This plugin focus on supporting to create a ray cluster.

Great, but you can resolve other comments first, we can continuously iterate on this plugin: )

Wonki4 · 2025-09-21T13:37:16Z

@Monokaix
The E2E test was failed due to a disk size.
The ray image is over 400mb at least (the size of ray package is about 300mb) and other images about tensorflow is also big. If the workers of ray and tensorflow are scheduled at the same k8s node, the pod will be pending state because of there are not enough disk size.

Could we increase a disk size of node? or Could you advice me to find a other way to solve it.

Sep 19 16:33:27 integration-worker3 kubelet[272]: E0919 16:33:27.945147 272 pod_workers.go:1301] "Error syncing pod, skipping" err="failed to "StartContainer" for "tensorflow" with ErrImagePull: "failed to pull and unpack image \"docker.io/volcanosh/dist-mnist-tf-example:0.0.1\": failed to extract layer sha256:5061983e267fa42a4403de0a25f41e4ca5ba0c75db3b2a4ceae769f94035816b: write /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/154/fs/opt/conda/lib/python3.5/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so: no space left on device"" pod="x2n8bo60/tensorflow-dist-mnist-ps-0" podUID="89d374cc-6b09-4498-a235-163431fcdb72"

kingeasternsun · 2025-09-22T01:24:45Z

/lgtm

Monokaix · 2025-09-22T01:41:04Z

Sep 19 16:33:27 integration-worker3 kubelet[272]: E0919 16:33:27.945147 272 pod_workers.go:1301] "Error syncing pod, skipping" err="failed to "StartContainer" for "tensorflow" with ErrImagePull: "failed to pull and unpack image "docker.io/volcanosh/dist-mnist-tf-example:0.0.1": failed to extract layer sha256:5061983e267fa42a4403de0a25f41e4ca5ba0c75db3b2a4ceae769f94035816b: write /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/154/fs/opt/conda/lib/python3.5/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so: no space left on device"" pod="x2n8bo60/tensorflow-dist-mnist-ps-0" podUID="89d374cc-6b09-4498-a235-163431fcdb72"

You can use sudo docker system prune -a -f or sudo crictl rmi --prune to release space before run a e2e.

Monokaix · 2025-09-22T09:46:55Z

example/integrations/ray/ray-example.yaml

+            - name: worker
+              image: bitnami/ray:2.49.0
+              resources: {}
+          restartPolicy: Never


Seems change to OnFailure is better, because the worker may cannot connect the header if the header starts after worker.

I have tested locally and the worker failed to start, so we should change to OnFailure here.

I changed the restartPolicy. (Never -> OnFailure)

Monokaix · 2025-09-22T09:47:51Z

docs/user-guide/how_to_use_ray_plugin.md

+| 6   | dashboardPort    | string | 8265          | No       | The port to open for the Ray dashboard  | --dashboardPort=8265     |
+| 7   | clientServerPort | string | 10001         | No       | The port to open for the client server  | --clientServerPort=10001 |
+
+## Examples


Maybe we need also add ray cluster operation doc to users, like https://docs.ray.io/en/master/cluster/kubernetes/getting-started/raycluster-quick-start.html?

Thank you for your check!
Maybe, it will be more comfortable and useful.

Because we suppose that a user doesn't have a ray operator,
I think that below 5 docs are fit for our users.

Step 4 ~ Step 5 of RayCluster Quick Guide can be useful (Step 1 ~ Step 3 is related to a Ray Operator)

What we create and provide is Ray Cluster Key Concepts

What we based on is Launching an On-Premise Cluster

How to use a ray cluster with Ray Jobs API

With a ray cluster, we can serve a model Model Deploy on ray cluster

Could you comment this? (I think the 5 links are helpful for users to use a ray cluster.)
(If you have, could you tell me a form about the ray cluster operation doc?)

I think it's ok, we have provided a ray cluster and user can run their job, and we just give a simple guide and link the guide doc is ok.

The simple guide was integrated into the documentation.

Monokaix · 2025-09-23T01:26:36Z

Thanks for your contribution! And we are planning to release new version this week, so it's better to solve all comments to update and merge it today: )

Wonki4 · 2025-09-23T01:31:46Z

Thanks for your contribution! And we are planning to release new version this week, so it's better to solve all comments to update and merge it today: )

Okay! I'll do it!

Signed-off-by: Wongi, Baek <qordnjsrl13@naver.com>

Monokaix · 2025-09-24T01:28:22Z

/approve
Great job!

volcano-sh-bot · 2025-09-24T01:28:32Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Monokaix

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [Monokaix]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

JesseStutler · 2025-09-24T01:34:41Z

/lgtm
Thanks

YeonghyeonKO · 2025-09-24T02:08:43Z

LGTM @Wonki4 Thanks for contributing your code.

volcano-sh-bot requested review from JesseStutler, Thor-wl and huone1 August 31, 2025 08:51

gemini-code-assist bot reviewed Aug 31, 2025

View reviewed changes

volcano-sh-bot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Aug 31, 2025

gemini-code-assist bot reviewed Aug 31, 2025

View reviewed changes

Wonki4 force-pushed the ray branch from 6d2135a to b438279 Compare August 31, 2025 09:07

volcano-sh-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 31, 2025

Wonki4 force-pushed the ray branch 2 times, most recently from cab9682 to 158a256 Compare August 31, 2025 12:24

Wonki4 mentioned this pull request Aug 31, 2025

Volcano job natively support Ray framework #4182

Closed

Wonki4 force-pushed the ray branch from 158a256 to 824caa8 Compare August 31, 2025 16:14

volcano-sh-bot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Sep 1, 2025

Wonki4 force-pushed the ray branch 2 times, most recently from 4950290 to 88989b6 Compare September 6, 2025 16:27

Wonki4 force-pushed the ray branch from 011e04e to 4e6ba1a Compare September 10, 2025 15:05

volcano-sh-bot assigned kingeasternsun Sep 12, 2025

volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Sep 12, 2025

Monokaix reviewed Sep 17, 2025

View reviewed changes

pkg/controllers/job/plugins/distributed-framework/ray/ray.go Outdated Show resolved Hide resolved

Monokaix reviewed Sep 17, 2025

View reviewed changes

pkg/controllers/job/plugins/distributed-framework/ray/ray.go Outdated Show resolved Hide resolved

Monokaix reviewed Sep 17, 2025

View reviewed changes

pkg/controllers/job/plugins/distributed-framework/ray/ray.go Outdated Show resolved Hide resolved

Wonki4 force-pushed the ray branch from 4e6ba1a to 56ca936 Compare September 19, 2025 15:13

volcano-sh-bot removed the lgtm Indicates that a PR is ready to be merged. label Sep 19, 2025

Wonki4 force-pushed the ray branch 2 times, most recently from 12692a3 to c6dd4df Compare September 21, 2025 13:36

volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Sep 22, 2025

Monokaix reviewed Sep 22, 2025

View reviewed changes

feat: add ray plugin for job

91e0d72

Signed-off-by: Wongi, Baek <qordnjsrl13@naver.com>

Wonki4 force-pushed the ray branch from c6dd4df to 91e0d72 Compare September 23, 2025 15:12

volcano-sh-bot removed the lgtm Indicates that a PR is ready to be merged. label Sep 23, 2025

volcano-sh-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 24, 2025

volcano-sh-bot assigned JesseStutler Sep 24, 2025

volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Sep 24, 2025

volcano-sh-bot merged commit c3ad660 into volcano-sh:master Sep 24, 2025
20 checks passed

feat: add ray plugin for job #4581

feat: add ray plugin for job #4581

Conversation

Wonki4 commented Aug 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Uh oh!

volcano-sh-bot commented Aug 31, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Aug 31, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JesseStutler commented Sep 1, 2025

Uh oh!

Monokaix commented Sep 2, 2025

Uh oh!

Monokaix commented Sep 2, 2025

Uh oh!

Wonki4 commented Sep 2, 2025

Uh oh!

Wonki4 commented Sep 6, 2025

Uh oh!

kingeasternsun commented Sep 12, 2025

Uh oh!

kingeasternsun commented Sep 16, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Monokaix commented Sep 17, 2025

Uh oh!

Monokaix commented Sep 17, 2025

Uh oh!

Wonki4 commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Monokaix commented Sep 18, 2025

Uh oh!

Wonki4 commented Sep 21, 2025

Uh oh!

kingeasternsun commented Sep 22, 2025

Uh oh!

Monokaix commented Sep 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Wonki4 Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Wonki4 commented Aug 31, 2025 •

edited

Loading

Wonki4 commented Sep 17, 2025 •

edited

Loading

Wonki4 Sep 22, 2025 •

edited

Loading