-
Notifications
You must be signed in to change notification settings - Fork 1.2k
feat: add ray plugin for job #4581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Welcome @Wonki4! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @Wonki4, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces a dedicated Ray plugin for Volcano, significantly simplifying the deployment and lifecycle management of Ray clusters within the Volcano job scheduling framework. By abstracting away complex manual configurations, it allows users to easily define and run distributed Ray applications, enhancing Volcano's capabilities for AI/ML and distributed computing workloads.
Highlights
- Ray Plugin Introduction: A new
rayplugin is added to Volcano, enabling simplified deployment and management of Ray clusters. - Automated Configuration: The plugin automates the setup of Ray head and worker node commands, port configurations (GCS, dashboard, client server), and creates a Kubernetes Service for the Ray head node.
- Customizable Parameters: Users can specify custom names for head/worker tasks and containers, and configure Ray-specific ports via plugin arguments.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new ray plugin for Volcano, which simplifies the deployment of Ray clusters. The implementation is well-structured, covering command injection for head and worker pods, service creation, and lifecycle management. My review focuses on improving code clarity, fixing a potential bug related to loop variable usage, and correcting a flaw in the new E2E test to ensure the plugin's functionality is properly verified. Addressing these points will enhance the robustness and maintainability of this new feature.
test/e2e/jobseq/ray_plugin.go
Outdated
| }, | ||
| }, | ||
| Plugins: map[string][]string{ | ||
| "ray": {"--head=head", "--worker=worker", "--port=2345", "--headContainer=rayproject", "--workerContainer=rayproject"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This E2E test is not correctly testing the plugin's main functionality. The plugin is configured here with "--headContainer=rayproject" and "--workerContainer=rayproject". However, the tasks defined in the job spec (lines 49-64) do not specify any containers. The test utility will create a default container, likely named "default-container".
Because the container names do not match, the plugin's OnPodCreate logic will never find the target containers, and therefore the ray start commands will not be injected. The test may pass if the base rayproject/ray image has a default entrypoint that doesn't immediately fail, but the plugin's core command injection logic is not being exercised.
To fix this, you should explicitly define containers named rayproject within the head and worker task specs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The container name is based on the container image repository name. so, I set up the container names as a "rayproject". Should I change the way I test?
pkg/controllers/job/plugins/distributed-framework/ray/ray_test.go
Outdated
Show resolved
Hide resolved
cab9682 to
158a256
Compare
|
/cc @Monokaix I think we can take on this feature in v1.13 |
+1 |
|
Okay! I'll fix it! |
4950290 to
88989b6
Compare
|
@Monokaix |
|
Thanks for your contribution. |
|
/lgtm |
|
Please also add the yaml example into |
|
Can you also check about the functionalities and API fields we missed and can be added into this plugin? Seems it's a little simple, or if it can meet the production requirements under current implementation? |
Okay, I'll check other scenarios using ray. This plugin focus on supporting to create a ray cluster. |
Great, but you can resolve other comments first, we can continuously iterate on this plugin: ) |
12692a3 to
c6dd4df
Compare
|
@Monokaix Could we increase a disk size of node? or Could you advice me to find a other way to solve it.
|
|
/lgtm |
You can use |
| - name: worker | ||
| image: bitnami/ray:2.49.0 | ||
| resources: {} | ||
| restartPolicy: Never |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems change to OnFailure is better, because the worker may cannot connect the header if the header starts after worker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tested locally and the worker failed to start, so we should change to OnFailure here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the restartPolicy. (Never -> OnFailure)
| | 6 | dashboardPort | string | 8265 | No | The port to open for the Ray dashboard | --dashboardPort=8265 | | ||
| | 7 | clientServerPort | string | 10001 | No | The port to open for the client server | --clientServerPort=10001 | | ||
|
|
||
| ## Examples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we need also add ray cluster operation doc to users, like https://docs.ray.io/en/master/cluster/kubernetes/getting-started/raycluster-quick-start.html?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your check!
Maybe, it will be more comfortable and useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we suppose that a user doesn't have a ray operator,
I think that below 5 docs are fit for our users.
- Step 4 ~ Step 5 of RayCluster Quick Guide can be useful (Step 1 ~ Step 3 is related to a Ray Operator)
- What we create and provide is Ray Cluster Key Concepts
- What we based on is Launching an On-Premise Cluster
- How to use a ray cluster with Ray Jobs API
- With a ray cluster, we can serve a model Model Deploy on ray cluster
Could you comment this? (I think the 5 links are helpful for users to use a ray cluster.)
(If you have, could you tell me a form about the ray cluster operation doc?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's ok, we have provided a ray cluster and user can run their job, and we just give a simple guide and link the guide doc is ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The simple guide was integrated into the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
Thanks for your contribution! And we are planning to release new version this week, so it's better to solve all comments to update and merge it today: ) |
Okay! I'll do it! |
Signed-off-by: Wongi, Baek <qordnjsrl13@naver.com>
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Monokaix The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/lgtm |
|
LGTM @Wonki4 Thanks for contributing your code. |
What type of PR is this?
This PR includes the ray plugin support for volcano job.
What this PR does / why we need it:
This PR supports users to build a ray cluster is composed of head and worker easily.
Example
If a user use ray plugin, the cluster configuration is easier than before
Here is the ray plugin arguments
The architecture of ray cluster in Volcano Job
Which issue(s) this PR fixes:
Fixes #4182
Special notes for your reviewer:
I know that there is another PR about this issue(#4193 ).
but there's no progress for 2 months so I decided to start working on it.
Does this PR introduce a user-facing change?