Skip to content

Conversation

@rueian
Copy link
Collaborator

@rueian rueian commented Mar 21, 2025

Why are these changes needed?

As shown in #3211:

In v1.3, I observe that the new behavior is:

  • Attempt 1: submit job -> tail logs -> error when job fails
  • Attempt 2: ray job status -> ray job logs -> completed

In Attempt 2, the pod successfully completes with exit code 0 because "ray job logs" returns exit 0 even if the job fails. >Here's an example from my cluster:

$ kubectl get po
NAME                                                        READY   STATUS      RESTARTS   AGE
pytorch-text-classifier-6q6ff-x59k6                         0/1     Error       0          11m          # first attempt
pytorch-text-classifier-6q6ff-xqhk7                         0/1     Completed   0          10m.  # second attempt

What I expected was the submitter Job to stop after attempt 1 since that is a terminal state and retrying does not achieve anything.

This PR makes the submitter job stop after attempt 1 by replacing ray job submit with ray job submit --no-wait + ray job logs --follow so that the submitter job will not exit with failures when the ray job fails.

The full script will look like this:

if ! ray job status --address http://$RAY_ADDRESS $RAY_JOB_SUBMISSION_ID >/dev/null 2>&1 ;
then ray job submit --address http://$RAY_ADDRESS --submission-id $RAY_JOB_SUBMISSION_ID --no-wait -- ... ;
fi ; ray job logs --address http://$RAY_ADDRESS --follow $RAY_JOB_SUBMISSION_ID

Related issue number

#3211

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

…error return code to the log tailing

Signed-off-by: Rueian <rueiancsie@gmail.com>
@rueian rueian force-pushed the rayjob-no-wait-submit branch from 24ea655 to 50196fe Compare March 21, 2025 06:27
"echo no quote 'single quote' \"double quote\"",
";", "fi",
";", "fi", ";",
"ray", "job", "logs", "--address", "http://127.0.0.1:8265", "--follow", "testJobId",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change, the submitter Job status will almost always be Completed even if the underlying Ray job fails, is this the behavior we want?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe so. The submitter pods should be marked as Error only when they fail to submit the Ray job or fail to tail the Ray job logs. spec.submitterConfig.backoffLimit will retry these kinds of failures. Ray job failures, on the other hand, will be covered by spec.backoffLimit.

@rueian rueian force-pushed the rayjob-no-wait-submit branch from 757ba18 to 1f2d15b Compare March 21, 2025 23:48
…error return code to the log tailing

Signed-off-by: Rueian <rueiancsie@gmail.com>
@rueian rueian force-pushed the rayjob-no-wait-submit branch from 1f2d15b to 4cbd80a Compare March 22, 2025 01:04
@rueian rueian marked this pull request as ready for review March 22, 2025 03:52
}
submitterTemplate.Spec.Containers[utils.RayContainerIndex].Command = []string{"/bin/bash"}
submitterTemplate.Spec.Containers[utils.RayContainerIndex].Args = []string{"-c", strings.Join(k8sJobCommand, " ")}
submitterTemplate.Spec.Containers[utils.RayContainerIndex].Args = []string{"-ce", strings.Join(k8sJobCommand, " ")}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the e option to make the bash script exit when a command fails.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what will happen if a command fails without -e?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the -e option, the bash script will not stop when the ray job submit --no-wait command fails. The script will then continue to execute ray job logs and fail again since the submission did not succeed. The result logs will be messed up because they now contain two failure messages, one for the ray job submit --no-wait and one for the ray job logs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment like "Without the -e option, the Bash script will continue executing even if a command returns a non-zero exit code."?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

@rueian
Copy link
Collaborator Author

rueian commented Mar 22, 2025

@MortalHappiness, @kevin85421, @andrewsykim Could you please help review this?

@andrewsykim
Copy link
Member

@rueian can you share what the new submitter Job behavior looks like with the change? Both when job fails and when job succeeds. Is it now a completed Job with 0 retries?

@rueian
Copy link
Collaborator Author

rueian commented Mar 24, 2025

Hi @andrewsykim,

This is the result of a failed ray job (spec.submitterConfig.backoffLimit=1). It stops at attempt 1:
image

This is the result of a successful ray job:
image

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you help me understand when the K8s Job submitter increments Failed or Succeeded?

}
submitterTemplate.Spec.Containers[utils.RayContainerIndex].Command = []string{"/bin/bash"}
submitterTemplate.Spec.Containers[utils.RayContainerIndex].Args = []string{"-c", strings.Join(k8sJobCommand, " ")}
submitterTemplate.Spec.Containers[utils.RayContainerIndex].Args = []string{"-ce", strings.Join(k8sJobCommand, " ")}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what will happen if a command fails without -e?

job, err := test.Client().Core().BatchV1().Jobs(namespace.Name).Get(test.Ctx(), rayJob.Name, metav1.GetOptions{})
g.Expect(err).NotTo(HaveOccurred())
g.Expect(job.Spec.BackoffLimit).To(HaveValue(Equal(int32(1))))
g.Expect(job.Status.Failed).To(Equal(int32(0))) // RayJob failures are not counted as Job failures.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update comment to something like this:

"Ray job failures (at the application level) are not counted as K8s job submitter failures; the K8s job submitter only focuses on issues such as network disconnections."

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I slightly changed the comment: "Ray job failures (at the application level) are not counted as K8s job submitter failures; K8s job submitter failures only count for issues like network errors."

Signed-off-by: Rueian <rueiancsie@gmail.com>
@andrewsykim
Copy link
Member

I like the new behavior and I think it makes sense. But I'm a bit worried that it can be misleading to users since in previous versions, job failures resulted in the submitter Job also failing. Do we need to update docs / release notes to reflect this change more clearly and give users a heads up?

@rueian
Copy link
Collaborator Author

rueian commented Mar 24, 2025

Could you help me understand when the K8s Job submitter increments Failed or Succeeded?

@kevin85421, A K8s Job submitter increments the Failed counter when the bash script fails, that are:

  • ray job submit --no-wait fails.
  • ray job logs fails.

As you said in your previous suggestion, the two cases of failures are mostly caused by network failures.

A K8s Job submitter increments the Succeeded counter when the bash succeeds, that is:

  • ray job logs succeeds.

That means the Ray job submission has succeeded, and its logs have been tailed by the submitter successfully.

@rueian
Copy link
Collaborator Author

rueian commented Mar 24, 2025

I like the new behavior and I think it makes sense. But I'm a bit worried that it can be misleading to users since in previous versions, job failures resulted in the submitter Job also failing. Do we need to update docs / release notes to reflect this change more clearly and give users a heads up?

How about adding a section at the end of the quick start to show the difference between v1.2, v1.3, and v1.4?

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

}
submitterTemplate.Spec.Containers[utils.RayContainerIndex].Command = []string{"/bin/bash"}
submitterTemplate.Spec.Containers[utils.RayContainerIndex].Args = []string{"-c", strings.Join(k8sJobCommand, " ")}
submitterTemplate.Spec.Containers[utils.RayContainerIndex].Args = []string{"-ce", strings.Join(k8sJobCommand, " ")}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment like "Without the -e option, the Bash script will continue executing even if a command returns a non-zero exit code."?

@kevin85421
Copy link
Member

I will leave it to @andrewsykim for a final review and merge, as Andrew is still reviewing this PR.

Signed-off-by: Rueian <rueiancsie@gmail.com>
@kevin85421 kevin85421 merged commit bcc8c09 into ray-project:master Mar 25, 2025
21 checks passed
@kevin85421
Copy link
Member

@andrewsykim I merge this PR for now. Feel free to discuss with @rueian if you have any comments.

andrewsykim pushed a commit to andrewsykim/kuberay that referenced this pull request Apr 2, 2025
…error return code to the log tailing (ray-project#3216)

* [RayJob][Fix] Use --no-wait for job submission to avoid carrying the error return code to the log tailing

Signed-off-by: Rueian <rueiancsie@gmail.com>

* [RayJob][Fix] Use --no-wait for job submission to avoid carrying the error return code to the log tailing

Signed-off-by: Rueian <rueiancsie@gmail.com>

* chore: update comments

Signed-off-by: Rueian <rueiancsie@gmail.com>

* chore: add a comment about bash -e

Signed-off-by: Rueian <rueiancsie@gmail.com>

---------

Signed-off-by: Rueian <rueiancsie@gmail.com>
@andrewsykim andrewsykim mentioned this pull request Apr 2, 2025
4 tasks
andrewsykim added a commit that referenced this pull request Apr 2, 2025
* [RayJob][Fix] Use --no-wait for job submission to avoid carrying the error return code to the log tailing (#3216)

* [RayJob][Fix] Use --no-wait for job submission to avoid carrying the error return code to the log tailing

Signed-off-by: Rueian <rueiancsie@gmail.com>

* [RayJob][Fix] Use --no-wait for job submission to avoid carrying the error return code to the log tailing

Signed-off-by: Rueian <rueiancsie@gmail.com>

* chore: update comments

Signed-off-by: Rueian <rueiancsie@gmail.com>

* chore: add a comment about bash -e

Signed-off-by: Rueian <rueiancsie@gmail.com>

---------

Signed-off-by: Rueian <rueiancsie@gmail.com>

* kubectl ray job submit: provide entrypoint (#3186)

* [kubectl-plugin] Add head/worker node selector option (#3228)

* add node selector option for kubectl plugin create cluster

Signed-off-by: Troy Chiu <y.troychiu@gmail.com>

* nit

Signed-off-by: Troy Chiu <y.troychiu@gmail.com>

---------

Signed-off-by: Troy Chiu <y.troychiu@gmail.com>

* add node selector option for kubectl plugin create worker group (#3235)

* add node selector option for kubectl plugin create work group

Signed-off-by: Troy Chiu <y.troychiu@gmail.com>

* nit

Signed-off-by: Troy Chiu <y.troychiu@gmail.com>

* code review: fix usage

Signed-off-by: Troy Chiu <y.troychiu@gmail.com>

---------

Signed-off-by: Troy Chiu <y.troychiu@gmail.com>

* [kubectl-plugin] remove CPU limits by default (#3243)

Signed-off-by: Andrew Sy Kim <andrewsy@google.com>

* [Chore][CI] Limit the release-image-build github workflow to only take tag as input (#3117)

* remove all inputs from workflow_dispatch

Signed-off-by: Tina Wu <j6vupz97@gmail.com>

* use tag only

Signed-off-by: Tina Wu <j6vupz97@gmail.com>

* align case

Signed-off-by: Tina Wu <j6vupz97@gmail.com>

* change sha

Signed-off-by: Tina Wu <j6vupz97@gmail.com>

* extract tag

* lint fix

Signed-off-by: Tina Wu <j6vupz97@gmail.com>

* update github_env

Signed-off-by: Tina Wu <j6vupz97@gmail.com>

* directly take tag

Signed-off-by: Tina Wu <j6vupz97@gmail.com>

* add env,

Signed-off-by: Tina Wu <j6vupz97@gmail.com>

* directly use tag

Signed-off-by: Tina Wu <j6vupz97@gmail.com>

* use env. when in script

Signed-off-by: Tina Wu <j6vupz97@gmail.com>

* env.tag when with

Signed-off-by: Tina Wu <j6vupz97@gmail.com>

* use env.tag for all

Signed-off-by: Tina Wu <j6vupz97@gmail.com>

---------

Signed-off-by: Tina Wu <j6vupz97@gmail.com>
Co-authored-by: tinaxfwu <twu@synchron.com>

* [CI] Remove create tag step from release (#3249)

Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>

---------

Signed-off-by: Rueian <rueiancsie@gmail.com>
Signed-off-by: Troy Chiu <y.troychiu@gmail.com>
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: Tina Wu <j6vupz97@gmail.com>
Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>
Co-authored-by: Rueian <rueiancsie@gmail.com>
Co-authored-by: Spencer Peterson <spencerjp@google.com>
Co-authored-by: Troy Chiu <114708546+troychiu@users.noreply.github.com>
Co-authored-by: Tina Wu <j6vupz97@gmail.com>
Co-authored-by: tinaxfwu <twu@synchron.com>
Co-authored-by: Chi-Sheng Liu <chishengliu@chishengliu.com>
win5923 pushed a commit to win5923/kuberay that referenced this pull request Apr 27, 2025
…error return code to the log tailing (ray-project#3216)

* [RayJob][Fix] Use --no-wait for job submission to avoid carrying the error return code to the log tailing

Signed-off-by: Rueian <rueiancsie@gmail.com>

* [RayJob][Fix] Use --no-wait for job submission to avoid carrying the error return code to the log tailing

Signed-off-by: Rueian <rueiancsie@gmail.com>

* chore: update comments

Signed-off-by: Rueian <rueiancsie@gmail.com>

* chore: add a comment about bash -e

Signed-off-by: Rueian <rueiancsie@gmail.com>

---------

Signed-off-by: Rueian <rueiancsie@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants