Skip to content

Conversation

@Yicheng-Lu-llll
Copy link
Member

@Yicheng-Lu-llll Yicheng-Lu-llll commented Feb 24, 2024

Why are these changes needed?

After this PR:

  1. There will be two terminal deployment statuses(Inspired by how K8s jobs define the status and how the Spark operator defines the terminal statuses):

    • JobDeploymentStatusComplete
    • JobDeploymentStatusFailed.
  2. If RayJob runs into JobDeploymentStatusFailed(Inspired by how k8s job use reason and message fields):

    • the reason field will include a machine-readable string that describes the failure reason. Other controllers like Flyte, can take actions based on this field.
    • The message field will include a human-readable explanation of the failure reason.
  3. There are three failed reason currently:

    • SubmissionFailed: The submitter job failed to submit the user code.
    • DeadlineExceeded: This is analogous to JobReasonDeadlineExceeded in Kubernetes Jobs.
    • AppFailed: The user code throw an error.

Additional Note:

Ideally, users should only consider the reason and message fields when the Deployment Status is Failed.
But Considering the fact that the submitter job might retry, these two fields have slightly different behavior:

  • The reason field is set only when the Deployment Status changes to Failed. Therefore, no failure reason will be displayed during the retry period, as the Deployment Status remains running.
  • the message field will always display the message for the current run. So, might have error message even if the Deployment Status remains running.

Related issue number

Checks

I have changed the CI test accordingly; all changes are covered by the CI test, but I have also manually tested it to give an idea of how the current status looks like:

  • The submitter job failed to submit the user code:
  Job Deployment Status:  Failed
  Job Id:                 rayjob-sample-wd89p
  Message:                Job submission has failed. Reason: BackoffLimitExceeded. Message: Job has reached the specified backoff limit
  Reason:      SubmissionFailed
  • deadline exceed:
  End Time:               2024-02-24T21:58:47Z
  Job Deployment Status:  Failed
  Job Id:                 rayjob-sample-7gj87
  Message:                The RayJob has passed the activeDeadlineSeconds. StartTime: 2024-02-24 21:58:46 +0000 UTC. ActiveDeadlineSeconds: 1
  Reason:      DeadlineExceeded
  • The user code throw an error:
  Job Deployment Status:  Failed
  Job Id:                 rayjob-sample-5bk57
  Job Status:             FAILED
  Message:                Job entrypoint command failed with exit code 1, last available logs (truncated to 20,000 chars):
2024-02-24 14:02:23,886   INFO worker.py:1715 -- Connected to Ray cluster. View the dashboard at ^[[1m^[[32mhttp://10.244.0.13:8265 ^[[39m^[[22m
test_counter got 1
test_counter got 2
test_counter got 3
test_counter got 4
test_counter got 5
Traceback (most recent call last):
  File "/home/ray/samples/sample_code.py", line 28, in <module>
    assert requests.__version__ == "2.26.0dewfrgsthsnrdyukjuytretyhgjk"
AssertionError
  Reason:                 AppFailed
  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@kevin85421 kevin85421 self-requested a review February 25, 2024 23:12
@kevin85421 kevin85421 self-assigned this Feb 25, 2024
rayJob.Status.JobDeploymentStatus = rayv1.JobDeploymentStatusComplete
r.Log.Info("The submitter Kubernetes Job has failed. Attempting to transition the status to `Failed`.", "RayJob", rayJob.Name, "Submitter K8s Job", job.Name, "Reason", cond.Reason, "Message", cond.Message)
rayJob.Status.JobDeploymentStatus = rayv1.JobDeploymentStatusFailed
// The Submitter Job needs to wait for the user code to finish and retrieve its logs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// The Submitter Job needs to wait for the user code to finish and retrieve its logs.
// The submitter Job needs to wait for the user code to finish and retrieve its logs.

JobDeploymentStatusSuspending JobDeploymentStatus = "Suspending"
JobDeploymentStatusSuspended JobDeploymentStatus = "Suspended"
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Define a new type type JobFailedReason string?

)

const (
JobReasonSubmissionFailed string = "SubmissionFailed"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
JobReasonSubmissionFailed string = "SubmissionFailed"
SubmissionFailed string = "SubmissionFailed"


const (
JobReasonSubmissionFailed string = "SubmissionFailed"
JobReasonDeadlineExceeded string = "DeadlineExceeded"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
JobReasonDeadlineExceeded string = "DeadlineExceeded"
DeadlineExceeded string = "DeadlineExceeded"

const (
JobReasonSubmissionFailed string = "SubmissionFailed"
JobReasonDeadlineExceeded string = "DeadlineExceeded"
JobReasonApplicationLevelFailed string = "ApplicationLevelFailed"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
JobReasonApplicationLevelFailed string = "ApplicationLevelFailed"
AppFailed string = "AppFailed"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JobFailed?

Copy link
Member Author

@Yicheng-Lu-llll Yicheng-Lu-llll Feb 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel users might be confused about whether it is a RayJob (KubeRay), a job (Ray), or a Kubernetes submitter job if we use JobFailed.

// indefinitely.
// If the submitting Kubernetes Job reaches the backoff limit, transition the status to `Complete` or `Failed`.
// This is because, beyond this point, it becomes impossible for the submitter to submit any further Ray jobs.
// For light-weight mode, we don't transition the status to `Complete` or Failed` based on the number of failed requests.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// For light-weight mode, we don't transition the status to `Complete` or Failed` based on the number of failed requests.
// For light-weight mode, we don't transition the status to `Complete` or `Failed` based on the number of failed requests.

…iled DeadlineExceeded ApplicationLevelFailed

Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>
Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>
@Yicheng-Lu-llll Yicheng-Lu-llll force-pushed the improve-rayjob-observability branch from ef6723e to e33111d Compare February 27, 2024 05:47
Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>
@Yicheng-Lu-llll Yicheng-Lu-llll force-pushed the improve-rayjob-observability branch from c17cbec to 317e7dd Compare February 27, 2024 15:15
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kevin85421 kevin85421 merged commit fe981a2 into ray-project:master Feb 28, 2024
@kevin85421
Copy link
Member

cc @pingsutw

@pingsutw
Copy link
Contributor

Thank you @kevin85421 @Yicheng-Lu-llll

ravishtiwari pushed a commit to ravishtiwari/kuberay that referenced this pull request Mar 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants