-
Notifications
You must be signed in to change notification settings - Fork 670
[RayJob] Add JobDeploymentStatusFailed Status and Reason Field to Enhance Observability for Flyte/RayJob Integration #1942
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RayJob] Add JobDeploymentStatusFailed Status and Reason Field to Enhance Observability for Flyte/RayJob Integration #1942
Conversation
| rayJob.Status.JobDeploymentStatus = rayv1.JobDeploymentStatusComplete | ||
| r.Log.Info("The submitter Kubernetes Job has failed. Attempting to transition the status to `Failed`.", "RayJob", rayJob.Name, "Submitter K8s Job", job.Name, "Reason", cond.Reason, "Message", cond.Message) | ||
| rayJob.Status.JobDeploymentStatus = rayv1.JobDeploymentStatusFailed | ||
| // The Submitter Job needs to wait for the user code to finish and retrieve its logs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // The Submitter Job needs to wait for the user code to finish and retrieve its logs. | |
| // The submitter Job needs to wait for the user code to finish and retrieve its logs. |
| JobDeploymentStatusSuspending JobDeploymentStatus = "Suspending" | ||
| JobDeploymentStatusSuspended JobDeploymentStatus = "Suspended" | ||
| ) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Define a new type type JobFailedReason string?
| ) | ||
|
|
||
| const ( | ||
| JobReasonSubmissionFailed string = "SubmissionFailed" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| JobReasonSubmissionFailed string = "SubmissionFailed" | |
| SubmissionFailed string = "SubmissionFailed" |
|
|
||
| const ( | ||
| JobReasonSubmissionFailed string = "SubmissionFailed" | ||
| JobReasonDeadlineExceeded string = "DeadlineExceeded" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| JobReasonDeadlineExceeded string = "DeadlineExceeded" | |
| DeadlineExceeded string = "DeadlineExceeded" |
| const ( | ||
| JobReasonSubmissionFailed string = "SubmissionFailed" | ||
| JobReasonDeadlineExceeded string = "DeadlineExceeded" | ||
| JobReasonApplicationLevelFailed string = "ApplicationLevelFailed" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| JobReasonApplicationLevelFailed string = "ApplicationLevelFailed" | |
| AppFailed string = "AppFailed" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JobFailed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel users might be confused about whether it is a RayJob (KubeRay), a job (Ray), or a Kubernetes submitter job if we use JobFailed.
| // indefinitely. | ||
| // If the submitting Kubernetes Job reaches the backoff limit, transition the status to `Complete` or `Failed`. | ||
| // This is because, beyond this point, it becomes impossible for the submitter to submit any further Ray jobs. | ||
| // For light-weight mode, we don't transition the status to `Complete` or Failed` based on the number of failed requests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // For light-weight mode, we don't transition the status to `Complete` or Failed` based on the number of failed requests. | |
| // For light-weight mode, we don't transition the status to `Complete` or `Failed` based on the number of failed requests. |
…iled DeadlineExceeded ApplicationLevelFailed Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>
Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>
ef6723e to
e33111d
Compare
Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>
c17cbec to
317e7dd
Compare
kevin85421
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
cc @pingsutw |
|
Thank you @kevin85421 @Yicheng-Lu-llll |
…ance Observability for Flyte/RayJob Integration (ray-project#1942)
Why are these changes needed?
After this PR:
There will be two terminal deployment statuses(Inspired by how K8s jobs define the status and how the Spark operator defines the terminal statuses):
JobDeploymentStatusCompleteJobDeploymentStatusFailed.If RayJob runs into
JobDeploymentStatusFailed(Inspired by how k8s job usereasonandmessagefields):reasonfield will include a machine-readable string that describes the failure reason. Other controllers like Flyte, can take actions based on this field.messagefield will include a human-readable explanation of the failure reason.There are three failed reason currently:
SubmissionFailed: The submitter job failed to submit the user code.DeadlineExceeded: This is analogous toJobReasonDeadlineExceededin Kubernetes Jobs.AppFailed: The user code throw an error.Additional Note:
Ideally, users should only consider the
reasonandmessagefields when the Deployment Status isFailed.But Considering the fact that the submitter job might retry, these two fields have slightly different behavior:
reasonfield is set only when the Deployment Status changes toFailed. Therefore, no failure reason will be displayed during the retry period, as the Deployment Status remainsrunning.messagefield will always display the message for the current run. So, might have error message even if the Deployment Status remainsrunning.Related issue number
Checks
I have changed the CI test accordingly; all changes are covered by the CI test, but I have also manually tested it to give an idea of how the current status looks like: