Skip to content

Conversation

@kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Sep 3, 2023

Why are these changes needed?

In PR #1341, KubeRay is configured to delete the pods that have a Never restart policy and are in terminal states (i.e., Succeeded, Failed). However, an edge case exists for Pods equipped with sidecar containers.

According to this Kubernetes document, a Pod status of Running indicates that "at least one container is still running, or is in the process of starting or restarting." This leads to a situation where the Ray cluster may never recover from a failure, as illustrated below:

  • Create a Ray Pod housing two containers: a primary Ray container and a sidecar container, with the Pod's restart policy designated as Never.
  • Terminate the ray start process in the Ray container, which subsequently will not restart.
  • The Pod maintains a Running status, given that the sidecar container continues to operate.

In this PR, we will check the status of the Ray container instead of checking the Pod's status only.

Related issue number

Closes #1355

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@kevin85421 kevin85421 changed the title [WIP][GCS FT] Consider the case of sidecar containers [GCS FT] Consider the case of sidecar containers Sep 5, 2023
@kevin85421 kevin85421 marked this pull request as ready for review September 5, 2023 17:40
@architkulkarni architkulkarni self-assigned this Sep 5, 2023
Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! A couple of very minor notes:

  • Currently, the definition of "when does KubeRay restart a Ray pod" only appears in the PR description, and the implementation of shouldDeletePod. I think it should appear in user facing docs somewhere, what do you think? Or is that too technical?
  • [Nit] Consider parametrizing the new test with 6 cases; I think it can be done using "table driven tests".

@kevin85421
Copy link
Member Author

Looks great! A couple of very minor notes:

  • Currently, the definition of "when does KubeRay restart a Ray pod" only appears in the PR description, and the implementation of shouldDeletePod. I think it should appear in user facing docs somewhere, what do you think? Or is that too technical?
  • [Nit] Consider parametrizing the new test with 6 cases; I think it can be done using "table driven tests".

Create issues #1392 and #1393.

@kevin85421 kevin85421 merged commit 1c9de23 into ray-project:master Sep 5, 2023
z103cb pushed a commit to z103cb/kuberay that referenced this pull request Sep 11, 2023
z103cb pushed a commit to z103cb/kuberay that referenced this pull request Sep 11, 2023
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[GCS FT] Consider the case of sidecar containers

2 participants