-
Notifications
You must be signed in to change notification settings - Fork 670
[Test] Implement RayService In-place update test in Golang #2536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Cheyu Wu <cheyu1220@gmail.com>
Signed-off-by: Cheyu Wu <cheyu1220@gmail.com>
This reverts commit 6ac63c2.
|
I've made all the necessary changes. If there’s anything else that needs to be modified, please let me know. Thanks! |
|
@MortalHappiness The pipeline is stuck. Could you help me to rerun it? |
ray-operator/test/support/ray.go
Outdated
| return cluster.Status.DesiredWorkerReplicas | ||
| } | ||
|
|
||
| func UnderlyingRayCluster(rayService *rayv1.RayService) func() (*rayv1.RayCluster, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| func UnderlyingRayCluster(rayService *rayv1.RayService) func() (*rayv1.RayCluster, error) { | |
| func UnderlyingRayClusterName(rayService *rayv1.RayService) string { |
Change the function signature. And then we can use
Eventually(RayService(...)).WithTransform(UnderlyingRayClusterName, ...).Should(...)
rayClusterName := UnderlyingRayClusterName(...)| namespace := test.NewTestNamespace() | ||
| test.StreamKubeRayOperatorLogs() | ||
|
|
||
| fileName := "ray-service.sample.yaml" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we now put this file under e2e test folder, we should not depend on the yaml file. Please apply the RayService with ApplyConfiguration
| type RayServiceApplyConfiguration struct { |
You can also reference other test files under e2e folder that import rayv1ac "github.com/ray-project/kuberay/ray-operator/pkg/client/applyconfiguration/ray/v1" to see how to use it.
…ge the comment in curlRayServicePod
|
Well, This is my revised version. Free to let me know if there are any issues because I'm not entirely sure if this is the best approach. I turned each config into a separate function instead of modifying the original template function to make it suitable as a shared function. |
| g.Expect(err).NotTo(HaveOccurred()) | ||
| g.Expect(rayService).NotTo(BeNil()) | ||
|
|
||
| // Wait for RayCluster name to be populated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Wait for RayCluster name to be populated
g.Eventually(RayService(test, namespace.Name, rayServiceName), TestTimeoutShort).Should(
WithTransform(UnderlyingRayClusterName, Not(BeEmpty())),
)
rayClusterName := UnderlyingRayClusterName(rayService)
test.T().Logf("Waiting for RayCluster %s/%s to be ready", namespace.Name, rayClusterName)
g.Eventually(RayCluster(test, namespace.Name, rayClusterName), TestTimeoutMedium).
Should(WithTransform(RayClusterState, Equal(rayv1.Ready)))
rayCluster, err := GetRayCluster(test, namespace.Name, rayClusterName)
g.Expect(err).NotTo(HaveOccurred())
// Check if the head pod is ready
g.Eventually(HeadPod(test, rayCluster), TestTimeoutShort).Should(WithTransform(sampleyaml.IsPodRunningAndReady, BeTrue()))
// Check if all worker pods are ready
g.Eventually(WorkerPods(test, rayCluster), TestTimeoutShort).Should(WithTransform(sampleyaml.AllPodsRunningAndReady, BeTrue()))We don't need to check the RayCluster or the readiness of Pods for RayService. We only need to make sure RayService.Status.ServiceStatus is Running. The readiness of worker Pods also depends on the Ray Serve configurations. For example, if Ray Serve doesn't schedule any Ray Serve replicas to a worker Pod, its readiness probe will fail, preventing it from receiving any requests. Hence, it is possible for a healthy RayService to have non-ready worker Pods. I think we can remove the code I quoted above.
| stdout, _ = curlRayServicePod(test, rayService, curlPod, curlContainerName, "/calc", `["MUL", 3]`) | ||
| g.Expect(stdout.String()).To(Equal("15 pizzas please!")) | ||
|
|
||
| // In place update |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
| // In place update | |
| // In-place update |
| ) | ||
| g.Expect(err).NotTo(HaveOccurred()) | ||
|
|
||
| g.Eventually(func(g Gomega) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ObservedGenerationis not well-maintained in the KubeRay codebase. It's better not to use this field.- The observability of the progress of in-place updates highly depends on Ray Serve's implementation, which may vary across different Ray releases.
Maybe we can use Eventually to send requests via the curl Pod to ensure that the return values are eventually updated and remove this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove this g.Eventually(func(g Gomega) {...}).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, sorry, I misunderstood your meaning.
ray-operator/test/e2e/support.go
Outdated
| WithContainers(corev1ac.Container(). | ||
| WithName("ray-worker"). | ||
| WithImage(GetRayImage()). | ||
| WithLifecycle(corev1ac.Lifecycle(). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is unnecessary
|
|
||
| func rayServiceSampleYamlApplyConfiguration() *rayv1ac.RayServiceSpecApplyConfiguration { | ||
| return rayv1ac.RayServiceSpec().WithServeConfigV2(`applications: | ||
| - name: fruit_app |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we remove the unnecessary parts of ServeConfigV2? For example, OrangeStand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ve tried several times, and it turns out that at least two fruit stands need to be retained; otherwise, it will keep getting stuck at "wait for deployment ready."
There are three types of fruits here. Do we want to remove one, or keep them all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at least two fruit stands need to be retained; otherwise, it will keep getting stuck at "wait for deployment ready."
Interesting. Could you open an issue about the reproduction? We don't need to handle the issue in this PR, but we should track the progress.
There are three types of fruits here. Do we want to remove one, or keep them all?
Remove one. Just try to make this AC as short as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I have created this issue #2557
|
|
||
| func rayServiceSampleYamlApplyConfiguration() *rayv1ac.RayServiceSpecApplyConfiguration { | ||
| return rayv1ac.RayServiceSpec().WithServeConfigV2(`applications: | ||
| - name: fruit_app |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at least two fruit stands need to be retained; otherwise, it will keep getting stuck at "wait for deployment ready."
Interesting. Could you open an issue about the reproduction? We don't need to handle the issue in this PR, but we should track the progress.
There are three types of fruits here. Do we want to remove one, or keep them all?
Remove one. Just try to make this AC as short as possible.
| num_cpus: 0.1 | ||
| - name: Router | ||
| num_replicas: 1`). | ||
| WithRayClusterSpec(rayv1ac.RayClusterSpec(). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does reusing rayClusterSpec() work?
kuberay/ray-operator/test/e2e/support.go
Line 102 in 8d60d61
| func rayClusterSpec() *rayv1ac.RayClusterSpecApplyConfiguration { |
| ) | ||
| rayClusterName := UnderlyingRayClusterName(rayService) | ||
|
|
||
| test.T().Logf("Waiting for RayCluster %s/%s to be ready", namespace.Name, rayClusterName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible for a healthy RayService to have its underlying RayCluster status not be rayv1.Ready, and for some worker Pods to be not ready. We can remove the checks of Equal(rayv1.Ready) and WorkerPods(test, rayCluster).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you are correct. We only need to make sure the Head Pod is ready.
| g.Expect(rayClusterName).NotTo(BeEmpty()) | ||
| }, TestTimeoutShort).Should(Succeed()) | ||
| g.Eventually(RayService(test, namespace.Name, rayServiceFromYaml.Name), TestTimeoutShort).Should( | ||
| WithTransform(UnderlyingRayClusterName, Not(BeEmpty())), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can get ActiveServiceStatus.RayClusterName directly because the RayService has become Running, and zero-downtime has not been triggered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I will also remove function UnderlyingRayClusterName, because this is not used anymore
kevin85421
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a comment. Others look good to me.
| ) | ||
| g.Expect(err).NotTo(HaveOccurred()) | ||
|
|
||
| g.Eventually(func(g Gomega) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove this g.Eventually(func(g Gomega) {...}).
|
Please fix the CI error. Thanks! |
I’d like to discuss a situation where the CI pipeline is stuck at I’d like to ask which approach would be better:
|
|
@kevin85421 Do you think it's better to lower the resource requests in our codebase, or to move the test to buildkite, which has more resources? |
@MortalHappiness @CheyuWu I think the issue is not related to physical resources, so moving to Buildkite will likely result in the same issue. Instead, the issue seems to be related to Ray's logical resources. @CheyuWu, - name: Router
num_replicas: 1 |
I tested it locally by increasing the CPU and memory for the Let me check if the pipeline passes. |
|
@CheyuWu, you can check the Ray dashboard to see which Ray Serve replica is not scheduled. A possible reason might be |
I conducted a few experiments and checked Ray Serve's logs on my local machine. Here are my findings:
|
|
@CheyuWu got it. maybe use buildkite instead and increase the resources in |
I would like to ask if it’s okay to directly modify the resource in this way using headPodTemplateApplyConfiguration() within rayClusterSpec() at this location? kuberay/ray-operator/test/e2e/rayjob_lightweight_test.go Lines 45 to 54 in 925effe
I’m not familiar with it and because the testcase is named "lightweight," I’m worried it might affect the original setup. |
Why are these changes needed?
Implement test_in_place_update in Golang, and add a file rayservice_in_place_update_test.go under https://github.com/ray-project/kuberay/tree/master/ray-operator/test/e2e.
Related issue number
Close #2516
Checks