Skip to content

Conversation

@CheyuWu
Copy link
Collaborator

@CheyuWu CheyuWu commented Nov 12, 2024

Why are these changes needed?

Implement test_in_place_update in Golang, and add a file rayservice_in_place_update_test.go under https://github.com/ray-project/kuberay/tree/master/ray-operator/test/e2e.

Related issue number

Close #2516

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@CheyuWu
Copy link
Collaborator Author

CheyuWu commented Nov 12, 2024

@kevin85421 @MortalHappiness
PTAL

@CheyuWu
Copy link
Collaborator Author

CheyuWu commented Nov 14, 2024

I've made all the necessary changes. If there’s anything else that needs to be modified, please let me know. Thanks!

@CheyuWu
Copy link
Collaborator Author

CheyuWu commented Nov 16, 2024

@MortalHappiness The pipeline is stuck. Could you help me to rerun it?

return cluster.Status.DesiredWorkerReplicas
}

func UnderlyingRayCluster(rayService *rayv1.RayService) func() (*rayv1.RayCluster, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func UnderlyingRayCluster(rayService *rayv1.RayService) func() (*rayv1.RayCluster, error) {
func UnderlyingRayClusterName(rayService *rayv1.RayService) string {

Change the function signature. And then we can use

Eventually(RayService(...)).WithTransform(UnderlyingRayClusterName, ...).Should(...)
rayClusterName := UnderlyingRayClusterName(...)

namespace := test.NewTestNamespace()
test.StreamKubeRayOperatorLogs()

fileName := "ray-service.sample.yaml"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we now put this file under e2e test folder, we should not depend on the yaml file. Please apply the RayService with ApplyConfiguration

type RayServiceApplyConfiguration struct {

You can also reference other test files under e2e folder that import rayv1ac "github.com/ray-project/kuberay/ray-operator/pkg/client/applyconfiguration/ray/v1" to see how to use it.

@CheyuWu
Copy link
Collaborator Author

CheyuWu commented Nov 17, 2024

Well, This is my revised version. Free to let me know if there are any issues because I'm not entirely sure if this is the best approach. I turned each config into a separate function instead of modifying the original template function to make it suitable as a shared function.

g.Expect(err).NotTo(HaveOccurred())
g.Expect(rayService).NotTo(BeNil())

// Wait for RayCluster name to be populated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

	// Wait for RayCluster name to be populated
	g.Eventually(RayService(test, namespace.Name, rayServiceName), TestTimeoutShort).Should(
		WithTransform(UnderlyingRayClusterName, Not(BeEmpty())),
	)
	rayClusterName := UnderlyingRayClusterName(rayService)

	test.T().Logf("Waiting for RayCluster %s/%s to be ready", namespace.Name, rayClusterName)
	g.Eventually(RayCluster(test, namespace.Name, rayClusterName), TestTimeoutMedium).
		Should(WithTransform(RayClusterState, Equal(rayv1.Ready)))
	rayCluster, err := GetRayCluster(test, namespace.Name, rayClusterName)
	g.Expect(err).NotTo(HaveOccurred())

	// Check if the head pod is ready
	g.Eventually(HeadPod(test, rayCluster), TestTimeoutShort).Should(WithTransform(sampleyaml.IsPodRunningAndReady, BeTrue()))

	// Check if all worker pods are ready
	g.Eventually(WorkerPods(test, rayCluster), TestTimeoutShort).Should(WithTransform(sampleyaml.AllPodsRunningAndReady, BeTrue()))

We don't need to check the RayCluster or the readiness of Pods for RayService. We only need to make sure RayService.Status.ServiceStatus is Running. The readiness of worker Pods also depends on the Ray Serve configurations. For example, if Ray Serve doesn't schedule any Ray Serve replicas to a worker Pod, its readiness probe will fail, preventing it from receiving any requests. Hence, it is possible for a healthy RayService to have non-ready worker Pods. I think we can remove the code I quoted above.

stdout, _ = curlRayServicePod(test, rayService, curlPod, curlContainerName, "/calc", `["MUL", 3]`)
g.Expect(stdout.String()).To(Equal("15 pizzas please!"))

// In place update
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
// In place update
// In-place update

)
g.Expect(err).NotTo(HaveOccurred())

g.Eventually(func(g Gomega) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. ObservedGeneration is not well-maintained in the KubeRay codebase. It's better not to use this field.
  2. The observability of the progress of in-place updates highly depends on Ray Serve's implementation, which may vary across different Ray releases.

Maybe we can use Eventually to send requests via the curl Pod to ensure that the return values are eventually updated and remove this line?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be removed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove this g.Eventually(func(g Gomega) {...}).

Copy link
Collaborator Author

@CheyuWu CheyuWu Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, sorry, I misunderstood your meaning.

WithContainers(corev1ac.Container().
WithName("ray-worker").
WithImage(GetRayImage()).
WithLifecycle(corev1ac.Lifecycle().
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is unnecessary


func rayServiceSampleYamlApplyConfiguration() *rayv1ac.RayServiceSpecApplyConfiguration {
return rayv1ac.RayServiceSpec().WithServeConfigV2(`applications:
- name: fruit_app
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we remove the unnecessary parts of ServeConfigV2? For example, OrangeStand.

Copy link
Collaborator Author

@CheyuWu CheyuWu Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve tried several times, and it turns out that at least two fruit stands need to be retained; otherwise, it will keep getting stuck at "wait for deployment ready."

There are three types of fruits here. Do we want to remove one, or keep them all?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at least two fruit stands need to be retained; otherwise, it will keep getting stuck at "wait for deployment ready."

Interesting. Could you open an issue about the reproduction? We don't need to handle the issue in this PR, but we should track the progress.

There are three types of fruits here. Do we want to remove one, or keep them all?

Remove one. Just try to make this AC as short as possible.

Copy link
Collaborator Author

@CheyuWu CheyuWu Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I have created this issue #2557


func rayServiceSampleYamlApplyConfiguration() *rayv1ac.RayServiceSpecApplyConfiguration {
return rayv1ac.RayServiceSpec().WithServeConfigV2(`applications:
- name: fruit_app
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at least two fruit stands need to be retained; otherwise, it will keep getting stuck at "wait for deployment ready."

Interesting. Could you open an issue about the reproduction? We don't need to handle the issue in this PR, but we should track the progress.

There are three types of fruits here. Do we want to remove one, or keep them all?

Remove one. Just try to make this AC as short as possible.

num_cpus: 0.1
- name: Router
num_replicas: 1`).
WithRayClusterSpec(rayv1ac.RayClusterSpec().
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does reusing rayClusterSpec() work?

func rayClusterSpec() *rayv1ac.RayClusterSpecApplyConfiguration {

)
rayClusterName := UnderlyingRayClusterName(rayService)

test.T().Logf("Waiting for RayCluster %s/%s to be ready", namespace.Name, rayClusterName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible for a healthy RayService to have its underlying RayCluster status not be rayv1.Ready, and for some worker Pods to be not ready. We can remove the checks of Equal(rayv1.Ready) and WorkerPods(test, rayCluster).

Copy link
Collaborator Author

@CheyuWu CheyuWu Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you are correct. We only need to make sure the Head Pod is ready.

g.Expect(rayClusterName).NotTo(BeEmpty())
}, TestTimeoutShort).Should(Succeed())
g.Eventually(RayService(test, namespace.Name, rayServiceFromYaml.Name), TestTimeoutShort).Should(
WithTransform(UnderlyingRayClusterName, Not(BeEmpty())),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can get ActiveServiceStatus.RayClusterName directly because the RayService has become Running, and zero-downtime has not been triggered.

Copy link
Collaborator Author

@CheyuWu CheyuWu Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will also remove function UnderlyingRayClusterName, because this is not used anymore

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a comment. Others look good to me.

)
g.Expect(err).NotTo(HaveOccurred())

g.Eventually(func(g Gomega) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove this g.Eventually(func(g Gomega) {...}).

@MortalHappiness
Copy link
Member

Please fix the CI error. Thanks!

@CheyuWu
Copy link
Collaborator Author

CheyuWu commented Nov 21, 2024

Does reusing rayClusterSpec() work?

func rayClusterSpec() *rayv1ac.RayClusterSpecApplyConfiguration {

@MortalHappiness @kevin85421

I’d like to discuss a situation where the CI pipeline is stuck at Ray Serve applications are not ready to serve requests.
The issue seems to be that using the default rayClusterSpec() leads to insufficient CPU and memory resources.

I’d like to ask which approach would be better:

  1. Revert this commit. 63fb8db
  2. Modify rayClusterSpec() to accept parameters (but this would require changes to other scripts and functions as well).
  3. Any other suggestions I haven’t considered?

@MortalHappiness
Copy link
Member

MortalHappiness commented Nov 21, 2024

@kevin85421 Do you think it's better to lower the resource requests in our codebase, or to move the test to buildkite, which has more resources?

@kevin85421
Copy link
Member

Do you think it's better to lower the resource requests in our codebase, or to move the test to buildkite, which has more resources?

@MortalHappiness @CheyuWu I think the issue is not related to physical resources, so moving to Buildkite will likely result in the same issue. Instead, the issue seems to be related to Ray's logical resources.

@CheyuWu, Router will use 1 CPU. Can you set ray_actor_options to use num_cpus: 0.1?

          - name: Router
            num_replicas: 1

@CheyuWu CheyuWu closed this Nov 21, 2024
@CheyuWu CheyuWu reopened this Nov 21, 2024
@CheyuWu
Copy link
Collaborator Author

CheyuWu commented Nov 21, 2024

Do you think it's better to lower the resource requests in our codebase, or to move the test to buildkite, which has more resources?

@MortalHappiness @CheyuWu I think the issue is not related to physical resources, so moving to Buildkite will likely result in the same issue. Instead, the issue seems to be related to Ray's logical resources.

@CheyuWu, Router will use 1 CPU. Can you set ray_actor_options to use num_cpus: 0.1?

          - name: Router
            num_replicas: 1

2. rayClusterSpec()

I tested it locally by increasing the CPU and memory for the head pod to match the values specified in ray-service.sample.yaml, and it worked. However, if your approach works, that's even better since we can directly use rayClusterSpec().

Let me check if the pipeline passes.

@kevin85421
Copy link
Member

@CheyuWu, you can check the Ray dashboard to see which Ray Serve replica is not scheduled. A possible reason might be MangoStand. There are two replicas for MangoStand, with at most one replica per Ray node. However, it is possible for the Ray head Pod to be occupied by other Ray Serve replicas, so MangoStand cannot be scheduled on the Ray head. You can try removing max_replicas_per_node: 1 and changing num_replicas from 2 to 1.

          - name: MangoStand
            num_replicas: 2
            max_replicas_per_node: 1
            user_config:
              price: 3
            ray_actor_options:
              num_cpus: 0.1

@kevin85421 kevin85421 self-assigned this Nov 22, 2024
@CheyuWu
Copy link
Collaborator Author

CheyuWu commented Nov 22, 2024

@CheyuWu, you can check the Ray dashboard to see which Ray Serve replica is not scheduled. A possible reason might be MangoStand. There are two replicas for MangoStand, with at most one replica per Ray node. However, it is possible for the Ray head Pod to be occupied by other Ray Serve replicas, so MangoStand cannot be scheduled on the Ray head. You can try removing max_replicas_per_node: 1 and changing num_replicas from 2 to 1.

          - name: MangoStand
            num_replicas: 2
            max_replicas_per_node: 1
            user_config:
              price: 3
            ray_actor_options:
              num_cpus: 0.1

I conducted a few experiments and checked Ray Serve's logs on my local machine. Here are my findings:

  1. I found that if fruit_stand is not included in WithServeConfigV2, it will automatically apply the default configuration to deployments that are not explicitly set. This causes unused fruit_stand instances not to be properly removed.

I couldn't get rid of those fruit stand

  1. If the num_replicas for some fruit_stand deployments is set to 0, Ray Serve fails to initialize properly.

  2. The Ray Serve logs show issues related to CPU and OOM.
    serve_controller_416.log

@kevin85421
Copy link
Member

@CheyuWu got it. maybe use buildkite instead and increase the resources in rayClusterSpec().

@CheyuWu
Copy link
Collaborator Author

CheyuWu commented Nov 23, 2024

@CheyuWu got it. maybe use buildkite instead and increase the resources in rayClusterSpec().

I would like to ask if it’s okay to directly modify the resource in this way using headPodTemplateApplyConfiguration() within rayClusterSpec() at this location?

WithRayClusterSpec(rayv1ac.RayClusterSpec().
WithRayVersion(GetRayVersion()).
WithHeadGroupSpec(rayv1ac.HeadGroupSpec().
WithRayStartParams(map[string]string{
"dashboard-host": "0.0.0.0",
"num-gpus": "4",
"num-cpus": "4",
"resources": `'{"R1": 4}'`,
}).
WithTemplate(podTemplateSpecApplyConfiguration(headPodTemplateApplyConfiguration(),

I’m not familiar with it and because the testcase is named "lightweight," I’m worried it might affect the original setup.

@kevin85421 kevin85421 merged commit 8dda8b3 into ray-project:master Nov 24, 2024
25 checks passed
Ygnas pushed a commit to Ygnas/kuberay that referenced this pull request Mar 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Implement RayService In-place update test in Golang

3 participants