Skip to content

Conversation

@Yicheng-Lu-llll
Copy link
Member

Why are these changes needed?

The test "should be able to update all Pods to Running" is flaky. The reason is that we get Kubernetes cluster information from the local cache. However, the local cache may have a jet lag with the Kubernetes API server.

So, the thought is we should get the true status of the Pods directly from the API server in the operator test.

In client.New doc, it says:

New returns a new Client using the provided config and Options. The returned client reads and writes directly from the server (it doesn't use object caches). It understands how to work with normal types (both custom resources and aggregated/built-in resources), as well as unstructured types.

In kubebuilder's writing-tests doc, it says:

Note that we set up both a “live” k8s client and a separate client from the manager. This is because when making assertions in tests, you generally want to assert against the live state of the API server. If you use the client from the manager (k8sManager.GetClient), you’d end up asserting against the contents of the cache instead, which is slower and can introduce flakiness into your tests. We could use the manager’s APIReader to accomplish the same thing, but that would leave us with two clients in our test assertions and setup (one for reading, one for writing), and it’d be easy to make mistakes.

So, we should use k8sClient that are created from client.New instead of k8sManager.GetClient.

In current implementation, we use k8sClient from k8sManager.GetClient:

	k8sClient = mgr.GetClient()
	Expect(k8sClient).ToNot(BeNil())

Thus, I remove the above k8sClient from k8sManager.GetClient and use k8sClient from client.New:

	k8sClient, err = client.New(cfg, client.Options{Scheme: scheme.Scheme})
	Expect(err).ToNot(HaveOccurred())
	Expect(k8sClient).ToNot(BeNil())

Related issue number

Closes #894

Checks

In a 2 core CPU, 7 GB RAM VM (to simulate the github's standard Linux runner), I ran the test 100 times:
[Bug] "should be able to update all Pods to Running" never happened again after this change.
[Bug] "should be able to update all Pods to Running" fails 34/100 before this change.

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this contribution! QQ: Where is client.New?

@Yicheng-Lu-llll
Copy link
Member Author

Thank you for this contribution! QQ: Where is client.New?

In current implementation, k8sClient is defined in two places:

https://github.com/Jeffwan/kuberay/blob/b3665af42481dee798f75e64ee2ea38f980d6066/ray-operator/controllers/suite_test.go#L72 (from client.New)

https://github.com/Jeffwan/kuberay/blob/b3665af42481dee798f75e64ee2ea38f980d6066/ray-operator/controllers/suite_test.go#L90 (overwrite the previous one, get client from mgr.GetClient())

So, we only need to delete the L90.

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very helpful! LGTM!

@kevin85421
Copy link
Member

cc @davidxia

@kevin85421 kevin85421 merged commit 2fa40fb into ray-project:master Feb 8, 2023
Yicheng-Lu-llll added a commit to Yicheng-Lu-llll/kuberay that referenced this pull request Feb 9, 2023
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] "should be able to update all Pods to Running" is still flaky.

2 participants