Skip to content

Conversation

@HarshAgarwal11
Copy link
Contributor

Why are these changes needed?

Have updated Probes Timeout Seconds for Ray Clusters to 2 seconds, as the value used in the probes command for wget timeout is 2 seconds. Also added the wget timeout to be picked from readiness default timeout seconds.
Initially Default Timeout Seconds for both the probes were 1 second, which used to result in the failure of probes, even before the wget command gets finished.

Related issue number

#2264

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests

@kevin85421 kevin85421 self-assigned this Jul 24, 2024
@kevin85421
Copy link
Member

Hi @HarshAgarwal11, would you mind fixing the CI error and running unit tests locally? You can read this doc for more details https://github.com/ray-project/kuberay/blob/master/ray-operator/DEVELOPMENT.md#running-the-tests

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix CI errors

@HarshAgarwal11
Copy link
Contributor Author

Fix CI errors

@kevin85421 CI errors have been fixed

@kevin85421 kevin85421 merged commit 71984fb into ray-project:master Jul 28, 2024
// Ray FT default readiness probe values
DefaultReadinessProbeInitialDelaySeconds = 10
DefaultReadinessProbeTimeoutSeconds = 1
DefaultReadinessProbeTimeoutSeconds = 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran into issues with the probe timeout in v1.1 as well.

I am thinking this probe timeout should actually be 4 or 5 seconds for the Head pod. This is because the probe for head pod runs both the agent heath check and GCS health check:

wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success
 && wget -T 2 -q -O- http://localhost:8265/api/gcs_healthz | grep success

Which is collectively up to 4 seconds. Thoughts @kevin85421 @HarshAgarwal11

Copy link
Contributor Author

@HarshAgarwal11 HarshAgarwal11 Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it should be 4 to 5 seconds. Because of the OR statement, timeout might get add up. And with 2 sec I was still getting some timeouts, not as frequent as earlier. But after changing it to 5 seconds, I didn't see any timeouts, there were some failures though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #2353

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened a separete issue to track exec probe issues: #2355

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants