-
Notifications
You must be signed in to change notification settings - Fork 469
feat: Retry stalled GetStorageLayout API call for zonal buckets #3561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Retry stalled GetStorageLayout API call for zonal buckets #3561
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @gargnitingoogle, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request primarily focuses on refining the retry behavior for control-client API calls by reducing their maximum timeout duration. Concurrently, it includes significant updates to the project's testing infrastructure, specifically modifying how gcsfuse is built and tested in presubmit performance checks, and streamlining the e2e test suite by removing emulator-specific test runs. A minor adjustment to logging file naming conventions is also part of these changes.
Highlights
- Control Client API Timeout Adjustment: The maximum retry sleep duration for control-client API calls has been explicitly set to 10 seconds, overriding a previously configurable value. This change, found in
internal/storage/storageutil/control_client.go, aims to reduce the upper limit of retry delays for these API interactions. - Logging File Path Simplification: In
cmd/legacy_main.go, the automatic appending of the.stderrsuffix to thestderrlog file name has been removed. This meansstderroutput will now be directed to the main log file path as configured. - Presubmit Performance Test Script Updates: The
perfmetrics/scripts/presubmit_test/pr_perf_test/build.shscript has been updated to ensuregcsfuseis built and installed from the current pull request's commit. It also switches the e2e test execution for zonal buckets fromimproved_run_e2e_tests.shtorun_e2e_tests.sh. - E2E Test Execution Streamlining: The
tools/integration_tests/run_e2e_tests.shscript has been modified to remove the execution and status checks for emulator-based e2e tests, simplifying the overall test execution flow within this script.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This PR aims to lower the API call retry duration for the control client. The core change in internal/storage/storageutil/control_client.go achieves this by hardcoding the max retry sleep. My feedback includes suggestions to improve the maintainability of this change by using a constant. Additionally, the PR includes several changes to CI scripts and application logic. I've pointed out some unrelated changes, potential regressions in script quality, and a significant reduction in test coverage by removing emulator tests. These changes should be clarified or reverted.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #3561 +/- ##
==========================================
- Coverage 82.25% 82.21% -0.04%
==========================================
Files 144 144
Lines 22910 22989 +79
==========================================
+ Hits 18844 18901 +57
- Misses 3522 3540 +18
- Partials 544 548 +4
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
1f84a6f to
8dde544
Compare
8dde544 to
0544d56
Compare
a01632c to
31ba72b
Compare
31ba72b to
cde26c0
Compare
c6fe29c to
f942e12
Compare
f942e12 to
53cf1fd
Compare
53cf1fd to
8ea6b60
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a robust retry mechanism for the GetStorageLayout API call, which is a valuable addition for improving the reliability of GCSFuse. The implementation of exponential backoff with jitter is well-structured. My review includes a suggestion to correct the jitter calculation to ensure it is effective on the first retry, which is crucial for preventing potential thundering herd issues, and a minor fix for a typo in a comment to improve code clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, few super nits.
The algorithm for defining the jitter in the next backoff duration now more closely resembles the algorithm in gax retries. Refer https://github.com/googleapis/gax-go/blob/d68660c4840e4d5aa3f78eee0705e0391082140d/v2/call_option.go#L202 for jittery backoff in gax retries.
74dfebf to
e1d9077
Compare
* add retry-stall timeout for storagelayout api and its tests * fix lint error * adress some review comments * stop handling improper inputs in withRetryOnStorageLayoutStall * add exponential jittery backoff in-between retries in contrl-client retrier * minor fix in interface of control-client wrapper creations * rename minRetryDeadline to initialRetryDeadline * remove stall from retry in name everywhere * split out backoff as a separate struct and its own unit tests * remove gax retry for GetStorageLayout API call * address some self and gemini review comments * remove exponential scaling from retry-deadline * add min backoff duration * enhance comment on a function * create a new backoff object every api call * introduce jitter in first retry backoff The algorithm for defining the jitter in the next backoff duration now more closely resembles the algorithm in gax retries. Refer https://github.com/googleapis/gax-go/blob/d68660c4840e4d5aa3f78eee0705e0391082140d/v2/call_option.go#L202 for jittery backoff in gax retries. * rename min backoff to initial backoff * address review comments
* add retry-stall timeout for storagelayout api and its tests * fix lint error * adress some review comments * stop handling improper inputs in withRetryOnStorageLayoutStall * add exponential jittery backoff in-between retries in contrl-client retrier * minor fix in interface of control-client wrapper creations * rename minRetryDeadline to initialRetryDeadline * remove stall from retry in name everywhere * split out backoff as a separate struct and its own unit tests * remove gax retry for GetStorageLayout API call * address some self and gemini review comments * remove exponential scaling from retry-deadline * add min backoff duration * enhance comment on a function * create a new backoff object every api call * introduce jitter in first retry backoff The algorithm for defining the jitter in the next backoff duration now more closely resembles the algorithm in gax retries. Refer https://github.com/googleapis/gax-go/blob/d68660c4840e4d5aa3f78eee0705e0391082140d/v2/call_option.go#L202 for jittery backoff in gax retries. * rename min backoff to initial backoff * address review comments
Description
This pull request introduces a robust retry mechanism into GCSFuse to enhance its reliability when interacting with Google Cloud Storage. The primary goal is to prevent operations from timing out prematurely by progressively increasing the context timeout duration for retried calls. This feature specifically targets the 'GetStorageLayout' API call, ensuring it is more resilient against stalls. The changes include a generic retry wrapper with configurable parameters and comprehensive unit tests to validate the new retry logic.
This is based on the design in go/gcsfuse-control-client-retries
Highlights
storageControlClientWithRetryOnStallandexecuteWithStallRetry) has been implemented to handle stalled control-client API calls with exponential backoff and time-bound retries.GetStorageLayoutAPI call to ensure its resilience against stalls.Changelog
This also includes the fix made in ci: Fix bash installation/version-check errors in e2e #3681 which is needed to fix random failures in e2e presubmit runs related to bash installation failure.Link to the issue in case of a bug fix.
b/432458611
b/434621770
b/437834129
b/438631653
Testing details
exit 1errors with permission issues in TestMountTimeout/TestMountDualRegionUSBucketWithTimeout test for flat bucket in non-zb in some attempts (transient failures)Software caused connection abortandTransport endpoint is not connected. Looks like unrelated issue, but need to debug.Any backward incompatible change? If so, please explain.