Skip to content

Conversation

@gargnitingoogle
Copy link
Collaborator

@gargnitingoogle gargnitingoogle commented Jul 22, 2025

Description

This pull request introduces a robust retry mechanism into GCSFuse to enhance its reliability when interacting with Google Cloud Storage. The primary goal is to prevent operations from timing out prematurely by progressively increasing the context timeout duration for retried calls. This feature specifically targets the 'GetStorageLayout' API call, ensuring it is more resilient against stalls. The changes include a generic retry wrapper with configurable parameters and comprehensive unit tests to validate the new retry logic.

This is based on the design in go/gcsfuse-control-client-retries

Highlights

  • Generic Retry Mechanism: A generic retry wrapper (storageControlClientWithRetryOnStall and executeWithStallRetry) has been implemented to handle stalled control-client API calls with exponential backoff and time-bound retries.
  • Resilience for GetStorageLayout: The new retry logic has been specifically applied to the GetStorageLayout API call to ensure its resilience against stalls.
  • Enhanced Folder Operations for Zonal Buckets: The retry mechanism has been extended to all folder-related control-client operations (Create, Get, Rename, Delete Folder) specifically for zonal buckets, making these operations more robust.
  • Configurable Retry Parameters: Default retry parameters (minimum/maximum deadline, retry multiplier, and total retry budget) are now centralized and configurable as constants, improving maintainability.
Changelog
  • internal/storage/control_client_wrapper.go
    • Added necessary imports for 'fmt', 'time', 'logger', and 'storageutil'.
    • Introduced constants for default retry parameters: 'defaultControlClientMinRetryDeadline', 'defaultControlClientMaxRetryDeadline', 'defaultControlClientRetryMultiplier', and 'defaultControlClientTotalRetryBudget'.
    • Defined 'storageControlClientWithRetryOnStall' struct to encapsulate retry configuration and the raw client.
    • Implemented 'executeWithStallRetry', a generic function that applies time-bound, exponential backoff retry logic to any control client API call.
    • Modified the 'GetStorageLayout' method within 'storageControlClientWithRetryOnStall' to leverage the new 'executeWithStallRetry' function, enabling conditional retries.
    • Added 'newRetryWrapper' and 'withRetryOnStorageLayoutStall' helper functions to create wrapped control clients with specific retry behaviors, including parameter sanitization.
  • internal/storage/control_client_wrapper_test.go
    • Added a new test file to provide comprehensive unit tests for the new retry mechanism.
    • Introduced 'stallingStorageControlClient' to simulate network latency for testing timeout-based retries.
    • Implemented test suites ('ControlClientStallRetryWrapperTest', 'StorageLayoutStallRetryWrapperTest') to cover various scenarios, including successful calls, retryable errors, non-retryable errors, attempts timing out and then succeeding, and all attempts timing out.
    • Included tests for parameter sanitization in the 'newRetryWrapper' function.
  • internal/storage/storage_handle.go
    • Integrated the 'withRetryOnStorageLayoutStall' wrapper when initializing the 'StorageControlClient', ensuring resilience for 'GetStorageLayout' calls.
  • internal/storage/storage_handle_test.go
    • Updated the 'TestNewStorageHandleWithBillingProject' test to correctly assert the new nested wrapper structure, reflecting the revised order of applying client wrappers.
  • perfmetrics/scripts/install_bash.sh
    • Modified the bash installation verification logic to include checks for the expected binary path, its presence in the system's PATH, and the version of the accessible bash executable.
  • perfmetrics/scripts/presubmit_test/pr_perf_test/build.sh
    • Changed explicit calls to '/usr/local/bin/bash' to simply 'bash' for running e2e tests on zonal and non-zonal buckets.

Link to the issue in case of a bug fix.

b/432458611
b/434621770
b/437834129
b/438631653

Testing details

  1. Manual - NA
  2. Unit tests - NA
  3. Integration tests - Ran through presubmits
    • run1 - e2e presubmit zb and non-zb - passed
    • run2 - - e2e presubmit zb and non-zb - failed
    • run3 - e2e presubmit zb and non-zb - failed (failure fixed in ci: Update test TestNewFileUnderImplicitDirectoryShouldNotGetSyncedToGCSTillClose for zonal/rapid bucket #3644)
    • run4 - reran with the fix in ci: Update test TestNewFileUnderImplicitDirectoryShouldNotGetSyncedToGCSTillClose for zonal/rapid bucket #3644 - e2e zb only - passed
    • run5 - e2e non-zb and perf only - failed with weird exit 1 errors with permission issues in TestMountTimeout/TestMountDualRegionUSBucketWithTimeout test for flat bucket in non-zb in some attempts (transient failures)
    • run6 - e2e non-zb - passed
    • run7 - e2e zb - passed
    • run8 - e2e zb - passed
    • run9 - e2e zb and non-zb - failed with unrelated test TestUnfinalizedObjectOperationTest/TestUnfinalizedObjectCantBeRenamedIfCreatedFromDifferentMount.
    • run10 - e2e presubmit zb and non-zb - with fix for the prev failing test and latest code - passed
    • run11 - presubmit perf and e2e presubmit zb and non-zb - got aborted during fio run for some reason.
    • run12 - presubmit perf and e2e presubmit zb and non-zb - got aborted at perf test run again with transport endpoint error.
    • run13 - presubmit e2e presubmit zb and non-zb - passed
    • run14 - perf tests - failing with errors Software caused connection abort and Transport endpoint is not connected . Looks like unrelated issue, but need to debug.
    • run15 - perf tests - passed.
    • run16 - e2e zb presubmit - got aborted unintentionally on re-push
    • run17 - e2e zb presubmit - passed
    • run18 - e2e zb and non-zb presubmit - re-run after restructure (only getStorageLayout changes in this PR now) - passed
    • run19 - e2e zb and non-zb - with jittery backoff on retry - non-zb failed because of the random failure fixed in test: Stabilize mount_timeout test by waiting for unmount #3664 .
    • run20 - e2e zb and non-zb - with jittery backoff on retry - passed .
    • run21 - e2e zb and non-zb - with jittery backoff on retry - passed .
    • run22 - e2e zb and non-zb - passed
    • run23 - e2e zb and non-zb - passed
    • run24 - e2e zb and non-zb - passed
  4. Perf testing
+--------+------------+--------------+------------+--------------+--------------+
| Branch | File Size  |   Read BW    |  Write BW  | RandRead BW  | RandWrite BW |
+--------+------------+--------------+------------+--------------+--------------+
| Master |  0.25MiB   | 580.21MiB/s  |  1.3MiB/s  |  78.63MiB/s  |  1.22MiB/s   |
|   PR   |  0.25MiB   | 593.88MiB/s  | 1.24MiB/s  |  79.96MiB/s  |   1.2MiB/s   |
|        |            |              |            |              |              |
|        |            |              |            |              |              |
| Master | 48.828MiB  | 3706.94MiB/s | 86.73MiB/s | 1498.15MiB/s |  84.7MiB/s   |
|   PR   | 48.828MiB  | 3672.92MiB/s | 85.04MiB/s | 1522.84MiB/s |  86.12MiB/s  |
|        |            |              |            |              |              |
|        |            |              |            |              |              |
| Master | 976.562MiB | 3684.57MiB/s | 37.63MiB/s | 836.78MiB/s  |  39.05MiB/s  |
|   PR   | 976.562MiB | 3692.63MiB/s | 38.08MiB/s | 1118.22MiB/s |  39.3MiB/s   |
|        |            |              |            |              |              |
|        |            |              |            |              |              |
+--------+------------+--------------+------------+--------------+--------------+

Any backward incompatible change? If so, please explain.

@gargnitingoogle gargnitingoogle added the execute-integration-tests-on-zb To run E2E tests on zonal bucket. label Jul 22, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @gargnitingoogle, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily focuses on refining the retry behavior for control-client API calls by reducing their maximum timeout duration. Concurrently, it includes significant updates to the project's testing infrastructure, specifically modifying how gcsfuse is built and tested in presubmit performance checks, and streamlining the e2e test suite by removing emulator-specific test runs. A minor adjustment to logging file naming conventions is also part of these changes.

Highlights

  • Control Client API Timeout Adjustment: The maximum retry sleep duration for control-client API calls has been explicitly set to 10 seconds, overriding a previously configurable value. This change, found in internal/storage/storageutil/control_client.go, aims to reduce the upper limit of retry delays for these API interactions.
  • Logging File Path Simplification: In cmd/legacy_main.go, the automatic appending of the .stderr suffix to the stderr log file name has been removed. This means stderr output will now be directed to the main log file path as configured.
  • Presubmit Performance Test Script Updates: The perfmetrics/scripts/presubmit_test/pr_perf_test/build.sh script has been updated to ensure gcsfuse is built and installed from the current pull request's commit. It also switches the e2e test execution for zonal buckets from improved_run_e2e_tests.sh to run_e2e_tests.sh.
  • E2E Test Execution Streamlining: The tools/integration_tests/run_e2e_tests.sh script has been modified to remove the execution and status checks for emulator-based e2e tests, simplifying the overall test execution flow within this script.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gargnitingoogle gargnitingoogle changed the title Gargnitin/control client api timeout fix/v1 ci(e2 tests): Lower control-client API call retry duration upper limit to 10s Jul 22, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR aims to lower the API call retry duration for the control client. The core change in internal/storage/storageutil/control_client.go achieves this by hardcoding the max retry sleep. My feedback includes suggestions to improve the maintainability of this change by using a constant. Additionally, the PR includes several changes to CI scripts and application logic. I've pointed out some unrelated changes, potential regressions in script quality, and a significant reduction in test coverage by removing emulator tests. These changes should be clarified or reverted.

@codecov
Copy link

codecov bot commented Jul 22, 2025

Codecov Report

❌ Patch coverage is 78.75000% with 17 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.21%. Comparing base (3050002) to head (accfb7b).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
internal/storage/control_client_wrapper.go 77.92% 14 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3561      +/-   ##
==========================================
- Coverage   82.25%   82.21%   -0.04%     
==========================================
  Files         144      144              
  Lines       22910    22989      +79     
==========================================
+ Hits        18844    18901      +57     
- Misses       3522     3540      +18     
- Partials      544      548       +4     
Flag Coverage Δ
unittests 82.21% <78.75%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@gargnitingoogle gargnitingoogle force-pushed the gargnitin/control-client-api-timeout-fix/v1 branch from 1f84a6f to 8dde544 Compare July 23, 2025 05:26
@gargnitingoogle gargnitingoogle changed the title ci(e2 tests): Lower control-client API call retry duration upper limit to 10s ci(e2 tests): Lower control-client API call retry duration upper limit to 10s and log these retries Jul 23, 2025
@gargnitingoogle gargnitingoogle added the execute-integration-tests Run only integration tests label Jul 23, 2025
@gargnitingoogle gargnitingoogle force-pushed the gargnitin/control-client-api-timeout-fix/v1 branch from 8dde544 to 0544d56 Compare July 30, 2025 04:12
@gargnitingoogle gargnitingoogle removed execute-integration-tests Run only integration tests execute-integration-tests-on-zb To run E2E tests on zonal bucket. labels Aug 1, 2025
@gargnitingoogle gargnitingoogle force-pushed the gargnitin/control-client-api-timeout-fix/v1 branch 2 times, most recently from a01632c to 31ba72b Compare August 4, 2025 01:10
@gargnitingoogle gargnitingoogle added the execute-integration-tests-on-zb To run E2E tests on zonal bucket. label Aug 4, 2025
@gargnitingoogle gargnitingoogle force-pushed the gargnitin/control-client-api-timeout-fix/v1 branch from 31ba72b to cde26c0 Compare August 4, 2025 07:02
@gargnitingoogle gargnitingoogle changed the title ci(e2 tests): Lower control-client API call retry duration upper limit to 10s and log these retries ci(e2 tests): Run e2e zb daily tests with gcsfuse-level retry on control-client idempotent APIs Aug 4, 2025
@gargnitingoogle gargnitingoogle force-pushed the gargnitin/control-client-api-timeout-fix/v1 branch 2 times, most recently from c6fe29c to f942e12 Compare August 4, 2025 11:15
@gargnitingoogle gargnitingoogle changed the title ci(e2 tests): Run e2e zb daily tests with gcsfuse-level retry on control-client idempotent APIs ci(e2 tests): Retry control-client calls from GCSFuse Aug 4, 2025
@gargnitingoogle gargnitingoogle added execute-perf-test Execute performance test in PR execute-integration-tests Run only integration tests and removed execute-perf-test Execute performance test in PR labels Aug 4, 2025
@gargnitingoogle gargnitingoogle force-pushed the gargnitin/control-client-api-timeout-fix/v1 branch from f942e12 to 53cf1fd Compare August 4, 2025 11:45
@gargnitingoogle gargnitingoogle changed the title ci(e2 tests): Retry control-client calls from GCSFuse fix: Retry control-client calls from GCSFuse Aug 4, 2025
@gargnitingoogle gargnitingoogle force-pushed the gargnitin/control-client-api-timeout-fix/v1 branch from 53cf1fd to 8ea6b60 Compare August 5, 2025 02:41
@gargnitingoogle gargnitingoogle marked this pull request as ready for review August 5, 2025 02:48
@gargnitingoogle gargnitingoogle requested a review from a team as a code owner August 5, 2025 02:48
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a robust retry mechanism for the GetStorageLayout API call, which is a valuable addition for improving the reliability of GCSFuse. The implementation of exponential backoff with jitter is well-structured. My review includes a suggestion to correct the jitter calculation to ensure it is effective on the first retry, which is crucial for preventing potential thundering herd issues, and a minor fix for a typo in a comment to improve code clarity.

Copy link
Collaborator

@raj-prince raj-prince left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, few super nits.

@gargnitingoogle gargnitingoogle force-pushed the gargnitin/control-client-api-timeout-fix/v1 branch from 74dfebf to e1d9077 Compare August 21, 2025 15:43
@gargnitingoogle gargnitingoogle enabled auto-merge (squash) August 22, 2025 04:50
@gargnitingoogle gargnitingoogle merged commit c90fb5e into master Aug 22, 2025
14 checks passed
@gargnitingoogle gargnitingoogle deleted the gargnitin/control-client-api-timeout-fix/v1 branch August 22, 2025 05:20
alleaditya pushed a commit that referenced this pull request Aug 25, 2025
* add retry-stall timeout for storagelayout api and its tests

* fix lint error

* adress some review comments

* stop handling improper inputs in withRetryOnStorageLayoutStall

* add exponential jittery backoff in-between retries in contrl-client retrier

* minor fix in interface of control-client wrapper creations

* rename minRetryDeadline to initialRetryDeadline

* remove stall from retry in name everywhere

* split out backoff as a separate struct and its own unit tests

* remove gax retry for GetStorageLayout API call

* address some self and gemini review comments

* remove exponential scaling from retry-deadline

* add min backoff duration

* enhance comment on a function

* create a new backoff object every api call

* introduce jitter in first retry backoff

The algorithm for defining the jitter in the next
backoff duration now more closely resembles the algorithm in gax retries.
Refer https://github.com/googleapis/gax-go/blob/d68660c4840e4d5aa3f78eee0705e0391082140d/v2/call_option.go#L202
for jittery backoff in gax retries.

* rename min backoff to initial backoff

* address review comments
Tulsishah pushed a commit that referenced this pull request Aug 25, 2025
* add retry-stall timeout for storagelayout api and its tests

* fix lint error

* adress some review comments

* stop handling improper inputs in withRetryOnStorageLayoutStall

* add exponential jittery backoff in-between retries in contrl-client retrier

* minor fix in interface of control-client wrapper creations

* rename minRetryDeadline to initialRetryDeadline

* remove stall from retry in name everywhere

* split out backoff as a separate struct and its own unit tests

* remove gax retry for GetStorageLayout API call

* address some self and gemini review comments

* remove exponential scaling from retry-deadline

* add min backoff duration

* enhance comment on a function

* create a new backoff object every api call

* introduce jitter in first retry backoff

The algorithm for defining the jitter in the next
backoff duration now more closely resembles the algorithm in gax retries.
Refer https://github.com/googleapis/gax-go/blob/d68660c4840e4d5aa3f78eee0705e0391082140d/v2/call_option.go#L202
for jittery backoff in gax retries.

* rename min backoff to initial backoff

* address review comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

execute-integration-tests Run only integration tests execute-integration-tests-on-zb To run E2E tests on zonal bucket. remind-reviewers Auto remind reviewers in attention set for review post 24hrs of inactivity on PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants