Skip to content

Conversation

@gargnitingoogle
Copy link
Collaborator

@gargnitingoogle gargnitingoogle commented Aug 12, 2025

Description

This is a reopen of #3682 which got unintentionally closed.

This pull request introduces a new retry mechanism for stalled Folder API calls specifically for zonal buckets in gcsfuse. It replaces the existing GAX retries for these operations, aiming to improve robustness for zonal bucket interactions without affecting non-zonal buckets.

Highlights

  • Enhanced Retry Logic: Implemented a new retry mechanism for DeleteFolder, GetFolder, RenameFolder, and CreateFolder API calls, specifically for zonal buckets.
  • Conditional Application: The new retry logic is conditionally applied based on a new enableRetriesOnFolderAPIs flag, ensuring it only affects zonal buckets.
  • Refactored Client Wrapping: Updated the BucketHandle method to dynamically apply the appropriate control client wrapper (either the new retry logic for zonal buckets or GAX retries for non-zonal buckets) based on the bucket's type.
  • Comprehensive Testing: Expanded the test suite to include detailed scenarios for the new retry mechanism, covering success, retryable errors, non-retryable errors, and timeouts for all affected Folder APIs.

Link to the issue in case of a bug fix.

b/432458611
b/434621770
b/437834129
b/438631653

Testing details

  1. Manual - NA
  2. Unit tests - NA
  3. Integration tests - Ran as presubmits
  4. perf tests - Passed as presubmit - log
+--------+------------+--------------+------------+--------------+--------------+
| Branch | File Size  |   Read BW    |  Write BW  | RandRead BW  | RandWrite BW |
+--------+------------+--------------+------------+--------------+--------------+
| Master |  0.25MiB   |  540.3MiB/s  | 1.22MiB/s  |  76.53MiB/s  |  1.15MiB/s   |
|   PR   |  0.25MiB   | 571.54MiB/s  | 1.34MiB/s  |  80.13MiB/s  |  1.33MiB/s   |
|        |            |              |            |              |              |
|        |            |              |            |              |              |
| Master | 48.828MiB  | 3886.34MiB/s | 82.86MiB/s | 1501.91MiB/s |  84.48MiB/s  |
|   PR   | 48.828MiB  | 3965.45MiB/s | 83.02MiB/s | 1537.33MiB/s |  84.49MiB/s  |
|        |            |              |            |              |              |
|        |            |              |            |              |              |
| Master | 976.562MiB | 3974.53MiB/s | 38.44MiB/s |  693.2MiB/s  |  38.8MiB/s   |
|   PR   | 976.562MiB | 4034.75MiB/s | 37.06MiB/s | 1142.56MiB/s |  38.57MiB/s  |
|        |            |              |            |              |              |
|        |            |              |            |              |              |
+--------+------------+--------------+------------+--------------+--------------+

Any backward incompatible change? If so, please explain.

@gargnitingoogle gargnitingoogle requested a review from a team as a code owner August 12, 2025 18:24
@gargnitingoogle gargnitingoogle requested review from ashmeenkaur and removed request for a team August 12, 2025 18:24
@gargnitingoogle gargnitingoogle changed the title Gargnitin/control client api timeout fix/v2 feat: Retry stalled Folder API calls for zonal buckets Aug 12, 2025
@gargnitingoogle gargnitingoogle requested review from raj-prince and removed request for ashmeenkaur August 12, 2025 18:26
@gargnitingoogle gargnitingoogle force-pushed the gargnitin/control-client-api-timeout-fix/v2 branch from 9d222f0 to 9d268f7 Compare August 12, 2025 19:31
@gargnitingoogle gargnitingoogle force-pushed the gargnitin/control-client-api-timeout-fix/v1 branch from a2ae2a5 to 6a4248d Compare August 12, 2025 19:37
@gargnitingoogle gargnitingoogle force-pushed the gargnitin/control-client-api-timeout-fix/v2 branch from 9d268f7 to 78c0893 Compare August 12, 2025 19:37
Copy link
Collaborator Author

@gargnitingoogle gargnitingoogle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some nit comments.

@gargnitingoogle gargnitingoogle added execute-integration-tests Run only integration tests execute-integration-tests-on-zb To run E2E tests on zonal bucket. labels Aug 13, 2025
@gargnitingoogle gargnitingoogle force-pushed the gargnitin/control-client-api-timeout-fix/v1 branch from 6a4248d to cb7ce40 Compare August 13, 2025 10:22
@gargnitingoogle gargnitingoogle force-pushed the gargnitin/control-client-api-timeout-fix/v2 branch from 78c0893 to c5c8019 Compare August 13, 2025 10:22
@gargnitingoogle gargnitingoogle changed the base branch from gargnitin/control-client-api-timeout-fix/v1 to master August 13, 2025 10:35
@gargnitingoogle gargnitingoogle force-pushed the gargnitin/control-client-api-timeout-fix/v2 branch from c5c8019 to c7255ce Compare August 13, 2025 16:37
@codecov
Copy link

codecov bot commented Aug 13, 2025

Codecov Report

❌ Patch coverage is 88.49558% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.35%. Comparing base (acdf037) to head (3f79e51).
⚠️ Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
internal/storage/control_client_wrapper.go 84.33% 11 Missing and 2 partials ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master    #3684       +/-   ##
===========================================
+ Coverage        0   82.35%   +82.35%     
===========================================
  Files           0      144      +144     
  Lines           0    23059    +23059     
===========================================
+ Hits            0    18991    +18991     
- Misses          0     3523     +3523     
- Partials        0      545      +545     
Flag Coverage Δ
unittests 82.35% <88.49%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@gargnitingoogle gargnitingoogle force-pushed the gargnitin/control-client-api-timeout-fix/v2 branch from c7255ce to 7b2bfc7 Compare August 14, 2025 04:03
@gargnitingoogle gargnitingoogle changed the base branch from master to gargnitin/control-client-api-timeout-fix/v1 August 14, 2025 06:26
@gargnitingoogle gargnitingoogle force-pushed the gargnitin/control-client-api-timeout-fix/v1 branch from b04ec3b to 7e367ff Compare August 14, 2025 11:36
@gargnitingoogle gargnitingoogle force-pushed the gargnitin/control-client-api-timeout-fix/v2 branch from 7b2bfc7 to 803e375 Compare August 14, 2025 11:36
@gargnitingoogle gargnitingoogle force-pushed the gargnitin/control-client-api-timeout-fix/v1 branch from 7e367ff to 0e31d53 Compare August 19, 2025 04:33
@gargnitingoogle gargnitingoogle force-pushed the gargnitin/control-client-api-timeout-fix/v2 branch from 803e375 to 5e00092 Compare August 19, 2025 04:34
@gargnitingoogle gargnitingoogle added execute-integration-tests Run only integration tests execute-integration-tests-on-zb To run E2E tests on zonal bucket. labels Aug 22, 2025
@gargnitingoogle gargnitingoogle removed execute-integration-tests Run only integration tests execute-integration-tests-on-zb To run E2E tests on zonal bucket. labels Aug 25, 2025
@raj-prince raj-prince self-requested a review August 25, 2025 07:29
Copy link
Collaborator

@raj-prince raj-prince left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, couple of nits which can be taken as different PR.

@gargnitingoogle gargnitingoogle merged commit a49233f into master Aug 25, 2025
16 checks passed
@gargnitingoogle gargnitingoogle deleted the gargnitin/control-client-api-timeout-fix/v2 branch August 25, 2025 08:25
alleaditya pushed a commit that referenced this pull request Aug 25, 2025
* add retry-stall timeout for folder APIs and their tests

* call -> API for consistency

* remove stall from stallRetry everywhere

* address gemini comment

* remove gax-based retries from folder API calls from ZB

add back gax-retries for non-zb for folder APIs

empty commit

temp

wip

wip

wip

final code change and debug logs

empty commit

* fix rebase issues

* rename storageControlClientRetryOptions -> storageControlClientGaxRetryOptions

* remove debug logs and add useful comments

* address some self-comments

* address more self-comments

* address some self-review comments

* add more unit tests

* enhance existing unit tests

* refactor BucketHandle() and add more unit tests to it

* empty commit
gargnitingoogle added a commit that referenced this pull request Sep 1, 2025
gargnitingoogle added a commit that referenced this pull request Sep 1, 2025
gargnitingoogle added a commit that referenced this pull request Sep 1, 2025
gargnitingoogle added a commit that referenced this pull request Sep 2, 2025
gargnitingoogle added a commit that referenced this pull request Sep 2, 2025
gargnitingoogle added a commit that referenced this pull request Sep 2, 2025
)

* address a comment in a previous commit

Fix wrt comment #3684 (comment) .

* Fix handling of gax-retries for non-ZB

- Create and hold raw storage control clients for
  both with and without gax retry scenarios.
- For shared controlClient held in
  storageHandle, wrap the raw storage control client
  without gax retries with billing project, and
  enhanced retries for storage-layout.
- For ZB, wrap the raw storage control client
  without gax retries with billing project, and
  enhanced retries for all APIs.
- For non-ZB, wrap the raw storage control client
  with gax retries for folder APIs, with billing project, and
  enhanced retries for storage-layout.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants