Bound R2 blob storage client timeouts and retries#5663
Open
enescakir wants to merge 1 commit into
Open
Conversation
We have seen recurring web request timeouts on POST /runtime/github/caches (reserveCache), surfacing on the Heroku router as H12 "Request timeout" after the full 30s ceiling, e.g.: at=error code=H12 desc="Request timeout" method=POST path="/runtime/github/caches?runId=..." service=30000ms status=503 The handler talks to Cloudflare R2 through the Aws::S3::Client built in GithubRepository#s3_client (create_multipart_upload and friends). That client was constructed without explicit HTTP timeouts or a retry cap, so it inherited the AWS SDK defaults of a 15s open timeout, a 60s read timeout, and 3 retries. When R2 is slow or a connection hangs, a single create_multipart_upload can therefore block well past the 30s router timeout, and the request is killed mid-flight rather than failing cleanly. The 5s Octokit timeouts in lib/github.rb do not help here: they only govern GitHub API calls, not the S3/R2 client. Every call we make through this client is a small control-plane operation (multipart create/complete/abort, object and bucket delete, list multipart uploads). The actual cache blob data never flows through it; that moves over presigned URLs directly between the runner and R2. Short timeouts are therefore safe for all operations on this client. Bound the client to a 5s open timeout, an 8s read timeout, and 2 retries. The worst case for a hung connection becomes (1 + retry_limit) * http_read_timeout = ~24s, which stays under the 30s web request timeout while still allowing two retries to ride out transient throttling or 5xx errors. The retry budget matters for commitCache (complete_multipart_upload), which has no application-level retry of its own, unlike reserveCache. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Member
Author
|
/run-e2e E2E tests triggered: https://github.com/ubicloud/ubicloud/actions/runs/27468617375 |
jeremyevans
approved these changes
Jun 13, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
We have seen recurring web request timeouts on POST /runtime/github/caches (reserveCache), surfacing on the Heroku router as H12 "Request timeout" after the full 30s ceiling, e.g.:
at=error code=H12 desc="Request timeout" method=POST
path="/runtime/github/caches?runId=..." service=30000ms status=503
The handler talks to Cloudflare R2 through the Aws::S3::Client built in GithubRepository#s3_client (create_multipart_upload and friends). That client was constructed without explicit HTTP timeouts or a retry cap, so it inherited the AWS SDK defaults of a 15s open timeout, a 60s read timeout, and 3 retries. When R2 is slow or a connection hangs, a single create_multipart_upload can therefore block well past the 30s router timeout, and the request is killed mid-flight rather than failing cleanly. The 5s Octokit timeouts in lib/github.rb do not help here: they only govern GitHub API calls, not the S3/R2 client.
Every call we make through this client is a small control-plane operation (multipart create/complete/abort, object and bucket delete, list multipart uploads). The actual cache blob data never flows through it; that moves over presigned URLs directly between the runner and R2. Short timeouts are therefore safe for all operations on this client.
Bound the client to a 5s open timeout, an 8s read timeout, and 2 retries. The worst case for a hung connection becomes (1 + retry_limit) * http_read_timeout = ~24s, which stays under the 30s web request timeout while still allowing two retries to ride out transient throttling or 5xx errors. The retry budget matters for commitCache (complete_multipart_upload), which has no application-level retry of its own, unlike reserveCache.