Skip to content

Bound R2 blob storage client timeouts and retries#5663

Open
enescakir wants to merge 1 commit into
mainfrom
enes/bound-r2-blob-storage-client
Open

Bound R2 blob storage client timeouts and retries#5663
enescakir wants to merge 1 commit into
mainfrom
enes/bound-r2-blob-storage-client

Conversation

@enescakir

Copy link
Copy Markdown
Member

We have seen recurring web request timeouts on POST /runtime/github/caches (reserveCache), surfacing on the Heroku router as H12 "Request timeout" after the full 30s ceiling, e.g.:

at=error code=H12 desc="Request timeout" method=POST
path="/runtime/github/caches?runId=..." service=30000ms status=503

The handler talks to Cloudflare R2 through the Aws::S3::Client built in GithubRepository#s3_client (create_multipart_upload and friends). That client was constructed without explicit HTTP timeouts or a retry cap, so it inherited the AWS SDK defaults of a 15s open timeout, a 60s read timeout, and 3 retries. When R2 is slow or a connection hangs, a single create_multipart_upload can therefore block well past the 30s router timeout, and the request is killed mid-flight rather than failing cleanly. The 5s Octokit timeouts in lib/github.rb do not help here: they only govern GitHub API calls, not the S3/R2 client.

Every call we make through this client is a small control-plane operation (multipart create/complete/abort, object and bucket delete, list multipart uploads). The actual cache blob data never flows through it; that moves over presigned URLs directly between the runner and R2. Short timeouts are therefore safe for all operations on this client.

Bound the client to a 5s open timeout, an 8s read timeout, and 2 retries. The worst case for a hung connection becomes (1 + retry_limit) * http_read_timeout = ~24s, which stays under the 30s web request timeout while still allowing two retries to ride out transient throttling or 5xx errors. The retry budget matters for commitCache (complete_multipart_upload), which has no application-level retry of its own, unlike reserveCache.

We have seen recurring web request timeouts on POST
/runtime/github/caches (reserveCache), surfacing on the Heroku
router as H12 "Request timeout" after the full 30s ceiling, e.g.:

  at=error code=H12 desc="Request timeout" method=POST
  path="/runtime/github/caches?runId=..." service=30000ms status=503

The handler talks to Cloudflare R2 through the Aws::S3::Client
built in GithubRepository#s3_client (create_multipart_upload and
friends). That client was constructed without explicit HTTP
timeouts or a retry cap, so it inherited the AWS SDK defaults of a
15s open timeout, a 60s read timeout, and 3 retries. When R2 is
slow or a connection hangs, a single create_multipart_upload can
therefore block well past the 30s router timeout, and the request
is killed mid-flight rather than failing cleanly. The 5s Octokit
timeouts in lib/github.rb do not help here: they only govern
GitHub API calls, not the S3/R2 client.

Every call we make through this client is a small control-plane
operation (multipart create/complete/abort, object and bucket
delete, list multipart uploads). The actual cache blob data never
flows through it; that moves over presigned URLs directly between
the runner and R2. Short timeouts are therefore safe for all
operations on this client.

Bound the client to a 5s open timeout, an 8s read timeout, and 2
retries. The worst case for a hung connection becomes
(1 + retry_limit) * http_read_timeout = ~24s, which stays under
the 30s web request timeout while still allowing two retries to
ride out transient throttling or 5xx errors. The retry budget
matters for commitCache (complete_multipart_upload), which has no
application-level retry of its own, unlike reserveCache.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@enescakir enescakir requested a review from jeremyevans June 13, 2026 13:49
@enescakir enescakir self-assigned this Jun 13, 2026
@enescakir

enescakir commented Jun 13, 2026

Copy link
Copy Markdown
Member Author

/run-e2e

E2E tests triggered: https://github.com/ubicloud/ubicloud/actions/runs/27468617375

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants