Bound R2 blob storage client timeouts and retries by enescakir · Pull Request #5663 · ubicloud/ubicloud

enescakir · 2026-06-13T13:49:12Z

We have seen recurring web request timeouts on POST /runtime/github/caches (reserveCache), surfacing on the Heroku router as H12 "Request timeout" after the full 30s ceiling, e.g.:

at=error code=H12 desc="Request timeout" method=POST
path="/runtime/github/caches?runId=..." service=30000ms status=503

The handler talks to Cloudflare R2 through the Aws::S3::Client built in GithubRepository#s3_client (create_multipart_upload and friends). That client was constructed without explicit HTTP timeouts or a retry cap, so it inherited the AWS SDK defaults of a 15s open timeout, a 60s read timeout, and 3 retries. When R2 is slow or a connection hangs, a single create_multipart_upload can therefore block well past the 30s router timeout, and the request is killed mid-flight rather than failing cleanly. The 5s Octokit timeouts in lib/github.rb do not help here: they only govern GitHub API calls, not the S3/R2 client.

Every call we make through this client is a small control-plane operation (multipart create/complete/abort, object and bucket delete, list multipart uploads). The actual cache blob data never flows through it; that moves over presigned URLs directly between the runner and R2. Short timeouts are therefore safe for all operations on this client.

Bound the client to a 5s open timeout, an 8s read timeout, and 2 retries. The worst case for a hung connection becomes (1 + retry_limit) * http_read_timeout = ~24s, which stays under the 30s web request timeout while still allowing two retries to ride out transient throttling or 5xx errors. The retry budget matters for commitCache (complete_multipart_upload), which has no application-level retry of its own, unlike reserveCache.

We have seen recurring web request timeouts on POST /runtime/github/caches (reserveCache), surfacing on the Heroku router as H12 "Request timeout" after the full 30s ceiling, e.g.: at=error code=H12 desc="Request timeout" method=POST path="/runtime/github/caches?runId=..." service=30000ms status=503 The handler talks to Cloudflare R2 through the Aws::S3::Client built in GithubRepository#s3_client (create_multipart_upload and friends). That client was constructed without explicit HTTP timeouts or a retry cap, so it inherited the AWS SDK defaults of a 15s open timeout, a 60s read timeout, and 3 retries. When R2 is slow or a connection hangs, a single create_multipart_upload can therefore block well past the 30s router timeout, and the request is killed mid-flight rather than failing cleanly. The 5s Octokit timeouts in lib/github.rb do not help here: they only govern GitHub API calls, not the S3/R2 client. Every call we make through this client is a small control-plane operation (multipart create/complete/abort, object and bucket delete, list multipart uploads). The actual cache blob data never flows through it; that moves over presigned URLs directly between the runner and R2. Short timeouts are therefore safe for all operations on this client. Bound the client to a 5s open timeout, an 8s read timeout, and 2 retries. The worst case for a hung connection becomes (1 + retry_limit) * http_read_timeout = ~24s, which stays under the 30s web request timeout while still allowing two retries to ride out transient throttling or 5xx errors. The retry budget matters for commitCache (complete_multipart_upload), which has no application-level retry of its own, unlike reserveCache. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

enescakir · 2026-06-13T13:50:01Z

/run-e2e

E2E tests triggered: https://github.com/ubicloud/ubicloud/actions/runs/27468617375

enescakir requested a review from jeremyevans June 13, 2026 13:49

enescakir self-assigned this Jun 13, 2026

github-actions Bot temporarily deployed to E2E-gcp June 13, 2026 13:50 Inactive

github-actions Bot temporarily deployed to E2E-aws June 13, 2026 13:50 Inactive

github-actions Bot temporarily deployed to E2E-metal June 13, 2026 13:50 Inactive

github-actions Bot temporarily deployed to E2E-gcp June 13, 2026 15:11 Inactive

github-actions Bot temporarily deployed to E2E-aws June 13, 2026 15:11 Inactive

jeremyevans approved these changes Jun 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bound R2 blob storage client timeouts and retries#5663

Bound R2 blob storage client timeouts and retries#5663
enescakir wants to merge 1 commit into
mainfrom
enes/bound-r2-blob-storage-client

enescakir commented Jun 13, 2026

Uh oh!

enescakir commented Jun 13, 2026 •

edited by github-actions Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

enescakir commented Jun 13, 2026

Uh oh!

enescakir commented Jun 13, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

enescakir commented Jun 13, 2026 •

edited by github-actions Bot

Loading