Skip to content

Too many API calls when restoring a backup from a S3 increment chain#1361

Merged
Slach merged 6 commits into
Altinity:masterfrom
Sfaynet:fix_list_duration_s3_api_walk_5
May 20, 2026
Merged

Too many API calls when restoring a backup from a S3 increment chain#1361
Slach merged 6 commits into
Altinity:masterfrom
Sfaynet:fix_list_duration_s3_api_walk_5

Conversation

@Sfaynet

@Sfaynet Sfaynet commented Apr 17, 2026

Copy link
Copy Markdown

fix #1362

Hello everyone!
I'd like to share the following issue:

When restoring (restore_remote) a large number of ClickHouse tables from S3, including incremental backups, there are numerous messages (pkg/storage/general.go:222 > , list_duration) in the logs, and it seems like more time is spent processing the logic than downloading the data itself. Restoring 280 GB of data and approximately 3,500 tables from S3 takes 8 hours, while after the backup is downloaded locally, it takes 11 minutes. Linear downloads from S3 (minio) are running at around a gigabyte per second.
I tried tweaking the parameters, but it didn't help:
S3_ALLOW_MULTIPART_DOWNLOAD="true"
DOWNLOAD_BY_PART="true"
DOWNLOAD_CONCURRENCY="255"
S3_CONCURRENCY
After analyzing the AI repo, claude-sonnet identified the cause:
Excessive S3 API calls (Walk) - up to 17,500 calls for each table in the increment chain.
The AI suggested the following as the primary solution:
In-Memory Cache (pkg/storage/general.go)
Added a cache for in-memory backup metadata
Cache key: {storage_type}:{backup_name}
When requesting metadata for the same backup again, the data is retrieved from the cache without calling S3.

I'm not a big Go expert, but the changes suggested by the AI ​​reduced the backup restore time from 8 hours to 30 minutes.

@Slach Slach self-requested a review April 23, 2026 05:23
Slach added 5 commits April 23, 2026 08:24
… chains

Replace the in-memory backupListCache added in PR Altinity#1361 with a direct
metadata.json fetch + on-disk cache reuse on the fast path.

BackupList(parseMetadata=true, parseMetadataOnly=name) used to Walk the
whole bucket root on every call. For incremental-chain restores this was
invoked per table (downloadDiffParts -> ReadBackupMetadataRemote), giving
N_tables * chain_length list calls (~17500 on a real workload).

Now:
- Fast path looks up name in the existing /tmp/.clickhouse-backup-metadata.cache.{kind}
  file; on hit, 0 S3 calls.
- On miss: one HEAD + one GET on name/metadata.json (no Walk), then merge
  into the on-disk cache and atomically rewrite it.
- metadataCacheLock removed; saveMetadataCache now writes via tempfile+rename
  so concurrent callers can't observe a torn file.
- prefetchBackupMetadataChain and ClearBackupListCache deleted.

Slow path (parseMetadataOnly="") unchanged, list_duration log preserved for
TestLongListRemote.

Verified: TestLongListRemote, TestS3, TestS3NoDeletePermission.
…kupList

When metadata.json is absent, walk the backup prefix to determine whether
the folder has content. If it does, return a broken backup entry with the
last modified time. If the prefix is empty, return an empty list as before.
@Slach Slach merged commit 662ae9f into Altinity:master May 20, 2026
53 of 54 checks passed
@Slach Slach added this to the 2.7.0 milestone May 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Too many API calls when restoring a backup from a S3 increment chain

2 participants