Split space usage by replicas #425

kirillgarbar · 2025-12-01T10:13:24Z

Summary by Sourcery

Split object storage space usage accounting by replica and adjust schema and queries accordingly.

New Features:

Report object storage space usage per replica instead of only global aggregates.

Enhancements:

Redesign local blobs and space usage tables to include replica identifiers and track state-specific sizes in a normalized form.
Update space usage population logic to generate per-replica metrics across ClickHouse versions while preserving orphaned size aggregation.
Adjust object storage space usage retrieval to return a nested mapping keyed by replica and state.

Tests:

Update object storage feature tests to validate the new per-replica space usage output format.

…torage-meta

sourcery-ai · 2025-12-01T10:13:31Z

Reviewer's Guide

Refactors object storage space usage accounting to store and return usage split by replica, adjusting table schemas, fill queries, and tests accordingly.

Sequence diagram for collecting and filling replica-split space usage

sequenceDiagram
    participant Caller
    participant _collect_space_usage
    participant _get_fill_space_usage_query
    participant ClickHouse

    Caller->>_collect_space_usage: invoke with ctx, tables, config
    _collect_space_usage->>_get_fill_space_usage_query: build space usage fill queries
    _get_fill_space_usage_query-->>_collect_space_usage: list[Query] space_usage_queries

    loop for each Query in space_usage_queries
        _collect_space_usage->>ClickHouse: execute_query(ctx, query, settings)
        ClickHouse-->>_collect_space_usage: execution result (ignored)
    end

    _collect_space_usage->>ClickHouse: TRUNCATE old space_usage_table
    ClickHouse-->>_collect_space_usage: ok
    _collect_space_usage->>ClickHouse: RENAME space_usage_table_new TO space_usage_table
    ClickHouse-->>_collect_space_usage: ok
    _collect_space_usage-->>Caller: return

Sequence diagram for reading replica-split space usage

sequenceDiagram
    participant Caller
    participant get_object_storage_space_usage
    participant ClickHouse

    Caller->>get_object_storage_space_usage: ctx, space_usage_table, scope, cluster_name
    get_object_storage_space_usage->>ClickHouse: SELECT toString(replica) replicas, state, size FROM space_usage_table
    ClickHouse-->>get_object_storage_space_usage: JSON rows[{replicas, state, size}]

    loop for each row in rows
        get_object_storage_space_usage->>get_object_storage_space_usage: ensure result[replicas] exists
        get_object_storage_space_usage->>get_object_storage_space_usage: result[replicas][state] = size
    end

    get_object_storage_space_usage-->>Caller: dict[replica][state] -> size

ER diagram for updated object storage space usage tables

erDiagram
    local_blobs_table {
        String replica
        String obj_path
        String state
        UInt64 obj_size
        UInt32 ref_count
    }

    orphaned_blobs_table {
        String obj_path
        UInt64 obj_size
    }

    space_usage_table {
        Array_String replica
        String state
        UInt64 size
    }

    local_blobs_table ||--o{ space_usage_table : aggregates_into
    orphaned_blobs_table ||--o{ space_usage_table : contributes_orphaned_size

File-Level Changes

Change	Details	Files
Change space usage aggregation to be per-replica and return a nested mapping of replica → state → size instead of flat totals.	Update get_object_storage_space_usage to select replica, state, and size from the space usage table ordered by replica and state. Replace single-row JSON result handling with construction of a dict keyed by replica, with inner dicts keyed by state containing sizes. Adjust feature tests to assert the new response shape with replica keys (including an 'all_replicas' bucket for orphaned data).	`ch_tools/common/commands/object_storage.py` `tests/features/object_storage.feature`
Redesign space usage storage schema and fill logic to track replica and state explicitly rather than aggregated columns.	Change space_usage table schema from four numeric columns (active, unique_frozen, unique_detached, orphaned) to (replica Array(String), state String, size UInt64) with ORDER BY (replica, state). Update _get_fill_space_usage_query to return a list of Query objects instead of a single Query, with different query shapes depending on ClickHouse version. For CH 25.3, compute per-replica usage via a CTE that aggregates local_blobs by (replica, obj_path, state) and inserts rows for active, unique_frozen, unique_detached plus an 'all_replicas' orphaned row; for older versions, split this into four INSERTs (active, unique_frozen, unique_detached, orphaned). Update _collect_space_usage to iterate over the list of fill queries and execute each with the same timeout/settings.	`ch_tools/common/commands/object_storage.py`
Extend local_blobs table and writes to record replica information for each object.	Modify _create_local_blobs_table schema to add a replica String column, switch engine from SummingMergeTree to MergeTree, and change ORDER BY to include replica. Update _insert_local_blobs_batch to insert a constant replica identifier ('missing_backups') as first column and include replica in the INSERT column list. Modify _get_fill_local_blobs_table_query to insert hostName() as the replica column when scanning disks into local_blobs.	`ch_tools/common/commands/object_storage.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey there - I've reviewed your changes and found some issues that need to be addressed.

In _get_fill_space_usage_query the groupArraySorted(10)(replica) limit is a silent constraint on the number of replicas; consider either removing the hard-coded 10 or making it a named constant with a short comment so the behavior is explicit and maintainable.
The two branches of _get_fill_space_usage_query (25.3+ vs older) now duplicate a fair amount of logic for computing active, unique_frozen, and unique_detached; it may be worth factoring out the shared subqueries/structure to reduce the chance of the branches drifting out of sync.
In _create_space_usage_table, the ctx line has a stale inline comment # , unique_frozen UInt64, unique_detached UInt64, orphaned UInt64) which no longer matches the schema; consider cleaning this up to avoid confusion about the current table layout.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- In `_get_fill_space_usage_query` the `groupArraySorted(10)(replica)` limit is a silent constraint on the number of replicas; consider either removing the hard-coded `10` or making it a named constant with a short comment so the behavior is explicit and maintainable.
- The two branches of `_get_fill_space_usage_query` (25.3+ vs older) now duplicate a fair amount of logic for computing `active`, `unique_frozen`, and `unique_detached`; it may be worth factoring out the shared subqueries/structure to reduce the chance of the branches drifting out of sync.
- In `_create_space_usage_table`, the `ctx` line has a stale inline comment `# , unique_frozen UInt64, unique_detached UInt64, orphaned UInt64)` which no longer matches the schema; consider cleaning this up to avoid confusion about the current table layout.

## Individual Comments

### Comment 1
<location> `ch_tools/common/commands/object_storage.py:1205` </location>
<code_context>
+                UNION ALL
+                SELECT replica, 'unique_detached', size FROM unique_detached
+                UNION ALL
+                SELECT ['all_replicas'], 'orphaned', toFloat64(sum(obj_size)) FROM {orphaned_blobs_table}
         """
+        ]
</code_context>

<issue_to_address>
**issue (bug_risk):** Casting orphaned size to Float64 conflicts with the UInt64 column type and may introduce precision or implicit-cast issues.

`size` in `{space_usage_table_new}` is `UInt64`, but this INSERT uses `toFloat64(sum(obj_size))`, forcing an implicit cast and potential precision/plan issues. Since `obj_size` is already `UInt64`, keep `sum(obj_size)` as `UInt64` and remove `toFloat64` here (and in the non-25.3 branch).
</issue_to_address>

### Comment 2
<location> `ch_tools/common/commands/object_storage.py:1160` </location>
<code_context>
+            WITH
+            aggregated_blobs AS (
+                SELECT
+                    groupArraySorted(10)(replica) as replica,
+                    obj_path,
+                    state,
</code_context>

<issue_to_address>
**issue (bug_risk):** Limiting `groupArraySorted` to 10 replicas can silently undercount space usage on clusters with more replicas.

Because only the first 10 replicas per `obj_path`+`state` are kept in the `replica` array, any additional replicas are dropped and never appear in the downstream per-replica aggregation. That will under-report usage on clusters with >10 replicas. If you need all replicas to be counted, remove the limit (`groupArraySorted(replica)`) or use another approach that preserves the full replica set.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2025-12-01T10:15:04Z

ch_tools/common/commands/object_storage.py

+                UNION ALL
+                SELECT replica, 'unique_detached', size FROM unique_detached
+                UNION ALL
+                SELECT ['all_replicas'], 'orphaned', toFloat64(sum(obj_size)) FROM {orphaned_blobs_table}


issue (bug_risk): Casting orphaned size to Float64 conflicts with the UInt64 column type and may introduce precision or implicit-cast issues.

size in {space_usage_table_new} is UInt64, but this INSERT uses toFloat64(sum(obj_size)), forcing an implicit cast and potential precision/plan issues. Since obj_size is already UInt64, keep sum(obj_size) as UInt64 and remove toFloat64 here (and in the non-25.3 branch).

sourcery-ai · 2025-12-01T10:15:04Z

ch_tools/common/commands/object_storage.py

+            WITH
+            aggregated_blobs AS (
+                SELECT
+                    groupArraySorted(10)(replica) as replica,


issue (bug_risk): Limiting groupArraySorted to 10 replicas can silently undercount space usage on clusters with more replicas.

Because only the first 10 replicas per obj_path+state are kept in the replica array, any additional replicas are dropped and never appear in the downstream per-replica aggregation. That will under-report usage on clusters with >10 replicas. If you need all replicas to be counted, remove the limit (groupArraySorted(replica)) or use another approach that preserves the full replica set.

aalexfvk · 2025-12-18T20:21:03Z

ch_tools/common/commands/object_storage.py

    """
    batch_values = ",".join(
-        f"('{item.key if item.key_is_full else os.path.join(disk_conf.prefix, item.key)}','shadow',{item.size},1)"
+        f"('missing_backups','{item.key if item.key_is_full else os.path.join(disk_conf.prefix, item.key)}','shadow',{item.size},1)"


This logic was here but let's make missing_backups, shadow as function parameter here because is is named generally _insert_local_blobs_batch

kirillgarbar added 6 commits November 20, 2025 12:43

Do not save missing backups on disk

62a1195

Review fixes

1ec3b95

Set ch-backup version with needed command options

d6b0a79

Fix test

8f330ea

Split space usage result by replicas

daa4e0b

Merge remote-tracking branch 'origin/main' into parse-missing-cloud-s…

f2595d7

…torage-meta

sourcery-ai bot reviewed Dec 1, 2025

View reviewed changes

Remove debug comment

779cdb0

aalexfvk reviewed Dec 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Split space usage by replicas #425

Split space usage by replicas #425

Uh oh!

kirillgarbar commented Dec 1, 2025 •

edited by sourcery-ai bot

Loading

Uh oh!

sourcery-ai bot commented Dec 1, 2025 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Uh oh!

sourcery-ai bot Dec 1, 2025

Uh oh!

sourcery-ai bot Dec 1, 2025

Uh oh!

aalexfvk Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Split space usage by replicas #425

Are you sure you want to change the base?

Split space usage by replicas #425

Uh oh!

Conversation

kirillgarbar commented Dec 1, 2025 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for collecting and filling replica-split space usage

Sequence diagram for reading replica-split space usage

ER diagram for updated object storage space usage tables

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

aalexfvk Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kirillgarbar commented Dec 1, 2025 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Dec 1, 2025 •

edited

Loading