Skip to content

Conversation

@dimbo4ka
Copy link

@dimbo4ka dimbo4ka commented Dec 3, 2025

To run the migration, execute:
chadmin dictionary migrate

Summary by Sourcery

Add a chadmin command to migrate ClickHouse external XML dictionaries into DDL dictionaries stored in a dedicated database, with parsing and conversion logic plus coverage tests.

New Features:

  • Introduce chadmin dictionary migrate command to convert XML-defined external dictionaries into DDL CREATE DICTIONARY objects in the _dictionaries database.
  • Add XML parsing utilities that generate CREATE DICTIONARY SQL from ClickHouse dictionary configuration files, including support for various key structures, layouts, lifetimes, and attribute defaults.

Enhancements:

  • Restrict ClickHouse config directory merging to XML and YAML files when loading configuration.
  • Expose dictionary value verification step in tests to assert migrated dictionaries return expected results.

Tests:

  • Add unit tests for converting XML dictionary definitions to DDL SQL, including multiple dictionaries and various null_value cases.
  • Add an integration feature test for the chadmin dictionary migrate command that validates migrated dictionaries against source table data.

@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Dec 3, 2025

Reviewer's Guide

Adds a new chadmin dictionary migrate command that scans ClickHouse XML dictionary configs, converts them into CREATE DICTIONARY DDL in the _dictionaries database, and verifies behavior via unit and integration tests; also tightens config loading to only merge XML/YAML includes.

Sequence diagram for chadmin dictionary migrate command

sequenceDiagram
    actor Admin
    participant CLI as chadmin_dictionary_migrate
    participant DictModule as dictionary_internal
    participant ConfigUtils as config_utils
    participant FS as Filesystem
    participant CH as ClickHouseServer

    Admin->>CLI: run chadmin dictionary migrate
    CLI->>DictModule: migrate_dictionaries(ctx)

    DictModule->>CH: execute_query CREATE DATABASE IF NOT EXISTS _dictionaries
    CH-->>DictModule: ok

    DictModule->>DictModule: config_glob_pattern = _get_dictionaries_config_pattern()
    DictModule->>ConfigUtils: _load_config(CLICKHOUSE_SERVER_CONFIG_PATH)
    ConfigUtils->>FS: read main config.xml
    FS-->>ConfigUtils: config.xml content
    ConfigUtils->>FS: list config.d directory
    FS-->>ConfigUtils: config.d entries
    ConfigUtils->>FS: read *.xml, *.yaml includes
    FS-->>ConfigUtils: include contents
    ConfigUtils-->>DictModule: merged config

    DictModule->>DictModule: extract <clickhouse><dictionaries_config>
    DictModule->>FS: glob(config_directory, config_glob_pattern)
    FS-->>DictModule: dictionary config files

    loop for each dictionary config file
        DictModule->>DictModule: generate_ddl_dictionary_from_xml(config_file)
        DictModule->>ConfigUtils: _load_config(config_file)
        ConfigUtils->>FS: read XML
        FS-->>ConfigUtils: parsed config dict
        ConfigUtils-->>DictModule: dictionary config

        DictModule->>DictModule: _build_dictionary_ddl_from_config(attrs)
        DictModule->>DictModule: _build_source_block / _build_layout_block
        DictModule->>DictModule: _build_lifetime_block
        DictModule->>DictModule: _build_structure_and_primary_key

        loop for each generated query
            DictModule->>CH: execute_query CREATE DICTIONARY _dictionaries.name ...
            CH-->>DictModule: ok or error
        end
    end

    DictModule-->>CLI: migration completed
    CLI-->>Admin: report success or error
Loading

File-Level Changes

Change Details Files
Introduce migration logic from XML dictionary configuration files to DDL statements and wire it into the chadmin CLI.
  • Add migrate_dictionaries to create _dictionaries database, discover dictionary config files from ClickHouse server config, and execute generated DDL with logging and error handling.
  • Implement generate_ddl_dictionary_from_xml and helpers to parse XML config into a structured dict, validate required sections, and build CREATE DICTIONARY statements including source, layout, lifetime, structure, and primary key.
  • Normalize null_value handling for dictionary attributes across multiple ClickHouse types and generate proper DEFAULT clauses in DDL.
  • Add _get_dictionaries_config_pattern to read <clickhouse><dictionaries_config> from main server config and use it as the glob pattern for dictionary XML files.
ch_tools/chadmin/internal/dictionary.py
Expose the migration as a CLI subcommand under chadmin dictionary migrate.
  • Register a new migrate command in the dictionary_group Click CLI group.
  • Delegate the CLI handler to migrate_dictionaries with a short command description.
ch_tools/chadmin/cli/dictionary_group.py
Restrict server config includes to XML/YAML files only when merging configs.
  • Update load_config to only merge included files whose names end with .xml or .yaml.
ch_tools/common/clickhouse/config/utils.py
Configure test ClickHouse to use dictionaries_config and add integration coverage for migration and dictionary lookups.
  • Extend test ClickHouse config to set <dictionaries_config>dictionaries/*.xml</dictionaries_config>.
  • Add a Behave step to assert dictionary values via dictGet after reloading dictionaries, comparing actual vs expected values per row.
  • Introduce an integration feature that writes a sample XML dictionary config into the container, restarts ClickHouse, runs chadmin dictionary migrate, and validates resulting _dictionaries.* dictionaries return expected values.
tests/images/clickhouse/config/config.xml
tests/steps/chadmin.py
tests/features/chadmin_dictionary_migration.feature
Add unit tests for XML-to-DDL parsing behavior, including null_value handling and multiple dictionaries per file.
  • Create parametrized tests that feed various XML snippets into generate_ddl_dictionary_from_xml and assert normalized SQL matches expected DDL for different lifetime, layout, key, and null_value combinations.
  • Add tests verifying multiple <dictionary> entries in one XML yield multiple CREATE DICTIONARY statements.
  • Introduce a helper normalize_sql to ignore formatting differences when comparing SQL strings.
tests/unit/chadmin/test_parse_xml_to_sql.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

  • The _build_layout_block implementation looks incorrect: layout_type is always a string key so the if isinstance(layout_type, str) branch makes the rest of the function unreachable, and the later use of layout_params in the f-string is wrong (it prints the dict instead of the layout name); consider refactoring this to correctly handle layouts with/without parameters.
  • In _build_lifetime_block, the branch if max_val is None and not isinstance(min_val, (str, int)) returns LIFETIME(({min_val})), which both formats a non-scalar object into SQL and adds extra parentheses; it would be safer either to validate and error here or normalize min_val to the expected scalar string/int.
  • The dictionary migration CLI help text says "Migrate XML-dictionaries to DLL"; consider correcting the wording to "DDL" so the command description matches the behavior implemented in migrate_dictionaries.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The `_build_layout_block` implementation looks incorrect: `layout_type` is always a string key so the `if isinstance(layout_type, str)` branch makes the rest of the function unreachable, and the later use of `layout_params` in the f-string is wrong (it prints the dict instead of the layout name); consider refactoring this to correctly handle layouts with/without parameters.
- In `_build_lifetime_block`, the branch `if max_val is None and not isinstance(min_val, (str, int))` returns `LIFETIME(({min_val}))`, which both formats a non-scalar object into SQL and adds extra parentheses; it would be safer either to validate and error here or normalize `min_val` to the expected scalar string/int.
- The dictionary migration CLI help text says "Migrate XML-dictionaries to DLL"; consider correcting the wording to "DDL" so the command description matches the behavior implemented in `migrate_dictionaries`.

## Individual Comments

### Comment 1
<location> `ch_tools/chadmin/internal/dictionary.py:176-185` </location>
<code_context>
+    layout = attrs.get("layout")
</code_context>

<issue_to_address>
**issue (bug_risk):** Layout parsing logic never handles parameters and appears to use the wrong variable in the final string.

Because `layout_type = list(layout.keys())[0]` is always a `str`, the `if isinstance(layout_type, str)` branch always runs and returns `LAYOUT(<TYPE>())`, so layouts with parameters (e.g. `<type ...>`) are never handled. In addition, in the parameterized case the final `return` currently uses `layout_params` instead of `layout_type`:

```python
return f"LAYOUT({layout_params}({str_params}))"
```
which will format the dict rather than the layout name.

You can simplify and fix both issues by validating the dict shape up front, unpacking via `layout_type, layout_params = next(iter(layout.items()))`, then:
- If `layout_params` is falsy, return `LAYOUT(<TYPE>())`.
- Otherwise, validate `layout_params` is a dict of `str` values, build `str_params = ", ".join(...)`, and return `LAYOUT(<LAYOUT_TYPE>({str_params}))` using the layout name, not the params dict.
</issue_to_address>

### Comment 2
<location> `ch_tools/chadmin/internal/dictionary.py:220-221` </location>
<code_context>
+            "At least one of <min> or <max> must be specified in <lifetime>"
+        )
+
+    if max_val is None and not isinstance(min_val, (str, int)):
+        return f"LIFETIME(({min_val}))"
+
+    if not isinstance(min_val, (str, int)) or not isinstance(max_val, (str, int)):
</code_context>

<issue_to_address>
**issue (bug_risk):** Lifetime block handling returns an invalid expression when `min` has an unexpected type.

In `_build_lifetime_block`, this branch:

```python
if max_val is None and not isinstance(min_val, (str, int)):
    return f"LIFETIME(({min_val}))"
```

will emit a `LIFETIME` clause with an invalid value instead of failing fast, and the double parentheses look syntactically wrong.

Consider validating and raising on bad input instead, e.g.:

```python
if max_val is None:
    if not isinstance(min_val, (str, int)):
        raise RuntimeError("<min> in <lifetime> must be a string or integer")
    return f"LIFETIME({min_val})"
```

while keeping the existing strict checks for the `min`/`max` pair.
</issue_to_address>

### Comment 3
<location> `ch_tools/chadmin/internal/dictionary.py:273-282` </location>
<code_context>
+    defaults = {
</code_context>

<issue_to_address>
**issue (bug_risk):** Default null value normalization can raise a KeyError for unsupported types.

In `_normalize_null_default_value`, when `null_value is None` and `attr_type` is missing from `defaults`, `defaults[attr_type]` will raise a `KeyError`, causing an unexpected hard failure for any unlisted type. Consider either using `defaults.get(attr_type, "NULL")` (or another appropriate fallback) or raising a more explicit error that clearly indicates the type is unsupported.
</issue_to_address>

### Comment 4
<location> `ch_tools/chadmin/cli/dictionary_group.py:64-66` </location>
<code_context>
+    """
+    Migrate XML-dictionaries to DLL.
+    """
</code_context>

<issue_to_address>
**nitpick (typo):** The command description says 'DLL' instead of 'DDL'.

This docstring should say "DDL" rather than "DLL" to match data definition language and avoid confusion in the CLI help output.

```suggestion
    """
    Migrate XML-dictionaries to DDL.
    """
```
</issue_to_address>

### Comment 5
<location> `tests/steps/chadmin.py:84` </location>
<code_context>
+@then('dictionary "{dict_name}" returns expected values on {node:w}')
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Assert on `SYSTEM RELOAD DICTIONARIES` exit code to avoid silently running checks against stale dictionaries

Right now the step ignores the exit code of `SYSTEM RELOAD DICTIONARIES`, so a failed reload could go unnoticed while subsequent lookups still run. Please capture the result of `container.exec_run('clickhouse-client -q "SYSTEM RELOAD DICTIONARIES"')` and assert `exit_code == 0` with a clear error message, similar to the existing assertions on `dictGet` queries.

```suggestion
    reload_result = container.exec_run('clickhouse-client -q "SYSTEM RELOAD DICTIONARIES"')
    assert (
        reload_result.exit_code == 0
    ), (
        f"SYSTEM RELOAD DICTIONARIES failed with exit code "
        f"{reload_result.exit_code}, output:\n{reload_result.output.decode().strip()}"
    )
```
</issue_to_address>

### Comment 6
<location> `tests/unit/chadmin/test_parse_xml_to_sql.py:9-18` </location>
<code_context>
+from ch_tools.chadmin.internal.dictionary import generate_ddl_dictionary_from_xml
+
+
+@pytest.fixture
+def test_data_dir(tmp_path: Path) -> Path:
+    data_dir = tmp_path / "test_dictionary.xml"
+    data_dir.mkdir()
+    return data_dir
+
+
+@pytest.fixture
+def create_xml_file(test_data_dir: Path) -> Callable[[Path, str], str]:
+    def _create(filename: Path, content: str) -> str:
+        filepath = test_data_dir / filename
+        filepath.write_text(content)
+        return str(filepath)
+
+    return _create
+
+
</code_context>

<issue_to_address>
**nitpick:** Align `create_xml_file` fixture type hints with how it is used in tests

The fixture is typed as `Callable[[Path, str], str]`, but `filename` is a `Path` in the implementation while all call sites pass a `str` (e.g. "test_dict.xml"). To match usage and improve tooling support, consider updating the type to `Callable[[str, str], str]` with `filename: str`, or use `Union[Path, str]` if both should be allowed.
</issue_to_address>

### Comment 7
<location> `tests/unit/chadmin/test_parse_xml_to_sql.py:284-293` </location>
<code_context>
+        ),
+    ],
+)
+def test_parse_xml_to_sql(
+    create_xml_file: Callable[[str, str], str], xml_content: str, expected_queries: str
+) -> None:
+    filepath = create_xml_file("test_dict.xml", xml_content)
+
+    result = generate_ddl_dictionary_from_xml(filepath)
+
+    assert len(result) == len(expected_queries)
+
+    for actual, expected in zip(result, expected_queries):
+        assert normalize_sql(actual) == normalize_sql(expected)
+
</code_context>

<issue_to_address>
**nitpick:** Fix type annotation of `expected_queries` parameter to reflect that it is a list of SQL strings

`expected_queries` is currently annotated as `str`, but the parametrization passes a list of SQL strings and the test iterates over it. Please change the type to `list[str]` (or `Sequence[str]`) so the signature matches actual usage and tools/readers aren’t misled.
</issue_to_address>

### Comment 8
<location> `tests/unit/chadmin/test_parse_xml_to_sql.py:26-35` </location>
<code_context>
+@pytest.mark.parametrize(
</code_context>

<issue_to_address>
**suggestion (testing):** Add negative tests for invalid or incomplete XML configurations to cover error paths in `generate_ddl_dictionary_from_xml`

`generate_ddl_dictionary_from_xml` has several branches that raise `RuntimeError` for malformed configs (e.g. missing `<dictionaries>` / `<dictionary>`, missing `<name>`, invalid `<source>` / `<structure>`, or both `<id>` and `<key>` present). Please add a parametrized test (e.g. `test_parse_xml_to_sql_invalid_configs`) that uses such broken XML snippets and asserts on the expected `RuntimeError` messages so these validation paths are covered and don’t regress.

Suggested implementation:

```python
@pytest.fixture
def create_xml_file(test_data_dir: Path) -> Callable[[Path, str], str]:
    def _create(filename: Path, content: str) -> str:
        filepath = test_data_dir / filename
        filepath.write_text(content)
        return str(filepath)

    return _create


@pytest.mark.parametrize(
    "xml_content,expected_error_substring",
    [
        pytest.param(
            # Missing <dictionaries> root element
            "<config></config>",
            "dictionaries",
            id="missing-dictionaries-root",
        ),
        pytest.param(
            # Missing <dictionary> element inside <dictionaries>
            "<dictionaries></dictionaries>",
            "dictionary",
            id="missing-dictionary",
        ),
        pytest.param(
            # Missing <name> element in <dictionary>
            """
<dictionaries>
  <dictionary>
    <source>
      <clickhouse>
        <db>db</db>
        <table>tbl</table>
      </clickhouse>
    </source>
    <structure>
      <attribute>
        <name>value</name>
        <type>String</type>
      </attribute>
    </structure>
  </dictionary>
</dictionaries>
""",
            "name",
            id="missing-name",
        ),
        pytest.param(
            # Missing <source>/<structure> configuration
            """
<dictionaries>
  <dictionary>
    <name>dict_without_source</name>
  </dictionary>
</dictionaries>
""",
            "source",
            id="missing-source-or-structure",
        ),
        pytest.param(
            # Both <id> and <key> present in <structure>
            """
<dictionaries>
  <dictionary>
    <name>dict_with_id_and_key</name>
    <source>
      <clickhouse>
        <db>db</db>
        <table>tbl</table>
      </clickhouse>
    </source>
    <structure>
      <id>
        <name>id</name>
        <type>UInt64</type>
      </id>
      <key>
        <attribute>
          <name>k</name>
          <type>UInt64</type>
        </attribute>
      </key>
      <attribute>
        <name>value</name>
        <type>String</type>
      </attribute>
    </structure>
  </dictionary>
</dictionaries>
""",
            "id and key",
            id="id-and-key-present",
        ),
    ],
)
def test_parse_xml_to_sql_invalid_configs(
    xml_content: str,
    expected_error_substring: str,
    create_xml_file: Callable[[Path, str], str],
) -> None:
    # Use the existing helper to create an XML file on disk
    xml_path = create_xml_file(Path("invalid_dictionary.xml"), xml_content)

    with pytest.raises(RuntimeError) as excinfo:
        generate_ddl_dictionary_from_xml(xml_path)

    # Assert that the raised error contains the expected validation message
    assert expected_error_substring in str(excinfo.value)


@pytest.mark.parametrize(

```

1. This edit assumes `generate_ddl_dictionary_from_xml` is already imported or defined in `tests/unit/chadmin/test_parse_xml_to_sql.py` (it likely is, since valid-config tests call it). If it is not, add the appropriate import at the top of this test file.
2. The `expected_error_substring` values (`"dictionaries"`, `"dictionary"`, `"name"`, `"source"`, `"id and key"`) should match substrings of the actual `RuntimeError` messages raised in `generate_ddl_dictionary_from_xml`. If the implementation uses different wording, adjust these substrings accordingly to match the real messages while keeping the tests focused on the specific validation branch they cover.
</issue_to_address>

### Comment 9
<location> `ch_tools/chadmin/internal/dictionary.py:203` </location>
<code_context>
    lifetime = attrs.get("lifetime", None)

</code_context>

<issue_to_address>
**suggestion (code-quality):** Replace `dict.get(x, None)` with `dict.get(x)` ([`remove-none-from-default-get`](https://docs.sourcery.ai/Reference/Rules-and-In-Line-Suggestions/Python/Default-Rules/remove-none-from-default-get))

```suggestion
    lifetime = attrs.get("lifetime")
```

<br/><details><summary>Explanation</summary>When using a dictionary's `get` method you can specify a default to return if
the key is not found. This defaults to `None`, so it is unnecessary to specify
`None` if this is the required behaviour. Removing the unnecessary argument
makes the code slightly shorter and clearer.
</details>
</issue_to_address>

### Comment 10
<location> `ch_tools/chadmin/internal/dictionary.py:85` </location>
<code_context>
def generate_ddl_dictionary_from_xml(config_path: str) -> list[str]:
    """
    Parse XML dictionary config and generate CREATE DICTIONARY statements.
    """
    config: dict[str, Any] = _load_config(config_path)

    queries: list[str] = []
    dictionaries = config.get("dictionaries")

    if dictionaries is None:
        raise RuntimeError("Dictionary config is missing <dictionaries> section")

    dictionary_nodes = dictionaries.get("dictionary")
    if dictionary_nodes is None:
        raise RuntimeError(
            "Dictionary config is missing <dictionaries><dictionary> section"
        )

    if not isinstance(dictionary_nodes, list) or isinstance(dictionary_nodes, dict):
        dictionary_nodes = [dictionary_nodes]

    for attrs in dictionary_nodes:
        if isinstance(attrs, dict):
            queries.append(_build_dictionary_ddl_from_config(attrs))

    return queries

</code_context>

<issue_to_address>
**issue (code-quality):** We've found these issues:

- Move assignment closer to its usage within a block ([`move-assign-in-block`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/move-assign-in-block/))
- Convert for loop into list comprehension ([`list-comprehension`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/list-comprehension/))
</issue_to_address>

### Comment 11
<location> `ch_tools/chadmin/internal/dictionary.py:138-140` </location>
<code_context>
def _build_dictionary_ddl_from_config(attrs: dict[str, Any]) -> str:
    name = _get_dictionary_name(attrs)
    source = _build_source_block(attrs)
    lifetime = _build_lifetime_block(attrs)
    layout = _build_layout_block(attrs)
    primary_key, structure = _build_structure_and_primary_key(attrs)

    query = _build_create_dictionary_statement(
        name, structure, primary_key, source, layout, lifetime, "", ""
    )

    return query

</code_context>

<issue_to_address>
**issue (code-quality):** Inline variable that is immediately returned ([`inline-immediately-returned-variable`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/inline-immediately-returned-variable/))
</issue_to_address>

### Comment 12
<location> `ch_tools/chadmin/internal/dictionary.py:273` </location>
<code_context>
def _normalize_null_default_value(attr_type: str, null_value: str) -> str:
    defaults = {
        "Int8": "0",
        "Int16": "0",
        "Int32": "0",
        "Int64": "0",
        "Int128": "0",
        "Int256": "0",
        "UInt8": "0",
        "UInt16": "0",
        "UInt32": "0",
        "UInt64": "0",
        "UInt128": "0",
        "UInt256": "0",
        "Float32": "0",
        "Float64": "0",
        "Decimal": "0",
        "Decimal32": "0",
        "Decimal64": "0",
        "Decimal128": "0",
        "Decimal256": "0",
        "String": "''",
        "FixedString": "''",
        "Date": "1970-01-01",
        "Date32": "1900-01-01",
        "DateTime": "1970-01-01 00:00:00",
        "DateTime64": "1970-01-01 00:00:00.000",
        "Bool": "0",
        "UUID": "00000000-0000-0000-0000-000000000000",
        "IPv4": "0.0.0.0",
        "IPv6": "::",
        "Array": "[]",
        "Tuple": "()",
        "Map": "{}",
        "Nullable": "NULL",
        "Point": "(0, 0)",
        "Ring": "[]",
        "Polygon": "[]",
        "MultiPolygon": "[]",
    }
    if null_value is None:
        return defaults[attr_type]
    if attr_type in ("String", "FixedString"):
        return f"'{null_value}'"
    return null_value

</code_context>

<issue_to_address>
**issue (code-quality):** We've found these issues:

- Move assignments closer to their usage ([`move-assign`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/move-assign/))
- Use set when checking membership of a collection of literals ([`collection-into-set`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/collection-into-set/))
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

def step_check_dict_values(context: ContextT, dict_name: str, node: str) -> None:
container = get_container(context, node)

container.exec_run('clickhouse-client -q "SYSTEM RELOAD DICTIONARIES"')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): Assert on SYSTEM RELOAD DICTIONARIES exit code to avoid silently running checks against stale dictionaries

Right now the step ignores the exit code of SYSTEM RELOAD DICTIONARIES, so a failed reload could go unnoticed while subsequent lookups still run. Please capture the result of container.exec_run('clickhouse-client -q "SYSTEM RELOAD DICTIONARIES"') and assert exit_code == 0 with a clear error message, similar to the existing assertions on dictGet queries.

Suggested change
container.exec_run('clickhouse-client -q "SYSTEM RELOAD DICTIONARIES"')
reload_result = container.exec_run('clickhouse-client -q "SYSTEM RELOAD DICTIONARIES"')
assert (
reload_result.exit_code == 0
), (
f"SYSTEM RELOAD DICTIONARIES failed with exit code "
f"{reload_result.exit_code}, output:\n{reload_result.output.decode().strip()}"
)

Comment on lines 26 to 35
@pytest.mark.parametrize(
"xml_content,expected_queries",
[
pytest.param(
"""
<dictionaries>
<dictionary>
<name>test_dict1</name>
<source>
<clickhouse>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Add negative tests for invalid or incomplete XML configurations to cover error paths in generate_ddl_dictionary_from_xml

generate_ddl_dictionary_from_xml has several branches that raise RuntimeError for malformed configs (e.g. missing <dictionaries> / <dictionary>, missing <name>, invalid <source> / <structure>, or both <id> and <key> present). Please add a parametrized test (e.g. test_parse_xml_to_sql_invalid_configs) that uses such broken XML snippets and asserts on the expected RuntimeError messages so these validation paths are covered and don’t regress.

Suggested implementation:

@pytest.fixture
def create_xml_file(test_data_dir: Path) -> Callable[[Path, str], str]:
    def _create(filename: Path, content: str) -> str:
        filepath = test_data_dir / filename
        filepath.write_text(content)
        return str(filepath)

    return _create


@pytest.mark.parametrize(
    "xml_content,expected_error_substring",
    [
        pytest.param(
            # Missing <dictionaries> root element
            "<config></config>",
            "dictionaries",
            id="missing-dictionaries-root",
        ),
        pytest.param(
            # Missing <dictionary> element inside <dictionaries>
            "<dictionaries></dictionaries>",
            "dictionary",
            id="missing-dictionary",
        ),
        pytest.param(
            # Missing <name> element in <dictionary>
            """
<dictionaries>
  <dictionary>
    <source>
      <clickhouse>
        <db>db</db>
        <table>tbl</table>
      </clickhouse>
    </source>
    <structure>
      <attribute>
        <name>value</name>
        <type>String</type>
      </attribute>
    </structure>
  </dictionary>
</dictionaries>
""",
            "name",
            id="missing-name",
        ),
        pytest.param(
            # Missing <source>/<structure> configuration
            """
<dictionaries>
  <dictionary>
    <name>dict_without_source</name>
  </dictionary>
</dictionaries>
""",
            "source",
            id="missing-source-or-structure",
        ),
        pytest.param(
            # Both <id> and <key> present in <structure>
            """
<dictionaries>
  <dictionary>
    <name>dict_with_id_and_key</name>
    <source>
      <clickhouse>
        <db>db</db>
        <table>tbl</table>
      </clickhouse>
    </source>
    <structure>
      <id>
        <name>id</name>
        <type>UInt64</type>
      </id>
      <key>
        <attribute>
          <name>k</name>
          <type>UInt64</type>
        </attribute>
      </key>
      <attribute>
        <name>value</name>
        <type>String</type>
      </attribute>
    </structure>
  </dictionary>
</dictionaries>
""",
            "id and key",
            id="id-and-key-present",
        ),
    ],
)
def test_parse_xml_to_sql_invalid_configs(
    xml_content: str,
    expected_error_substring: str,
    create_xml_file: Callable[[Path, str], str],
) -> None:
    # Use the existing helper to create an XML file on disk
    xml_path = create_xml_file(Path("invalid_dictionary.xml"), xml_content)

    with pytest.raises(RuntimeError) as excinfo:
        generate_ddl_dictionary_from_xml(xml_path)

    # Assert that the raised error contains the expected validation message
    assert expected_error_substring in str(excinfo.value)


@pytest.mark.parametrize(
  1. This edit assumes generate_ddl_dictionary_from_xml is already imported or defined in tests/unit/chadmin/test_parse_xml_to_sql.py (it likely is, since valid-config tests call it). If it is not, add the appropriate import at the top of this test file.
  2. The expected_error_substring values ("dictionaries", "dictionary", "name", "source", "id and key") should match substrings of the actual RuntimeError messages raised in generate_ddl_dictionary_from_xml. If the implementation uses different wording, adjust these substrings accordingly to match the real messages while keeping the tests focused on the specific validation branch they cover.

Comment on lines 59 to 62


@dictionary_group.command("migrate")
@pass_context
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we need parameters here:

  1. dry-run mode to check correctness of the generated sql
  2. Options that allows to migrage selectively eg with regex
  3. keep-going ? Not sure but for consistency with other commands looks useful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep-going looks bad for this command, it is better to fail as soon as possible before some dicts are converted, otherwise we will need to distinguish already migrated dictionaries from non-migrated ones on restart and make sure that you do not delete the file with the non-migrated dictionary

Migrate external dictionaries to DDL.
"""
logging.info("Creating '_dictionaries' database")
execute_query(ctx, "CREATE DATABASE IF NOT EXISTS _dictionaries", format_=None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets move the name of database into config

return queries


def _get_dictionaries_config_pattern() -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _get_dictionaries_config_pattern() -> str:
def _get_dictionaries_config_paths_pattern() -> str:

The best option I can think of, but the current name doesn't represent that we're getting the regexp for paths to XML files

And there are no function desc.

Comment on lines 62 to 65
config_path = Path(CLICKHOUSE_SERVER_CONFIG_PATH)
config_directory = config_path.parent
config_path_list = list(config_directory.glob(config_glob_pattern))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regexp in <dictionaries_config> also can represent abs paths.

config_path_list = list(config_directory.glob(config_glob_pattern))

logging.info("Starting external dictionaries migration")
for i, config_file in enumerate(config_path_list, start=1):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO execute_tasks_in_parallel can fit here very well, free parallelization


def _get_dictionary_name(attrs: dict[str, Any]) -> str:
name = attrs.get("name")
if name is None or not isinstance(name, str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if name is None or not isinstance(name, str):
if not isinstance(name, str):

isinstance(None, str) is fine ihmo

return name


def _build_source_block(attrs: dict[str, Any]) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acutually lets add the xml snippets for such functions, because it is hard to understand without looking into the config example

)

if max_val is None:
if not isinstance(min_val, str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why str? The min/max for lifetime is integer value

Copy link
Author

@dimbo4ka dimbo4ka Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xmltodict parses all text as str


parts = [name, attr_type]

if "null_value" in attr:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if "null_value" in attr:
if 'null_value' not in attr:
raise ...

Null value is reqired

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think that it is not required

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parts = [name, attr_type]

if "null_value" in attr:
null_value = _normalize_null_default_value(attr_type, attr["null_value"])
Copy link
Contributor

@MikhailBurdukov MikhailBurdukov Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if in config the null value is specified as

<null_value />

Will it be equal to None?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will be {"null_value": None}

Comment on lines 36 to 38
@when("we execute command on {node:w}")
@then("we execute command on {node:w}")
def step_command(context: ContextT, node: str) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it is not used in tests. Given and when should also be enough in any case

Comment on lines +116 to +118
@when("we execute command on {node:w} expecting failure")
@then("we execute command on {node:w} expecting failure")
def step_execute_command_expecting_failure(context: ContextT, node: str) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is already implemented like this

When we try to execute command on clickhouse01
"""
....
"""
Then it fails with response contains
"""
.....
"""

Comment on lines +134 to +136
except Exception as e:
raise RuntimeError(f"Error while removing '{config_file}'") from e
logging.info("Removing external dictionaries completed successfully")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reload can be added here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also makes sense to add an option to decide if new SQL dictionaries should be loaded after the migration, in case lazy load is enabled

Comment on lines +348 to +350

attributes = structure.get("attribute", [])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line was moved lower, but old line is node deleted here

Comment on lines +208 to +209
config_path = CLICKHOUSE_SERVER_CONFIG_PATH
parsed_config = load_config_file(CLICKHOUSE_SERVER_CONFIG_PATH)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_clickhouse_config can be reused here

Comment on lines +43 to +50
for config_file in config_path_list:
queries = _generate_ddl_dictionaries_from_xml(str(config_file), target_database)
for dict_name, query in queries:
all_dictionaries.append((config_file, dict_name, query))

filtered_dictionaries = _filter_dictionaries(
all_dictionaries, include_pattern, exclude_pattern
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing filters to _generate_ddl_dictionaries_from_xml or filtering by file name before calling _generate_ddl_dictionaries_from_xml should be more optimal

Comment on lines +147 to +151
if include_pattern:
if not (
fnmatch(dict_name, include_pattern)
or fnmatch(config_file_str, include_pattern)
or fnmatch(config_file.name, include_pattern)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it can be simplified to matching config_file_str only. Whether to check only filename or abs path can be decided with a pattern itself

Comment on lines +168 to +171
def _get_dictionary_nodes_from_config(config: dict[str, Any]) -> list[dict[str, Any]]:
dictionary_node_list = []
if not isinstance(config, dict):
return dictionary_node_list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such checks look redundant, config is already declared as dict[str, Any]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants