Implement migration from XML dictionaries to DDL #426

dimbo4ka · 2025-12-03T10:49:20Z

To run the migration, execute:
chadmin dictionary migrate

Summary by Sourcery

Add a chadmin command to migrate ClickHouse external XML dictionaries into DDL dictionaries stored in a dedicated database, with parsing and conversion logic plus coverage tests.

New Features:

Introduce chadmin dictionary migrate command to convert XML-defined external dictionaries into DDL CREATE DICTIONARY objects in the _dictionaries database.
Add XML parsing utilities that generate CREATE DICTIONARY SQL from ClickHouse dictionary configuration files, including support for various key structures, layouts, lifetimes, and attribute defaults.

Enhancements:

Restrict ClickHouse config directory merging to XML and YAML files when loading configuration.
Expose dictionary value verification step in tests to assert migrated dictionaries return expected results.

Tests:

Add unit tests for converting XML dictionary definitions to DDL SQL, including multiple dictionaries and various null_value cases.
Add an integration feature test for the chadmin dictionary migrate command that validates migrated dictionaries against source table data.

sourcery-ai · 2025-12-03T10:49:27Z

Reviewer's Guide

Adds a new chadmin dictionary migrate command that scans ClickHouse XML dictionary configs, converts them into CREATE DICTIONARY DDL in the _dictionaries database, and verifies behavior via unit and integration tests; also tightens config loading to only merge XML/YAML includes.

Sequence diagram for chadmin dictionary migrate command

sequenceDiagram
    actor Admin
    participant CLI as chadmin_dictionary_migrate
    participant DictModule as dictionary_internal
    participant ConfigUtils as config_utils
    participant FS as Filesystem
    participant CH as ClickHouseServer

    Admin->>CLI: run chadmin dictionary migrate
    CLI->>DictModule: migrate_dictionaries(ctx)

    DictModule->>CH: execute_query CREATE DATABASE IF NOT EXISTS _dictionaries
    CH-->>DictModule: ok

    DictModule->>DictModule: config_glob_pattern = _get_dictionaries_config_pattern()
    DictModule->>ConfigUtils: _load_config(CLICKHOUSE_SERVER_CONFIG_PATH)
    ConfigUtils->>FS: read main config.xml
    FS-->>ConfigUtils: config.xml content
    ConfigUtils->>FS: list config.d directory
    FS-->>ConfigUtils: config.d entries
    ConfigUtils->>FS: read *.xml, *.yaml includes
    FS-->>ConfigUtils: include contents
    ConfigUtils-->>DictModule: merged config

    DictModule->>DictModule: extract <clickhouse><dictionaries_config>
    DictModule->>FS: glob(config_directory, config_glob_pattern)
    FS-->>DictModule: dictionary config files

    loop for each dictionary config file
        DictModule->>DictModule: generate_ddl_dictionary_from_xml(config_file)
        DictModule->>ConfigUtils: _load_config(config_file)
        ConfigUtils->>FS: read XML
        FS-->>ConfigUtils: parsed config dict
        ConfigUtils-->>DictModule: dictionary config

        DictModule->>DictModule: _build_dictionary_ddl_from_config(attrs)
        DictModule->>DictModule: _build_source_block / _build_layout_block
        DictModule->>DictModule: _build_lifetime_block
        DictModule->>DictModule: _build_structure_and_primary_key

        loop for each generated query
            DictModule->>CH: execute_query CREATE DICTIONARY _dictionaries.name ...
            CH-->>DictModule: ok or error
        end
    end

    DictModule-->>CLI: migration completed
    CLI-->>Admin: report success or error

File-Level Changes

Change	Details	Files
Introduce migration logic from XML dictionary configuration files to DDL statements and wire it into the chadmin CLI.	Add `migrate_dictionaries` to create `_dictionaries` database, discover dictionary config files from ClickHouse server config, and execute generated DDL with logging and error handling. Implement `generate_ddl_dictionary_from_xml` and helpers to parse XML config into a structured dict, validate required sections, and build CREATE DICTIONARY statements including source, layout, lifetime, structure, and primary key. Normalize `null_value` handling for dictionary attributes across multiple ClickHouse types and generate proper DEFAULT clauses in DDL. Add `_get_dictionaries_config_pattern` to read `<clickhouse><dictionaries_config>` from main server config and use it as the glob pattern for dictionary XML files.	`ch_tools/chadmin/internal/dictionary.py`
Expose the migration as a CLI subcommand under `chadmin dictionary migrate`.	Register a new `migrate` command in the `dictionary_group` Click CLI group. Delegate the CLI handler to `migrate_dictionaries` with a short command description.	`ch_tools/chadmin/cli/dictionary_group.py`
Restrict server config includes to XML/YAML files only when merging configs.	Update `load_config` to only merge included files whose names end with `.xml` or `.yaml`.	`ch_tools/common/clickhouse/config/utils.py`
Configure test ClickHouse to use dictionaries_config and add integration coverage for migration and dictionary lookups.	Extend test ClickHouse config to set `<dictionaries_config>dictionaries/.xml</dictionaries_config>`. Add a Behave step to assert dictionary values via `dictGet` after reloading dictionaries, comparing actual vs expected values per row. Introduce an integration feature that writes a sample XML dictionary config into the container, restarts ClickHouse, runs `chadmin dictionary migrate`, and validates resulting `_dictionaries.` dictionaries return expected values.	`tests/images/clickhouse/config/config.xml` `tests/steps/chadmin.py` `tests/features/chadmin_dictionary_migration.feature`
Add unit tests for XML-to-DDL parsing behavior, including null_value handling and multiple dictionaries per file.	Create parametrized tests that feed various XML snippets into `generate_ddl_dictionary_from_xml` and assert normalized SQL matches expected DDL for different lifetime, layout, key, and null_value combinations. Add tests verifying multiple `<dictionary>` entries in one XML yield multiple CREATE DICTIONARY statements. Introduce a helper `normalize_sql` to ignore formatting differences when comparing SQL strings.	`tests/unit/chadmin/test_parse_xml_to_sql.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey there - I've reviewed your changes - here's some feedback:

The _build_layout_block implementation looks incorrect: layout_type is always a string key so the if isinstance(layout_type, str) branch makes the rest of the function unreachable, and the later use of layout_params in the f-string is wrong (it prints the dict instead of the layout name); consider refactoring this to correctly handle layouts with/without parameters.
In _build_lifetime_block, the branch if max_val is None and not isinstance(min_val, (str, int)) returns LIFETIME(({min_val})), which both formats a non-scalar object into SQL and adds extra parentheses; it would be safer either to validate and error here or normalize min_val to the expected scalar string/int.
The dictionary migration CLI help text says "Migrate XML-dictionaries to DLL"; consider correcting the wording to "DDL" so the command description matches the behavior implemented in migrate_dictionaries.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The `_build_layout_block` implementation looks incorrect: `layout_type` is always a string key so the `if isinstance(layout_type, str)` branch makes the rest of the function unreachable, and the later use of `layout_params` in the f-string is wrong (it prints the dict instead of the layout name); consider refactoring this to correctly handle layouts with/without parameters.
- In `_build_lifetime_block`, the branch `if max_val is None and not isinstance(min_val, (str, int))` returns `LIFETIME(({min_val}))`, which both formats a non-scalar object into SQL and adds extra parentheses; it would be safer either to validate and error here or normalize `min_val` to the expected scalar string/int.
- The dictionary migration CLI help text says "Migrate XML-dictionaries to DLL"; consider correcting the wording to "DDL" so the command description matches the behavior implemented in `migrate_dictionaries`.

## Individual Comments

### Comment 1
<location> `ch_tools/chadmin/internal/dictionary.py:176-185` </location>
<code_context>
+    layout = attrs.get("layout")
</code_context>

<issue_to_address>
**issue (bug_risk):** Layout parsing logic never handles parameters and appears to use the wrong variable in the final string.

Because `layout_type = list(layout.keys())[0]` is always a `str`, the `if isinstance(layout_type, str)` branch always runs and returns `LAYOUT(<TYPE>())`, so layouts with parameters (e.g. `<type ...>`) are never handled. In addition, in the parameterized case the final `return` currently uses `layout_params` instead of `layout_type`:

```python
return f"LAYOUT({layout_params}({str_params}))"
```
which will format the dict rather than the layout name.

You can simplify and fix both issues by validating the dict shape up front, unpacking via `layout_type, layout_params = next(iter(layout.items()))`, then:
- If `layout_params` is falsy, return `LAYOUT(<TYPE>())`.
- Otherwise, validate `layout_params` is a dict of `str` values, build `str_params = ", ".join(...)`, and return `LAYOUT(<LAYOUT_TYPE>({str_params}))` using the layout name, not the params dict.
</issue_to_address>

### Comment 2
<location> `ch_tools/chadmin/internal/dictionary.py:220-221` </location>
<code_context>
+            "At least one of <min> or <max> must be specified in <lifetime>"
+        )
+
+    if max_val is None and not isinstance(min_val, (str, int)):
+        return f"LIFETIME(({min_val}))"
+
+    if not isinstance(min_val, (str, int)) or not isinstance(max_val, (str, int)):
</code_context>

<issue_to_address>
**issue (bug_risk):** Lifetime block handling returns an invalid expression when `min` has an unexpected type.

In `_build_lifetime_block`, this branch:

```python
if max_val is None and not isinstance(min_val, (str, int)):
    return f"LIFETIME(({min_val}))"
```

will emit a `LIFETIME` clause with an invalid value instead of failing fast, and the double parentheses look syntactically wrong.

Consider validating and raising on bad input instead, e.g.:

```python
if max_val is None:
    if not isinstance(min_val, (str, int)):
        raise RuntimeError("<min> in <lifetime> must be a string or integer")
    return f"LIFETIME({min_val})"
```

while keeping the existing strict checks for the `min`/`max` pair.
</issue_to_address>

### Comment 3
<location> `ch_tools/chadmin/internal/dictionary.py:273-282` </location>
<code_context>
+    defaults = {
</code_context>

<issue_to_address>
**issue (bug_risk):** Default null value normalization can raise a KeyError for unsupported types.

In `_normalize_null_default_value`, when `null_value is None` and `attr_type` is missing from `defaults`, `defaults[attr_type]` will raise a `KeyError`, causing an unexpected hard failure for any unlisted type. Consider either using `defaults.get(attr_type, "NULL")` (or another appropriate fallback) or raising a more explicit error that clearly indicates the type is unsupported.
</issue_to_address>

### Comment 4
<location> `ch_tools/chadmin/cli/dictionary_group.py:64-66` </location>
<code_context>
+    """
+    Migrate XML-dictionaries to DLL.
+    """
</code_context>

<issue_to_address>
**nitpick (typo):** The command description says 'DLL' instead of 'DDL'.

This docstring should say "DDL" rather than "DLL" to match data definition language and avoid confusion in the CLI help output.

```suggestion
    """
    Migrate XML-dictionaries to DDL.
    """
```
</issue_to_address>

### Comment 5
<location> `tests/steps/chadmin.py:84` </location>
<code_context>
+@then('dictionary "{dict_name}" returns expected values on {node:w}')
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Assert on `SYSTEM RELOAD DICTIONARIES` exit code to avoid silently running checks against stale dictionaries

Right now the step ignores the exit code of `SYSTEM RELOAD DICTIONARIES`, so a failed reload could go unnoticed while subsequent lookups still run. Please capture the result of `container.exec_run('clickhouse-client -q "SYSTEM RELOAD DICTIONARIES"')` and assert `exit_code == 0` with a clear error message, similar to the existing assertions on `dictGet` queries.

```suggestion
    reload_result = container.exec_run('clickhouse-client -q "SYSTEM RELOAD DICTIONARIES"')
    assert (
        reload_result.exit_code == 0
    ), (
        f"SYSTEM RELOAD DICTIONARIES failed with exit code "
        f"{reload_result.exit_code}, output:\n{reload_result.output.decode().strip()}"
    )
```
</issue_to_address>

### Comment 6
<location> `tests/unit/chadmin/test_parse_xml_to_sql.py:9-18` </location>
<code_context>
+from ch_tools.chadmin.internal.dictionary import generate_ddl_dictionary_from_xml
+
+
+@pytest.fixture
+def test_data_dir(tmp_path: Path) -> Path:
+    data_dir = tmp_path / "test_dictionary.xml"
+    data_dir.mkdir()
+    return data_dir
+
+
+@pytest.fixture
+def create_xml_file(test_data_dir: Path) -> Callable[[Path, str], str]:
+    def _create(filename: Path, content: str) -> str:
+        filepath = test_data_dir / filename
+        filepath.write_text(content)
+        return str(filepath)
+
+    return _create
+
+
</code_context>

<issue_to_address>
**nitpick:** Align `create_xml_file` fixture type hints with how it is used in tests

The fixture is typed as `Callable[[Path, str], str]`, but `filename` is a `Path` in the implementation while all call sites pass a `str` (e.g. "test_dict.xml"). To match usage and improve tooling support, consider updating the type to `Callable[[str, str], str]` with `filename: str`, or use `Union[Path, str]` if both should be allowed.
</issue_to_address>

### Comment 7
<location> `tests/unit/chadmin/test_parse_xml_to_sql.py:284-293` </location>
<code_context>
+        ),
+    ],
+)
+def test_parse_xml_to_sql(
+    create_xml_file: Callable[[str, str], str], xml_content: str, expected_queries: str
+) -> None:
+    filepath = create_xml_file("test_dict.xml", xml_content)
+
+    result = generate_ddl_dictionary_from_xml(filepath)
+
+    assert len(result) == len(expected_queries)
+
+    for actual, expected in zip(result, expected_queries):
+        assert normalize_sql(actual) == normalize_sql(expected)
+
</code_context>

<issue_to_address>
**nitpick:** Fix type annotation of `expected_queries` parameter to reflect that it is a list of SQL strings

`expected_queries` is currently annotated as `str`, but the parametrization passes a list of SQL strings and the test iterates over it. Please change the type to `list[str]` (or `Sequence[str]`) so the signature matches actual usage and tools/readers aren’t misled.
</issue_to_address>

### Comment 8
<location> `tests/unit/chadmin/test_parse_xml_to_sql.py:26-35` </location>
<code_context>
+@pytest.mark.parametrize(
</code_context>

<issue_to_address>
**suggestion (testing):** Add negative tests for invalid or incomplete XML configurations to cover error paths in `generate_ddl_dictionary_from_xml`

`generate_ddl_dictionary_from_xml` has several branches that raise `RuntimeError` for malformed configs (e.g. missing `<dictionaries>` / `<dictionary>`, missing `<name>`, invalid `<source>` / `<structure>`, or both `<id>` and `<key>` present). Please add a parametrized test (e.g. `test_parse_xml_to_sql_invalid_configs`) that uses such broken XML snippets and asserts on the expected `RuntimeError` messages so these validation paths are covered and don’t regress.

Suggested implementation:

```python
@pytest.fixture
def create_xml_file(test_data_dir: Path) -> Callable[[Path, str], str]:
    def _create(filename: Path, content: str) -> str:
        filepath = test_data_dir / filename
        filepath.write_text(content)
        return str(filepath)

    return _create


@pytest.mark.parametrize(
    "xml_content,expected_error_substring",
    [
        pytest.param(
            # Missing <dictionaries> root element
            "<config></config>",
            "dictionaries",
            id="missing-dictionaries-root",
        ),
        pytest.param(
            # Missing <dictionary> element inside <dictionaries>
            "<dictionaries></dictionaries>",
            "dictionary",
            id="missing-dictionary",
        ),
        pytest.param(
            # Missing <name> element in <dictionary>
            """
<dictionaries>
  <dictionary>
    <source>
      <clickhouse>
        <db>db</db>
        <table>tbl</table>
      </clickhouse>
    </source>
    <structure>
      <attribute>
        <name>value</name>
        <type>String</type>
      </attribute>
    </structure>
  </dictionary>
</dictionaries>
""",
            "name",
            id="missing-name",
        ),
        pytest.param(
            # Missing <source>/<structure> configuration
            """
<dictionaries>
  <dictionary>
    <name>dict_without_source</name>
  </dictionary>
</dictionaries>
""",
            "source",
            id="missing-source-or-structure",
        ),
        pytest.param(
            # Both <id> and <key> present in <structure>
            """
<dictionaries>
  <dictionary>
    <name>dict_with_id_and_key</name>
    <source>
      <clickhouse>
        <db>db</db>
        <table>tbl</table>
      </clickhouse>
    </source>
    <structure>
      <id>
        <name>id</name>
        <type>UInt64</type>
      </id>
      <key>
        <attribute>
          <name>k</name>
          <type>UInt64</type>
        </attribute>
      </key>
      <attribute>
        <name>value</name>
        <type>String</type>
      </attribute>
    </structure>
  </dictionary>
</dictionaries>
""",
            "id and key",
            id="id-and-key-present",
        ),
    ],
)
def test_parse_xml_to_sql_invalid_configs(
    xml_content: str,
    expected_error_substring: str,
    create_xml_file: Callable[[Path, str], str],
) -> None:
    # Use the existing helper to create an XML file on disk
    xml_path = create_xml_file(Path("invalid_dictionary.xml"), xml_content)

    with pytest.raises(RuntimeError) as excinfo:
        generate_ddl_dictionary_from_xml(xml_path)

    # Assert that the raised error contains the expected validation message
    assert expected_error_substring in str(excinfo.value)


@pytest.mark.parametrize(

```

1. This edit assumes `generate_ddl_dictionary_from_xml` is already imported or defined in `tests/unit/chadmin/test_parse_xml_to_sql.py` (it likely is, since valid-config tests call it). If it is not, add the appropriate import at the top of this test file.
2. The `expected_error_substring` values (`"dictionaries"`, `"dictionary"`, `"name"`, `"source"`, `"id and key"`) should match substrings of the actual `RuntimeError` messages raised in `generate_ddl_dictionary_from_xml`. If the implementation uses different wording, adjust these substrings accordingly to match the real messages while keeping the tests focused on the specific validation branch they cover.
</issue_to_address>

### Comment 9
<location> `ch_tools/chadmin/internal/dictionary.py:203` </location>
<code_context>
    lifetime = attrs.get("lifetime", None)

</code_context>

<issue_to_address>
**suggestion (code-quality):** Replace `dict.get(x, None)` with `dict.get(x)` ([`remove-none-from-default-get`](https://docs.sourcery.ai/Reference/Rules-and-In-Line-Suggestions/Python/Default-Rules/remove-none-from-default-get))

```suggestion
    lifetime = attrs.get("lifetime")
```

<br/><details><summary>Explanation</summary>When using a dictionary's `get` method you can specify a default to return if
the key is not found. This defaults to `None`, so it is unnecessary to specify
`None` if this is the required behaviour. Removing the unnecessary argument
makes the code slightly shorter and clearer.
</details>
</issue_to_address>

### Comment 10
<location> `ch_tools/chadmin/internal/dictionary.py:85` </location>
<code_context>
def generate_ddl_dictionary_from_xml(config_path: str) -> list[str]:
    """
    Parse XML dictionary config and generate CREATE DICTIONARY statements.
    """
    config: dict[str, Any] = _load_config(config_path)

    queries: list[str] = []
    dictionaries = config.get("dictionaries")

    if dictionaries is None:
        raise RuntimeError("Dictionary config is missing <dictionaries> section")

    dictionary_nodes = dictionaries.get("dictionary")
    if dictionary_nodes is None:
        raise RuntimeError(
            "Dictionary config is missing <dictionaries><dictionary> section"
        )

    if not isinstance(dictionary_nodes, list) or isinstance(dictionary_nodes, dict):
        dictionary_nodes = [dictionary_nodes]

    for attrs in dictionary_nodes:
        if isinstance(attrs, dict):
            queries.append(_build_dictionary_ddl_from_config(attrs))

    return queries

</code_context>

<issue_to_address>
**issue (code-quality):** We've found these issues:

- Move assignment closer to its usage within a block ([`move-assign-in-block`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/move-assign-in-block/))
- Convert for loop into list comprehension ([`list-comprehension`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/list-comprehension/))
</issue_to_address>

### Comment 11
<location> `ch_tools/chadmin/internal/dictionary.py:138-140` </location>
<code_context>
def _build_dictionary_ddl_from_config(attrs: dict[str, Any]) -> str:
    name = _get_dictionary_name(attrs)
    source = _build_source_block(attrs)
    lifetime = _build_lifetime_block(attrs)
    layout = _build_layout_block(attrs)
    primary_key, structure = _build_structure_and_primary_key(attrs)

    query = _build_create_dictionary_statement(
        name, structure, primary_key, source, layout, lifetime, "", ""
    )

    return query

</code_context>

<issue_to_address>
**issue (code-quality):** Inline variable that is immediately returned ([`inline-immediately-returned-variable`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/inline-immediately-returned-variable/))
</issue_to_address>

### Comment 12
<location> `ch_tools/chadmin/internal/dictionary.py:273` </location>
<code_context>
def _normalize_null_default_value(attr_type: str, null_value: str) -> str:
    defaults = {
        "Int8": "0",
        "Int16": "0",
        "Int32": "0",
        "Int64": "0",
        "Int128": "0",
        "Int256": "0",
        "UInt8": "0",
        "UInt16": "0",
        "UInt32": "0",
        "UInt64": "0",
        "UInt128": "0",
        "UInt256": "0",
        "Float32": "0",
        "Float64": "0",
        "Decimal": "0",
        "Decimal32": "0",
        "Decimal64": "0",
        "Decimal128": "0",
        "Decimal256": "0",
        "String": "''",
        "FixedString": "''",
        "Date": "1970-01-01",
        "Date32": "1900-01-01",
        "DateTime": "1970-01-01 00:00:00",
        "DateTime64": "1970-01-01 00:00:00.000",
        "Bool": "0",
        "UUID": "00000000-0000-0000-0000-000000000000",
        "IPv4": "0.0.0.0",
        "IPv6": "::",
        "Array": "[]",
        "Tuple": "()",
        "Map": "{}",
        "Nullable": "NULL",
        "Point": "(0, 0)",
        "Ring": "[]",
        "Polygon": "[]",
        "MultiPolygon": "[]",
    }
    if null_value is None:
        return defaults[attr_type]
    if attr_type in ("String", "FixedString"):
        return f"'{null_value}'"
    return null_value

</code_context>

<issue_to_address>
**issue (code-quality):** We've found these issues:

- Move assignments closer to their usage ([`move-assign`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/move-assign/))
- Use set when checking membership of a collection of literals ([`collection-into-set`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/collection-into-set/))
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

ch_tools/chadmin/internal/dictionary.py

ch_tools/chadmin/cli/dictionary_group.py

sourcery-ai · 2025-12-03T10:51:16Z

tests/steps/chadmin.py

+def step_check_dict_values(context: ContextT, dict_name: str, node: str) -> None:
+    container = get_container(context, node)
+
+    container.exec_run('clickhouse-client -q "SYSTEM RELOAD DICTIONARIES"')


suggestion (bug_risk): Assert on SYSTEM RELOAD DICTIONARIES exit code to avoid silently running checks against stale dictionaries

Right now the step ignores the exit code of SYSTEM RELOAD DICTIONARIES, so a failed reload could go unnoticed while subsequent lookups still run. Please capture the result of container.exec_run('clickhouse-client -q "SYSTEM RELOAD DICTIONARIES"') and assert exit_code == 0 with a clear error message, similar to the existing assertions on dictGet queries.

Suggested change

container.exec_run('clickhouse-client -q "SYSTEM RELOAD DICTIONARIES"')

reload_result = container.exec_run('clickhouse-client -q "SYSTEM RELOAD DICTIONARIES"')

assert (

reload_result.exit_code == 0

), (

f"SYSTEM RELOAD DICTIONARIES failed with exit code "

f"{reload_result.exit_code}, output:\n{reload_result.output.decode().strip()}"

)

sourcery-ai · 2025-12-03T10:51:16Z

tests/unit/chadmin/test_parse_xml_to_sql.py

+@pytest.mark.parametrize(
+    "xml_content,expected_queries",
+    [
+        pytest.param(
+            """
+<dictionaries>
+  <dictionary>
+    <name>test_dict1</name>
+    <source>
+      <clickhouse>


suggestion (testing): Add negative tests for invalid or incomplete XML configurations to cover error paths in generate_ddl_dictionary_from_xml

generate_ddl_dictionary_from_xml has several branches that raise RuntimeError for malformed configs (e.g. missing <dictionaries> / <dictionary>, missing <name>, invalid <source> / <structure>, or both <id> and <key> present). Please add a parametrized test (e.g. test_parse_xml_to_sql_invalid_configs) that uses such broken XML snippets and asserts on the expected RuntimeError messages so these validation paths are covered and don’t regress.

Suggested implementation:

@pytest.fixture def create_xml_file(test_data_dir: Path) -> Callable[[Path, str], str]: def _create(filename: Path, content: str) -> str: filepath = test_data_dir / filename filepath.write_text(content) return str(filepath) return _create @pytest.mark.parametrize( "xml_content,expected_error_substring", [ pytest.param( # Missing <dictionaries> root element "<config></config>", "dictionaries", id="missing-dictionaries-root", ), pytest.param( # Missing <dictionary> element inside <dictionaries> "<dictionaries></dictionaries>", "dictionary", id="missing-dictionary", ), pytest.param( # Missing <name> element in <dictionary> """ <dictionaries> <dictionary> <source> <clickhouse> <db>db</db> <table>tbl</table> </clickhouse> </source> <structure> <attribute> <name>value</name> <type>String</type> </attribute> </structure> </dictionary> </dictionaries> """, "name", id="missing-name", ), pytest.param( # Missing <source>/<structure> configuration """ <dictionaries> <dictionary> <name>dict_without_source</name> </dictionary> </dictionaries> """, "source", id="missing-source-or-structure", ), pytest.param( # Both <id> and <key> present in <structure> """ <dictionaries> <dictionary> <name>dict_with_id_and_key</name> <source> <clickhouse> <db>db</db> <table>tbl</table> </clickhouse> </source> <structure> <id> <name>id</name> <type>UInt64</type> </id> <key> <attribute> <name>k</name> <type>UInt64</type> </attribute> </key> <attribute> <name>value</name> <type>String</type> </attribute> </structure> </dictionary> </dictionaries> """, "id and key", id="id-and-key-present", ), ], ) def test_parse_xml_to_sql_invalid_configs( xml_content: str, expected_error_substring: str, create_xml_file: Callable[[Path, str], str], ) -> None: # Use the existing helper to create an XML file on disk xml_path = create_xml_file(Path("invalid_dictionary.xml"), xml_content) with pytest.raises(RuntimeError) as excinfo: generate_ddl_dictionary_from_xml(xml_path) # Assert that the raised error contains the expected validation message assert expected_error_substring in str(excinfo.value) @pytest.mark.parametrize(

This edit assumes generate_ddl_dictionary_from_xml is already imported or defined in tests/unit/chadmin/test_parse_xml_to_sql.py (it likely is, since valid-config tests call it). If it is not, add the appropriate import at the top of this test file.

The expected_error_substring values ("dictionaries", "dictionary", "name", "source", "id and key") should match substrings of the actual RuntimeError messages raised in generate_ddl_dictionary_from_xml. If the implementation uses different wording, adjust these substrings accordingly to match the real messages while keeping the tests focused on the specific validation branch they cover.

ch_tools/chadmin/internal/dictionary.py

ch_tools/chadmin/cli/dictionary_group.py

ch_tools/chadmin/internal/dictionary.py

MikhailBurdukov · 2025-12-04T13:21:42Z

ch_tools/chadmin/cli/dictionary_group.py

+
+
+@dictionary_group.command("migrate")
+@pass_context


IMO we need parameters here:

dry-run mode to check correctness of the generated sql

Options that allows to migrage selectively eg with regex

keep-going ? Not sure but for consistency with other commands looks useful.

keep-going looks bad for this command, it is better to fail as soon as possible before some dicts are converted, otherwise we will need to distinguish already migrated dictionaries from non-migrated ones on restart and make sure that you do not delete the file with the non-migrated dictionary

MikhailBurdukov · 2025-12-04T13:25:10Z

ch_tools/chadmin/internal/dictionary.py

+    Migrate external dictionaries to DDL.
+    """
+    logging.info("Creating '_dictionaries' database")
+    execute_query(ctx, "CREATE DATABASE IF NOT EXISTS _dictionaries", format_=None)


Lets move the name of database into config

MikhailBurdukov · 2025-12-04T13:29:29Z

ch_tools/chadmin/internal/dictionary.py

+    return queries
+
+
+def _get_dictionaries_config_pattern() -> str:


Suggested change

def _get_dictionaries_config_pattern() -> str:

def _get_dictionaries_config_paths_pattern() -> str:

The best option I can think of, but the current name doesn't represent that we're getting the regexp for paths to XML files

And there are no function desc.

MikhailBurdukov · 2025-12-04T13:34:57Z

ch_tools/chadmin/internal/dictionary.py

+    config_path = Path(CLICKHOUSE_SERVER_CONFIG_PATH)
+    config_directory = config_path.parent
+    config_path_list = list(config_directory.glob(config_glob_pattern))
+


The regexp in <dictionaries_config> also can represent abs paths.

MikhailBurdukov · 2025-12-04T13:37:21Z

ch_tools/chadmin/internal/dictionary.py

+    config_path_list = list(config_directory.glob(config_glob_pattern))
+
+    logging.info("Starting external dictionaries migration")
+    for i, config_file in enumerate(config_path_list, start=1):


IMO execute_tasks_in_parallel can fit here very well, free parallelization

MikhailBurdukov · 2025-12-04T14:10:04Z

ch_tools/chadmin/internal/dictionary.py

+
+def _get_dictionary_name(attrs: dict[str, Any]) -> str:
+    name = attrs.get("name")
+    if name is None or not isinstance(name, str):


Suggested change

if name is None or not isinstance(name, str):

if not isinstance(name, str):

isinstance(None, str) is fine ihmo

MikhailBurdukov · 2025-12-04T14:11:13Z

ch_tools/chadmin/internal/dictionary.py

+    return name
+
+
+def _build_source_block(attrs: dict[str, Any]) -> str:


Acutually lets add the xml snippets for such functions, because it is hard to understand without looking into the config example

MikhailBurdukov · 2025-12-04T14:18:22Z

ch_tools/chadmin/internal/dictionary.py

+        )
+
+    if max_val is None:
+        if not isinstance(min_val, str):


Why str? The min/max for lifetime is integer value

xmltodict parses all text as str

MikhailBurdukov · 2025-12-04T14:51:11Z

ch_tools/chadmin/internal/dictionary.py

+
+    parts = [name, attr_type]
+
+    if "null_value" in attr:


Suggested change

if "null_value" in attr:

if 'null_value' not in attr:

raise ...

Null value is reqired

i think that it is not required

https://clickhouse.com/docs/sql-reference/dictionaries#attributes

According to the doc: it is required.

MikhailBurdukov · 2025-12-04T14:52:48Z

ch_tools/chadmin/internal/dictionary.py

+    parts = [name, attr_type]
+
+    if "null_value" in attr:
+        null_value = _normalize_null_default_value(attr_type, attr["null_value"])


What if in config the null value is specified as

<null_value />

Will it be equal to None?

will be {"null_value": None}

… extend functionality

kirillgarbar · 2025-12-18T13:05:17Z

tests/steps/common.py

 @when("we execute command on {node:w}")
+@then("we execute command on {node:w}")
 def step_command(context: ContextT, node: str) -> None:


Looks like it is not used in tests. Given and when should also be enough in any case

kirillgarbar · 2025-12-18T13:07:36Z

tests/steps/chadmin.py

+@when("we execute command on {node:w} expecting failure")
+@then("we execute command on {node:w} expecting failure")
+def step_execute_command_expecting_failure(context: ContextT, node: str) -> None:


This logic is already implemented like this

When we try to execute command on clickhouse01 """ .... """ Then it fails with response contains """ ..... """

kirillgarbar · 2025-12-18T13:22:02Z

ch_tools/chadmin/internal/dictionary_migration.py

+        except Exception as e:
+            raise RuntimeError(f"Error while removing '{config_file}'") from e
+    logging.info("Removing external dictionaries completed successfully")


Reload can be added here

It also makes sense to add an option to decide if new SQL dictionaries should be loaded after the migration, in case lazy load is enabled

kirillgarbar · 2025-12-18T13:27:56Z

ch_tools/chadmin/internal/dictionary_migration.py

+
+    attributes = structure.get("attribute", [])
+


This line was moved lower, but old line is node deleted here

kirillgarbar · 2025-12-18T13:41:32Z

ch_tools/chadmin/internal/dictionary_migration.py

+    config_path = CLICKHOUSE_SERVER_CONFIG_PATH
+    parsed_config = load_config_file(CLICKHOUSE_SERVER_CONFIG_PATH)


get_clickhouse_config can be reused here

kirillgarbar · 2025-12-18T16:32:07Z

ch_tools/chadmin/internal/dictionary_migration.py

+    for config_file in config_path_list:
+        queries = _generate_ddl_dictionaries_from_xml(str(config_file), target_database)
+        for dict_name, query in queries:
+            all_dictionaries.append((config_file, dict_name, query))
+
+    filtered_dictionaries = _filter_dictionaries(
+        all_dictionaries, include_pattern, exclude_pattern
+    )


Passing filters to _generate_ddl_dictionaries_from_xml or filtering by file name before calling _generate_ddl_dictionaries_from_xml should be more optimal

kirillgarbar · 2025-12-18T16:38:52Z

ch_tools/chadmin/internal/dictionary_migration.py

+        if include_pattern:
+            if not (
+                fnmatch(dict_name, include_pattern)
+                or fnmatch(config_file_str, include_pattern)
+                or fnmatch(config_file.name, include_pattern)


I guess it can be simplified to matching config_file_str only. Whether to check only filename or abs path can be decided with a pattern itself

kirillgarbar · 2025-12-18T16:44:43Z

ch_tools/chadmin/internal/dictionary_migration.py

+def _get_dictionary_nodes_from_config(config: dict[str, Any]) -> list[dict[str, Any]]:
+    dictionary_node_list = []
+    if not isinstance(config, dict):
+        return dictionary_node_list


Such checks look redundant, config is already declared as dict[str, Any]

dimbo4ka added 4 commits December 3, 2025 10:26

feat: implement external dictionary migration

def3aa5

fix: skip non-config files when loading config.d directory

e4022fa

test: add unit tests for dictionary migration

eee17f3

test: add integration tests for dictionary migration

29a0a85

sourcery-ai bot reviewed Dec 3, 2025

View reviewed changes

dimbo4ka added 3 commits December 3, 2025 12:17

test: add negative unit tests

0b48b49

chore: apply minor fixes from code review

05a99e8

feat: delete migrated dictionary config files after migration

99b276e

kirillgarbar requested changes Dec 4, 2025

View reviewed changes

MikhailBurdukov reviewed Dec 4, 2025

View reviewed changes

dimbo4ka added 9 commits December 15, 2025 11:42

feat: add new flags for migration

d5cf46e

refactor: extract dictionary migration logic into separate module and…

9312cbc

… extend functionality

test: fix unit tests

13e91e2

test: extend integration tests

0f0171a

refactor: rename unit-test names to kebab-case

423804a

fix: correct empty clause parsing

ee4bd8b

fix: correct <attribute> <expression> parsing

4a345cd

fix: correct attribute parsing

0cf8968

fix: correct null_value parsing

f9ac4cb

kirillgarbar requested changes Dec 19, 2025

View reviewed changes

		return queries


		def _get_dictionaries_config_pattern() -> str:

	def _get_dictionaries_config_pattern() -> str:
	def _get_dictionaries_config_paths_pattern() -> str:

	if name is None or not isinstance(name, str):
	if not isinstance(name, str):

		return name


		def _build_source_block(attrs: dict[str, Any]) -> str:

	if "null_value" in attr:
	if 'null_value' not in attr:
	raise ...

		config_path = CLICKHOUSE_SERVER_CONFIG_PATH
		parsed_config = load_config_file(CLICKHOUSE_SERVER_CONFIG_PATH)



		@dictionary_group.command("migrate")
		@pass_context

Implement migration from XML dictionaries to DDL #426

Are you sure you want to change the base?

Implement migration from XML dictionaries to DDL #426

Uh oh!

Conversation

dimbo4ka commented Dec 3, 2025 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for chadmin dictionary migrate command

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sourcery-ai bot Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dimbo4ka Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MikhailBurdukov Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dimbo4ka commented Dec 3, 2025 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Dec 3, 2025 •

edited

Loading

dimbo4ka Dec 15, 2025 •

edited

Loading

MikhailBurdukov Dec 4, 2025 •

edited

Loading