Skip to content

Hive partition column missing when WHERE is used with file(..., Parquet) #538

@mariannacao

Description

@mariannacao

Describe what's wrong

Hive partition columns inferred from directory names disappear when a WHERE clause is present in queries using the file(..., Parquet) table function.

Without WHERE, the partition column is correctly detected.
When a WHERE clause is added, the query fails with:

DB::Exception: Not found column date in block (NOT_FOUND_COLUMN_IN_BLOCK)

Disabling hive partitioning (SETTINGS use_hive_partitioning = 0) avoids the error.


Repro

example directory layout (with hive partitioning)

tmp/date=2025-01-01/file.parquet
tmp/date=2025-01-02/file.parquet

Parquet file schema includes column:

value Float64

queries that don't select a partitioning column work:

SELECT *
FROM file('tmp/**/*.parquet', Parquet);

but queries that do fail:

SELECT *
FROM file('tmp/**/*.parquet', Parquet)
WHERE value > 1.0;
SELECT *
FROM file('tmp/**/*.parquet', Parquet)
WHERE value > 1.0
SETTINGS use_hive_partitioning = 1;

the query works when I turn off hive partitioning:

SELECT *
FROM file('tmp/**/*.parquet', Parquet)
WHERE value > 1.0
SETTINGS use_hive_partitioning = 0;

I'm calling these via the chdb.query() function:

import chdb

chdb.query(
    "SELECT * FROM file('tmp/**/*.parquet', Parquet) WHERE value > 1.0",
    "ArrowTable",
)

the error message produced is as follows:

DB::Exception: Not found column date in block (NOT_FOUND_COLUMN_IN_BLOCK)

I worked around this by disabling hive partitioning and reconstructing partition columns manually from _path, which impacts performance. also, the issue appears only when a WHERE clause is present; queries without WHERE work as expected.


env:

  • chdb: 4.1.0
  • ClickHouse engine version: 26.1.2.1
  • Python: 3.10

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions