Describe what's wrong
Hive partition columns inferred from directory names disappear when a WHERE clause is present in queries using the file(..., Parquet) table function.
Without WHERE, the partition column is correctly detected.
When a WHERE clause is added, the query fails with:
DB::Exception: Not found column date in block (NOT_FOUND_COLUMN_IN_BLOCK)
Disabling hive partitioning (SETTINGS use_hive_partitioning = 0) avoids the error.
Repro
example directory layout (with hive partitioning)
tmp/date=2025-01-01/file.parquet
tmp/date=2025-01-02/file.parquet
Parquet file schema includes column:
queries that don't select a partitioning column work:
SELECT *
FROM file('tmp/**/*.parquet', Parquet);
but queries that do fail:
SELECT *
FROM file('tmp/**/*.parquet', Parquet)
WHERE value > 1.0;
SELECT *
FROM file('tmp/**/*.parquet', Parquet)
WHERE value > 1.0
SETTINGS use_hive_partitioning = 1;
the query works when I turn off hive partitioning:
SELECT *
FROM file('tmp/**/*.parquet', Parquet)
WHERE value > 1.0
SETTINGS use_hive_partitioning = 0;
I'm calling these via the chdb.query() function:
import chdb
chdb.query(
"SELECT * FROM file('tmp/**/*.parquet', Parquet) WHERE value > 1.0",
"ArrowTable",
)
the error message produced is as follows:
DB::Exception: Not found column date in block (NOT_FOUND_COLUMN_IN_BLOCK)
I worked around this by disabling hive partitioning and reconstructing partition columns manually from _path, which impacts performance. also, the issue appears only when a WHERE clause is present; queries without WHERE work as expected.
env:
chdb: 4.1.0
- ClickHouse engine version: 26.1.2.1
- Python: 3.10
Describe what's wrong
Hive partition columns inferred from directory names disappear when a
WHEREclause is present in queries using thefile(..., Parquet)table function.Without
WHERE, the partition column is correctly detected.When a
WHEREclause is added, the query fails with:Disabling hive partitioning (
SETTINGS use_hive_partitioning = 0) avoids the error.Repro
example directory layout (with hive partitioning)
Parquet file schema includes column:
queries that don't select a partitioning column work:
but queries that do fail:
the query works when I turn off hive partitioning:
I'm calling these via the chdb.query() function:
the error message produced is as follows:
I worked around this by disabling hive partitioning and reconstructing partition columns manually from
_path, which impacts performance. also, the issue appears only when aWHEREclause is present; queries withoutWHEREwork as expected.env:
chdb: 4.1.0