-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Closed
duckdb/duckdb-r
#1805Labels
Description
What happens?
When writing parquet files, if the value of a float or double column are both infinity or -infinity, the min and max stats been wrote to parquet metadata is incorrect.
For example, given the column type is float, and all values are -std::numeric_limits::infinity(), the min and max collected in
duckdb/extension/parquet/include/writer/parquet_write_operators.hpp
Lines 71 to 96 in 27aec99
| struct FloatingPointOperator : public BaseParquetOperator { | |
| template <class SRC, class TGT> | |
| static TGT Operation(SRC input) { | |
| return TGT(input); | |
| } | |
| template <class SRC, class TGT> | |
| static unique_ptr<ColumnWriterStatistics> InitializeStats() { | |
| return make_uniq<FloatingPointStatisticsState<SRC, TGT, BaseParquetOperator>>(); | |
| } | |
| template <class SRC, class TGT> | |
| static void HandleStats(ColumnWriterStatistics *stats, TGT target_value) { | |
| auto &numeric_stats = stats->Cast<FloatingPointStatisticsState<SRC, TGT, BaseParquetOperator>>(); | |
| if (Value::IsNan(target_value)) { | |
| numeric_stats.has_nan = true; | |
| } else { | |
| if (LessThan::Operation(target_value, numeric_stats.min)) { | |
| numeric_stats.min = target_value; | |
| } | |
| if (GreaterThan::Operation(target_value, numeric_stats.max)) { | |
| numeric_stats.max = target_value; | |
| } | |
| } | |
| } | |
| }; |
But the actual result should be [-Inf, -Inf].
To Reproduce
template <class SRC, class TGT>
static void HandleStats(ColumnWriterStatistics *stats, TGT target_value) {
auto &numeric_stats = stats->Cast<FloatingPointStatisticsState<SRC, TGT, BaseParquetOperator>>();
if (Value::IsNan(target_value)) {
numeric_stats.has_nan = true;
} else {
if (LessThan::Operation(target_value, numeric_stats.min)) {
numeric_stats.min = target_value;
}
if (GreaterThan::Operation(target_value, numeric_stats.max)) {
numeric_stats.max = target_value;
}
}
}
OS:
x86_64
DuckDB Version:
v1.4.2
DuckDB Client:
command line
Hardware:
No response
Full Name:
Ping Liu
Affiliation:
IBM
Did you include all relevant configuration (e.g., CPU architecture, Linux distribution) to reproduce the issue?
- Yes, I have
Did you include all code required to reproduce the issue?
- Yes, I have
Did you include all relevant data sets for reproducing the issue?
Not applicable - the reproduction does not require a data set