Skip to content

Incorrect min max parquet stats when the column value are infinity or -infinity #20014

@PingLiuPing

Description

@PingLiuPing

What happens?

When writing parquet files, if the value of a float or double column are both infinity or -infinity, the min and max stats been wrote to parquet metadata is incorrect.
For example, given the column type is float, and all values are -std::numeric_limits::infinity(), the min and max collected in

struct FloatingPointOperator : public BaseParquetOperator {
template <class SRC, class TGT>
static TGT Operation(SRC input) {
return TGT(input);
}
template <class SRC, class TGT>
static unique_ptr<ColumnWriterStatistics> InitializeStats() {
return make_uniq<FloatingPointStatisticsState<SRC, TGT, BaseParquetOperator>>();
}
template <class SRC, class TGT>
static void HandleStats(ColumnWriterStatistics *stats, TGT target_value) {
auto &numeric_stats = stats->Cast<FloatingPointStatisticsState<SRC, TGT, BaseParquetOperator>>();
if (Value::IsNan(target_value)) {
numeric_stats.has_nan = true;
} else {
if (LessThan::Operation(target_value, numeric_stats.min)) {
numeric_stats.min = target_value;
}
if (GreaterThan::Operation(target_value, numeric_stats.max)) {
numeric_stats.max = target_value;
}
}
}
};
are [-Inf, std::numeric_limits::lowest()].
But the actual result should be [-Inf, -Inf].

To Reproduce

	template <class SRC, class TGT>
	static void HandleStats(ColumnWriterStatistics *stats, TGT target_value) {
		auto &numeric_stats = stats->Cast<FloatingPointStatisticsState<SRC, TGT, BaseParquetOperator>>();
		if (Value::IsNan(target_value)) {
			numeric_stats.has_nan = true;
		} else {
			if (LessThan::Operation(target_value, numeric_stats.min)) {
				numeric_stats.min = target_value;
			}
			if (GreaterThan::Operation(target_value, numeric_stats.max)) {
				numeric_stats.max = target_value;
			}
		}
	}

OS:

x86_64

DuckDB Version:

v1.4.2

DuckDB Client:

command line

Hardware:

No response

Full Name:

Ping Liu

Affiliation:

IBM

Did you include all relevant configuration (e.g., CPU architecture, Linux distribution) to reproduce the issue?

  • Yes, I have

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant data sets for reproducing the issue?

Not applicable - the reproduction does not require a data set

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions