Skip to content

Conversation

@flashmouse
Copy link
Contributor

This PR is a simple fix for #19575. If we still want to create a COMPARE_LESSTHAN filter, we need to handle the prefix_string at the character encoding level, not the byte level.

auto upper_bound = make_uniq<ConstantFilter>(ExpressionType::COMPARE_LESSTHAN, Value(prefix_string));
table_filters.PushFilter(column_index, std::move(upper_bound));
}
return FilterPushdownResult::PUSHED_DOWN_FULLY;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels off, returning FilterPushdownResult::PUSHED_DOWN_FULLY when the table filter is not pushed

Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix! Left a comment on how to improve this fix. Can we also target this towards v1.4-andium?

auto upper_bound = make_uniq<ConstantFilter>(ExpressionType::COMPARE_LESSTHAN, Value(prefix_string));
table_filters.PushFilter(column_index, std::move(lower_bound));
table_filters.PushFilter(column_index, std::move(upper_bound));
prefix_string[prefix_string.size() - 1]++;
Copy link
Collaborator

@Mytherin Mytherin Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of just incrementing the last byte (which might result in invalid UTF8) - could we find the next UTF8 codepoint?

This would look something like this:

// find the start of the last codepoint
idx_t last_codepoint_start;
for(last_codepoint_start = prefix_string.size(); last_codepoint_start > 0; last_codepoint_start--) {
    if (IsCharacter(prefix_string[last_codepoint_start - 1])) {
        break;
    }
}
D_ASSERT(last_codepoint_start > 0);
last_codepoint_start--;
int codepoint_size;
auto codepoint = Utf8Proc ::UTF8ToCodepoint(prefix_string.c_str() + last_codepoint_start, codepoint_size);
auto next_codepoint = codepoint + 1;
if (next_codepoint >= 0xD800 && next_codepoint <= 0xDFFF) {
    // next codepoint falls within surrogate range increment to next valid character
    next_codepoint = 0xE000;
}
char next_codepoint_text[4];
int next_codepoint_size;
if (!Utf8Proc::CodepointToUtf8(codepoint + 1, next_codepoint_size, next_codepoint_text)) {
        // invalid codepoint
		return FilterPushdownResult::NO_PUSHDOWN;
}
string upper_bound = prefix_string.substr(0, last_codepoint_start) + string(next_codepoint_text, next_codepoint_size);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In UTF8 a higher codepoint is always lexicographically higher than a lower codepoint, we can verify this using DuckDB itself, e.g.:

select codepoint, try(chr(codepoint::INTEGER)) unicode
from range(0, '0x10FFFF'::BIGINT + 1) t(codepoint)
where unicode is not null
qualify codepoint < lag(codepoint) over (order by unicode);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can trigger the "invalid codepoint" codepath in a test using a prefix filter on chr('0x10FFFF') which is the highest unicode codepoint possible, and the surrogate range codepath using a filter on chr('0xD7FF')

@flashmouse flashmouse marked this pull request as draft November 15, 2025 12:56
@flashmouse flashmouse changed the base branch from main to v1.4-andium November 15, 2025 12:56
@flashmouse flashmouse marked this pull request as ready for review November 15, 2025 15:13
Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes! Looks good - one comment

bool FilterCombiner::FindNextLegalUTF8(string &prefix_string) {
// find the start of the last codepoint
idx_t last_codepoint_start;
for (last_codepoint_start = prefix_string.size() - 1; last_codepoint_start >= 0; last_codepoint_start--) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last_codepoint_start >= 0 is always true because last_codepoint_start is an idx_t - this will instead underflow (and loop infinitely) if this does not find a character. Could you modify this to the following loop for safety purposes - otherwise we might get infinite loops if invalid UTF8 somehow enters the system:

idx_t last_codepoint_start;
for(last_codepoint_start = prefix_string.size(); last_codepoint_start > 0; last_codepoint_start--) {
    if (IsCharacter(prefix_string[last_codepoint_start - 1])) {
        break;
    }
}
if (last_codepoint_start == 0) {
    throw InvalidInputException("Invalid UTF8 found in string \"%s\"", prefix_string);
}
last_codepoint_start--;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reply! I missed that last_codepoint_start is unsigned.

And I have another one question:
I want to keep find the legal utf8 codepoint until we find the first one or all bytes in prefix_string are processed, do you think it's correct and useful?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that would be possible yeah - although I think the last codepoint is very rarely used

@flashmouse flashmouse marked this pull request as draft November 17, 2025 14:38
@flashmouse flashmouse marked this pull request as ready for review November 17, 2025 14:38
@Mytherin Mytherin merged commit b878a5d into duckdb:v1.4-andium Nov 18, 2025
96 of 100 checks passed
@Mytherin
Copy link
Collaborator

Thanks!

github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request Nov 26, 2025
Fix duckdb/duckdb#19575: Fix illegal utf8 character (duckdb/duckdb#19699)
It should probably be `${EXTENSION_CONFIG_BUILD}` instead of `EXTENSION_CONFIG_BUILD` (duckdb/duckdb#19694)
github-actions bot added a commit to duckdb/duckdb-r that referenced this pull request Nov 26, 2025
Fix duckdb/duckdb#19575: Fix illegal utf8 character (duckdb/duckdb#19699)
It should probably be `${EXTENSION_CONFIG_BUILD}` instead of `EXTENSION_CONFIG_BUILD` (duckdb/duckdb#19694)

Co-authored-by: krlmlr <krlmlr@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants