-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Fix #19575: Fix illegal utf8 character #19699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
src/optimizer/filter_combiner.cpp
Outdated
| auto upper_bound = make_uniq<ConstantFilter>(ExpressionType::COMPARE_LESSTHAN, Value(prefix_string)); | ||
| table_filters.PushFilter(column_index, std::move(upper_bound)); | ||
| } | ||
| return FilterPushdownResult::PUSHED_DOWN_FULLY; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this feels off, returning FilterPushdownResult::PUSHED_DOWN_FULLY when the table filter is not pushed
Mytherin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix! Left a comment on how to improve this fix. Can we also target this towards v1.4-andium?
src/optimizer/filter_combiner.cpp
Outdated
| auto upper_bound = make_uniq<ConstantFilter>(ExpressionType::COMPARE_LESSTHAN, Value(prefix_string)); | ||
| table_filters.PushFilter(column_index, std::move(lower_bound)); | ||
| table_filters.PushFilter(column_index, std::move(upper_bound)); | ||
| prefix_string[prefix_string.size() - 1]++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of just incrementing the last byte (which might result in invalid UTF8) - could we find the next UTF8 codepoint?
This would look something like this:
// find the start of the last codepoint
idx_t last_codepoint_start;
for(last_codepoint_start = prefix_string.size(); last_codepoint_start > 0; last_codepoint_start--) {
if (IsCharacter(prefix_string[last_codepoint_start - 1])) {
break;
}
}
D_ASSERT(last_codepoint_start > 0);
last_codepoint_start--;
int codepoint_size;
auto codepoint = Utf8Proc ::UTF8ToCodepoint(prefix_string.c_str() + last_codepoint_start, codepoint_size);
auto next_codepoint = codepoint + 1;
if (next_codepoint >= 0xD800 && next_codepoint <= 0xDFFF) {
// next codepoint falls within surrogate range increment to next valid character
next_codepoint = 0xE000;
}
char next_codepoint_text[4];
int next_codepoint_size;
if (!Utf8Proc::CodepointToUtf8(codepoint + 1, next_codepoint_size, next_codepoint_text)) {
// invalid codepoint
return FilterPushdownResult::NO_PUSHDOWN;
}
string upper_bound = prefix_string.substr(0, last_codepoint_start) + string(next_codepoint_text, next_codepoint_size);There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In UTF8 a higher codepoint is always lexicographically higher than a lower codepoint, we can verify this using DuckDB itself, e.g.:
select codepoint, try(chr(codepoint::INTEGER)) unicode
from range(0, '0x10FFFF'::BIGINT + 1) t(codepoint)
where unicode is not null
qualify codepoint < lag(codepoint) over (order by unicode);There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can trigger the "invalid codepoint" codepath in a test using a prefix filter on chr('0x10FFFF') which is the highest unicode codepoint possible, and the surrogate range codepath using a filter on chr('0xD7FF')
13c5f6a to
4939f9a
Compare
Mytherin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes! Looks good - one comment
src/optimizer/filter_combiner.cpp
Outdated
| bool FilterCombiner::FindNextLegalUTF8(string &prefix_string) { | ||
| // find the start of the last codepoint | ||
| idx_t last_codepoint_start; | ||
| for (last_codepoint_start = prefix_string.size() - 1; last_codepoint_start >= 0; last_codepoint_start--) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
last_codepoint_start >= 0 is always true because last_codepoint_start is an idx_t - this will instead underflow (and loop infinitely) if this does not find a character. Could you modify this to the following loop for safety purposes - otherwise we might get infinite loops if invalid UTF8 somehow enters the system:
idx_t last_codepoint_start;
for(last_codepoint_start = prefix_string.size(); last_codepoint_start > 0; last_codepoint_start--) {
if (IsCharacter(prefix_string[last_codepoint_start - 1])) {
break;
}
}
if (last_codepoint_start == 0) {
throw InvalidInputException("Invalid UTF8 found in string \"%s\"", prefix_string);
}
last_codepoint_start--;There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reply! I missed that last_codepoint_start is unsigned.
And I have another one question:
I want to keep find the legal utf8 codepoint until we find the first one or all bytes in prefix_string are processed, do you think it's correct and useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that would be possible yeah - although I think the last codepoint is very rarely used
|
Thanks! |
Fix duckdb/duckdb#19575: Fix illegal utf8 character (duckdb/duckdb#19699) It should probably be `${EXTENSION_CONFIG_BUILD}` instead of `EXTENSION_CONFIG_BUILD` (duckdb/duckdb#19694)
Fix duckdb/duckdb#19575: Fix illegal utf8 character (duckdb/duckdb#19699) It should probably be `${EXTENSION_CONFIG_BUILD}` instead of `EXTENSION_CONFIG_BUILD` (duckdb/duckdb#19694) Co-authored-by: krlmlr <krlmlr@users.noreply.github.com>
This PR is a simple fix for #19575. If we still want to create a
COMPARE_LESSTHANfilter, we need to handle theprefix_stringat the character encoding level, not the byte level.