Fix #19575: Fix illegal utf8 character #19699

flashmouse · 2025-11-08T07:36:49Z

This PR is a simple fix for #19575. If we still want to create a COMPARE_LESSTHAN filter, we need to handle the prefix_string at the character encoding level, not the byte level.

samansmink · 2025-11-10T16:28:57Z

src/optimizer/filter_combiner.cpp

+		auto upper_bound = make_uniq<ConstantFilter>(ExpressionType::COMPARE_LESSTHAN, Value(prefix_string));
+		table_filters.PushFilter(column_index, std::move(upper_bound));
+	}
 	return FilterPushdownResult::PUSHED_DOWN_FULLY;


this feels off, returning FilterPushdownResult::PUSHED_DOWN_FULLY when the table filter is not pushed

Mytherin

Thanks for the fix! Left a comment on how to improve this fix. Can we also target this towards v1.4-andium?

Mytherin · 2025-11-11T07:20:55Z

src/optimizer/filter_combiner.cpp

-	auto upper_bound = make_uniq<ConstantFilter>(ExpressionType::COMPARE_LESSTHAN, Value(prefix_string));
 	table_filters.PushFilter(column_index, std::move(lower_bound));
-	table_filters.PushFilter(column_index, std::move(upper_bound));
+	prefix_string[prefix_string.size() - 1]++;


Instead of just incrementing the last byte (which might result in invalid UTF8) - could we find the next UTF8 codepoint?

This would look something like this:

// find the start of the last codepoint idx_t last_codepoint_start; for(last_codepoint_start = prefix_string.size(); last_codepoint_start > 0; last_codepoint_start--) { if (IsCharacter(prefix_string[last_codepoint_start - 1])) { break; } } D_ASSERT(last_codepoint_start > 0); last_codepoint_start--; int codepoint_size; auto codepoint = Utf8Proc ::UTF8ToCodepoint(prefix_string.c_str() + last_codepoint_start, codepoint_size); auto next_codepoint = codepoint + 1; if (next_codepoint >= 0xD800 && next_codepoint <= 0xDFFF) { // next codepoint falls within surrogate range increment to next valid character next_codepoint = 0xE000; } char next_codepoint_text[4]; int next_codepoint_size; if (!Utf8Proc::CodepointToUtf8(codepoint + 1, next_codepoint_size, next_codepoint_text)) { // invalid codepoint return FilterPushdownResult::NO_PUSHDOWN; } string upper_bound = prefix_string.substr(0, last_codepoint_start) + string(next_codepoint_text, next_codepoint_size);

In UTF8 a higher codepoint is always lexicographically higher than a lower codepoint, we can verify this using DuckDB itself, e.g.:

select codepoint, try(chr(codepoint::INTEGER)) unicode from range(0, '0x10FFFF'::BIGINT + 1) t(codepoint) where unicode is not null qualify codepoint < lag(codepoint) over (order by unicode);

We can trigger the "invalid codepoint" codepath in a test using a prefix filter on chr('0x10FFFF') which is the highest unicode codepoint possible, and the surrogate range codepath using a filter on chr('0xD7FF')

Mytherin

Thanks for the changes! Looks good - one comment

Mytherin · 2025-11-17T11:30:31Z

src/optimizer/filter_combiner.cpp

+bool FilterCombiner::FindNextLegalUTF8(string &prefix_string) {
+	// find the start of the last codepoint
+	idx_t last_codepoint_start;
+	for (last_codepoint_start = prefix_string.size() - 1; last_codepoint_start >= 0; last_codepoint_start--) {


last_codepoint_start >= 0 is always true because last_codepoint_start is an idx_t - this will instead underflow (and loop infinitely) if this does not find a character. Could you modify this to the following loop for safety purposes - otherwise we might get infinite loops if invalid UTF8 somehow enters the system:

idx_t last_codepoint_start; for(last_codepoint_start = prefix_string.size(); last_codepoint_start > 0; last_codepoint_start--) { if (IsCharacter(prefix_string[last_codepoint_start - 1])) { break; } } if (last_codepoint_start == 0) { throw InvalidInputException("Invalid UTF8 found in string \"%s\"", prefix_string); } last_codepoint_start--;

Thanks for reply! I missed that last_codepoint_start is unsigned.

And I have another one question:
I want to keep find the legal utf8 codepoint until we find the first one or all bytes in prefix_string are processed, do you think it's correct and useful?

I guess that would be possible yeah - although I think the last codepoint is very rarely used

Mytherin · 2025-11-18T09:39:05Z

Thanks!

Fix duckdb/duckdb#19575: Fix illegal utf8 character (duckdb/duckdb#19699) It should probably be `${EXTENSION_CONFIG_BUILD}` instead of `EXTENSION_CONFIG_BUILD` (duckdb/duckdb#19694)

Fix duckdb/duckdb#19575: Fix illegal utf8 character (duckdb/duckdb#19699) It should probably be `${EXTENSION_CONFIG_BUILD}` instead of `EXTENSION_CONFIG_BUILD` (duckdb/duckdb#19694) Co-authored-by: krlmlr <krlmlr@users.noreply.github.com>

samansmink reviewed Nov 10, 2025

View reviewed changes

Mytherin reviewed Nov 11, 2025

View reviewed changes

Mytherin added the Changes Requested label Nov 11, 2025

fix illegal utf8 character

4939f9a

flashmouse force-pushed the issue-19575 branch from 13c5f6a to 4939f9a Compare November 15, 2025 12:55

flashmouse marked this pull request as draft November 15, 2025 12:56

flashmouse changed the base branch from main to v1.4-andium November 15, 2025 12:56

flashmouse added 2 commits November 15, 2025 14:17

copy code boss teach me

3d61064

add ut ensure correct

2ee9aba

flashmouse marked this pull request as ready for review November 15, 2025 15:13

Mytherin added Ready For Review and removed Changes Requested labels Nov 16, 2025

Mytherin reviewed Nov 17, 2025

View reviewed changes

Mytherin added Changes Requested and removed Ready For Review labels Nov 17, 2025

fix potential overflow

97b77a3

flashmouse marked this pull request as draft November 17, 2025 14:38

flashmouse marked this pull request as ready for review November 17, 2025 14:38

Mytherin merged commit b878a5d into duckdb:v1.4-andium Nov 18, 2025
96 of 100 checks passed

github-actions bot mentioned this pull request Nov 26, 2025

vendor: Update vendored sources to duckdb/duckdb@b878a5db0cbf37ab52361fb0d48e7f28598d5992 duckdb/duckdb-r#1742

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix #19575: Fix illegal utf8 character #19699

Fix #19575: Fix illegal utf8 character #19699

Uh oh!

flashmouse commented Nov 8, 2025

Uh oh!

samansmink Nov 10, 2025

Uh oh!

Mytherin left a comment

Uh oh!

Mytherin Nov 11, 2025 •

edited

Loading

Uh oh!

Mytherin Nov 11, 2025

Uh oh!

Mytherin Nov 11, 2025

Uh oh!

Mytherin left a comment

Uh oh!

Mytherin Nov 17, 2025

Uh oh!

flashmouse Nov 17, 2025

Uh oh!

Mytherin Nov 17, 2025

Uh oh!

Uh oh!

Mytherin commented Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix #19575: Fix illegal utf8 character #19699

Fix #19575: Fix illegal utf8 character #19699

Uh oh!

Conversation

flashmouse commented Nov 8, 2025

Uh oh!

samansmink Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Mytherin left a comment

Choose a reason for hiding this comment

Uh oh!

Mytherin Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Mytherin Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

Mytherin Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

Mytherin left a comment

Choose a reason for hiding this comment

Uh oh!

Mytherin Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

flashmouse Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Mytherin Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Mytherin commented Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Mytherin Nov 11, 2025 •

edited

Loading