Skip to content

Conversation

@krlmlr
Copy link
Contributor

@krlmlr krlmlr commented Oct 4, 2025

This is in response to Valgrind errors observed in the R package.

In the presence of NULL values (and perhaps with other conditions), sort keys can become shorter than the originally anticipated length. This leads to uninitialized memory access.

To prevent UMA, we can either ensure that the uninitialized memory is not read (as this PR does), or fill it with some value.

Submitting the PR to get a second opinion and also to perhaps run the tests on CI/CD. Merging will make my patch in the R package obsolete.

Details

Failing valgrind run: https://github.com/duckdb/duckdb-r/actions/runs/18214180255/job/51860150272#step:6:1717

Reproducible example in R:

library(duckdb)

# Autogenerated
drv <- duckdb()
con <- dbConnect(drv)

dbExecute(con, "SET threads TO 1")

df1 <- tibble::remove_rownames(data.frame(a = seq(1, 6, by = 1), b = 2, g = c(1L, 2L, 2L, 3L, 3L, 3L))[1:2, ])

"union_all"
rel1 <- duckdb:::rel_from_df(con, df1)
print("rel1:")
print(duckdb:::rel_to_altrep(rel1))

df2 <- data.frame(a = 1L, b = 3, g = 2L)

"union_all"
rel2 <- duckdb:::rel_from_df(con, df2)
print("rel2:")
print(duckdb:::rel_to_altrep(rel2))

"union_all"
rel3 <- duckdb:::rel_project(
  rel1,
  list(
    {
      tmp_expr <- duckdb:::expr_reference("a")
      duckdb:::expr_set_alias(tmp_expr, "a")
      tmp_expr
    },
    {
      tmp_expr <- duckdb:::expr_reference("b")
      duckdb:::expr_set_alias(tmp_expr, "b")
      tmp_expr
    },
    {
      tmp_expr <- duckdb:::expr_reference("g")
      duckdb:::expr_set_alias(tmp_expr, "g")
      tmp_expr
    },
    {
      tmp_expr <- duckdb:::expr_window(duckdb:::expr_function("row_number", list()), list(), list(), offset_expr = NULL, default_expr = NULL)
      duckdb:::expr_set_alias(tmp_expr, "___row_number_x")
      tmp_expr
    },
    {
      tmp_expr <- duckdb:::expr_constant(NA_integer_)
      duckdb:::expr_set_alias(tmp_expr, "___row_number_y")
      tmp_expr
    }
  )
)
print("rel3:")
print(duckdb:::rel_to_altrep(rel3))

"union_all"
rel4 <- duckdb:::rel_project(
  rel2,
  list(
    {
      tmp_expr <- duckdb:::expr_reference("a")
      duckdb:::expr_set_alias(tmp_expr, "a")
      tmp_expr
    },
    {
      tmp_expr <- duckdb:::expr_reference("b")
      duckdb:::expr_set_alias(tmp_expr, "b")
      tmp_expr
    },
    {
      tmp_expr <- duckdb:::expr_reference("g")
      duckdb:::expr_set_alias(tmp_expr, "g")
      tmp_expr
    },
    {
      tmp_expr <- duckdb:::expr_constant(NA_integer_)
      duckdb:::expr_set_alias(tmp_expr, "___row_number_x")
      tmp_expr
    },
    {
      tmp_expr <- duckdb:::expr_window(duckdb:::expr_function("row_number", list()), list(), list(), offset_expr = NULL, default_expr = NULL)
      duckdb:::expr_set_alias(tmp_expr, "___row_number_y")
      tmp_expr
    }
  )
)
print("rel4:")
print(duckdb:::rel_to_altrep(rel4))

"union_all"
rel5 <- duckdb:::rel_union_all(rel3, rel4)
print("rel5:")
print(duckdb:::rel_to_altrep(rel5))

# df5 <- duckdb:::rel_to_altrep(rel5)
# rel5 <- duckdb:::rel_from_df(con, df5)

"union_all"
rel6 <- duckdb:::rel_order(
  rel5,
  list(duckdb:::expr_reference("___row_number_x"), duckdb:::expr_reference("___row_number_y"))
)
print("rel6:")
print(duckdb:::rel_to_altrep(rel6))

dbDisconnect(con, shutdown = TRUE)

// call Finalize on the result
for (idx_t r = 0; r < size; r++) {
result_data[r].Finalize();
result_data[r].SetSizeAndFinalize(offsets[r],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #18885 this sounds like it could also result in a problem here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Yes, I saw this. Do we need to take extra care in this PR?

Earlier, I wondered if SetSizeAndFinalize() should also zero out the remaining INLINE_LENGTH - size bytes that were not filled by the memcpy(), to satisfy the check in VerifyCharacters() .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it already does that
What it doesnt do is deal with the case where you allocate for > inlined_size and then SetSizeAndFinalize(<= inlined_size)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it did? I'm seeing in my code:

	void SetSizeAndFinalize(uint32_t size, idx_t allocated_size) {
		value.inlined.length = size;
		if (allocated_size > INLINE_LENGTH && IsInlined()) {
			//! Data was written to the 'value.pointer.ptr', has to be copied to the inlined bytes
			D_ASSERT(value.pointer.ptr);
			auto ptr = value.pointer.ptr;
			memcpy(GetDataWriteable(), ptr, size);
		}
		Finalize();
		VerifyCharacters();
	}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Tishj I think this should work properly here. The allocation could be greater than inlined size (while the actual written string is not), but SetSizeAndFinalize handles that case nicely.

Copy link
Contributor

@Tishj Tishj Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I thought that somehow landed in a dev branch, since I thought I only saw one argument to SetSizeAndFinalize but I was mistaken

Thats why I linked to that PR
Anyway, ignore this thread

Copy link
Collaborator

@lnkuiper lnkuiper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! This should always truncate the string to the actual written size (stored in offsets), which excludes bytes for fixed-width columns that are NULL (that were initially added to constant_length).

@hannes hannes merged commit 0b33496 into duckdb:v1.4-andium Oct 8, 2025
53 checks passed
@hannes
Copy link
Member

hannes commented Oct 8, 2025

Thanks!

github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request Oct 8, 2025
Pass lambda bindings to next binder (duckdb/duckdb#19144)
feat: Fix length of sort keys (duckdb/duckdb#19260)
Fix regex optimization to remove flags when converting to contains (duckdb/duckdb#19290)
bump duckdb-azure ref for 1.4.1 (duckdb/duckdb#19275)
github-actions bot added a commit to duckdb/duckdb-r that referenced this pull request Oct 8, 2025
Pass lambda bindings to next binder (duckdb/duckdb#19144)
feat: Fix length of sort keys (duckdb/duckdb#19260)
Fix regex optimization to remove flags when converting to contains (duckdb/duckdb#19290)
bump duckdb-azure ref for 1.4.1 (duckdb/duckdb#19275)

Co-authored-by: krlmlr <krlmlr@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants