Hive partitioned write: lazy partitioning initialization #11765

Mytherin · 2024-04-22T12:57:41Z

This PR changes the way that partitions are initialized for hive partitioned writes. Previously, all found partitions would be eagerly initialized for each partitioned collection through the methods SynchronizeLocalMap, GrowAppendState and GrowPartitions. This initializes a ColumnDataCollection and DataChunk for each partition for each thread.

After #10976, we flush the intermediate partitioned states after a set amount of rows (500K by default). After this, we reset the partitioned state. In the current set-up, after the reset happens, the synchronize methods are called and all partitioned states are initialized again. When dealing with many partitions, especially on wider tables, this consumes a lot of memory. We are also doing work that might not be necessary, as not every batch of 500K rows might contain all partitions.

This PR reworks the initialization of the partitioned states so that, when a new partition is found locally, we initialize the partitioned states for that partition. The rest of the states are left uninitialized. This greatly reduces memory usage in many cases when there are many partitions. This is especially true when the partitions are not randomly spread but co-located, for example when partitioning by dates and the dates are sorted.

Benchmarks

Below are some timings running the same query as in #10976, with n replaced by the number of partitions. This time we go up to the total number of unique values in the CounterID column (6506). The table itself is wide (100~ columns, with long strings) which makes this partitioning more challenging as well.

copy (select CounterID%$n as partition_key, * from '~/Data/hits.parquet') TO 'hits_partitioned' (FORMAT PARQUET, PARTITION_BY partition_key);

As shown the old version generally performs worse with more partitions, and runs OOM in both the temp directory and the non-temp directory cases for the full data set, whereas the temp directory is not necessary in the current implementation.

Partitions	Version	52GB mem limit, no temp dir	52GB mem limit, temp dir
10	v0.10.2	61.1s
	New	58.8s
100	v0.10.2	65.8s
	New	57.8s
1000	v0.10.2	73.8s
	New	60.8s
2000	v0.10.2	82.5s
	New	62.9s
4000	v0.10.2	OOM	116s
	New	62.8s
6506	v0.10.2	OOM	OOM
	New	66s

…te partition states rather than doing so eagerly

samansmink

LGTM

Merge pull request duckdb/duckdb#11765 from Mytherin/lazypartitioninit

Mytherin added 2 commits April 22, 2024 10:45

Rework state initialization in hive partitioned write - lazily initia…

e3d5e9d

…te partition states rather than doing so eagerly

Format fix

89c2f28

Mytherin requested a review from samansmink April 22, 2024 12:57

samansmink approved these changes Apr 22, 2024

View reviewed changes

Mytherin merged commit 14e192d into duckdb:main Apr 22, 2024

github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request May 5, 2024

chore: Update vendored sources to duckdb/duckdb@14e192d

62fd5e3

Merge pull request duckdb/duckdb#11765 from Mytherin/lazypartitioninit

Mytherin deleted the lazypartitioninit branch June 7, 2024 12:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Hive partitioned write: lazy partitioning initialization #11765

Hive partitioned write: lazy partitioning initialization #11765

Uh oh!

Mytherin commented Apr 22, 2024

Uh oh!

samansmink left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Hive partitioned write: lazy partitioning initialization #11765

Hive partitioned write: lazy partitioning initialization #11765

Uh oh!

Conversation

Mytherin commented Apr 22, 2024

Benchmarks

Uh oh!

samansmink left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants