Skip to content

Conversation

@Mytherin
Copy link
Collaborator

This PR changes the way that partitions are initialized for hive partitioned writes. Previously, all found partitions would be eagerly initialized for each partitioned collection through the methods SynchronizeLocalMap, GrowAppendState and GrowPartitions. This initializes a ColumnDataCollection and DataChunk for each partition for each thread.

After #10976, we flush the intermediate partitioned states after a set amount of rows (500K by default). After this, we reset the partitioned state. In the current set-up, after the reset happens, the synchronize methods are called and all partitioned states are initialized again. When dealing with many partitions, especially on wider tables, this consumes a lot of memory. We are also doing work that might not be necessary, as not every batch of 500K rows might contain all partitions.

This PR reworks the initialization of the partitioned states so that, when a new partition is found locally, we initialize the partitioned states for that partition. The rest of the states are left uninitialized. This greatly reduces memory usage in many cases when there are many partitions. This is especially true when the partitions are not randomly spread but co-located, for example when partitioning by dates and the dates are sorted.

Benchmarks

Below are some timings running the same query as in #10976, with n replaced by the number of partitions. This time we go up to the total number of unique values in the CounterID column (6506). The table itself is wide (100~ columns, with long strings) which makes this partitioning more challenging as well.

copy (select CounterID%$n as partition_key, * from '~/Data/hits.parquet') TO 'hits_partitioned' (FORMAT PARQUET, PARTITION_BY partition_key);

As shown the old version generally performs worse with more partitions, and runs OOM in both the temp directory and the non-temp directory cases for the full data set, whereas the temp directory is not necessary in the current implementation.

Partitions Version 52GB mem limit, no temp dir 52GB mem limit, temp dir
10 v0.10.2 61.1s
New 58.8s
100 v0.10.2 65.8s
New 57.8s
1000 v0.10.2 73.8s
New 60.8s
2000 v0.10.2 82.5s
New 62.9s
4000 v0.10.2 OOM 116s
New 62.8s
6506 v0.10.2 OOM OOM
New 66s

@Mytherin Mytherin requested a review from samansmink April 22, 2024 12:57
Copy link
Collaborator

@samansmink samansmink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Mytherin Mytherin merged commit 14e192d into duckdb:main Apr 22, 2024
github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request May 5, 2024
Merge pull request duckdb/duckdb#11765 from Mytherin/lazypartitioninit
@Mytherin Mytherin deleted the lazypartitioninit branch June 7, 2024 12:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants