-
Notifications
You must be signed in to change notification settings - Fork 124
Add file size filtering options to merge_adjacent_files compaction #632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Adds `min_file_size` and `max_file_size` optional parameters to `ducklake_merge_adjacent_files` to filter which files are considered for compaction based on their size. This enables tiered compaction strategies and more granular control over the compaction process. Related discussion: duckdb#627 **Tiered Compaction**: Implement Lucene-like tiered compaction where files are merged in stages: - Tier 0 → Tier 1: Merge small files (< 10KB) into ~50KB files - Tier 1 → Tier 2: Merge medium files (10KB-100KB) into ~200KB files - Tier 2 → Tier 3: Merge large files (100KB-500KB) into ~1MB files This provides more predictable I/O amplification and better incremental compaction for streaming workloads. - Added `min_file_size` and `max_file_size` parameters to filter files by size - Files smaller than `min_file_size` (if set) are excluded from compaction - Files at or larger than `max_file_size` (if set) are excluded from compaction - If `max_file_size` is not specified, it defaults to `target_file_size` - Added validation: `max_file_size` must be > 0, and `min_file_size` < `max_file_size` if both are set - Added `merge_adjacent_file_size_filter.test` covering filtering, error handling, and data integrity - Added `merge_adjacent_tiered.test` demonstrating a full tiered compaction workflow
|
Hi @redox, thanks for the PR! You should execute the size filters at the query in |
Ah yes of course! Let me know what y'all think about it; happy to submit a docs PR to https://github.com/duckdb/ducklake-web just after |
pdet
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just have a small comment, then I think it's good to go.
| uint64_t max_files; | ||
| optional_idx min_file_size; | ||
| optional_idx max_file_size; | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pdet I hesitated with DuckLakeMergeOptions let me know if you have a POV
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks great, thanks!
|
Thanks for all the work! It looks excellent! |
This follows discussion on #627
Adds
min_file_sizeandmax_file_sizeoptional parameters toducklake_merge_adjacent_filesto filter which files are considered for compaction based on their size. This enables tiered compaction strategies and more granular control over the compaction process.The original idea was to implement a Lucene-like tiered compaction where files are merged in stages in order to address realtime/streamed ingestion patterns:
Such compaction strategy provides more predictable I/O amplification and better incremental compaction for streaming workloads.
tldr;
min_file_sizeandmax_file_sizeparameters to filter files by sizemin_file_size(if set) are excluded from compactionmax_file_size(if set) are excluded from compactionmax_file_sizeis not specified, it defaults totarget_file_sizemax_file_sizemust be > 0, andmin_file_size<max_file_sizeif both are setmerge_adjacent_file_size_filter.testcovering filtering, error handling, and data integritymerge_adjacent_tiered.test_slowdemonstrating a full tiered compaction workflow