Skip to content

Conversation

@redox
Copy link
Contributor

@redox redox commented Dec 18, 2025

This follows discussion on #627

Adds min_file_size and max_file_size optional parameters to ducklake_merge_adjacent_files to filter which files are considered for compaction based on their size. This enables tiered compaction strategies and more granular control over the compaction process.

The original idea was to implement a Lucene-like tiered compaction where files are merged in stages in order to address realtime/streamed ingestion patterns:

  • Tier 0 → Tier 1: Merge small files (< 10KB) into ~50KB files
  • Tier 1 → Tier 2: Merge medium files (10KB-100KB) into ~200KB files
  • Tier 2 → Tier 3: Merge large files (100KB-500KB) into ~1MB files

Such compaction strategy provides more predictable I/O amplification and better incremental compaction for streaming workloads.

tldr;

  • Added min_file_size and max_file_size parameters to filter files by size
  • Files smaller than min_file_size (if set) are excluded from compaction
  • Files at or larger than max_file_size (if set) are excluded from compaction
  • If max_file_size is not specified, it defaults to target_file_size
  • Added validation: max_file_size must be > 0, and min_file_size < max_file_size if both are set
  • Added merge_adjacent_file_size_filter.test covering filtering, error handling, and data integrity
  • Added merge_adjacent_tiered.test_slow demonstrating a full tiered compaction workflow

Adds `min_file_size` and `max_file_size` optional parameters to `ducklake_merge_adjacent_files` to filter which files are considered for compaction based on their size. This enables tiered compaction strategies and more granular control over the compaction process.

Related discussion: duckdb#627

**Tiered Compaction**: Implement Lucene-like tiered compaction where files are merged in stages:
- Tier 0 → Tier 1: Merge small files (< 10KB) into ~50KB files
- Tier 1 → Tier 2: Merge medium files (10KB-100KB) into ~200KB files
- Tier 2 → Tier 3: Merge large files (100KB-500KB) into ~1MB files

This provides more predictable I/O amplification and better incremental compaction for streaming workloads.

- Added `min_file_size` and `max_file_size` parameters to filter files by size
- Files smaller than `min_file_size` (if set) are excluded from compaction
- Files at or larger than `max_file_size` (if set) are excluded from compaction
- If `max_file_size` is not specified, it defaults to `target_file_size`
- Added validation: `max_file_size` must be > 0, and `min_file_size` < `max_file_size` if both are set
- Added `merge_adjacent_file_size_filter.test` covering filtering, error handling, and data integrity
- Added `merge_adjacent_tiered.test` demonstrating a full tiered compaction workflow
@pdet
Copy link
Collaborator

pdet commented Dec 18, 2025

Hi @redox, thanks for the PR!

You should execute the size filters at the query in
vector<DuckLakeCompactionFileEntry> DuckLakeMetadataManager::GetFilesForCompaction(DuckLakeTableEntry &table, CompactionType type, double deletion_threshold, DuckLakeSnapshot snapshot);

@redox redox marked this pull request as ready for review December 19, 2025 08:04
@redox
Copy link
Contributor Author

redox commented Dec 19, 2025

Hi @redox, thanks for the PR!

You should execute the size filters at the query in

Ah yes of course! Let me know what y'all think about it; happy to submit a docs PR to https://github.com/duckdb/ducklake-web just after

Copy link
Collaborator

@pdet pdet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just have a small comment, then I think it's good to go.

uint64_t max_files;
optional_idx min_file_size;
optional_idx max_file_size;
};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pdet I hesitated with DuckLakeMergeOptions let me know if you have a POV

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks great, thanks!

@pdet pdet merged commit 502cad9 into duckdb:v1.4-andium Dec 19, 2025
48 checks passed
@pdet
Copy link
Collaborator

pdet commented Dec 19, 2025

Thanks for all the work! It looks excellent!

redox added a commit to altertable-ai/ducklake-web that referenced this pull request Dec 19, 2025
@redox redox deleted the su/tiered-merge branch December 19, 2025 21:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants