Compaction Maps for Apache Iceberg

This branch is a research prototype built on Apache Iceberg 1.10.1 (tag apache-iceberg-1.10.1). It adds compaction maps: a compact data structure that records the position transformations applied by a compaction so that concurrent transactions writing position deletes (or deletion vectors) can be rebased onto the new layout instead of restarted.

Compactions and concurrent updates logically commute — compaction does not change table contents — but in current table formats they conflict on direct file references. A compaction map captures, per run of rows, the move from a source file to one or more target files. Either side of a conflict can use the map to rewrite its position-delete references and commit, with no global coordination beyond Iceberg's existing snapshot pointer swap.

Paper

Chris Douglas and Joseph M. Hellerstein. Commutative Compaction. 1st International Workshop on Data FORMATS for Modern Architectures and Workloads (FORMATS '26), May 31–June 5, 2026, Bengaluru, India. ACM. https://doi.org/10.1145/3802514.3809174

The paper introduces compaction maps, the rebase operation, and the remapping policy, and evaluates an Apache Iceberg prototype (this branch) on Apache Spark 3.5 and 4.0 with both v2 position delete files and v3 deletion vectors.

Benchmark Results

End-to-end remapping was measured against object storage in three clouds (AWS S3 us-west-2, Azure ADLSv2 westus2, GCS uswest1) on commodity VMs (4 vCPU, 16 GiB), varying the number of runs in the compaction map (10 to 10K) and the size of the position delete file or deletion vector (1K to 1M deletes):

Repairing a 1M-delete commit against a 10K-run compaction map completes in under one second in every cloud, including all I/O — 0.34–0.45 s for deletion vectors and 1.8–2.3 s for position delete files.
At 10K deletes (typical commit size) latency never exceeds half a second in any cloud, regardless of run count.
Deletion vectors outperform position delete files across the board; Parquet encode dominates the PD cost while the DV roaring-bitmap region is only a few KiB.
The compaction map itself is small: 10 runs occupies 2.6 KiB and 10K runs occupies 8.9 KiB on disk.

Cost is negligible relative to the compaction it commutes with — compactions typically run for minutes to hours.

See docs/docs/compaction_maps_bench.md for the full benchmark methodology and the JMH microbenchmark suite that drives the remapping-algorithm policy.

Status

Implementation is on the cmpmap branch:

Core data structure, builder, Avro storage, and manifest-list reference.
Conflict detection and SERIALIZABLE-isolation integration in BaseRowDelta / MergingSnapshotProducer.
Automatic remapping in RewriteDataFilesCommitManager for Spark 3.5 and 4.0, for both position delete files (v2) and deletion vectors (v3).
Empirically-tuned remapping policy (IntervalTree / RangeQuery / StreamJoin) selected at runtime from the shape of the inputs.

Documentation

Detailed documentation lives in docs/docs/:

compaction_maps.md — user guide, table properties, conflict-resolution workflow.
compaction_maps_impl.md — implementation walkthrough, schema, integration points.
compaction_maps_impl_pseudocode.md — pseudocode for the remapping strategies.
compaction_maps_bench.md — benchmark suites, methodology, and how to reproduce.
compaction_maps_errata.md — design scope and known limitations (e.g. sort/z-order rewrites are out of scope).

Building

This branch builds with the standard Iceberg toolchain (Gradle, Java 11/17/21):

./gradlew :iceberg-core:compileJava
./gradlew :iceberg-core:test --tests "*CompactionMap*"
./gradlew :iceberg-core:test --tests "*Remapping*"
./gradlew spotlessApply

For general Iceberg build, engine-compatibility, and contribution information, see the upstream project at https://iceberg.apache.org.

Name		Name	Last commit message	Last commit date
Latest commit History 7,695 Commits
.baseline		.baseline
.github		.github
.palantir		.palantir
aliyun/src		aliyun/src
api/src		api/src
arrow/src		arrow/src
aws-bundle		aws-bundle
aws/src		aws/src
azure-bundle		azure-bundle
azure/src		azure/src
benchmark		benchmark
bigquery/src		bigquery/src
bundled-guava		bundled-guava
common/src		common/src
core/src		core/src
data/src		data/src
dell/src		dell/src
delta-lake/src		delta-lake/src
dev		dev
docker/iceberg-rest-fixture		docker/iceberg-rest-fixture
docs		docs
examples		examples
flink		flink
format		format
gcp-bundle		gcp-bundle
gcp/src		gcp/src
gradle		gradle
hive-metastore/src		hive-metastore/src
kafka-connect		kafka-connect
mr		mr
nessie/src		nessie/src
open-api		open-api
orc/src		orc/src
parquet/src		parquet/src
project		project
site		site
snowflake/src		snowflake/src
spark		spark
.asf.yaml		.asf.yaml
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
COMPACT_SPEC.md		COMPACT_SPEC.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
baseline.gradle		baseline.gradle
build.gradle		build.gradle
deploy.gradle		deploy.gradle
doap.rdf		doap.rdf
gradle.properties		gradle.properties
gradlew		gradlew
jmh.gradle		jmh.gradle
settings.gradle		settings.gradle
tasks.gradle		tasks.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Compaction Maps for Apache Iceberg

Paper

Benchmark Results

Status

Documentation

Building

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Compaction Maps for Apache Iceberg

Paper

Benchmark Results

Status

Documentation

Building

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages