2 releases
| 0.1.1 | Nov 5, 2025 |
|---|---|
| 0.1.0 | Nov 5, 2025 |
#527 in Biology
205 downloads per month
Used in seqwish
31KB
503 lines
iitree-rs
Implicit Augmented Interval Tree in Rust with Disk-Backed Storage
A Rust implementation of the IAITree/cgranges algorithm with memory-mapped disk storage for handling large interval datasets.
Features
- Disk-backed storage: ALL data stored on disk via memory mapping (matches C++ exactly)
- Optimal complexity: O(n) space on disk, O(log n + k) queries, O(n log n) build
- Memory efficient: 32 bytes per interval on disk (for u64/u64 types)
- Zero-copy queries: Direct queries on memory-mapped files
- Parallel sorting: Uses rayon for fast in-place sorting on disk
- Concurrent writer: Lock-free queue with background writer thread
- Generic: Works with any
Ord + Copytypes - Battle-tested: Ported from production C++ code (mmmulti::iitree)
- Comprehensive tests: Includes cgranges validation suite
Algorithm
Intervals are stored in a sorted flat array and interpreted as an implicit complete binary tree. Each node is augmented with the maximum end position in its subtree, enabling efficient pruning during queries.
Based on the cgranges algorithm by Heng Li.
Complexity
| Operation | Time | Space |
|---|---|---|
| Build (index) | O(n log n) | O(n) |
| Query (overlap) | O(log n + k) | - |
| Memory per interval | - | 32 bytes* |
*For Interval<u64, u64>. Scales with type sizes.
Usage
Add to your Cargo.toml:
[dependencies]
iitree-rs = "0.1"
Basic Example
use iitree_rs::{IITree, Interval};
fn main() -> std::io::Result<()> {
// Create tree backed by a file on disk
let mut tree = IITree::new("my_intervals.iitree")?;
// Start writer thread
tree.open_writer()?;
// Add intervals [start, end) with associated data
tree.add(10, 20, 1)?;
tree.add(15, 25, 2)?;
tree.add(30, 40, 3)?;
// Build the index (sorts and augments on disk)
tree.index()?;
// Query overlaps with [18, 22)
tree.overlap(18, 22, |idx, start, end, data| {
println!("Interval {}: [{}, {}) data={}", idx, start, end, data);
})?;
// Output:
// Interval 0: [10, 20) data=1
// Interval 1: [15, 25) data=2
Ok(())
}
With Custom Types
use iitree_rs::IITree;
// Use custom position and data types
let mut tree: IITree<u32, String> = IITree::new();
tree.add(100, 200, "gene1".to_string());
tree.add(150, 250, "gene2".to_string());
tree.index();
tree.overlap(175, 225, |_, start, end, data| {
println!("[{}, {}): {}", start, end, data);
});
Collecting Results
use iitree_rs::IITree;
let mut tree = IITree::new();
tree.add(10, 30, "A");
tree.add(20, 40, "B");
tree.add(25, 35, "C");
tree.index();
// Collect all overlapping intervals
let mut results = Vec::new();
tree.overlap(22, 28, |_, start, end, data| {
results.push((start, end, data));
});
assert_eq!(results.len(), 3);
assert_eq!(results, vec![(10, 30, "A"), (20, 40, "B"), (25, 35, "C")]);
Genomic Intervals Example
use iitree_rs::IITree;
// Store genomic features with their positions
let mut features = IITree::new();
features.add(1000, 2000, "exon1");
features.add(3000, 4000, "exon2");
features.add(5000, 6000, "exon3");
features.add(1500, 3500, "intron1");
features.index();
// Find features overlapping position 3500
features.overlap(3500, 3501, |_, start, end, name| {
println!("{}: [{}, {})", name, start, end);
});
// Output:
// intron1: [1500, 3500)
// exon2: [3000, 4000)
API Reference
Interval<S, T>
Represents an interval with start, end, and associated data.
pub struct Interval<S, T> {
pub st: S, // Start position (inclusive)
pub en: S, // End position (exclusive)
pub max: S, // Max end in subtree (computed during indexing)
pub data: T, // Associated data
}
IITree<S, T>
The interval tree container.
Methods
new() -> Self- Create an empty treeadd(start: S, end: S, data: T)- Add an intervalindex()- Sort and build the tree (required before queries)overlap<F>(st: S, en: S, func: F)- Find overlaps with[st, en)- Callback:
FnMut(index: usize, start: S, end: S, data: T)
- Callback:
len() -> usize- Number of intervalsis_empty() -> bool- Check if tree is emptyget(index: usize) -> Option<&Interval<S, T>>- Direct access to interval
Performance
Benchmarked on:
- 10K random intervals over 10K positions: ~0.05s (all queries)
- 1M intervals indexed: <1s (parallel sorting with rayon)
Performance characteristics match the C++ mmmulti::iitree implementation.
Testing
Run tests:
cargo test
Run comprehensive cgranges validation:
cargo test --release -- test_cgranges
The cgranges tests validate correctness by:
- Generating thousands of random intervals
- Querying every position in the range
- Verifying all returned intervals actually contain the query position
Current Status
✅ Fully Implemented (v0.1):
- Core IAITree algorithm (index_core, overlap)
- Memory-mapped disk storage (ALL data on disk)
- Thread-safe concurrent writer with lock-free queue
- Parallel sorting with rayon (in-place on disk)
- Comprehensive test suite (7/7 passing + cgranges validation)
- Generic over position and data types
- Matches C++ mmmulti::iitree behavior exactly
🚧 Future Enhancements:
- Serde serialization support
- Iterator interface
- Additional utility methods
Roadmap
- v0.1 - Core algorithm (current)
- v0.2 - Memory-mapped I/O for disk-backed storage
- v0.3 - Concurrent writer for parallel insertions
- v0.4 - Serde support and additional utilities
References
- Original algorithm: cgranges by Heng Li
- C++ implementation: mmmulti::iitree
- Used in: seqwish - pangenome graph inducer
License
MIT
Contributing
Contributions welcome! This crate is being developed as part of the seqwish Rust migration project.
Acknowledgments
- Heng Li for the cgranges algorithm
- Erik Garrison for the mmmulti C++ implementation
- The Rust bio/bioinformatics community
Dependencies
~1.5MB
~30K SLoC