Systematic integration of long- and short-read sequencing improves somatic structural variant detection
---SomaSV is a hybrid method for high-accuracy somatic structural variant (SSV) detection from long-read sequencing data. By integrating tumor long-read data with matched normal data composed of both long-read and short-read sequencing, SomaSV achieves superior detection performance while significantly reducing sequencing costs.
SomaSV supports two operating modes:
- Hybrid mode — combines 30× tumor long-read, 10× normal long-read, and 30× normal short-read data for cost-effective, high-accuracy somatic SV calling.
- Long-read-only mode — uses tumor and matched normal long-read data without short-read input.
- Hybrid sequencing design — leverages inexpensive short-read data as matched normal to replace a substantial portion of long-read sequencing, making somatic SV detection more accessible and cost-effective.
- Multi-platform support — compatible with Oxford Nanopore Technologies (ONT) and PacBio HiFi long-read platforms.
- High accuracy — achieves state-of-the-art F1 scores across multiple benchmark datasets and SV types (DEL, DUP, INV, INS, BND).
- Clinical relevance — capable of identifying somatic SVs in cancer-associated genes with potential diagnostic and screening value.
git clone https://github.com/eioyuou/SomaSV.git
cd SomaSV
pip install -e .| Data Type | Coverage | Cost Ratio |
|---|---|---|
| Tumor long-read | 30× | High |
| Normal long-read | 10× | Medium |
| Normal short-read | 30× | Low |
This hybrid design reduces overall sequencing cost by approximately 19% compared to a full long-read-only approach while maintaining or improving detection accuracy.
| Data Type | Coverage | Cost Ratio |
|---|---|---|
| Tumor long-read | — | High |
| Normal long-read | — | High |
This mode uses standard matched tumor–normal long-read sequencing without short-read data. Coverage can be adjusted based on project requirements and budget.
The commands below show the standard usage of SomaSV with user-provided input files. For a minimal runnable example using the bundled demo dataset, see Demo.
| File | Description |
|---|---|
| Tumor long-read BAM | Long-read sequencing alignment of the tumor sample |
| Normal long-read BAM | Long-read sequencing alignment of the matched normal sample |
| Reference genome | Reference FASTA file (e.g., GRCh38) |
| PoN VCF | Panel of Normals VCF for germline SV filtering (e.g., gnomAD SV v4.1) |
| Normal short-read BAM | (Hybrid mode only) Short-read sequencing alignment of the matched normal sample |
python main.py \
--tumor-bam tumor_long_read.bam \
--normal-bam normal_long_read.bam \
--reference ref.fasta \
--output-dir results/ \
--sample-platform HIFI \
--mode hybrid \
--short-read-normal-bam normal_short_read.bam \
--short-read-coverage 30 \
--pon-vcf gnomad.v4.1.sv.sites.vcf.gzpython main.py \
--tumor-bam tumor_long_read.bam \
--normal-bam normal_long_read.bam \
--reference ref.fasta \
--output-dir results/ \
--sample-platform ONT \
--mode long-read-only | Parameter | Required | Description |
|---|---|---|
--tumor-bam |
✓ | Path to tumor long-read BAM file |
--normal-bam |
✓ | Path to normal long-read BAM file |
--reference |
✓ | Path to reference genome FASTA file |
--output-dir |
✓ | Output directory for results |
--sample-platform |
✓ | Sequencing platform: HIFI or ONT |
--mode |
✓ | Running mode: hybrid or long-read-only |
--pon-vcf |
Hybrid | Path to Panel of Normals (PoN) VCF for germline filtering |
--short-read-normal-bam |
Hybrid | Path to normal short-read BAM file |
--short-read-coverage |
Hybrid | Coverage of normal short-read data (e.g., 30) |
The main output is a standard VCF file located at:
results/final_somatic_variants.vcf
A lightweight demo dataset is provided in data/demo/ for quick testing and validation of SomaSV.
These demo files can be used directly with the example commands below to verify installation and basic pipeline execution.
The demo files are small subset BAMs derived from chromosome 22 of sample HG008 and are intended for demonstration purposes only. They do not represent complete sequencing datasets.
Example demo files:
data/demo/HG008_30X_HiFi_chr20_tumor_demo_subset.bamdata/demo/HG008_10X_HiFi_chr20_normal_demo_subset.bamdata/demo/HG008_30X_illumina_chr20_normal_demo_subset.bam
python main.py \
--tumor-bam data/demo/HG008_30X_HiFi_chr20_tumor_demo_subset.bam \
--normal-bam data/demo/HG008_10X_HiFi_chr20_normal_demo_subset.bam \
--reference ref.fasta \
--output-dir demo_results/ \
--sample-platform HIFI \
--mode hybrid \
--short-read-normal-bam data/demo/HG008_30X_illumina_chr20_normal_demo_subset.bam \
--short-read-coverage 30 \
--pon-vcf gnomad.v4.1.sv.sites.vcf.gz
python main.py \
--tumor-bam data/demo/HG008_30X_HiFi_chr20_tumor_demo_subset.bam \
--normal-bam data/demo/HG008_10X_HiFi_chr20_normal_demo_subset.bam \
--reference ref.fasta \
--output-dir demo_results_long_read_only/ \
--sample-platform HIFI \
--mode long-read-only
The main output VCF will be written to:
demo_results/final_somatic_variants.vcf
- The demo BAM files contain only a small subset region from chromosome 22.
- These files are intended for demonstration purposes only and do not represent complete datasets.
- For real analyses, please use complete datasets and an appropriate reference genome and PoN resource.
If you use SomaSV in your research, please cite:
Systematic integration of long- and short-read sequencing improves somatic structural variant detection
Manuscript in preparation
This project is licensed under the MIT License.
For questions, bug reports, or feature requests, please open an issue on GitHub.