EchoSV is a versatile tool for comparing and merging structural variant (SV) call sets that were generated using different reference genomes. It studies how SVs "echo" across these references through a hybrid workflow that combines lift-over and graph-based matching.
Given two or more SV call sets from the same sample—each aligned to a different reference—EchoSV can perform two primary operations:
- Merge: Consolidates multiple SV call sets into a single, unified output. For example, it can merge two DSA haplotype–based call sets into one consolidated file.
- Compare: Generates a detailed comparison identifying overlapping variants and those exclusive to a specific reference, such as when analyzing calls across GRCh38, CHM13, and DSA.
EchoSV depends on the following Python packages:
- pysam: Read and write BAM/CRAM files and VCF records for variant processing.
- intervaltree: Efficiently store and query genomic intervals to detect overlapping SVs.
- Biopython (Bio): Parse and manipulate sequence data during liftover steps.
- scipy: Perform statistical analysis and numerical computations on variant metrics.
- networkx: Construct and traverse graphs that model structural variant matches.
Option 1: From GitHub (recommended)
# Clone the repo and install
git clone git@github.com:parklab/EchoSV.git
cd EchoSV
pip install .Option 2: Via PyPI
pip install echosvEchoSV workflow consists of three main steps: chain, genotype, and match. Below are detailed instructions and examples using the test data under test_data/input_data.
wget http://genomebrowser-uploads.hms.harvard.edu/data/yuz006/test_data.tar.gz
tar -xzvf test_data.tar.gzThe chain command generates liftover chain files for mapping SV coordinates across reference assemblies.
Before running it, please create a ref2-to-ref1 alignment using minimap2's asm-to-asm mapping:
minimap2 -a -x asm5 --cs <ref1.fa> <ref2.fa> | samtools view -hSb - | samtools sort -O BAM -o ref2_to_ref1.bam# Generate chain file between two references
echosv chain -b test_data/input_data/chm13_to_grch38.bam -f test_data/input_data/chm13.fa -o test_data/chm13_to_grch38.chain.gz Parameters:
-b: Path to the ref2-to-ref1 alignment (BAM format)-f: Path to the ref2 reference genome (FASTA format)-o: Output chain file for coordinate mapping (and bed file for alignment coverage)
The genotype command collects supporting reads for each SV in the corresponding BAMs and prepares them for matching by graph-based matching.
# Genotype SVs from the first call set
echosv genotype --longread -i test_data/input_data/grch38_colo829_somatic_svs.vcf.gz -b BAM [BAMs...] -o test_data/grch38_colo829_genotyped.vcf.gzParameters:
--longread: If given, genotyping SVs from long-read alignments--shortread: If given, genotyping SVs from short-read alignments-i: Input SV vcf file-b: Bam file(s)-o: Output VCF with genotyped info
The match command compares SV call sets and either merges them or identifies overlapping and reference-unique variants.
# Compare and merge two SV call sets
echosv match -i test_data/test_colo829_config.json --merge
# Or compare to find unique and shared variants
echosv match -i test_data/test_colo829_config.json Parameters:
-i: Input config file, see example as ./test_data/test_colo829_config.json--multiplat: If given, use multi-platform genotyping info.--merge: If given, merge concordant SVs across references and derive a single VCF.--min_echo_score: Minimum echo score to consider an SV for matching (default: 0.5).
This project is licensed under the MIT License - see the LICENSE.md file for details.
Feel free to open an issue in Github or contact Yuwei Zhang (yuwei_zhang@hms.harvard.edu) if you have any problem in using EchoSV.