EchoSV

EchoSV is a versatile tool for comparing and merging structural variant (SV) call sets generated using different reference genomes. It studies how SVs "echo" across these references through a hybrid workflow that combines liftover and graph-based matching.

Given two or more SV call sets from the same sample—each aligned to a different reference—EchoSV can perform two primary operations:

Compare: Generates a detailed comparison identifying overlapping variants and those exclusive to a specific reference, e.g., calls across GRCh38, CHM13, and a donor-specific assembly (DSA).
Merge: Consolidates multiple SV call sets into a single, unified output, e.g., merging two DSA haplotype-based call sets into one consolidated file.

Requirements

EchoSV depends on the following Python packages:

pysam: Read and write BAM/CRAM files and VCF records for variant processing.
intervaltree: Efficiently store and query genomic intervals to detect overlapping SVs.
Biopython (Bio): Parse and manipulate sequence data during liftover steps.
scipy: Perform statistical analysis and numerical computations on variant metrics.
networkx: Construct and traverse graphs that model structural variant matches.
pandas: Tabular data manipulation for comparison outputs.
numpy: Numerical computations on variant metrics.
rich: Formatted terminal output and progress display.

Installation

Option 1: From GitHub (recommended)

git clone git@github.com:parklab/EchoSV.git
cd EchoSV
pip install -r requirements.txt
pip install .

Option 2: Via PyPI

pip install echosv

Usage

The EchoSV workflow consists of four main steps: chain, merge (optional), genotype, and match. Below are detailed instructions and examples using the test data (can be downloaded from Zenodo).

Step 0: Download and uncompress test data

Download the EchoSV test data echosv_test_data.tar.gz from Zenodo and decompress it:

tar -xzvf echosv_test_data.tar.gz

Step 1: Generate chains

The chain command generates a liftover chain file that maps coordinates from ref2 (the source assembly) to ref1 (the target reference). Before running chain, align ref2 against ref1 using minimap2's asm-to-asm mode and index the output:

minimap2 -a -x asm5 --cs ref1.fa ref2.fa \
    | samtools view -hSb - \
    | samtools sort -O BAM -o ref2_to_ref1.bam
samtools index ref2_to_ref1.bam

Then generate the chain file. EchoSV looks for a pre-built index automatically to parse the contig lengths; if none exist, the FASTA is parsed directly (slower for large assemblies). You can generate an index with samtools faidx ref2.fa or samtools dict ref2.fa > ref2.fa.dict.

echosv chain \
    -b test_data/input_data/chm13_to_grch38.bam \
    -f test_data/input_data/chm13.fa \
    -o test_data/chm13_to_grch38.chain.gz

Parameters

-b: Path to the ref2-to-ref1 alignment (BAM format, must be indexed)
-f: Path to the ref2 reference FASTA
-o: Output chain file for coordinate mapping (a coverage BED file is also written alongside)

Step 2: Merge SV call sets from the same reference (optional)

Merge multi-caller VCFs from the same reference into one call set before genotyping.

The merge command merges multiple SV call sets that were called against the same reference genome (e.g., outputs from multiple callers). This step is typically run before genotype and match so that each reference has a single unified call set for cross-reference comparison. Scripts to reproduce the analysis from our paper are available in scripts/.

# Merge multiple VCFs from the same reference into a single call set
echosv merge \
    -i grch38_colo829_caller1.vcf.gz grch38_colo829_caller2.vcf.gz [...] \
    -o grch38_colo829_svs.vcf.gz \
    --merge --new

# Extract high-confidence SVs (≥4 supporting callers, ≥2 platforms)
echosv merge \
    -i grch38_colo829_svs.vcf.gz \
    -o grch38_colo829_svs_highconf.vcf.gz \
    --extract

Pre-built gap BED files for the references used in this study are provided in the src/echosv/beds/ directory; a new gap BED can be passed by using --gapbed.

Parameters:

-i: Input VCF file(s) — space-separated list for --merge, single file for --extract
-o: Output file path
-a / --atol: Positional tolerance in bp for matching breakpoints (default: 500)
-s / --sizetol: Minimum size-similarity ratio for matching SVs (default: 0.5)
-c / --checksvtype: Require matching SV types when merging
--merge: Write a merged VCF from the comparison result
--new: Build merged VCF records from scratch (use with --merge)
--extract: Extract high-confidence SVs (≥4 supporting callers and ≥2 platforms)
--gaps-bed: BED file of reference gap / N regions; SVs near gaps are excluded when using --extract

Step 3: Collect supporting reads

The genotype command collects supporting reads for each SV from BAM files and annotates the VCF with allele-frequency and read-name fields used by the graph-based matching in Step 4.

echosv genotype --longread \
    -i test_data/input_data/grch38_colo829_somatic_svs.vcf.gz \
    -b test_data/input_data/chm13_to_grch38.bam \
    -o test_data/grch38_colo829_genotyped.vcf.gz

Parameters

--longread: Collect supporting reads from long-read alignments
--shortread: Collect supporting reads from short-read alignments
-i: Input SV VCF file
-b: BAM file(s) — multiple BAMs can be provided space-separated
-o: Output VCF with annotated supporting-read information

Step 4: Match SVs across references

The match command compares SV call sets across different reference genomes using a two-step hybrid approach: liftover-based coordinate matching followed by graph-based matching on shared supporting reads (echo score).

# Compare SV call sets and report concordant / reference-exclusive variants
echosv match -i test_data/test_colo829_config.json

# Compare SV call sets between DSA haplotypes and also produce a merged DSA-based VCF
echosv match -i dsa_merge_colo829_config.json --merge

The input is a JSON config file specifying reference labels, genotyped VCFs, chain files, and the output path. See test_data/test_colo829_config.json below for a working example.

Example JSON

{
    "refs":   { "1": "grch38", "2": "chm13", "3": "dsa" },
    "vcfs":   { "1": "./test_data/grch38_colo829_genotyped.vcf.gz",
                "2": "./test_data/chm13_colo829_genotyped.vcf.gz",
                "3": "./test_data/dsa_colo829_genotyped.vcf.gz" },
    "chains": { "2_to_1": "./test_data/chm13_to_grch38.chain.gz",
                "3_to_1": "./test_data/colo829bl_hap*_grch38.chain.gz" },
    "output": "./test_data/colo829_svs_comparison.txt"
}

Parameters

-i: Input config JSON file
--merge: Merge concordant SVs across references and write a unified VCF
--multiplat: Use multi-platform genotyping information during matching
-m / --min_echo_score: Minimum echo score to consider two SVs a match (default: 0.5)

License

This project is licensed under the MIT License — see the LICENSE file for details.

Contact

Feel free to open an issue on GitHub or contact Yuwei Zhang (yuwei_zhang@hms.harvard.edu) if you have any questions about EchoSV.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
scripts		scripts
src/echosv		src/echosv
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
echosv_workflow.jpg		echosv_workflow.jpg
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EchoSV

Table of Contents

Requirements

Installation

Usage

Step 0: Download and uncompress test data

Step 1: Generate chains

Step 2: Merge SV call sets from the same reference (optional)

Step 3: Collect supporting reads

Step 4: Match SVs across references

License

Contact

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EchoSV

Table of Contents

Requirements

Installation

Usage

Step 0: Download and uncompress test data

Step 1: Generate chains

Step 2: Merge SV call sets from the same reference (optional)

Step 3: Collect supporting reads

Step 4: Match SVs across references

License

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages