Skip to content

maiziezhoulab/CNVeil

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

104 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CNVeil

Table of content

CNVeil

Install CNVeil

You can install CNVeil through github

git clone git@github.com:maiziezhoulab/CNVeil.git

To set up the environment, you should have conda installed on your machine. Then simply run

# create conda env CNVeil_py3
conda env create -f CNVeil/env_yamls/environment_py3.yml 
# create conda env CNVeil_py2
conda env create -f CNVeil/env_yamls/environment_py2.yml
# create conda env CNVeil_R451
conda env create -f CNVeil/env_yamls/environment_R451.yml

You will have 3 virtual environments installed (CNVeil_py3, CNVeil_py2 and CNVeil_R451). And they will be used in diferent steps. You will need to load the corresponding virtual environment before you run each step in CNVeil pipeline.

Run CNVeil

Step 1 Extract read depth matrix

This step bins reads from the input BAM files, performs GC correction, and reformats the corrected bin-level read depth into a matrix for downstream CNVeil analysis.

Script:

bin/CNVeil_extract_read_depth.py

Required conda environment:

conda activate CNVeil_R451

Arguments

Argument Short Required Description
--merged_bam_file -mgbam Yes Path to the merged BAM file containing reads from all cells. This is used for binning and GC correction.
--split_bam_dir -spbam Yes Directory containing split BAM files for individual cells.
--work_dir -w Yes Working directory for CNVeil output. The script will create a Total_CN/ subdirectory inside the working folder.
--reftype -rt Yes Reference genome version. Must be either hg19 or hg38.
--num_threads -t No Number of threads to use. Default: 20.
--cell_node -cn No Optional cell-node mapping file. This is mainly designed for simulation data generated from SimscTree. If provided, cell IDs will be transformed to node IDs in the final matrix. For real data, this is not needed.

Example usage

For real data:

python bin/CNVeil_extract_read_depth.py \
  --merged_bam_file /path/to/merged.bam \
  --split_bam_dir /path/to/split_bams \
  --work_dir /path/to/work_dir \
  --reftype hg38 \
  --num_threads 20

For simulation data with a cell-node mapping file:

python bin/CNVeil_extract_read_depth.py \
  --merged_bam_file /path/to/merged.bam \
  --split_bam_dir /path/to/split_bams \
  --work_dir /path/to/work_dir \
  --reftype hg38 \
  --num_threads 20 \
  --cell_node /path/to/cell_node.tsv

Output files

The script creates the following output directory:

<work_dir>/Total_CN/

Main output:

<work_dir>/Total_CN/RDmatrix.csv

This is the final read-depth matrix used by downstream CNVeil steps.

Step 2 Infer Total Copy Number

This step takes the read-depth matrix generated from Step 1, and outputs a total copy-number matrix for downstream CNVeil analysis.

Script:

bin/CNVeil_total_CN.py

Required conda environment:

conda activate CNVeil_py3

This script expects the Step 1 output directory structure:

<work_dir>/Total_CN/

Required input file:

<work_dir>/Total_CN/RDmatrix.csv
Arguments
Argument Short Required Description
--work_dir -w Yes Working directory for CNVeil. The script reads input from <work_dir>/Total_CN/RDmatrix.csv and writes outputs to <work_dir>/Total_CN/.
--reftype -rt Yes Reference genome version. Must be either hg19 or hg38.
--seq_type -st Yes Sequencing type. Must be either paired-end or single-end. This argument is currently parsed but not directly used in the script.
--data_type -dtype Yes Data type. Must be either 10x or other. This affects QC and low-coverage bin filtering behavior.
--sample -sp No Sample name used in output file names. Default: sample.
--num_threads -t No Number of threads. Default: 20. This argument is currently parsed but not directly used in the main CN inference steps.
--cell_node -cn No Optional cell-node mapping file. Mainly designed for simulation data. If provided, cell names are transformed to node labels in the final CNV output.
Example usage

For real 10x data:

python bin/CNVeil_total_CN.py \
  --work_dir /path/to/work_dir \
  --reftype hg38 \
  --seq_type <paired-end or single-end> \
  --data_type 10x \
  --sample SAMPLE_NAME \
  --num_threads 20

For other single-cell sequencing data (non-10x):

python bin/CNVeil_total_CN.py \
  --work_dir /path/to/work_dir \
  --reftype hg38 \
  --seq_type <paired-end or single-end> \
  --data_type other \
  --sample SAMPLE_NAME

For simulation data with a cell-node mapping file:

python bin/CNVeil_total_CN.py \
  --work_dir /path/to/work_dir \
  --reftype hg38 \
  --seq_type <paired-end or single-end> \
  --data_type 10x \
  --sample SAMPLE_NAME \
  --cell_node /path/to/cell_node.tsv
Main output files

All outputs are written under:

<work_dir>/Total_CN/

Main total copy-number matrix:

<work_dir>/Total_CN/CNV_<sample>.csv

This file has cells as rows and genomic bins as columns.

Transposed copy-number matrix:

<work_dir>/Total_CN/CNV_<sample>.csv.T

This file has genomic bins as rows and cells as columns.

Step 3: Infer allele-specific copy number

This step uses the total copy-number results from Step 2 together with SNP allele-count information to infer allele-specific copy number (ASCN). You need to first extract per cell allele count before running the main script.

To get allele-count information, follow the instruction here.

After that, you will get var_all.rds in the output folder and you need to process it using bin/load_snp.R.

Activate the R environment before running:

conda activate CNVeil_R451

Run:

Rscript load_snp.R <rds_dir> <work_dir> <sample> <num_threads>

Example:

Rscript load_snp.R \
  /path/to/rds_dir \
  /path/to/work_dir \
  TN1 \
  8

Required input files in <rds_dir>:

var_all.rds
ref_all.rds
alt_all.rds

Required total CN file:

<work_dir>/Total_CN/CNV_<sample>.csv

This step outputs files to:

<work_dir>/Allele_CN/allele_count_by_cell/

Main outputs:

<sample>_baf.tsv
<sample>_baf_chr1.tsv
...
<sample>_baf_chr22.tsv
<sample>_pos_baf_chr1.tsv.gz
...
<sample>_pos_baf_chr22.tsv.gz

The *_baf_chr*.tsv files contain cell-level allele counts:

chrom    pos    cell    ref    alt

The *_pos_baf_chr*.tsv.gz files contain position-level BAF features:

chrom    pos    ref_sum    alt_sum    n_cells    n_obs    depth    baf    mbaf

Normal cells are inferred from average total CN:

|avg_CN - 2| <= 0.1

Normal cells are excluded when computing position-level BAF files.

Afterwards, you may run bin/CNVeil_Allele_CN.py to generate allele-specifc CN.

Script:

bin/CNVeil_allele_CN.py

Required conda environment:

conda activate CNVeil_py3
Required input from previous steps

This step expects the CNVeil working directory to already contain the total CN result from Step 2:

<work_dir>/Total_CN/CNV_<sample>.csv

It also requires a directory containing SNP/RDS-derived input files:

<var_dir>/

The SNP loader expects this directory to contain:

var_all.rds
ref_all.rds
alt_all.rds

Arguments

Argument Short Required Description
--work_dir Yes Working directory for the CNVeil pipeline. The script reads total CN results from <work_dir>/Total_CN/ and writes allele-specific CN results to <work_dir>/Allele_CN/.
--var_dir Yes Directory containing SNP/RDS-derived files, including var_all.rds, ref_all.rds, and alt_all.rds.
--ref_type Yes Reference genome type. Must be hg19 or hg38.
--convert No Optional cell name/barcode converter file. Use this if cell names in allele-count files and total CN files need to be matched.
--dtype No Optional sequencing data type. Must be '10x' or 'other'. By default it is set to 'other'.
--snp_file No Optional SNP genotype file. Usually only needed when external genotype information is available.
--num_threads -t No Number of parallel threads. Default: 22.
Example usage

Basic usage:

python bin/CNVeil_allele_CN.py \
  --work_dir /path/to/work_dir \
  --var_dir /path/to/var_rds_dir \
  --ref_type hg38 \
  --num_threads 22

With a cell-name converter:

python bin/CNVeil_allele_CN.py \
  --work_dir /path/to/work_dir \
  --var_dir /path/to/var_rds_dir \
  --ref_type hg38 \
  --convert /path/to/collected_barcodes_SAMPLE.txt \
  --num_threads 22

When running for 10x data:

python bin/CNVeil_allele_CN.py \
  --work_dir /path/to/work_dir \
  --var_dir /path/to/var_rds_dir \
  --ref_type hg38 \
  --convert /path/to/collected_barcodes_SAMPLE.txt \
  --dtype 10x \
  --num_threads 22
Main output files

The script writes allele-specific CN results under:

<work_dir>/Allele_CN/

Important outputs include:

<work_dir>/Allele_CN/ascn_seg.csv

Allele-specific copy number inferred at the cell/segment level.

Step 4: Infer haplotype-specific copy number

This step uses the allele-specific copy-number results from Step 3 and phases allele-specific states along the genome to infer haplotype-specific copy number (HAP-CN).

Script:

bin/CNVeil_hap_CN.py

Required conda environment:

conda activate CNVeil_py2

Required input from previous steps

This step expects the CNVeil working directory to already contain the allele-specific CN result from Step 3:

<work_dir>/Allele_CN/ascn_seg.csv

This file contains allele-specific copy number states per cell and segment, formatted as:

chr,start,end,cell1,cell2,...
chr1,100000,200000,2|1,1|2,...

Each entry A|B represents the allele-specific copy numbers for a given segment and cell.

Arguments

Argument Short Required Description
--ascn_file Yes Input allele-specific CN segment file from Step 3.
--out_dir Yes Output directory for haplotype-specific CN results.

Example usage

Basic usage:

python2.7 bin/CNVeil_hap_CN.py \
  --ascn_file /path/to/work_dir/Allele_CN/ascn_seg.csv \
  --out_dir /path/to/work_dir/Haplotype_CN 

Main output files

The script writes haplotype-specific CN results under:

<work_dir>/Haplotype_CN/

Important outputs include:

Haplotype-specific copy number inferred at the cell/segment level with consistent allele orientation across the genome.

<work_dir>/Haplotype_CN/phasing_output.pkl

Serialized phased copy-number object for downstream analysis.

Benchmark

Prepare scDNA-seq data

Here is an example of generating the bam files from fastq files downloaded from NCBI cell by cell. The patient is named as 'KTN126'.

patient="KTN126"
file=YOUR_PATH_FOR_SRR_LIST
srapath=YOUR_PATH_TO_STORE_SRA_FILE
fastqpath=YOUR_PATH_TO_STORE_FASTQ_FILE
mkdir ${fastqpath}

 while read line; do
     echo $line
     YOUR_PATH_TO_SRATOOLKIT/bin/fastq-dump -I --split-files $srapath$line"/"$line".sra" 
     -O $fastqpath$line
 done < $file

Generate the fastq files from the. sra file with sra toolkit.

ml Intel/2017.4.196 BWA/0.7.17
ml GCC/11.3.0 SAMtools/1.18
sampath=${srapath}/sampath_SraAccList${patient}/
align_dir=${srapath}/sampath_SraAccList${patient}/
readgroup_path=${srapath}/bampath_SraAccList${patient}_readgroup/
dedup_path=${srapath}/bampath_SraAccList${patient}_dedup/

mkdir ${sampath} ${readgroup_path} ${dedup_path}
 
while read line; do
    echo $line
    cd $align_dir 
     bwa mem -M -t 8 YOUR_PATH_TO_REF_GENOME_FILE $fastqpath/$line/"$line"_1.fastq $fastqpath/$line/"$line"_2.fastq > $align_dir/"$line".sam
     samtools view -bS $align_dir/"$line".sam > $align_dir/"$line".bam 
     java -jar YOUR_PATH_TO_PICARD/picard.jar SortSam -I $align_dir/"$line".bam -SORT_ORDER coordinate -O $align_dir/"$line"_hg38.sorted.bam 
     java -jar YOUR_PATH_TO_PICARD/picard.jar AddOrReplaceReadGroups -I $align_dir/"$line"_hg38.sorted.bam -O $readgroup_path/"$line"_hg38.sorted.rg.bam -RGID $readgroup_path/"$line"_hg38/ -RGLB NAVIN_Et_Al -RGPL ILLUMINA -RGPU machine -RGSM $readgroup_path/"$line"_hg38/
     java -jar YOUR_PATH_TO_PICARD/picard.jar MarkDuplicates -REMOVE_DUPLICATES true -I $readgroup_path/"$line"_hg38.sorted.rg.bam -O $dedup_path/"$line"_hg38.sorted.rg.dedup.bam -METRICS_FILE $dedup_path/"$line"_hg38.sorted.rg.dedup.metrics.txt -PROGRAM_RECORD_ID MarkDuplicates -PROGRAM_GROUP_VERSION null -PROGRAM_GROUP_NAME MarkDuplicates
     java -jar YOUR_PATH_TO_PICARD/picard.jar BuildBamIndex -I $dedup_path/"$line"_hg38.sorted.rg.dedup.bam -O $dedup_path/"$line"_hg38.sorted.rg.dedup.bai
done < $file

Generate the bam files after alignment.

Next, we generate the bed files with hg19/hg38 reference.

ml GCC/8.2.0 BEDTools/2.28.0
bedpath_hg38=${srapath}/bedpath_SraAccList${patient}_hg38
bedpath_hg19=${srapath}/bedpath_SraAccList${patient}_hg19
mkdir ${bedpath_hg38} ${bedpath_hg19}
while read line; do
    echo $line
    bamToBed -i $dedup_path/$line'_hg38.sorted.rg.dedup.bam' > \
        ${bedpath_hg38}/$line'_grch38_rmdup.bed'
done < $file

while read line; do
    echo $line
    YOUR_PATH_TO_LIFTOVER/liftOver \
        ${bedpath_hg38}/$line'_grch38_rmdup.bed' \
        'YOUR_PATH_TO_LIFTOVER_CHAINFILE/liftOver_ChainFiles/hg38ToHg19.over.chain' \
        ${bedpath_hg19}/$line'_hg19_rmdup.bed' \
        ${bedpath_hg19}/$line'_hg19_unlifted_rmdup.bed'
done < $file

Zipped the bed file, which is recommended for the Ginkgo web platform to analyze.

cd ${bedpath_hg19}

About

CNVeil enables accurate and robust tumor subclone identification and copy number estimation from single-cell DNA sequencing data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors