Skip to content

Xinglab/scCirRL-seq

Repository files navigation

Haplotype-resolved full-length transcriptome analysis in single cells by scCirRL-seq

Updates (v0.0.1)

  • Added split-parallel-merge workflow for barcode & UMI calling on large BAM files (scCirRL_extract_reads, scCirRL_collect_ref_bc, scCirRL_assign_bc_from_tsv, scCirRL_merge_bc_umi)

What is scCirRL-seq

scCirRL-seq is a computational pipeline designed for single-cell RNA-seq long-read data.

It mainly consists of three parts:

  1. cell barcode/UMI calling: a standalone module for barcode/UMI calling and gene/transcript quantification
  2. alternative splicing analysis: modules for identification of cell type-specific and allele-specific splicing
  3. visualization of gene and transcript expression in single cells

Table of Contents

Installation

Operating system and python version

scCirRL-seq was written in python3 and tested on Linux/Unix systems, so it may not work well with python2 and/or other systems(Windows/Mac).

Via conda (install locally for now)

git clone git@github.com:Xinglab/scCirRL-seq.git
cd scCirRL-seq
conda create --prefix ./conda_env
conda activate ./conda_env
conda install -c conda-forge -c bioconda python=3.8 --file conda_requirements.txt
python setup.py install

From source files

git clone git@github.com:Xinglab/scCirRL-seq.git
cd scCirRL-seq && pip install .

0. Preprocessing

0.1 Mapping

For mapping, we recommend using minimap2 in RNA splice mode. Any other long-read RNA-seq alignment tools can also be used here.

  • Input

    • long_read.fq/fa: 1D long reads or RCA consensus sequences in fastq/fasta format
    • ref.fa: reference genome
    • (optional) anno.gtf: gene annotation file in GTF format
    • (optional) n_threads: number of threads to use
  • Command

# 1. convert GTF to BED12 using paftools.js from minimap2
paftools.js gff2bed anno.gtf > anno_junc.bed12

# 2. mapping with minimap2
minimap2 ref.fa long_read.fq/fa         \
         --junc-bed anno_junc.bed12     \
         -Yax splice -ub -k14 -w4       \
         --sam-hit-only --secondary=no  \
         -t n_threads -o long_read.sam

# 3. convert sam to sorted bam
samtools view long_read.sam -b long_read.bam
samtools sort long_read.bam -@ n_threads -o long_read.sorted.bam

0.2 (optional) Mark potential chimeric long reads

Optionally, you can use scCirRL_split_chimeric_read to mark potential chimeric long reads, which may rescue some reads from secondary alignments.

scCirRL_split_chimeric_read long_read.sorted.bam long_read.split.sorted.bam

0.3 Transcript identification

For transcript identification, we recommend using ESPRESSO(≥1.3.1), other tools like Bambu or IsoQuant can also be used.

  • Input
    • long_read.sorted.bam: sorted long-read alignment file in BAM format
    • ref.fa: reference genome file in FASTA format
    • anno.gtf: gene annotation file in GTF format
    • esp_output_dir: output directory
  • Output
    • esp_output_dir/esp_cmpt.tsv: compatible isoforms for all reads
    • esp_output_dir/for_esp_input_N2_R0_updated.gtf: updated GTF annotation
  • Command
# 1. create tab-separated input file for ESPRESSO
mkdir esp_output_dir 2> /dev/null
in_tsv=esp_output_dir/for_esp_input.tsv
abs_path=$(realpath long_read.sorted.bam)
base_name=$(basename long_read.sorted.bam)
echo -e "$abs_path\t$base_name" > $in_tsv

# 2. ESPRESSO S step
perl ESPRESSO_S.pl -L esp_output_dir/for_esp_input.tsv  \
                   -F ref.fa -A anno.gtf            \
                   -M notFilterOutchrM -T n_threads \
                   -O esp_output_dir

# 3. ESPRESSO C step
perl ESPRESSO_C.pl -I esp_output_dir -F ref.fa \
                   -X 0 -T n_threads

# 4. ESPRESSO Q step
perl ESPRESSO_Q.pl -L esp_output_dir/for_esp_input.updated \
                   -A anno.gtf -T n_threads            \
                   -V esp_output_dir/esp_cmpt.tsv

Note that -V esp_cmpt.tsv is optional in ESPRESSO Q step, but it is required if you want scCirRL-seq to output gene/transcript quantification information.

1. Barcode & UMI calling

scCirRL-seq identifies barcode and UMI from sorted alignment BAM file of single-cell long reads without requiring reference barcode from short-read data.

1.1 Input

  • Required:
    • long_read.sorted.bam: sorted long-read BAM (recommend using minimap2)
  • Optional:
    • read_isoform_compatible.tsv: tabular file of compatible isoforms for all reads, generated by ESPRESSO(≥1.3.1), Bambu or IsoQuant
    • updated.gtf: updated GTF annotation, generated by ESPRESSO(≥1.3.1), Bambu or IsoQuant
    • annotation.gtf: reference annotation GTF file, to retrieve gene names if gene names are not provided in updated.gtf
    • cell_barcode.tsv: reference cell barcode list. If provided, scCirRL-seq will directly use it to guide the barcode calling, only long reads with cell barcodes in the provided list will be kept
    • barcode sequence length (default: 16)
    • max. allowed edit distance between barcode and reference barcode (default: 2)
    • UMI sequence length (default: 12)
    • max. allowed edit distance between PCR-duplicated UMIs (default: 1)

1.2 Command

scCirRL long_read.sorted.bam \
        output_dir           \
        -t updated.gtf       \
        -m read_isoform_compatible.tsv \
        -g annotation.gtf

1.3 Output

  • bc_umi.bam: BAM file with barcode/UMI information for each long read, BAM tags: CB and UB, for barcode and UMI. Note that only long read with barcode/UMI called are kept
  • bc_umi.tsv: tabular file with barcode/UMI/gene/transcript information for all barcode-called long reads
  • expression_matrix/: folder containing 10X format sparse expression matrix at both gene and transcript level, can be directly parsed using standard single-cell analysis tools, like Seurat/Azimuth
  • ref_bc.tsv: reference cell barcode list identified from long reads. Note that if cell barcode list was provided to scCirRL-seq, ref_bc.tsv file will not be generated
  • high_qual_bc_umi.rank.tsv: cell barcode list ranked by unique UMI count based on all high-quality long reads. Note that if cell barcode list was provided to scCirRL-seq, high_qual_bc_umi.rank.tsv file will not be generated

Example of bc_umi.tsv:

cell barcode UMI # reads compatible transcript id gene id gene name read names (separeted by `,`)
ATCACGACACTTTAGG ATCACATCCATG 3 ENST00000407249, ENST00000341832 ENSG00000248333 CDK11B 741aa2c2-5840-4a29-bd90-3bdcb71604ba, 05f8432a-7f7e-446d-a6d5-8ab9e4eb5102, 6bb13ee1-7397-435c-8840-aeb2a90cf4ab

1.4 Parallel barcode & UMI calling (for large BAM files)

For large BAM files, the barcode and UMI calling step can be split into parallel jobs and then merged. This six-step workflow produces identical outputs to the main scCirRL command.

Overview

Step Tool Parallelisable Description
1 scCirRL_extract_reads BAM region → lightweight reads TSV
2 scCirRL_collect_ref_bc All TSVs → reference barcode list
3 scCirRL_split_cmpt_tsv Read–isoform compatible TSV → per-chromosome TSVs
4 scCirRL_assign_bc_from_tsv TSV chunk → bc_umi.tsv + tagged BAM chunk
5 scCirRL_merge_bc_umi Merge TSV and BAM chunks
6 scCirRL_make_expression_matrix bc_umi.tsv → gene/transcript expression matrix

The intermediate reads TSV produced in step 1 stores the pre-extracted barcode/UMI candidate information per read (qname, mapping position, bc, umi, bu, is_perfect, ts_tag), so steps 2–6 never need to re-open the original BAM (except for the optional BAM output in step 4).

Step 1 — Extract reads (run in parallel, one job per chromosome/region)

# split by chromosome
# Note: chrM (MT) is required to be included for single-cell analysis, as it is used to estimate the ambient RNA contamination level
for CHR in chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 \
           chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 \
           chr20 chr21 chr22 chrX chrY chrM; do
    scCirRL_extract_reads long_read.sorted.bam \
                          ${CHR}.reads.tsv.gz  \
                          -r ${CHR} &
done
wait
Option Default Description
-r / --region all reads Genomic region, e.g. chr1 or chr1:10000-50000
-b / --bc-len 16 Barcode length
-u / --umi-len 12 UMI length
-5 / --five-ada CTACACGACGCTCTTCCGATCT 5′ adapter sequence
-s / --skip-chimeric False Skip chimeric reads (reads with SA tag)

Step 2 — Collect reference barcodes (single pass over all TSVs, runs once)

scCirRL_collect_ref_bc chr*.reads.tsv.gz \
                        ref_bc.tsv       \
                        --high-qual-bc-rank high_qual_bc.rank.tsv

If a reference barcode list is already available (e.g. from 10X Cell Ranger):

scCirRL_collect_ref_bc chr*.reads.tsv.gz ref_bc.tsv -l barcodes.tsv.gz
Option Default Description
--high-qual-bc-rank high_qual_bc.rank.tsv BC rank table output path
-c / --cell-count auto Force top-N cell count; -1 = knee-point detection
-l / --bc-list Pre-existing reference barcode list (skips knee-point step)

Step 3 — Split read–isoform compatible TSV by chromosome (runs once)

When a read–isoform compatible TSV is available (from ESPRESSO or Bambu), split it into per-chromosome files so that step 4 can filter each chunk on-the-fly without scanning the whole file.

scCirRL_split_cmpt_tsv read_isoform_compatible.tsv \
                        updated.gtf                 \
                        cmpt_by_chrom/              \
                        --prefix cmpt
# produces: cmpt_by_chrom/cmpt.chr1.tsv, cmpt.chr2.tsv, ..., cmpt.unassigned.tsv
Option Default Description
-p / --prefix cmpt Filename prefix for output files

Step 3 can be skipped if no read–isoform compatible TSV is available; in that case omit -m in step 4.

Step 4 — Assign barcodes & UMIs, tag BAM (run in parallel)

for CHR in chr1 chr2 ... chrX chrY chrM; do
    scCirRL_assign_bc_from_tsv                            \
        ${CHR}.reads.tsv.gz ref_bc.tsv                    \
        ${CHR}.bc_umi.tsv                                 \
        --region  ${CHR}                                  \
        --in-bam  long_read.sorted.bam                    \
        --out-bam ${CHR}.bc_umi.bam                       \
        -g annotation.gtf                                 \
        -t updated.gtf                                    \
        -m cmpt_by_chrom/cmpt.${CHR}.tsv &
done
wait
Option Default Description
--region all reads Genomic region to process (e.g. chr1 or chr1:10000-50000). Filters reads in the TSV by coordinate, restricts the per-chromosome read–isoform compatible TSV to transcripts overlapping the region (via -t updated GTF), and limits the BAM pass to the same region. Should match the region used in step 1.
--in-bam Original sorted BAM (required for BAM output)
--out-bam Output tagged BAM chunk
-g / --anno-gtf Reference annotation GTF
-t / --updated-gtf Updated GTF from ESPRESSO/Bambu (required for region-aware filtering of the read–isoform compatible TSV)
-m / --cmpt-tsv Per-chromosome read–isoform compatible TSV from step 3
-b, -e, -u, -d, -5 (same as scCirRL) Barcode/UMI parameters

Each chunk BAM is indexed automatically after writing.

Note: Omit --in-bam and --out-bam if you only need the bc_umi.tsv output.

Step 5 — Merge

scCirRL_merge_bc_umi bc_umi.tsv                  \
    --tsv-chunks chr*.bc_umi.tsv                 \
    --bam-chunks chr*.bc_umi.bam                 \
    --out-bam    bc_umi.sorted.bam

Omit --bam-chunks and --out-bam if no BAM output was requested in step 4.

Step 6 — Generate expression matrix

scCirRL_make_expression_matrix bc_umi.tsv \
    expression_matrix/
Option Default Description
-c / --bc-cluster Cell barcode–cluster TSV (enables cluster-wise EM abundance estimation)
-t / --read-cate all Only keep reads with specific transcript category, e.g. FSM. Comma-separated for multiple types.
-s / --splice False Only keep reads with splice tag (minimap2 ts tag)
-a / --hap-list Haplotype assignment file from WhatsHap (enables haplotype-wise matrix)

Output files are written under expression_matrix/gene/ and expression_matrix/transcript/, each containing matrix.mtx.gz, features.tsv.gz, and barcodes.tsv.gz in 10X sparse format.

The merged bc_umi.tsv and bc_umi.sorted.bam from step 5 are drop-in replacements for the outputs of the main scCirRL command and are fully compatible with all downstream scCirRL-seq tools.

2. Cell clustering and annotation

All the downstream single-cell long-read analyses rely on the cell type clustering result, which could be accomplished by running Seurat on the gene expression matrix.

Annotating the cell clusters is the process of assigning cell types to each cell cluster. This could be performed manually or based on known reference annotation like Azimuth.

For example, for human peripheral blood mononuclear cells (PBMC), the clustering and annotation result can be obtained by mapping to the Azimuth human PBMC reference dataset.

We provide an example R script for human PBMC data. For data from other species/tissues, this needs to be done manually by users.

Note that not having cell clusters annotated with cell types does not affect the downstream splicing analyses by scCirRL-seq. You can simply provide scCirRL-seq with the cluster IDs for cell barcodes.

Here is an example of the output file for this step, bc_to_cell_type.tsv:

barcode cell type
TACGCCGAGCAGGCCA CD4 T
ACGATCGAGGCATCAG Mono
... ...

Or, if no cell type annotation was available:

barcode cluster ID
TACGCCGAGCAGGCCA 1
ACGATCGAGGCATCAG 2
... ...

3. Cell type-specific splicing analysis

scCirRL-seq performs pairwise comparison to identify differentially spliced genes/transcripts between each two cell types/clusters.

3.1 Input

  • Required
    • expression_matrix/transcript: transcript count matrix directory, generated by scCirRL-seq
    • bc_to_cell_type.tsv: list of barcode and corresponding cell type or cluster ID, generated by Seurat/Azimuth
    • out_prefix: prefix of output files
  • Optional
    • min. number of reads to keep a transcript for each cell type (default: 10)
    • min. number of transcripts to keep a gene for each cell type (default: 2)
    • list of genes of interest. If provided, all genes will still be processed, but only differentially spliced genes that are in this list will be output

3.2 Command

scCirRL_cell_type_specific_splicing expression_matrix/transcript \
                                    bc_to_cell_type.tsv          \
                                    out_prefix

3.3 Output

  • *_cell_specific_spliced_genes.tsv: differentially spliced genes between 2 cell types. Gene FDR ≤0.05, max. delta ratio ≥0.05
  • *_cell_specific_spliced_transcripts.tsv: differentially spliced transcripts between 2 cell types. Gene FDR ≤0.05, transcript p value ≤0.05, delta ratio ≥0.05

Example of *_cell_specific_spliced_genes.tsv:

cell_type1 cell_type2 gene_id gene_name gene_fdr max_delta_ratio transcript_ids cell1_counts cell2_counts cell1_ratios cell2_ratios
Epithelial Mesenchymal ENSG00000026508 CD44 7.96e-65 0.68 ENST00000263398, ENST00000433892, ENST00000527326, ESPRESSO:chr11:443:2, ESPRESSO:chr11:443:3 61.43,122.97, 20.0,23.41,13.63 322.74,4.24,15.0,1.33-05,1.33e-05 0.25,0.50,0.08,0.09,0.05 0.94,0.01,0.04,3.89e-08,3.89e-08

Example of *_cell_specific_spliced_transcripts.tsv:

cell_type1 cell_type2 gene_id gene_name transcript_id gene_fdr transcript_p delta_ratio count1 count2 ratio1 ratio2
Epithelial Mesenchymal ENSG00000026508 CD44 ENST00000263398 7.96e-65 8.25e-73 0.68 61.43 322.74 0.25 0.94

4. Allele-specific splicing analysis

With additional whole-genome sequencing data, scCirRL-seq can identify allele-specific splicing within each cell type. The allele-specific splicing analysis mainly consists of three steps:

  1. Phase long reads with whatshap
  2. Identify allele-specific spliced genes/transcripts within each cell type
  3. Search for disease-associated GWAS SNPs from identified allele-specific spliced genes

4.1 Phase long reads with whatshap

  • Input
    • wgs_phased.vcf: WGS phased VCF file, generated from WGS short-read data using GATK and SHAPEIT
    • ref.fa: reference genome fasta file
    • bc_umi.bam: BAM file with barcode/UMI information, generated by scCirRL-seq
  • Command
# sort and index `bc_umi.bam`
samtools sort bc_umi.bam -o bc_umi_sorted.bam
samtools index bc_umi_sorted.bam

# identify haplotype
whatshap haplotag wgs_phased.vcf bc_umi_sorted.bam \
                  --reference ref.fa               \
                  --ignore-read-groups             \
                  --skip-missing-contigs           \
                  -o bc_umi_hap.bam                \
                  --output-haplotag-list hap_list.tsv
  • Output
    • bc_umi_hap.bam: BAM file with barcode/UMI and additional haplotype information
    • hap_list.tsv: tabular file containing haplotype information for barcode-called long reads

4.2 Identify allele-specific splicing

  • Input

    • bc_umi.tsv: tabular file with barcode/UMI/gene/transcript information, generated by scCirRL-seq
    • hap_list.tsv: tabular file containing haplotype information, generated by whatshap, see above
    • bc_to_cell_type.tsv: list of barcode and corresponding cell type or cluster ID, generated by Seurat/Azimuth
  • Command

scCirRL_allele_specific_splicing bc_umi.tsv \
                                 bc_to_cell_type.tsv \
                                 hap_list.tsv \
                                 output_prefix

  • Output
    • *_allele_specific_spliced_genes.tsv: allele-specific spliced genes. Gene FDR ≤0.05, max. delta ratio ≥0.05
    • *_allele_specific_spliced_transcripts.tsv: allele-specific spliced transcripts. Gene FDR ≤0.05, transcript p value ≤0.05, delta ratio ≥0.05

Example of *_allele_specific_spliced_genes.tsv:

cell_type gene_id gene_name gene_fdr transcripts hap1_counts hap2_counts hap1_ratios hap2_ratios
Mono ENSG00000105383 CD33 4.42e-07 ENST00000262262, ENST00000421133 58,9 24,41 0.86,0.13 0.36,0.63

Example of *_allele_specific_spliced_transcripts.tsv:

cell_type transcript_id transcript_p gene_id gene_name gene_fdr hap1_count hap2_count hap1_ratio hap2_ratio
Mono ENST00000262262 3.58e-09 ENSG00000105383 CD33 4.42e-07 58 24 0.86 0.36

4.3 Identify disease-associated GWAS SNPs from allele-specific spliced genes

  • Input

    • *_allele_specific_spliced_genes.tsv: allele-specific spliced genes, generated by scCirRL_allele_specific_splicing
    • wgs_phased.vcf: WGS phased VCF file, generated from WGS short-read data using GATK and SHAPEIT
    • annotation.gtf: annotation GTF file
    • gwas_catalog_efo.tsv: GWAS catalog database with EFO term
    • ld.duckdb: linkage disequilibrium (LD) score database in duckdb format
  • Command

scCirRL_gene_with_gwas_disease -g allele_spliced_genes.tsv \
                               annotation.gtf       \
                               wgs_phased.vcf       \
                               gwas_catalog_efo.tsv \
                               ld.duckdb            \
                               output_prefix
  • Output
    • *_gwas_efo_term_stat.tsv: GWAS disease-associated genes, with traits described in EFO term
    • *_gwas_snps_ld: directory of LD scores and GWAS traits for all SNPs, each gene has one separate .ld file

Example of *_gwas_efo_term_stat.tsv:

gene_name gene_id Cancer Immune system disorder Neurological disorder Cardiovascular disease Digestive system disorder Metabolic disorder Response to drug Other disease
CD33 ENSG00000105383 False False True False False False False False

Example of *_gwas_snps_ld/gene.ld:

snp_id rs3865444 rs2459141 rs12459419 rs2455069 rs7245846 rs33978622 rs34813869 rs1354106 haplotype chrom pos type gwas_trait gwas_trait_efo_term gwas_pvalue
rs3865444 1.0 0.326844 1.0 0.327703 0.891224 0.967195 0.887773 0.881092 H2 chr19 51224706 gwas Alzheimer disease Neurological disorder 2e-09

5. Visualization

Given genes/transcripts of interest (cell-type-specific splicing/allele-specific spciling), scCirRL-seq also offers visualization of your results in the single-cell UMAP space, based on the Seurat R package.

# gene/transcript expression folder generated by scCirRL-seq
gene_mtx <- "output_dir/expression_matrix/gene"
transcript_mtx <- "output_dir/expression_matrix/transcript"
# script for UMAP plot
source("visualization.R)

5.1. UMAP plot of gene VIM and EPCAM

scCirRL_umap_plot(gene_mtx_dir = gene_mtx,
                  # `feature_mtx_dir` needs to be only one folder/object or the same size as `feature_list`
                  feature_mtx_dir = gene_mtx,
                  feature_list = c("VIM", "EPCAM"))

5.2. UMAP plot of CD44 transcripts

scCirRL_umap_plot(gene_mtx_dir = gene_mtx,
                  # `feature_mtx_dir` needs to be only one folder/object or the same size as `feature_list`
                  feature_mtx_dir = transcript_mtx,
                  feature_list = c("ENST00000433892", "ENST00000263398"))

5.3. UMAP plot of both CD44 gene and transcripts

scCirRL_umap_plot(gene_mtx_dir = gene_mtx,
                  # `feature_mtx_dir` needs to be only one folder/object or the same size as `feature_list`
                  feature_mtx_dir = c(gene_mtx, transcript_mtx, transcript_mtx),
                  feature_list = c("CD44", "ENST00000433892", "ENST00000263398"))

Note that, for gene_mtx_dir and feature_mtx_dir, you can also provide with Seurat object instead of a matrix folder.

gene_obj = make_seurat_obj(gene_mtx)
transcript_obj = make_seurat_obj(transcript_mtx)

scCirRL_umap_plot(gene_mtx_dir = gene_obj,
                  # `feature_mtx_dir` needs to be only one folder/object or same size as `feature_list`
                  feature_mtx_dir = c(gene_obj, transcript_obj, transcript_obj),
                  feature_list = c("CD44", "ENST00000433892", "ENST00000263398"))

About

Haplotype-resolved full-length transcriptome analysis in single cells by scCirRL-seq

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors