Data: RNA-seq data
Source: Allim and Fear et.al 2016.
Goal: This pipeline is designed to detect allele specific expression in population RNA-seq data.
Download the files.
bamtools
samtools
bedtools
hisat2
biopython, numpy, collections (python)
1 Extract all the homologous single nucleotide mismatches from two parents
2 Replace all these sites into 'N' and generate a masked reference
3 Estimate mapping bias by simulation.
3.1 Generate reference sequence for two parents by replacing mismatches
3.2 Simulate non-error short reads from two parents
3.3 Align two parents simulated reads to the masked reference
3.4 Count the number of read for each single nucleotide mismatch
4 Estimate the expression for each exon mismatch
3.1 Align all the raw data to the masked reference
3.2 Count the number of read for each single nucleotide mismatch
Usage: ./pipeline.pl --vcf1 --vcf2 --ref --fastq --outDir --bin --exon
Options:
--vcf1 STR Parent1 genome vcf file
--vcf2 STR Parent2 genome vcf file
--ref STR Genome reference file
--fastq STR A file contains all the fastqs.
Header format: fastq1 fastq2 Info1 Info2 Info3...
Body format: fq1 fq2 Info1Value Info2Value Info3Value...
Note1: Info[1-n] and their values will appear in the final table header and body, respectively.
Note2: The combination of 'Info1-Info2-...-Info[n]' should be unique.
--bin STR Bin directory. 'InstallationPath/bin'
--exon STR A non-overlap file contains all the exon and gene infomation.
Format: chromosome start end exonName geneName
--mode (run|script) run: run the pipeline directly [run]
script: generate script instead of running (step1.sh step2[id1,id2,...].sh step3.sh)
--vcf1Name STR Name for vcf1. [190]
--vcf2Name STR Name for vcf2. [226]
--outDir STR Output directory
--cpu INT Number of cpu. [1]
--readlength INT Maximum read length for simulation. [151]
--insertsize INT Read insert size for simulation. [350]
--hisat2 STR Path for hisat2 [hisat2]
--hisat2-build STR Path for hisat2-build [histat2-build]
--samtools STR Path for samtools [samtools]
--bedtools STR Path for bedtools [bedtools]
Example data and script are provided in "example" directory.
cd example; sh example.sh
simulation.csv
chromosome site Parent1Allele1 Parent1Allele2 Parent2Allele1 Parent2Allele2 exonID geneID
scf4459 10579 290 0 0 299 maker-scf4459-snap-gene-0.23-mRNA-1_exon_2 maker-scf4459-snap-gene-0.23
scf4459 10766 294 0 0 291 maker-scf4459-snap-gene-0.23-mRNA-1_exon_3 maker-scf4459-snap-gene-0.23
ASE.csv
chromosome site Parent1_hybrid Parent2_hybrid OtherAllele exon_id gene_id Exon_expression Gene_expression Run line id
scf4459 10579 16 0 0 maker-scf4459-snap-gene-0.23-mRNA-1_exon_2 maker-scf4459-snap-gene-0.23 20 264 Run_1 33 1
scf4459 13114 1 0 0 maker-scf4459-snap-gene-0.23-mRNA-10_exon_4 maker-scf4459-snap-gene-0.23 26 264 Run_1 33 1
Email: uqhshao at uq.edu.au or haojingshao at gmail.com
Manuscript in preparation.