Nextflow pipeline (DSL2) that runs ATLAS to infer genotype likelihoods and the run GLIMPSE to impute ancient genomes.
-v 0.1.0
- Supports bams as input.
- Option to run glimpse or not and only get ATLAS output.
- Option to get only the 1240k
- Scalability and reproducibility via a Nextflow-based framework.
- Compatible with nf-core institutional configs.
- Ubuntu 24.04.4 LTS
- macOS
| Requirement | Version | Required Commands * |
|---|---|---|
| bcftools | 1.23 | reheader,view,index,merge,annotate,call,concat |
| samools | 1.23 | view,merge |
| nextflow | 26.04.3 | nextflow |
| GLIMPSE | 1.1 | chunk,ligate,phase |
| atlas | v1.4.0.2 | call,pmd,recal |
* These commands must be accessible from your $PATH (i.e. you should be able to invoke them from your command line).
Download nf-haplotype-selection from Github repository:
git clone https://github.com/jbv2/call_impute.git
To test call_impute execution using test data, run:
nextflow run main.nf -profile <test>This pipeline can use nf-core Institutional profiles. For MPI EVA people, please use the following to test on GRACE:
module load apptainer/1.5.0 ## to be able to run Nextflow
module load java/1.21.0 ## to use java for nextflow
module load bcftools/1.23.1
module load samtools/1.23.1
module load bcftools/1.23.1
module load htslib/1.23.1
module load GLIMPSE/1.1.1-static
which atlas ##verify you have atlas in your $PATH
nextflow run main.nf -profile eva_grace,testTo run call_impute go to the pipeline directory and execute:
nextflow run main.nf \
-c call_impute.config \ ## Config with the specified paths for the references
--input 'test/data/input.tsv' \ ## Input tsv
--outdir 'test/results' \ ## Output directory
--project_name 'test' \ ## general name
--get_1240k <true/false> \ ## specify if you want to subset to the 1240k. Defaulf: false
--csv_1240k 'test/reference/1240kSNPs/' ## Path to the directory with the 1240k csvs per chromosome
Where your config file has these options:
profiles {
hs37d5_1KGP {
params {
config_profile_name = 'hs37d5_1KGP'
config_profile_description = 'Imputation of low-coverage ancient genomes'
// References
fasta = "<PATH to the fasta reference genome>"
fai = "<PATH to the fasta index>"
// ATLAS parameters
atlas_pmd_length = 50
atlas_recal_regions = "<PATH to the `.bed` file for atlas to run recal"
atlas_recal_chr = 20 ## Chromosome that ATLAS uses to run recal
method = 'MLE' ## Method for genotyping
alleles = "<PATH to the `alleles` file for Atlas to infer genotypes likelihoods>"
// bcftools call parameters
bcftools_alleles = "<PATH to the `alleles` file for bcftools to infer genotypes>"
ref_snps_only = "<PATH to `vcf.gz` with its index (`vcf.gz.tbi`) for bcftools to infer genotypes>"
// GLIMPSE references
run_glimpse = true
glimpse_ref = "<PATH to the`GLIMPSE` reference panel directory.>"
glimpse_map = "<PATH to the`GLIMPSE` genetic map directory directory.>"
glimpse_chunks = "<PATH to the`GLIMPSE` chunks.>"
}
}
}-
input.tsv- A
TSVfile with three columns:sample bam bai. Example line(s):
sample bam bai PC537 test/data/PC537_chr20_1.0x.bam test/data/PC537_chr20_1.0x.bam.bai PC537 test/data/PC537_chr21_1.0x.bam test/data/PC537_chr21_1.0x.bam.bai PC537 test/data/PC537_chr22_1.0x.bam test/data/PC537_chr22_1.0x.bam.bai
1A. The
.bamfiles you want to input, together with their.bai. - A
-
A
.fastafile with its index (.fai) of the reference genome you used to aligne your BAMS. -
A
.bedfile for atlas to run recal. Example line(s):20 5556502 5556504 20 5556505 5556507 20 5556508 5556510 20 5556512 5556513
-
An
allelesfile for Atlas to infer genotypes likelihoods. Example line(s):20 60479 C T 20 60571 C A 20 60828 T G 20 61098 C T
-
An
allelesfile for bcftools to infer genotypes. Example line(s):20 60479 C,T 20 60571 C,A 20 60828 T,G 20 61098 C,T
-
A
vcf.gzwith its index (vcf.gz.tbi) for bcftools to infer genotypes. For more information about the VCF format, please go to the next link: Variant Call Format Example line(s):##fileformat=VCFv4.2 #CHROM POS ID REF ALT QUAL FILTER INFO 20 60479 rs149529999 C T . PASS AN=5008;AF=0.00339457;AC=17 20 60571 rs116145529 C A . PASS AN=5008;AF=0.00199681;AC=10 20 60828 rs187713677 T G . PASS AN=5008;AF=0.00119808;AC=6 20 61098 rs6078030 C T . PASS AN=5008;AF=0.287141;AC=1438
-
A
GLIMPSEreference panel directory that contains:
- A
VCFwith its index (.vcf.gz.tbi) with snps per chromosome. Usually the 1KGP reference panel is used. Example line(s):##fileformat=VCFv4.2 #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG00096 HG00097 HG00099 HG00100 HG00101 HG00102 HG00103 HG00105 HG00106 HG00107 HG00108 HG00109 HG00110 HG00111 HG00112 HG00113 HG00114 HG00115 HG00116 HG00117 HG00118 HG00119 HG00120 HG001> 20 60479 rs149529999 C T . PASS AN=5008;AF=0.00339457;AC=17 GT 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 > 20 60571 rs116145529 C A . PASS AN=5008;AF=0.00199681;AC=10 GT 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 > 20 60828 rs187713677 T G . PASS AN=5008;AF=0.00119808;AC=6 GT 0|0 0|1 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 > 20 61098 rs6078030 C T . PASS AN=5008;AF=0.287141;AC=1438 GT 0|0 0|0 0|0 0|0 0|1 0|0 1|0 0|0 0|0 0|0 0|0 0|1 0|0 0|1 0|0 0|0 0|0 1|0 0|0 1|0 >
- A
GLIMPSEgenetic map directory that contains:
- A
.gmap.gzper chromosome from the version of the genome you are using. Example line(s):pos chr cM 61795 20 0.000000 63231 20 0.001056 63244 20 0.001066 63799 20 0.001473
- A
GLIMPSEchunks directory that contains:
Output from running GLIMPSE_chunk on the reference panel you are using. Better to only run it once.
- A
.txtper chromosome. Example line(s):0 20 20:60479-3867430 20:60479-3667395 3606917 41914 1 20 20:3467342-7285687 20:3667406-7085233 3417828 41913 2 20 20:6885259-11156251 20:7085271-10955878 3870608 41914 3 20 20:10755798-15077388 20:10955980-14877295 3921316 41912
- If
--get_1240k true,--csv_1240kneeds a path to the directory with the 1240k csvs per chromosome.
- A
.csvper chromosome with the 1240k positions. Example line(s):20 63244 20 63799 20 68749 20 69094 20 74347
-
A compressed imputed
VCFwith its index.vcf.gz.csiper chromosome. Example line(s):##fileformat=VCFv4.2 #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT PC537 20 63244 rs6139074 A C . . RAF=0.218251;AF=0.0025;INFO=0.003;AC=0;AN=2 GT:GP:DS 0|0:0.995,0.005,0:0.005 20 63799 rs1418258 C T . . RAF=0.439497;AF=0.003;INFO=0.003;AC=0;AN=2 GT:GP:DS 0|0:0.994,0.006,0:0.006 20 68749 rs6086616 T C . . RAF=0.571286;AF=0.926;INFO=0;AC=2;AN=2 GT:GP:DS 1|1:0.006,0.136,0.858:1.852
-
A
imputation_qc.tsvfile with the number of sites that have GP_0.99 per individual. Example line(s):sample n_sites_GP_0.99 PC537 1017889
-
For the 1240k subset Results will be in the directory:
<outputdir>/vcfs_1240k
-
For the whole-genome Results will be in the directory:
<outputdir>/vcfs_wg
Under the hood call_impute uses some coding tools, please include the following ciations in your work:
- Wegmann, D., Sell, C., Kousathanas, A., & Veeramah, K. R. (2017). ATLAS: Analysis Tools for Low-depth and Ancient Samples.
- Rubinacci, S., Ribeiro, D. M., Hofmeister, R. J., & Delaneau, O. (2021). Efficient phasing and imputation of low-coverage sequencing data using large reference panels. Nature genetics, 53(1), 120-126.
- Narasimhan, V., Danecek, P., Scally, A., Xue, Y., Tyler-Smith, C., & Durbin, R. (2016). BCFtools/RoH: a hidden Markov model approach for detecting autozygosity from next-generation sequencing data. Bioinformatics, 32(11), 1749-1751.
- Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., ... & 1000 Genome Project Data Processing Subgroup. (2009). The sequence alignment/map format and SAMtools. bioinformatics, 25(16), 2078-2079.
- Di Tommaso, P., Chatzou, M., Floden, E. W., Barja, P. P., Palumbo, E., & Notredame, C. (2017). Nextflow enables reproducible computational workflows. Nature biotechnology, 35(4), 316-319.
- Ewels, P. A., Peltzer, A., Fillinger, S., Patel, H., Alneberg, J., Wilm, A., ... & Nahnsen, S. (2020). The nf-core framework for community-curated bioinformatics pipelines. Nature biotechnology, 38(3), 276-278.
This is based on scripts and input from:
- Guido Alberto Gnecchi Ruscone
- Selina Carlhoff
- Luca Traverso
This pipeline also takes nf-core modules (although I changed some things in some for necessity, reason why this is not an nf-core pipeline at the end). Thanks to those authors as well:
- "@maxibor"
- "@merszym"
- "@jfy133"
- "@ltcrod"
- "@joseespinosa"
- "@drpatelh"
- "@nvnieuwk"
- "@ramprasadn"
- "@bjohnnyd"
- "@jemten"
- "@ramprasadn"
- "@louislenezet"
If you have questions, requests, or bugs to report, please email judith.vballesteros@gmail.com
Judith Ballesteros-Villascán judith.vballesteros@gmail.com