Skip to content

amoghpj/primergen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Overview

Given a set of genome fasta files, generate primer candidates (for qPCR) that is specific to each genome.

Approach

  1. Focus on coding sequences - first annotate each genome to extract protein coding sequences.
  2. Concatenate every genome fasta file to make an “excluded” genome. For every genome G, primers specific to G should not bind to G’, the excluded genome.
  3. Index each genome and its cognate excluded genome using bowtie2.
  4. Propose primer candidates with parameters suitable for qPCR using primer3. Format the forward and reverse primers as “reads”
  5. Align these primer “read” files to both the genome and the excluded genome.
  6. Filter out primer pairs with non specific binding. I am using bowtie2 with a default alignment range of [0,5000]. This is meant to filter out short confordant alignments.

./img/primergen_illustration.png

Installation and setup notes

  • Setup conda with python 3.11 and snakemake 8.11.3. My initial choice of using python3.11 is affecting a lot of downstream decisions, because most conda packages still require python<3.11.
  • Installed the snakemake executor plugin for slurm to work
  • I am using two custom conda environments.
    • bakta has weird dependency issues with python 3.11
      • bakta also has weird issues currently with its reliance on amrfinderplus. I have had to comment out the parts of the bakta code which handle this.
    • bowtie2 was easier to install in a separate environment for some reason.
  • I am using bakta because I have not been able to install prokka with my current setup.
  • Finally, this pipeline is setup to use slurm for parallelizing tasks.
  • Applied this patch to get a local env woring https://github.com/snakemake/snakemake/compare/main…ShogoAkiyama:snakemake:conda-url-env-bug
  • bakta performance is affected by a lot of network IO (oschwengers/bakta#282). Typical runs I’ve observed are ~30 minutes.

[2024-12-07 Sat]

  • For gene finding I am using (meta)prodigal for viruses, and genemark-et for eukaryotic (fungal genomes).
    • The latter two don’t get filtered in any way, though I could possibly rely on VAPID for viral annotations.
    • It appears that viral genomes typically have polyprotein encoding CDSs, so the number of total CDSes might end up being low.

Usage

  1. Create a directory, say RUNNAME.
  2. Modify config.yaml to point run_path to RUNNAME.
  3. In RUNNAME/genome/ copy the individual genome FASTA files that are the targets for primer generation.
  4. If using slurm modify the supplied sbatch file to reflect the partition parameters, and run using sbatch run_primergen.sbath
  5. If running from the commandline, use something like
    snakemake --snakefile primergen.snakefile -p\
     --rerun-incomplete\
     --cores 4\
     --use-conda\
     --configfile config.yaml\
     -j 100\
     --keep-incomplete --latency-wait 60
        

About

Species specific primers

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors