Given a set of genome fasta files, generate primer candidates (for qPCR) that is specific to each genome.
- Focus on coding sequences - first annotate each genome to extract protein coding sequences.
- Concatenate every genome fasta file to make an “excluded” genome. For every genome G, primers specific to G should not bind to G’, the excluded genome.
- Index each genome and its cognate excluded genome using bowtie2.
- Propose primer candidates with parameters suitable for qPCR using primer3. Format the forward and reverse primers as “reads”
- Align these primer “read” files to both the genome and the excluded genome.
- Filter out primer pairs with non specific binding. I am using bowtie2 with a default alignment range of [0,5000]. This is meant to filter out short confordant alignments.
- Setup conda with python 3.11 and snakemake 8.11.3. My initial choice of using python3.11 is affecting a lot of downstream decisions, because most conda packages still require python<3.11.
- Installed the snakemake executor plugin for slurm to work
- I am using two custom conda environments.
baktahas weird dependency issues with python 3.11baktaalso has weird issues currently with its reliance on amrfinderplus. I have had to comment out the parts of the bakta code which handle this.
bowtie2was easier to install in a separate environment for some reason.
- I am using
baktabecause I have not been able to installprokkawith my current setup. - Finally, this pipeline is setup to use
slurmfor parallelizing tasks. - Applied this patch to get a local env woring https://github.com/snakemake/snakemake/compare/main…ShogoAkiyama:snakemake:conda-url-env-bug
baktaperformance is affected by a lot of network IO (oschwengers/bakta#282). Typical runs I’ve observed are ~30 minutes.
[2024-12-07 Sat]
- For gene finding I am using (meta)prodigal for viruses, and genemark-et for eukaryotic (fungal genomes).
- The latter two don’t get filtered in any way, though I could possibly rely on VAPID for viral annotations.
- It appears that viral genomes typically have polyprotein encoding CDSs, so the number of total CDSes might end up being low.
- Create a directory, say RUNNAME.
- Modify
config.yamlto pointrun_pathto RUNNAME. - In RUNNAME/genome/ copy the individual genome FASTA files that are the targets for primer generation.
- If using
slurmmodify the suppliedsbatchfile to reflect the partition parameters, and run usingsbatch run_primergen.sbath - If running from the commandline, use something like
snakemake --snakefile primergen.snakefile -p\ --rerun-incomplete\ --cores 4\ --use-conda\ --configfile config.yaml\ -j 100\ --keep-incomplete --latency-wait 60