Skip to content

hsgweon/pipits

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

License: GPL v3

PIPITS 🍄

An automated pipeline for analyses of fungal internal transcribed spacer (ITS) sequences from the Illumina sequencing platform (Gweon et al., 2015)

Shown to perform better than QIIME2 - See this paper

1. Latest Update

UPDATE (29 April 2025) - PIPITS 4.0

Significant changes.

  • **The Latest UNITE 10.0 (Version 19.02.2025) added is the default DB version.
  • The default classifier now is SINTAX. It's much much quicker and provides very comparable results against RDP Classifier. This change is due to the fact that UNITE database has increased massively.
  • Changes to how PIPITS runs. See below.

2. Synopsis

PIPITS is an automated pipeline designed for analyzing fungal Internal Transcribed Spacer (ITS) sequences generated on the Illumina sequencing platform. It streamlines the process from raw reads to taxonomically assigned OTU tables.

Key features:

  • Automated processing including read joining, quality filtering, ITS extraction, OTU clustering, chimera removal, and taxonomic assignment.
  • Utilizes tools like ITSx, VSEARCH (SINTAX), and RDP Classifier.
  • Automatically downloads the most recent UNITE fungal database.
  • Consists of just 4 main commands for ease of use.

3. Requirements

  • Operating System: POSIX-compatible system (e.g., Linux, macOS). PIPITS does not run on Windows.
  • Memory: Minimum 16 GB RAM, preferably 32 GB or more, especially with recent large database versions.
  • Software: Conda package manager.
  • Python: Python 3.10.

4. Installing PIPITS

4.1. Install

It is recommended that you use a conda environment for running PIPITS to ensure that its dependencies are contained in this "sandbox". This meant that you don't mess with your existig system and you don't need to be the admin. Don't worry, it's easy - just type the following command.

EXPLANATION: install PIPITS and dependencies and create a Conda environment (here the environment is named "pipit_env" but you can choose any name you wish). PIPITS is exclusively compatible with Python3, so add "python=3.10" as below:

conda create -n pipits_env --channel bioconda --channel conda-forge --channel defaults python=3.10 pipits

4.2. Test PIPITS

The PIPITS is divided into three consequential parts:

  1. Prepping raw sequences: join, convert, quality filter etc.
  2. Fungal ITS extraction [OPTIONAL]: remove conserved regions
  3. Process the reads to produce an OTU abundance table and the taxonomic assignment table for downstream analysis

Let's test it with a very small test dataset to ensure everything is set up correcly.

EXPLANATION: Download & extract a test dataset

wget https://sourceforge.net/projects/pipits/files/PIPITS_TESTDATA/pipits_test.tar.gz -O pipits_test.tar.gz
tar xvfz pipits_test.tar.gz

EXPLANATION: Get into the Conda environment you've just created, and run PIPITS.

cd pipits_test
conda activate pipits_env
pipits_createreadpairslist -i rawdata -o readpairslist.txt
pipits_prep -i rawdata -o out_prep -l readpairslist.txt
pipits_funits -i out_prep/prepped.fasta -o out_funits -x ITS2 -v -r
pipits_process -i out_funits/ITS.fasta -o out_process -v -r

Some rare setups (e.g., installation in user-level folders of dated server distributions) cause pipits_process to fail while converting to biom format. The issue can be solved by updating the fresh installation from within the environment: conda update pipits.

5. Running PIPITS

5.1. Sequence Preparation

Illumina reads are generally provided as demultiplexed FASTQ files where the Illumina software (BASESPACE) splits the reads into separate files, one for each barcode. Make sure the files are in a directory (e.g. rawdata)

EXPLANATION: pipits_createreadpairslist generates a tab-delimited text file for all read-pairs from the directory containing your raw sequences

pipits_createreadpairslist -i rawdata -o readpairslist.txt
Note
  1. The command produces a tab-delimited file with three columns denoting forward and reverse read filenames and sample IDs for the pairs
  2. Prior to running the command, you need to ensure that the raw data are either uncompressed (“.fastq”), or compressed with bz2 or gz (“.fastq.bz2”, “.fastq.gz”). Sample IDs are taken from the first characters preceding an underscore from each filename
  3. After running pipits_createreadpairslist, check the resulting file ("readpairslist.txt") to see correct filenames and desired sample IDs are listed in the resulting file ("readpairslist.txt"). No duplicate sample IDs are allowed

EXPLANATION: Once we have the list file ("readpairslist.txt"), we can then begin to "prepare" the sequences:

pipits_prep -i rawdata -o out_prep -l readpairslist.txt
Note
  1. Read-pairs are joined by examining the overlapping regions of sequences
  2. The resulting assembled reads are then quality filtered
  3. The header of each read is then relabelled with an index number followed by a Sample ID
  4. The resulting files are converted into FASTA and merged into a single file to produce the final output file "prepped.fasta" in the output directory

5.2. [OPTIONAL] Fungal ITS extraction

The output from pipits_prep is taken as an input for this step. It is also mandatory to provide the script with which ITS subregion (i.e. ITS1 or ITS2) is to be extracted. This can take a very long time depending on the number of sequences and your system. As this is an optional step, you can proceed to the next step (pipits_process). Removing the conserved region has been shown to improve taxonomic resolution but this may be skipped.

EXPLANATION: the input file (indicated with "-i") is the resulting file from the previous step

pipits_funits -i out_prep/prepped.fasta -o out_funits -x ITS2
Note
  1. Selected subregion are extracted with ITSx and where necessary they are re-orientated to 5’ to 3’ direction. It is worth noting that ITSx uses HMMER3 (Mistry et al., 2013) to compare input sequences against a set of models built from a number of different subregions of ITS sequences found in various organisms. This makes ITSx an ideal tool for both extraction of desired ITS subregions as well as filtering for specific groups of organisms. It also means that while PIPITS has been created with the analysis of fungal amplicons in mind, it could be adapted for the analyses of other organism groups where ITS is used as a marker by changing the ITSx settings and reference databases
  2. Having extracted the subregion, sequences are re-inflated to reflect their original abundances. To date, the longest sequenceable reads from the Illumina technology are 300 bp x 2 which is not sufficient to sequence both ITS1 and ITS2 and to have an overlapping region to join them. For this reason the program supports only a single subregion extraction mode
  3. PIPITS will include those sequences that do not have any conserved region detected. This is so that ALL sequences are taken into account.

5.3. Process sequences

EXPLANATION: This is the final step involving clustering and assigning of taxonomy.

(if you have run pipits_funits)
pipits_process -i out_funits/ITS.fasta -o out_process

(if you have NOT run pipits_funits)
pipits_process -i out_prep/prepped.fasta -o out_process
Note
  1. Input sequences are dereplicated
  2. Short (< 100bp) and unique (singletons) are removed
  3. The sequences are clustered at 97% PID
  4. The resulting representative sequences for each cluster are subjected to chimera detection and removal
  5. The input sequences are mapped onto the chimera-free representative sequences at 97% PID
  6. The representatives are taxonomically assigned with RDP Classifier against the UNITE fungal ITS reference dataset
  7. The results are translated into two types of OTU abundance tables:
    • OTU abundance table”, an OTU is defined as a cluster of reads with the user-defined threshold typically 97% sequence identity motivated by the expectation that these correspond approximately to species.
    • phylotype abundance table”, an OTU is defined as a cluster of sequences binned into the same taxonomic assignments.
  8. If you have memory issues, try increasing the maximum memory with "--Xmx". For example, "--Xmx 4G".
  9. Once all finished, you can leave Conda environment by typeing
conda deactivate

6. Misc

You can tweak parameters and there are several options for each of the above steps. To view them, type "-h" after each command.

pipits_prep -h

7. Citation

Please cite:

Hyun S. Gweon, Anna Oliver, Joanne Taylor, Tim Booth, Melanie Gibbs, Daniel S. Read, Robert I. Griffiths and Karsten Schonrogge, PIPITS: an automated pipeline for analyses of fungal internal transcribed spacer sequences from the Illumina sequencing platform, Methods in Ecology and Evolution, DOI: 10.1111/2041-210X.12399

8. Update History

UPDATE (29 April 2025) - PIPITS 4.0

Significant changes.

  • **The Latest UNITE 10.0 (Version 19.02.2025) added is the default DB version.
  • The default classifier now is SINTAX. It's much much quicker and provides very comparable results against RDP Classifier. This change is due to the fact that UNITE database has increased massively.
  • Slight changes to how PIPITS runs. See below.
UPDATE (18 August 2024) - PIPITS 3.1

Significant changes.

  • **UNITE 10.0 added is the default DB version until further update.
  • The default classifier now is SINTAX. It's much much quicker and provides very comparable results against RDP Classifier. This change is due to the fact that UNITE database has increased massively.
UPDATE (11 August 2023) - PIPITS 3.0
  • Just a slight change in the installation instruction, namely from python=3.6 to python=3.8 to avoid "SyntaxError: invalid syntax"
UPDATE (27 November 2022) - PIPITS 3.0

Some significant changes!

  • PIPITS now classifies sequences against UNITE 9.0 (205,888 fungi & 326,300 Eukaryotes - see below).
  • The database now includes non-fungi (i.e. Eukaryotes) to ensure that the infamous OTUs with a mere "k__Fungi" could be better classified. With the inclusion, you will now see OTUs classified as "k__Fungi", "k__Viridiplantae" or "k__unidentified". Do note that depending on your choice of primers, you may pick up sometimes quite a lot of plant ITS sequences (no primers are perfectly specific for fungi).
  • However, because of the significant increase in the size of the database, PIPITS now requires at least 16GB of RAM (preferably more e.g. 32GB). This may not suite those who used to enjoy running PIPITS on their laptop. Sorry... time has moved on!
  • Also the increase in the size of the database meant that RDP Classifier can take a very long time to process the data. For this reason, you now have an option to run SINTAX (VSEARCH) to assign sequences. This is remarkably quick!
  • If you find that RDP Classifier is taking too long, please use "--taxassignmentmethod sin" to just run SINTAX (VSEARCH). That said, the confidence threshold of 0.85 doesn't equates 0.85 of RDP Classifier though from my experience, the differences are small. Do note that SINTAX is a non-Bayesian taxonomic classifier.
  • I will look to incorporate other classifier such as CONSTAX in the future!
UPDATE (15 February 2022) - PIPITS 2.8
  • UNITE 8.3 added. PIPITS now classifies sequences against UNITE 8.3 (98,090 sequences)
UPDATE (28 April 2020) - PIPITS 2.7
  • WARCUP phylotype table bug fixed. It now produces correcly aggregated table (it used to aggregate at the Family-level, but now it aggregates at the Species-level)
UPDATE (26 April 2020) - PIPITS 2.6
  • BIOM to phylotype table bug fixed. After BIOM (one of the dependencies) was upgraded, phylotype table inadvertently got filled with normalised values. This now has been remedied, and it's now back to the previous behaviour. For those who just want to convert OTU tables to phylotype tables without re-running PIPITS again, please update PIPITS, and (within pipits_env) then:
pipits_phylotype_biom -i otu_table.biom -o phylotype_table.txt -l 6
UPDATE (19 Feb 2020) - PIPITS 2.5
  • New UNITE DB (released on 2020-02-04). PIPITS will now download the new UNITE db. Also few minor bugs have now been fixed.
UPDATE (28 May 2019) - PIPITS 2.4
  • BIOM files are now in the HDF5 format. OTU tables in BIOM format is now in HDF5 rather than JSON format. OTU tables in HDF5 BIOM are supported by PHYLOSEQ and QIIME2.
UPDATE (22 April 2019) - PIPITS 2.3
  • PIPITS_PROCESS automatically downloads UNITE database (the most recent version), so there is no need to meddle with environment variables anymore. Just run commands and it will take care of the database issues. You can still use older database by the way using --unite option (see help by -h).
  • PIPITS_FUNITS exploits multiple CPUs. It's an experimental feature, so do use it with care. You can invoke to use multiple CPUs by using the usual -t NUMBER_OF_CPUS option.
  • Update PIPITS with conda update --channel bioconda --channel conda-forge --channel defaults pipits then check you have version 2.3 installed by: conda list pipits

About

Automated pipeline for analyses of fungal ITS from the Illumina

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages