Roarytutorial
Roarytutorial
Abstract
A description of the genetic makeup of a species based on a single genome is often insufficient because it ignores the
variability in gene repertoire among multiple strains. The estimation of the pangenome of a species is a solution to this
issue as it provides an overview of genes that are shared by all strains and genes that are present in only some of the
genomes. These different sets of genes can then be analyzed functionally to explore correlations with unique phenotypes
                                                                                                                                                         Protocol
species (represented by an open pangenome), (ii) providing
information on shared and unique traits of strains within a                           Step 1: Installation of Roary
species (exemplified by core and accessory genes), and, more                          Roary is a Linux-native software that can be installed on Linux,
recently, (iii) using it to identify species boundaries (repre-                       MacOSX, and Windows machines in a variety of ways. In this
sented by a high frequency of core genes).                                            section, we will provide a series of commands that will allow
   These large-scale applications of a pangenome necessitate                          you to install Roary in a Linux environment (see Step 5 for
a fast and accurate software that can analyze and produce                             installation in different operating system) (we show com-
results for tens or hundreds of lineages in a reasonable                              mands to be typed with a different font). The easiest way
amount of computational time. One such software is Roary                              to run Roary is to install it in a Linux environment using the
(Page et al. 2015), a Linux-native software that takes as inputs                      package manager “conda,” which is part of the Anaconda
GFF3 (General Feature Formats version 3) files (easily obtain-                        distribution. This will work also in a MacOSX environment
able from NCBI) and outputs a series of files with statistics on                      and the Linux Subsystem in Windows with very minor mod-
genes shared by all or most (core and soft core genes) lineages                       ifications (see Step 5).
or only by some genomes (accessory, further subdivided into                               The first step is to download Anaconda (https://www.an-
shell and cloud genes). This software is complemented by                              aconda.com/distribution/; last accessed December 9, 2019)
python scripts and other software that produce a graphical                            for the appropriate operating system and select the most
view of the results.                                                                  recent version of Python that is supported and updated reg-
   Although other software are available for pangenome                                ularly (currently it is Python 3.7) (e.g., for Linux: Anaconda3-
reconstructions, such as PGAP, PanX, get_homologues, and                              2019.03-Linux-x86_64.sh). Open a terminal window and type
Pantools (Zhao et al. 2012; Contreras-Moreira and Vinuesa                             bash /Downloads/Anaconda3-2019.03-Linux-
2013; Sheikhizadeh et al. 2016; Ding et al. 2018), we found                           x86_64.sh (if the file was downloaded in a different direc-
Roary to be the simplest and most flexible to use and,                                tory change /Downloads to the correct location). Press
ß The Author(s) 2019. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License
(http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any me-
dium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com             Open Access
Mol. Biol. Evol. 37(3):933–939 doi:10.1093/molbev/msz284 Advance Access publication December 17, 2019                                             933
Sitto and Battistuzzi . doi:10.1093/molbev/msz284                                                                                MBE
Enter to start the installation and space bar to visualize the           To check whether installation is successful type roary -h to
license agreement. You will be prompted to accept the de-                visualize the list of parameters Roary uses (fig. 1). The location
fault location for installation by pressing Enter (or change the         in which Roary is now installed does not have to also be the
installation location), and the installation will start (it can take     one that will include your input and output files. We suggest
a minute or so to start seeing progress on the screen). Finally,         creating a separate directory in which to upload the input files
answer “yes” to initialize Anaconda3 by running conda init               and where the output files will be saved.
and, at the end, you will see “Thank you for installing
Anaconda!” Enter the command source /.bashrc                            Step 2: Input Files
for the installation to take effect. These instructions can              The format of the input files for Roary is GFF3 (General
also be found at https://docs.anaconda.com/anaconda/in-                  Feature Format version 3). This format includes a series of
stall/linux/; last accessed December 9, 2019. To test the in-            information in a specific order and needs to be followed
stallation, type in the Linux terminal conda –V and it will              strictly for Roary to accept the input file (see https://github.
return the version of conda you just installed. Once conda has           com/The-Sequence-Ontology/Specifications/blob/master/gff3.
been installed correctly, the next step is to create an environ-         md; last accessed December 9, 2019 for a description of the
ment in which Roary will run. This can be achieved with the              format). There are two primary ways to obtain GFF3 files: from
following command at the command prompt (shown in                        the NCBI website or from the software Prokka by converting
Linux as $): conda create –name Roaryenv (note                           .fna files into GFF3 (see Step 5). An easy way to obtain the input
that you can use any name for the environment instead of                 files without additional software installation is to download
Roaryenv). In order to work within this environment, you will            genome *.gbff files from NCBI and then run the bp_gen-
need to activate it (this step will need to be repeated every            bank2gff3.pl script. This is a Perl script that is installed
time you open a new terminal window): source acti-                       along with Roary and that can be found in the Roary
vate Roaryenv.                                                           conda environment in the directory “bin.” It is also avail-
    Next, install Roary in your newly created environment with           able through BioPerl (https://bioperl.org/INSTALL.html;
the following 5 “conda config” commands:                                 last accessed December 9, 2019) and can be easily run
                                                                         in the terminal window. Note that for this script to
      conda config –-add channels r                                      work, Perl needs to be installed in the system you are
      conda   config    -–add   channels                                 using         (https://www.activestate.com/products/active-
      defaults                                                           perl/downloads/; last accessed December 9, 2019). For
      conda config –-add channels conda-                                 example, let us say that you are interested in estimating
      forge                                                              the pangenome of three strains of Bifidobacterium ani-
      conda   config    –-add   channels                                 malis A6, KLDS2.0603, and RH. From the Genome function
                                                                         in NCBI (https://www.ncbi.nlm.nih.gov/genome; last
      bioconda
                                                                         accessed December 9, 2019) you can browse by organism
      conda install roary
                                                                         and search for B. animalis. The individual assemblies can
934
Estimating Pangenomes with Roary . doi:10.1093/molbev/msz284                                                                    MBE
be visualized by selecting “Prokaryotes.” After having                       Options in Roary fall broadly into three categories: file ac-
identified the strains of interest, select the GenBank FTP                cess, analysis settings, and visualization. The “file access” set-
site on the right-hand side and download the *.gbff.gz file               tings are the least likely to need modification. They include
for each of them (fig. 2).                                                those that allow users to manipulate the location of inputs/
   Next, move all the downloaded gbff files into a single di-             outputs and the location (path) of where the software that
rectory (if you have used a Windows machine to download                   Roary depends on is located. Roary requires mcl, blastp,
the files, upload them into the Linux machine) and, from                  mcxdeblast, and makeblastdb that are installed along with
terminal, issue the command perl bp_gen-                                  Roary within the environment in conda. However, users
bank2gff3.pl *.gbff.gz.                                                   can use a different location of these software, if preferred.
   If you are using the perl script within the Roary environ-             Additionally, users can provide directory names for outputs
ment, you will need to specify the path to the script (e.g.,              (option -f).
perl /home/Roaryenv/bin/bp_genbank2gff3.                                     The “analysis settings” parameters allow users to refine the
pl). To identify the path of this perl script, use the command            sensitivity of the analysis itself to identify core and accessory
which bp_genbank2gff3.pl. If your current working                         genes. These are most likely the parameters that users will
directory is not the same as the one where the gbff files are,            want to modify to explore the robustness of the results to
either navigate into that directory and use the above com-                variations. For computational speed, the -p option will allow
mand or add the path of the directory before the “*” (e.g.,               users to select the number of threads to use during the com-
perl /home/Roaryenv/bin/bp_genbank2gff3.                                  putation. Many new computers are multicore with multi-
pl /home/Roary/Inputs/*.gbff.gz). This com-                               threads for each core, so selecting >1 (e.g., roary –f
mand will create as many output files as the input files all              output_dir –p 10 *.gff) for this parameter is likely
with an extension *.gff. These will be the input files for Roary.         to speed up the analysis. For the pangenome calculation,
                                                                          the two most important parameters are the threshold (in
Step 3: Parameters and Commands                                           percentage) of isolates required to define a core gene (-cd:
Roary can be run very easily with a single short command:                 default is 99%) and the minimum percentage identity for
roary *.gff (remember to activate the Roary environ-                      sequence comparisons performed by BlastP (-i: default is
ment [source activate Roaryenv] every time you                            95%). Decreasing the threshold of isolates will increase the
use terminal window for the first time).                                  number of core genes identified, and increasing the min-
   This command will run Roary with default parameters (see               imum identity will partition the genes in more and
below) from within a directory that contains all the gff3-                smaller clusters.
converted files obtained from Step 2. All output files gener-                Finally, to visualize results, Roary has a series of options.
ated will be located in this same directory, which could make             The standard option, which requires no additions to the pre-
downstream analyses more difficult. To specify an output                  vious command, will produce a series of text outputs (see
directory, add the option -f to the command: roary –f                     Step 4). If the user desires an additional graphical output, the
output_dir *.gff (where output_dir is user-defined).                      option -r can be added to produce plots using R (this option
                                                                                                                                        935
Sitto and Battistuzzi . doi:10.1093/molbev/msz284                                                                          MBE
will need R and ggplot2 to be installed). Note that the graphs        uniquely present in one set of strains and not others. This
can also be obtained after the results have already been pro-         kind of analysis can be done calling the query_pan_ge-
duced because Roary will output R formatted files in addition         nome –a difference -–input_set_one 1.gff,
to text files. Finally, one of the most useful parameters for         2.gff –-input_set_two 3.gff, 4.gff –g
visualization is the possibility of creating alignments from core     clustered_proteins (where the *.gff files are the
genes (options -e and -n). Such files are potentially important       names of the genomes of interest in two subsets). Finally,
for downstream analyses including phylogenetic tree recon-            the same query_pan_genome function can be used to
struction and SNP identification. Additional visualization            output genes that are unique, shared by all, or shared by some
tools are provided as separate scripts and packages (e.g., roar-      of the strains (e.g., query_pan_genome –a union –g
y_plots.py) that can be found on the main Roary website               clustered_proteins *.gff).
(https://sanger-pathogens.github.io/Roary/; last accessed                 A good description of all the output files created by Roary
December 9, 2019).                                                    is available in the supplementary material of the Roary pub-
                                                                      lication (Page et al. 2015) and, in a less detailed way, on the
936
Estimating Pangenomes with Roary . doi:10.1093/molbev/msz284                                                                      MBE
File ! “Import appliance” and select the VM (*.ova file) you                  The contents of the shared directory are now visible from
downloaded. To start the VM, click on the green arrow icon                the VM (ls/mnt/share/) and can be used to proceed
and a new window will open showing the VM desktop. On                     with a normal Roary installation for Linux. Input and output
the left-hand side, click on the terminal window icon (fig. 3)            files for Roary can be exchanged through the shared folder if
and type sudo apt-get install virtualbox-                                 the path is provided at the command line (e.g., roary –f /
guest-utils (the password is manager).                                    mnt/share/RoaryVM/output /mnt/share/Roary
    To be able to use Roary within the VM, you will follow the            VM/input/*.gff).
Linux installation instructions. However, this requires that files
are shared between the host (Windows) and the VM. To
achieve this, a shared directory has to be created and used               Prokka to Create Input Files
to exchange files. Within the Windows machine, go to the                  An alternative way to converting gbff files into input files for
Anaconda website and download the Linux version as shown                  Roary is to use Prokka. This is particularly useful when gbff
in Step 1. Save this file in a directory you will share with the          files are not already available, as it may be the case for se-
VM. Then, switch to the VM, select Devices ! “Shared                      quencing projects that are in progress. First, using terminal in
folders” ! “Shared folder settings” and click on the “Add                 Linux (or in MacOSX or Windows) type conda install –
folder” icon on the right-hand side. Provide the path of the              c conda-forge –c bioconda prokka. To check
location of the Anaconda installer, assign a name to the VM               whether Prokka was installed correctly, type prokka –
(e.g., RoaryVM), a path where it will be mounted (e.g., /mnt/             h and the menu options of Prokka will be listed.
share/) and check “auto mount” and “make permanent” to                        Next, download *.genomic.fna.gz files from NCBI for the
ensure that the folder will be recognized upon restart of the             strains of interest, extract them, and upload these uncom-
VM. Then, in the VM terminal, type sudo mkdir/mnt/                        pressed files into the Linux/MacOSX/Windows machine. In
share/(the password is again manager) and then sudo                       the terminal window type: prokka –kingdom
mount -t vboxsf RoaryVM/mnt/share/. If the                                Bacteria –outdir prokka_GCA_XXXXX –genus
shared folder is not visible, repeat the mounting command.                YYYYY –locustag GCA_XXXXX GCA_XXXXX_ASMZ
                                                                                                                                           937
Sitto and Battistuzzi . doi:10.1093/molbev/msz284                                                                               MBE
ZZZZ_genomic.fna where XXXXX is the genome and                      and Gelfand 2018). Defining prokaryotic species bound-
ZZZZZ is the assembly number of one of the strains and              aries is a long-standing issue that, for now, has been
YYYYY is the genus of the same strain (e.g., for one of the         approached using DNA similarity thresholds (e.g., average
three B. animalis strains mentioned in Step 1: prokka –             nucleotide identity measures; Jain et al. 2018). However, a
kingdom Bacteria –outdir prokka_GCA_                                pangenome approach has the advantage of adding an
000816205 –genus Bifidobacterium –locus-                            evolutionary perspective by considering not only identity
tag GCA_000816205 GCA_000816205.1_ASM81                             (-i parameter in Roary) but also orthology/paralogy and
620v1_genomic.fna). Repeat for all the strains (each                gene flow (Bobay and Ochman 2017; Moldovan and
strain will take a few minutes to process). Each run will pro-      Gelfand 2018).
duce multiple output files, one of which is the GFF3 format             Finally, pangenome results can be used to investigate the
required by Roary.                                                  correlation between the spread of some genes and the traits
                                                                    they encode. A corollary software, Scoary (Brynildsrud et al.
                                                                    2016), is available to work with Roary’s outputs to identify
Applications of a Pangenome
938
Estimating Pangenomes with Roary . doi:10.1093/molbev/msz284                                                                              MBE
Eggertsson HP, Jonsson H, Kristmundsdottir S, Hjartarson E, Kehr B,               scale prokaryote pan genome analysis. Bioinformatics
    Masson G, Zink F, Hjorleifsson KE, Aslaug J, Adalbjorg J, et al. 2017.        31(22):3691–3693.
    Graphtyper enables population-scale genotyping using pangenome            Rodriguez-Valera F, Ussery DW. 2012. Is the pan-genome also a pan-
    graphs. Nat Genet. 49(11):1654–1660.                                          selectome? F1000Res. 1:16.
Jain C, Rodriguez-R LM, Phillippy AM, Konstantinidis KT, Aluru S. 2018.       Sheikhizadeh S, Schranz ME, Akdel M, de Ridder D, Smit S. 2016.
    High throughput ANI analysis of 90K prokaryotic genomes reveals               PanTools: representation, storage and exploration of pan-genomic
    clear species boundaries. Nat Commun. 9(1):5114.                              data. Bioinformatics 32(17):i487–i493.
Locey KJ, Lennon JT. 2016. Scaling laws predict global microbial diversity.   Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D, Ward NL,
    Proc Natl Acad Sci U S A. 113(21):5970–5975.                                  Angiuoli SV, Crabtree J, Jones AL, Durkin AS, et al. 2005. Genome
McInerney JO, McNally A, O’Connell MJ. 2017. Why prokaryote have                  analysis of multiple pathogenic isolates of Streptococcus agalactiae:
    pangenomes. Nat Microbiol. 2(4):17040.                                        implications for the microbial “pan-genome.” Proc Natl Acad Sci U S
Moldovan MA, Gelfand MS. 2018. Pangenomic definition of prokaryotic               A. 102(39):13950–13955.
    species and the phylogenetic structure of Prochlorococcus spp. Front      Tettelin H, Riley D, Cattuto C, Medini D. 2008. Comparative geno-
    Microbiol. 9:428.                                                             mics: the bacterial pan-genome. Curr Opin Microbiol.
Muzzi A, Donati C. 2011. Population genetics and evolution of the pan-            11(5):472–477.
    genome of Streptococcus pneumoniae. Int J Med Microbiol.                  Vernikos G, Medini D, Riley DR, Tettelin H. 2015. Ten years of pan-
939