MODULE-4
TOPIC: BIOINFORMATICS
RESOURCES: NCBI, EBI,
EXPASY, RCSB.
           Selected resources
   Broad Institute of Harvard and MIT
   DNA Databank of Japan (DDBJ)
   The European Bioinformatics Institute (EBI)
   ExPASy Bioinformatics Resource Portal
   National Center for Biotechnology Information (NCBI)
   Ingenuity Pathway Analysis (IPA)
   IPA is a web-based software application for the analysis, integration, and interpretation of data derived
    from 'omics experiments. Access provided to Tufts University and Tufts Medical Center researchers. Click
    the link above for more information on how to access IPA.
   Oncomine
   Cancer microarray database with more than 700 independent data sets and a set of analysis functions
    that compute gene expression signatures, clusters and gene-set modules. Users must register for
    username and password at site.
   Find More Resources
   Nucleic Acids Research Database Issue
   Each year, the journal Nucleic Acids Research devotes an issue to descriptions of new molecular biology
    databases and updates of previously reviewed molecular biology databases.
   Nucleic Acids Research Database Summary Papers
   Searchable collection of databases that have been described in the journal, Nucleic Acids Research.
    Databases are organized by category, and each entry has a description of the database and/or a link
    to the review in Nucleic Acids Research.
   Nucleic Acids Research Web Server Issue
   Each year, the journal Nucleic Acids Research dedicates an issue to reports on web-based software
    resources for analysis and visualization of molecular biology data.
   Online Bioinformatics Resources Collection (OBRC)
   The OBRC, created and maintained by the University of Pittsburgh Health Sciences Library System,
    contains brief descriptions of and links for more than 2,400 bioinformatics databases and tools.
NCBI (https://www.ncbi.nlm.nih.gov/)
   The National Center for Biotechnology Information (NCBI) is part of the United States
    National Library of Medicine (NLM), a branch of the National Institutes of Health (NIH).
   It is approved and funded by the government of the United States. The NCBI is located
    in Bethesda, Maryland and was founded in 1988.
   The NCBI houses a series of databases relevant to biotechnology and biomedicine and is
    an important resource for bioinformatics tools and services. Major databases
    include GenBank for DNA sequences and PubMed, a bibliographic database for
    biomedical literature.
   Other databases include the NCBI Epigenomics database. All these databases are
    available online through the Entrez search engine.
•   A comprehensive website for biologists including:
   biology-related databases,
   tools for viewing and analyzing
   automated systems for storing and retrieval
•   NCBI along with EBI and CIB together form International Sequence Database
    Collaboration which act as the chief working unit and Information Centre. NCBI has 3
    collaborative databases:
•   GenBank
•   European Molecular Biology Laboratory (EMBL)
•   Database DNA Database of Japan (DDBJ)
A     Science “Primer" yields access to general
    definitions and introductory information
    regarding the branches of science included
    in bioinformatics.
 Many    bioinformatics terms are defined in this
    section in a clear-cut and basic manner,
    making this Primer an excellent first resource.
   "Databases and Tools" from the yields is a
    complete and well-ordered listing of
    accessible information.
    EMBL's European Bioinformatics Institute (EMBL-EBI)
    https://www.ebi.ac.uk/
   EMBL-EBI, makes the world’s public biological
    data freely available to the scientific community
    via a range of services and tools, perform basic
    research and provide professional training in
    bioinformatics.
   Are part of the European Molecular Biology
    Laboratory (EMBL), an international, innovative
    and interdisciplinary research organization
    funded by over 20 member states, prospect
    and associate member states.
   situated on the Wellcome Genome Campus in
    Hinxton, Cambridge, UK, one of the world’s
    largest concentrations of scientific and
    technical expertise in genomics.
What they do….
   provide freely available data and bioinformatics services to the scientific
    community.
   contribute to the advancement of biology through investigator-driven
    research.
   provide advanced bioinformatics training to scientists at all levels.
   help disseminate cutting-edge technologies to industry.
   support the coordination of biological data provision throughout Europe.
   The European Nucleotide Archive and the protein sequence
    resource UniProt (then known as Swiss-Prot–TrEMBL) were the original
    EMBL-EBI databases. Since then, the EMBL-EBI has played a major part
    in the bioinformatics revolution.
Tools & Data Resources
 Clustal Omega
    Multiple sequence alignment of DNA or protein sequences. Clustal Omega
     replaces the older ClustalW alignment tools
 InterProScan
    InterProScan searches sequences against InterPro's predictive protein
     signatures.
 BLAST [protein]
    Fast local similarity search tool for protein sequence databases.
 BLAST [nucleotide]
    Fast local similarity search tool for nucleotide sequence databases
 HMMER
    Fast sensitive protein homology searches using profile hidden Markov
     models (HMMs) for querying against both sequence and HMM target
     databases
Tools & Data Resources
Ensembl
   Genome browser, API and database, providing access to reference genome annotation.
UniProt
   A comprehensive resource for protein sequence and functional annotation
PDBe
   The European resource for the collection, organisation and dissemination of 3D structural data (from
    PDB and EMDB) on biological macromolecules and their complexe.
Europe PMC
   A database to search the worldwide life sciences literature.
Expression Atlas
   An added-value database that shows which genes/proteins are expressed under which conditions,
    and how expression differs between conditions.
ChEMBL
   An open data resource of binding, functional and ADMET bioactivity data.
Browse by type
    DNA & RNA
    Gene Expression
    Proteins
    Structures
    Systems
    Chemical biology
    Ontologies
    Literature
    Cross domai
     ExPASy SIB(https://www.expasy.org/)
     Swiss Bioinformatics resource portal
About Expasy
    Expasy is the bioinformatics resource portal of the SIB Swiss Institute of Bioinformatics (more
     about its history).
    It is an extensible and integrative portal which provides access to over 160 databases and
     software tools, developed by SIB Groups and supporting a range of life science and clinical
     research domains, from genomics, proteomics and structural biology, to evolution and
     phylogeny, systems biology and medical chemistry.
The Expasy search engine
Expasy allows you to seamlessly
1)    query in parallel a subset of SIB databases through a single search, and to
2)    surface related information and knowledge from the complete set of >160 resources on the
      portal. Expasy provides information that is automatically aligned with the most recent release
      of each resources, thereby ensuring up-to-date information.
Some history
   Expasy was created in August 1993 - the dawn of the internet
    era. At that time, it was referred to as 'ExPASy, the Expert Protein
    Analysis System' as proteins were its primary focus. It was the first
    life science website - and among the 150 very first websites in the
    world!
   In June 2011, it became the SIB Expasy Bioformatics Resources
    Portal: a diverse catalogue of bioinformatics resources
    developed by SIB Groups.
   The current version of Expasy was released in July 2020 following
    a massive user study and taking into account design, user
    experience and architecture aspects: we thank all participants
    for their help in shaping Expasy 3.0!
    RCSB-PDB(https://www.rcsb.org/)
   The Protein Data Bank (PDB) was established as the 1st open access digital data
    resource in all of biology and medicine (Historical Timeline). It is today a leading global
    resource for experimental data central to scientific discovery.
   Through an internet information portal and downloadable data archive, the PDB provides
    access to 3D structure data for large biological molecules (proteins, DNA, and RNA).
    These are the molecules of life, found in all organisms on the planet.
   Knowing the 3D structure of a biological macromolecule is essential for understanding its
    role in human and animal health and disease, its function in plants and food and energy
    production, and its importance to other topics related to global prosperity and
    sustainability.
A Structural View of Biology
   This resource is powered by the Protein Data Bank archive-information about the 3D
    shapes of proteins, nucleic acids, and complex assemblies that helps students and
    researchers understand all aspects of biomedicine and agriculture, from protein synthesis
    to health and disease.
   As a member of the wwPDB, the RCSB PDB curates and annotates PDB data.
   The RCSB PDB builds upon the data by creating tools and resources for research and
    education in molecular biology, structural biology, computational biology, and beyond.
              MODULE-4
TOPIC: Databases , classifications and
            file formats
         What is database????
• Database are convenient system to
  properly store, search and retrieve any
  type of data.
• A database helps to easily handle and share
  large amount of data and supports large
  scale analysis by easy access and data
  updating
   What is Biological Database???
• Biological databases are libraries of life sciences
  information      ,collected    from       scientific
  experiments, published literature, high-
  throughput experiment technology and
  computational analysis.
• They contain information from genomics,
  proteomics, microarray gene expression.
 What is expected from a database..!!
• Sequence, functional, structural information,
  related bibliography
• Well Structured and Indexed information
• Well cross-referenced (with other databases)
• Periodically updated
• Tools for analysis and visualization
                Databases Architecture
Information system                          (The Google,Entrez
                                                   SRS)
)Query system
 Storage System                              Your search keywords
                                            Oracle,MySQL,PCbinary
                                            files,Unix text
  Data                                      files,Bookshelves
                       GenBank flat file
                       PDB file
                       Interaction Record
                       Title of a book
                       Book
     Biological Databases- Types and Importance
   One of the hallmarks of modern genomic research is the generation of
    enormous amounts of raw sequence data.
   As the volume of genomic data grows, sophisticated computational
    methodologies are required to manage the data deluge.
   Thus, the very first challenge in the genomics era is to store and
    handle the staggering volume of information through the establishment
    and use of computer databases.
   A biological database is a large, organized body of persistent data,
    usually associated with computerized software designed to update,
    query, and retrieve components of the data stored within the system.
   A simple database might be a single file containing many records, each
    of which includes the same set of information.
   The chief objective of the development of a database is to organize
    data in a set of structured records to enable easy retrieval of
    information.
               Types of Biological Databases
Based on their contents, biological databases can be roughly divided into
two categories:
1. Primary databases
  Primary databases are also called as archieval database.
   They are populated with experimentally derived data such as
    nucleotide sequence, protein sequence or macromolecular structure.
   Experimental results are submitted directly into the database by
    researchers, and the data are essentially archival in nature.
   Once given a database accession number, the data in primary
    databases are never changed: they form part of the scientific record.
   Examples:ENA, GenBank and DDBJ (nucleotide sequence)
   Array Express Archive and GEO (functional genomics data)
   Protein Data Bank (PDB;         coordinates   of   three-dimensional
    macromolecular structures)
2. Secondary databases
  Secondary databases comprise data derived from the results of
   analysing primary data.
   Secondary databases often draw upon information from numerous
    sources, including other databases (primary and secondary),
    controlled vocabularies and the scientific literature.
   They are highly curated, often using a complex combination of
    computational algorithms and manual analysis and interpretation to
    derive new knowledge from the public record of science.
•   Examples
   InterPro (protein families, motifs and domains)
   UniProt Knowledgebase (sequence and functional information on
    proteins)
   Ensembl (variation, function, regulation and more layered onto whole
    genome sequences)
3.However, many data resources have both primary
and       secondary       characteristics.     For
example, UniProt accepts primary sequences
derived from peptide sequencing experiments.
However, UniProt also infers peptide sequences
from genomic information, and it provides a wealth
of additional information, some derived from
automated annotation (TrEMBL), and even more
from careful manual analysis (SwissProt).
4. There are also specialized databases are those
that cater to a particular research interest. For
example, Flybase, HIV sequence database, and
Ribosomal Database Project are databases that
specialize in a particular organism or a particular
type of data.
                  GenBank (Genetic Sequence Databank)
• GenBank® is the genetic sequence database at the National             Center for
  BiotechnologyInformation (NCBI).
• It was established in the year 1982 and now maintained by the National Center for
  Biotechnology(NCBI).
• DNAsequencescanbesubmitted to GenBankusingseveral different methods.
• It contains publicly available nucleotide sequences for more than 240 000 named
  organisms, obtained primarily through submissions from individual laboratories and
  batch submissions fromlarge-scale sequencing projects.
• It has a flat file structure that is an ASCII text file, readable & downloadable by
  both humansand computers.
• There are two main ways of making batch sequence submissions to GenBank: NCBI’s
  Barcode SubmissionTool(BarSTool)andSequin.
   EMBL
• Institute (EBI), in England), Grenoble (France), Hamburg (Germany), and The European
  Molecular Biology Laboratory (EMBL) is a molecular biology research institution
  supportedby 22member states, four prospectand two associatemember states.
• EMBL was created in 1974 and is an intergovernmental organisation funded by public
  researchmoney from its member states.
• The Laboratory operates from five sites: the main laboratory in Heidelberg, and
  outstations in Hinxton (the European Bioinformatics Monterotondo (near Rome).
• EMBL groups and laboratories perform basic research in molecular biology and
  molecularmedicine aswell astraining for scientists, studentsand visitors.
• Israelis the onlyAsianstate that hasfull membership.
• The EMBL Nucleotide Sequence Database (http:// www.ebi.ac.uk/embl/), maintained at
  the EuropeanBioinformaticsInstitute (EBI).
• It is usedto incorporate anddistributes nucleotide sequencesfrom public sources.
• The database is a part of an international collaboration with DDBJ (Japan) and GenBank
  (USA).
• Data are exchanged between the collaborating databases on a daily
  basis.
• The web-based tool, Webin, is the preferred system for individual submission
  of nucleotide sequences, including Third Party Annotation (TPA) and
  alignment data.
• Automatic submission procedures are used for submission of data from large-
  scale genomesequencing
• The latest data collection can be accessed via FTP, email and WWW
  interfaces.
• The EBI's Sequence Retrieval System (SRS) integrates and links the main
  nucleotide and protein databases as well as many other specialist molecular
  biologydatabases.
• For sequence similarity searching, a variety of tools (e.g. FASTA and BLAST) are
  available that allow external users to compare their own sequences against
  the data in the EMBL Nucleotide Sequence Database and otherdatabases.
• All available resources canbe accessedvia the EBIhome pageat
  http://www.ebi.ac.uk.
           DDBJ(DNA Data Bank of Japan,
             https://www.ddbj.nig.ac.jp/)
• DDBJ Center collects nucleotide sequence data as a member of
  INSDC(International     Nucleotide       Sequence       Database
  Collaboration) and provides freely available nucleotide sequence
  data and supercomputer system, to support research activities
  in life science.
• Currently, DDBJ Center is in operation at Research
  Organization of Information and System National Institute
  of Genetics(NIG) in Mishima, Japan with endorsement
  of MEXT; Japanese Ministry of Education, Culture, Sports,
  Science and Technology.
• DDBJ Center is reviewed and advised by its own advisory
  board, DNA Database Advisory Committee (an outside
  committee of NIG), and also by the advisory board to
  INSDC, International Advisory Committee.
      UniProt
• UniProt is a freely accessible database of protein
  sequence and functional information, many entries being
  derived from genome sequencing projects.
• It contains a large amount of information about the
  biological function of proteins derived from the research
  literature.
• It is maintained by the UniProt consortium, which
  consists of several European bioinformatics organisations
  and a foundation from Washington, DC, United States.
• The UniProt consortium comprises the European
  Bioinformatics Institute (EBI), the Swiss Institute of
  Bioinformatics (SIB), and the Protein Information
  Resource (PIR).
             Organization of UniProt databases
• UniProtKB
• UniProt Knowledgebase (UniProtKB) is a protein database
  partially curated by experts, consisting of two sections:
• UniProtKB/Swiss-Prot (containing reviewed, manually
  annotated entries) and UniProtKB/TrEMBL (containing
  unreviewed, automatically annotated entries).
• As of 19 March 2014, release "2014_03" of
  UniProtKB/Swiss-Prot contains 542,782 sequence entries
  (comprising 193,019,802 amino acids abstracted from
  226,896 references) and release "2014_03" of
  UniProtKB/TrEMBL contains 54,247,468 sequence entries
  (comprising 17,207,833,179 amino acids)
               UniProtKB/Swiss-Prot
• UniProtKB/Swiss-Prot is a manually annotated, non-
  redundant protein sequence database.
• It combines information extracted from scientific
  literature and biocurator-evaluated computational
  analysis.
• The aim of UniProtKB/Swiss-Prot is to provide all known
  relevant information about a particular protein.
• Annotation is regularly reviewed to keep up with current
  scientific findings. The manual annotation of an entry
  involves detailed analysis of the protein sequence and of
  the scientific literature.
                  UniProtKB/TrEMBL
• UniProtKB/TrEMBL contains high-quality computationally
  analyzed records, which are enriched with automatic
  annotation.
• It was introduced in response to increased dataflow resulting
  from genome projects, as the time- and labour-consuming
  manual annotation process of UniProtKB/Swiss-Prot could
  not be broadened to include all available protein sequences.
• The translations of annotated coding sequences in the EMBL-
  Bank/GenBank/DDBJ nucleotide sequence database are
  automatically processed and entered in UniProtKB/TrEMBL.
• UniProtKB/TrEMBL also contains sequences from PDB, and
  from gene prediction, including Ensembl, RefSeq and CCDS.
              The Protein Information Resource (PIR)
• The Protein Information Resource (PIR) produces the largest,
  most comprehensive, annotated protein sequence database in the
  public domain.
• The PIR-International Protein Sequence Database, in collaboration
  with the Munich Information Center for Protein Sequences
  (MIPS) and the Japan International Protein Sequence Database
  (JIPID).
• The expanded PIR WWW site allows sequence similarity and text
  searching of the Protein Sequence Database and auxiliary
  databases.
• Several new web-based search engines combine searches of
  sequence similarity and database annotation to facilitate the
  analysis and functional identification of proteins.
• New capabilities for searching the PIR
  sequence databases include annotation-sorted
  search, domain search, combined global and
  domain search, and interactive text searches.
• The PIR-International databases and search
  tools are accessible on the PIR WWW site at
  http://pir.georgetown.edu and at the MIPS
  WWW                     site                  at
  http://www.mips.biochem.mpg.de .
The database has the following distinguishing features.
• It is a comprehensive, annotated, and non-redundant protein sequence database,
  containing over 142 000 sequences as of September 1999. Included are
  sequences from the completely sequenced genomes of 16 prokaryotes, six
  archaebacteria, 17 viruses and phages, >100 eukaryote organelles
  and Saccharomyces cerevisiae.
• The collection is well organized with >99% of entries classified by protein
  family and >57% classified by protein superfamily.
• PSD annotation includes concurrent cross-references to other sequence,
  structure, genomic and citation databases, including the public nucleic acid
  sequence databases ENTREZ, MEDLINE, PDB, GDB, OMIM, FlyBase,
  MIPS/Yeast, SGD/Yeast, MIPS/Arabidopsis and TIGR.
• The PIR is the only sequence database to provide context cross-references
  between its own database entries.
   PIR-International sequence and auxiliary databases
           Database                     Description                        Information
PSD                         Annotated and classified protein   http://pir.georgetown.edu/pirw
                            sequences                          ww/dbinfo/textpsd.html
PATCHX                      Sequences not yet in the PIR-      http://pir.georgetown.edu/pirw
                            International PSD                  ww/dbinfo/patchx.html
ARCHIVE                     Sequences as originally reported   http://pir.georgetown.edu/pirw
                            in a publication or submission     ww/dbinfo/archive.html
NRL_3D                      Sequences from three-              http://pir.georgetown.edu/pirw
                            dimensional structure database     ww/dbinfo/nrl3d.html
                            PDB
FAMBASE                     Representative sequences from      http://pir.georgetown.edu/pirw
                            each protein family                ww/dbinfo/fambase.html
PIR-ALN                     Sequence alignments of             http://pir.georgetown.edu/pirw
                            superfamilies, families and        ww/dbinfo/piraln.html
                            homology domains
RESID                       Post-translational modifications   http://pir.georgetown.edu/pirw
                            with PSD feature information       ww/dbinfo/resid.html
ProClass                    Non-redundant sequences            http://pir.georgetown.edu/gfserv
                            organized according to             er/proclass.html
                            superfamilies and motifs
ProtFam                     Sequence alignments of             http://www.mips.biochem.mpg.
                            superfamilies                      de/proj/protfam/protfam
PIR - https://proteininformationresource.org/
FILE FORMATS
                   Flat file
• A flat-file database is a database stored
  in a file called a flat file. Records follow a
  uniform format, and there are no
  structures for indexing or recognizing
  relationships between records. The file is
  simple. A flat file can be a plain text file, or
  a binary file. Relationships can be inferred
  from the data in the database, but the
  database format itself does not make
  those relationships explicit.
      Flat File Storage Data Formats
•When GenBank, EMBL and DDBJ formed a
 collaboration (1986), sequence databases had
 moved to a defined flat file format with a shared
 feature table format and annotation standards.
•The flat file formats from the sequence databases
 are still used to access and display sequence and
 annotation. They are also convenient for storage of
 localcopies.
           Genbank flat file
• The Genbank format allows for the storage
  of information in addition to a DNA/protein
  sequence.
The screen grab shows various details, the first section includes the entry’s
LOCUS, DEFINITION, ACCESSION and VERSION and denoted by ORIGIN,
you can see that the final detail is the actual sequence. These five elements are
the essential parts of the GenBank format.
        GenBank (Genetic Sequence Databank)
• GenBank® is the genetic sequence database at the
  National Center for Biotechnology Information (NCBI).
• It was established in the year 1982 and now maintained by
  the NationalCenter for Biotechnology (NCBI).
• DNA sequences can be submitted to GenBank using
  several different methods.
• It contains publicly available nucleotide sequences for more
  than 240 000 named organisms, obtained primarily through
  submissions from individual laboratories and batch
  submissions fromlarge-scale sequencing projects.
•It has a flat file structure that is anASCII text
 file, readable & downloadable by both
 humans and computers.
•There are two main ways of making batch
 sequence submissions to GenBank: NCBI’s Barcode
 SubmissionTool (BarSTool) and Sequin.
                      FASTA format
• In bioinformatics and biochemistry, the FASTA format is a text-
  based format for representing either nucleotide sequences or
  amino acid (protein) sequences, in which nucleotides or amino
  acids are represented using single-letter codes. The format also
  allows for sequence names and comments to precede the
  sequences.
• FASTA format is a text-based format for representing either
  nucleotide sequences or peptide sequences, in which base pairs
  or amino acids are represented using single-letter codes.
• A sequence in FASTA format begins with a single-line
  description, followed by lines of sequence data.
• The description line is distinguished from the sequence data by a
  greater-than (">") symbol in the first column.
• It is recommended that all lines of text be shorter than 80
  characters in length
>gi|129295|sp|P01013|OVAX_CHICK GENE X
PROTEIN (OVALBUMIN-RELATED)
QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDT
REMPFHVTKQESKPVQMMCMNNSFNVATLPAEKM
KILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTW
TNPNTMEKRRVKVYLPQMKIEEKYNLTSVLMALGM
TDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIE
MAGSTGVIEDIKHSPESEQFRADHPFLFLIKHNPTNTI
VYFGRYWSP
Filename extension
                  Extension              Meaning                      Notes
                                                         Any generic fasta file. See below
            [9]
fasta, fa                     generic FASTA              for other common FASTA file
                                                         extensions
                                                         Used generically to specify
fna                           FASTA nucleic acid
                                                         nucleic acids.
                              FASTA nucleotide of gene   Contains coding regions for a
ffn
                              regions                    genome.
                                                         Contains amino acid sequences.
                                                         A multiple protein fasta file can
faa                           FASTA amino acid
                                                         have the more specific extension
                                                         mpfa.
                                                         Contains non-coding RNA
frn                           FASTA non-coding RNA       regions for a genome, in DNA
                                                         alphabet e.g. tRNA, rRNA
      There is no standard filename extension for a text file containing FASTA
      formatted sequences. The table below shows each extension and its respective
      meaning.
              Protein Data Bank (PDB) flat file
• The Protein Data Bank (pdb) file format is a textual file format describing the
  three-dimensional structures of molecules held in the Protein Data Bank.
• The pdb format accordingly provides for description and annotation of protein
  and nucleic acid structures including atomic coordinates, secondary structure
  assignments, as well as atomic connectivity.
• In addition experimental metadata are stored. PDB format is the legacy file
  format for the Protein Data Bank which now keeps data on biological
  macromolecules in the newer mmCIF file format.
• The PDB file format was invented in 1976 as a human-readable file that would
  allow researchers to exchange protein coordinates through a database system.
• Its fixed-column width format is limited to 80 columns, which was based on
  the width of the computer punch cards that were previously used to exchange
  the coordinates.
• Through the years the file format has undergone many changes and revisions.
• HEADER, TITLE and AUTHOR records
provide information about the researchers who defined the
structure; numerous other types of records are available to provide
other types of information.
• REMARK records
can contain free-form annotation, but they also accommodate
standardized information; for example, the REMARK 350 BIOMT
records describe how to compute the coordinates of the
experimentally observed multimer from those of the explicitly
specified ones of a single repeating unit.
• SEQRES records
give the sequences of the three peptide chains (named A, B and C),
which are very short in this example but usually span multiple lines.
• ATOM records
describe the coordinates of the atoms that are part of
the protein. For example, the first ATOM line above
describes the alpha-N atom of the first residue of
peptide chain A, which is a proline residue; the first
three floating point numbers are its x, y and z
coordinates and are in units of Ångströms.[3] The next
three columns are the occupancy, temperature factor,
and the element name, respectively.
• HETATM records
describe coordinates of hetero-atoms, that is those
atoms which are not part of the protein molecule.
                   Protein Information Resource
                            (PIR format)
PIR format description
•A sequence in PIR format consists of:
      o One line starting with
            • a ">" (greater-than) sign, followed by
            • a two-letter code describing the sequence type (P1, F1, DL, DC, RL, RC, or XX),
              followed by
            • a semicolon, followed by
            • the sequence identification code (the database ID-code).
      o One line containing a textual description of the sequence.
      o One or more lines containing the sequence itself. The end of the sequence is
          marked by a "*" (asterisk) character.
      o Optionally, this can be followed by one or more lines describing the sequence.
          Software that is supposed to read only the sequence should ignore these.
•A file in PIR format may comprise more than one sequence.
•The PIR format is also often referred to as the NBRF format.
PIR format example (for sequence which doesn't have
                     structure)
>P1;test
sequence
ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLM
NTTVTTGLLLNGSYSENRTQIWQKHRTSNDSALILLNKHYNLT
VTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWCHFPSN
WKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETA
NLWFNCHGEFFYCKMDWFLNYLNNLTVDADHNECKNTSGT
KSGNKRAPGPCVQRTYVACHIRSVIIWLETISKKTYAPPREGHL
ECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLV
EITPIGFAPTEVRRYTGGHERQKRVPFV*
   PIR format example (for sequence which has structure)
>P1;1bbha-
structure:1bbha-
AGLSPEEQIETRQAGYEFMGWNMGKIKANLEGEYNAAQVEAAANVIAAIANSGMGALYGPG
TDKNVGDVKTRVKPEFFQNMEDVGKIAREFVGAANTLAEVAATGEAEAVKTAFGDVGAACKS
CHEKYRAK-*
>P1;1cpq--
structure:1cpq--
--ADTKEVLEAREAYFKSLGGSMKAMTGVAKA-
DAEAAKVEAAKLEKILATDVAPLFPAGTSSTDLPG-
QTEAKAAIWANMDDFGAKGKAMHEAGGAVIAAANAGDGAAFGAALQKLGGTCKACHDDY
REED*
>P1;256bb-
structure:256bb-
---------ADLEDNMETLNDNLKVIEKAD----NAAQVKDALTKMRAAALD-AQKATPPKLE---------
DKSP-DSPEMKDFRHGFDILVGQIDDALKLANEGKVKEAQAAAEQLKTTRNAYHQKYR---*
               Protein structure file
• A PSF file, also called a protein structure file,
  contains all of the molecule-specific information
  needed to apply a particular force field to a
  molecular system.
• The PSF file contains six main sections of
  interest: atoms, bonds, angles, dihedrals,
  impropers (dihedral force terms used to maintain
  planarity), and cross-terms. The following is
  taken from a PSF file for ubiquitin. First is the
  title and atom records:
PSF CMAP
   6 !NTITLE
REMARKS original generated structure x-plor psf file
REMARKS 2 patches were applied to the molecule.
REMARKS topology top_all27_prot_lipid.inp
REMARKS segment U { first NTER; last CTER; auto angles dihedrals }
REMARKS defaultpatch NTER U:1
REMARKS defaultpatch CTER U:76
  1231 !NATOM
 1 U 1 MET N NH3 -0.300000            14.0070        0
 2 U 1 MET HT1 HC 0.330000             1.0080        0
 3 U 1 MET HT2 HC 0.330000             1.0080        0
 4 U 1 MET HT3 HC 0.330000             1.0080        0
 5 U 1 MET CA CT1 0.210000            12.0110        0
 6 U 1 MET HA HB 0.100000             1.0080         0
 7 U 1 MET CB CT2 -0.180000           12.0110        0
The fields in the atom section are atom ID, segment name, residue ID, residue name, atom name,
atom type, charge, mass, and an unused 0.
      Module 4
Topic: Modular Nature
     of proteins
Introduction - Domain
o A protein domain is a region of the protein's polypeptide chain that is self-
  stabilizing and that folds independently from the rest.
o Each domain forms a compact folded three-dimensional structure.
o Many proteins consist of several domains. One domain may appear in a variety of
  different proteins.
o Molecular evolution uses domains as building blocks and these may be
  recombined in different arrangements to create proteins with different functions.
o In general, domains vary in length from between about 50 amino acids up to 250
  amino acids in length.
o
   Introduction – Domain contd….
 The shortest domains, such as zinc fingers, are stabilized by metal ions
  or disulfide bridges. Domains often form functional units, such as the
  calcium-binding EF hand domain of calmodulin.
 Because they are independently stable, domains can be "swapped"
  by genetic engineering between one protein and another to make chimeric
  proteins.
Proteins are composed of evolutionary units called domains
Can either have an independent function or contribute to the function of a
 multidomain protein in cooperation with other domains.
Once a domain has duplicated, it can evolve a new or
 modified function.
Based on sequence, structural and functional evidence are grouped into
 superfamilies.
Background
 The concept of the domain was first proposed in 1973 by Wetlaufer after X-ray crystallographic
   studies of hen lysozyme and papain and by limited proteolysis studies of immunoglobulins.
 Wetlaufer defined domains as stable units of protein structure that could fold autonomously.
 In the past domains have been described as units of:
•compact structure
•function and evolution
•folding.
Domain swapping
Domain swapping is a mechanism for forming oligomeric
 assemblies.
 In domain swapping, a secondary or tertiary element of a
 monomeric protein is replaced by the same element of
 another protein.
Domain swapping can range from secondary structure
 elements to whole structural domains.
It also represents a model of evolution for functional
 adaptation by oligomerisation, e.g. oligomeric enzymes that
 have their active site at subunit interfaces
Role of domains
 Acquiring new sructures and function by combination of
  domain
   New domain combinations
☺ Formation of new domain combinations is an important mechanism
 in protein evolution.
☺ Proteins contain several thousand different combinations of two
 superfamilies.
☺ Duplication is one of the main sources for creation of new proteins.
☺ After duplication ,it evolve a new or modified function either by
 sequence divergence or by combining with other domains to form
 a multidomain protein with a new series of domains.
☺ Formation of multidomain proteins by duplication and
 recombination, and the geometry and functional relationships .
☺ Supradomains are two- or three-domain combinations that occur
 in different domain architectures with different N- and C-terminal
 neighbours.
Overview of different aspects of
multidomain proteins :
Domains belonging to the same
superfamily are represented as
rectangles of the same colour. {1}
Supradomains are two- or three-
domain combinations that occur
in different domain architectures
{2}
Forms different geometry with
different functions.{3}
These domains forms a
A    few domain superfamilies are highly versatile and have neighbouring domains from
many superfamilies.
 Each superfamily has its own feature.
* Some superfamilies are highly versatile, some are highly abundant and some
superfamilies are both.
* It depends on the structure and function of the domains and domain combinations
that determine the selection.
Cntd…
  • Important examples of the reuse of particular domains
    come from signal transduction.
       .The SH3 and SH2 domains in signal transduction.
       . Combination and addition of several domains
  determine the versatility of the protein.
To have the
SAME FUNCTION
 -- Sequential order of
domains are conserved
 If the same domain combination is observed in two different proteins,they
  are closely relatedwith each other phylogenetically.
  *Domain architecture have evolved from the sameancestor.
  * EG:Rossmann fold
  *Proteins sharing the same series of domains tend to have the same
  function.
  *The total number of defined domains is relatively small and is growing
  only slowly. For example, the Pfam domain database defines about
  18,000 domains in its current version (version 32).
  * On the other hand, the number of known unique domain
  arrangements - defined by the linear order of domains in an amino acid
  sequence is much larger and growing rapidly .
  * Accordingly, rearrangements of existing domains can help explain the
  vast protein diversity we observe in nature
Geometry of
  domain
  combinations
~Sequential order of domains are largely conserved.
~The geometry of Rossmann domains and their
partner domains -conserved - same superfamily.
~~Proteins of unknown structure - based on
homologous polypeptide(s) of known structure.
*EG :yeast ribosome and exosome
~the more similar the domain sequences -
interaction of protein domains is more conserved.
Functional relationships
of domains in multi-
domain proteins!
Domain-centric scheme emphasises domain function.
In this domain-centric functional classification scheme, domains are
classified into several categories
1.catalytic activity,
2.cofactor binding,
3.responsibility for subcellular localisation,
4.protein–protein interaction etc..
TWO principle
1.A domain can perform the same function, but in different protein contexts
(i.e. with different partner domains).Eg:sensory, regulatory and enzymatic
domains.
2.Some domains modify their function according to the partner
domain.Eg:WHD domain (Winged Helix Domain)
             Module 4
Topic: Optional Alignment Methods,
       Sequence Alignment
Introduction
 Fundamental building blocks are linear
  sequences
 Heart of bioinfo analysis is sequence
  comparision
 Gene repository in ncbi
 Pairwise sequence alignment
Evolutionary basis
 molecular sequences undergo random
  changes
 traces of evolution may still remain in certain
  portions of the sequences to allow
  identification of the common ancestry
 Functional and structural roles tend to be
  preserved
 patterns of conservation and variation can be
  identified
   evolutionary relationships between
    sequences helps to characterize the function
    of unknown sequences
   Charactarization into families or domains or
    motifs
   insertions or deletions or mutations
   Sequence homology vs sequence similarity
   Sequence similarity vs sequence identity
Sequencing a genome
 Shotgun sequencing
 Accurate to 650 nucleotides
 Sequence alignment used to stitch the whole
  length
 Sequence assembly
Sequence comparison
 Sequence similarity can provide clues about
  function and evolutionary relationships
 Algorithms used to search in massive
  databases
 Two types
 global and local
Global
 Generally similar over entire length
 best possible alignment across the entire
  length
local
 local regions with the highest level of
  similarity
 Conserved patterns in DNA or protein
  sequences.
 Motifs
 Protien domains
Pairwise Sequence
    Alignment
Sequences
   DNA/RNA sequences
    – strings composed of an alphabet of 4 letters
   Protein sequences
    – alphabet of 20 letters
                                                     101
A Quantitative Measure of Sequence
Similarity
 To compare the nucleotides or amino acids
  that appear at corresponding positions in two
  or more sequences, we must first assign
  those correspondences.
 Sequence alignment is the identification of
  residue-residue correspondences.
                                             102
Orthologous and paralogous
 Orthologous sequences differ because they are
  found in different species (a speciation event)
 Paralogous sequences differ due to a gene
  duplication event
 Sequences may be both orthologous and
  paralogous
       Pairwise Alignment
 The alignment of two sequences (DNA or
  protein) is a relatively straightforward
  computational problem.
  – There are lots of possible alignments.
            •
 Two sequences can always be aligned.
 Sequence alignments have to be scored.
 Often there is more than one solution with the
  same score.
Methods of Alignment
 By hand - slide sequences on two lines of a word
  processor
 Dot plot
  – with windows
 Rigorous mathematical approach
  – Dynamic programming (slow, optimal)
 Heuristic methods (fast, approximate)
  – BLAST and FASTA
     • Word matching and hash tables.
Applications
 The basic tool of bioinformatics
 Sequence similarity is an indicator of
  homology
 Database queries
    – Determining the function of a newly discovered
      genetic sequence
   Annotation of genomes
    – Involving assignment of structure and function to
      as many genes as possible
                                                       106
Dot Plot – Example (1)
   Lets consider a dot plot between sperm
    whale and human myoglobins (肌紅蛋白)
Sperm whale myoglobin
GLSDGEWQLV LNVWGKVEAD   IPGHGQEVLI   RLFKGHPETL
EKFDKFKHLK SEDEMKASED   LKKHGATVLT   ALGGILKKKG
HHEAEIKPLA QSHATKHKIP   VKYLEFISEC   IIQVLQSKHP
GDFGADAQGA MNKALELFRK   DMASNYKELG   FQG
human myoglobin
VLSEGEWQLV LHVWAKVEAD   VAGHGQDILI   RLFKSHPETL
EKFDRFKHLK TEAEMKASED   LKKHGVTVLT   ALGAILKKKG
HHEAELKPLA QSHATKHKIP   IKYLEFISEA   IIHVLHSRHP
GDFGADAQGA MNKALELFRK   DIAAKYKELG   YQG
                                                  107
Dot Plot – Example (2)
                            Diagonal lines
                             of dots show
                             similarities
                                          108
Dot Plots  Sequence Alignments
 A alignment can reflect the evolutionary
  relationship between two or more homologs.
 Three kinds of changes can occur at any
  given position within a sequence
    – Mutation
    – Insertion
    – Deletion
                                           109
Many Possibilities
   An uninformative alignment:   -----gctgsscg
                                  ctataatc-------
   An alignment without gaps:
                                  gctgaacg
                                  ctataatc
   An alignment with gaps:
                                  gctga-a--cg
   And another:                  --ct-ataatc
                                  gctg-aa-cg
                                  -ctataatc-
                                                    110
An Introduction to Bioinformatics Algorithms   www.bioalgorithms.info
Percent Sequence Identity
• The extent to which two nucleotide or amino
  acid sequences are invariant
               AC C TG A G – AG
               AC G TG – G C AG
           mismatch
                                                 indel
                      70% identical
Affine Gap – Example (1)
 +2 for a match
 -2 for a gap
 -1 for a mismatch
                           112
Gap Penalties
   Linear gap penalty
    – cost of gap (length n) depends linearly on gap-open
      penalty
       • f(g)= – gi
   Affine gap penalty
    – cost of gap depends on an initial gap-open penalty(gi) and
      a subsequent gap-extension penalty(ge)
    – based on the fact that a single biological mutational event
      can insert or delete more than one residue
       • f(g) = –[gi + (n – 1)  ge]
 currently there is no widely accepted theory for selecting gap costs
 - generally guided by trial and error                                  113
Affine Gap – Example (2)
+2 for a match
-1 for a mismatch
a gap open score of –2
a gap extension score of -1.
                               114
Find the Best
   Need a way to examine all possible
    alignments systematically
   Compute a score reflecting the quality of
    each possible alignment
   To identify the alignment with the optimal
    score.
   Several different alignments may give the
    same best score.
   Many different scoring scheme
                                                 115
Scoring Matrices for Nucleotide
Sequence
   A mild penalty for transitions
    – AG
    – CT
   A severe penalty for transversions
    –   AC
                                           a    g    t    c
    –   AT
                                      a    20   10   5    5
    –   GC                           g    10   20   5    5
    –   GT                 Transition t   5    5    20   10
                         Transversion
                                       c   5    5    10   20
                               Matrix
                                                          116
Scoring Matrices for Amino Acid
Sequence
   Based on observed chemical/physical
    similarity
    – Residue hydrophobicity, charge, and size
    – Genetic code
   Based on observed substitution frequencies
                                                 117
Widely Used Substitution Matrices
– Empirically Derived
   PAM: Point Accepted Mutations
    – The PAM family (Dayhoff) is based on evolutionary
      distance. The matrices were derived from closely related
      sequences and the mutations seen in them.
   BLOSUM: BLOcks SUbsitution Matrix
    – The Blosum family (Henikoff and Henikoff) were derived
      from more distantly related sequences. The number of
      the matrix is percent identity.
                                                           118
   Scoring system is a set of values for qualifying the set of one
    residue being substituted by another in an alignment.
   It is also known as substitution matrix.
   Scoring matrix of nucleotide is relatively simple.
         A positive value or a high score is given for a match &
          negative value or a low score is given for a mismatch.
    Scoring matrices for amino acids are more complicated
    because scoring has to reflect the physicochemical properties
    of amino acid residues.
 Identity matrix
                          1
                         Transition-Transvesion
 matrix
Transition  --- substitutions in which a purine (A/G) is replaced by
   another purine (A/G) or a pyrimidine (C/T) is replaced by
   another pyrimidine (C/T).
Tansversions ---
   (A/G)  (C/T)
   Match score:         +1
   Mismatch score:      +0
   Gap penalty:         –1
   ACGTCTGATACGCCGTATAGTCTATCT
        ||||| |||   || ||||||||
    ----CTGATTCGC---ATCGTCTATCT
   Matches: 18 × (+1)
   Mismatches: 2 × 0             Score = +11
   Gaps: 7 × (– 1)
PAM - point accepted mutation based on
  global alignment [evolutionary model]
BLOSUM - Block substitutions based
 on local alignments [similarity among
 conserved sequences]
   First given by Dayhoff who compiled alignment of 71
    groups of very closely related protein sequences.
   PAM- Point Accepted Mutation.
   PAM matrix were derived based on evolutionary
    divergence between sequences of protein structure.
   Construction of PAM1 matrix involves alignment of full
    length sequence & subsequent construction of phylogenic
    trees using parsimony principle.
   Ancestral sequence information is used to count the number of
    substitution along each branch of tree.
   Positive scores in the matrix denotes substitutions occurring
    more frequently than expected among evolutionary conserved
    replacements.
   Negative score corresponds to substution which occurs less
    frequently.
   A PAM is defined as 1% amino acid change or one mutation per
    100 residues.
   The increasing PAM numbers correlate with increasing PAM
units & thus evolutionary distances of protein sequences.
 Constructed  based on the phylogenetic
  relationships prior to scoring mutations;
 Difficulty
           of determining ancestral
  relationships among sequences;
 Based  on a small set of closely related
  proteins;
   It is a series of block amino acid substitution matrix.
   Derived on the basis of direct observation for every
    possible amino acid substitution in multiple sequence
    alignment.
   Sequence pattern is also called as block.
   Ungapped alignments are less than 60 amino acid in
    length.
   BLOSUM matrix are actual % values of sequence
    selected for construction of matrix.
   BLOSUM 62 indicates that sequence selected for
    constructing the matrix is an average share of 62%.
   BLOSUM share for a particular residue pair is derived
    from the log ratio of observed residue substitution versus
    the expected probability of particular residue.
   Lower the number of BLOSUM more divergent species
    are present.
   BLOSUM62 was
    measured on pairs
    of sequences with
    an average of 62 %
    identical amino
    acids.
     Log-odds = log ( chance to see the pair in homologous proteins             )
                       chance to see the pair in unrelated proteins by chance
   PAM                          BLOSUM
    › Based on mutational         › Based on the multiple
      model of evolution            alignment of blocks
      (Markov process)
                                  › Good to be used to
    › PAM1 is based on              compare distant
      sequences of 85%              sequences
      similarity
                                  › Designed to find
    › Designed to track the         proteins’ conserved
      evolutionary origins          domains
Measure of Sequence Divergence –
PAM
 1 PAM = 1 percent accepted mutation
 Two sequences 1 PAM apart have 99%
  identical residues.
 Given amount of evolutionary time, how likely
  one amino acid is to mutate to another.
 Collecting statistics from pairs of sequences
  as closely related as 1 PAM to produce the
  1PAM substitution matrix.
                                             130
For More Widely Divergent
Sequences
 Matrices representing larger evolutionary
  distances may be derived from the PAM1
  matrix by matrix multiplication.
 PAM250:
    – Corresponding to ~20% identity
    – The lowest sequence similarity for which we can
      hope to produce a correct alignment
             PAM       0   30   80   110   200   250
        % identity   100   75   50   60     25    20
                                                        131
                                            PAM 250
     A    R    N    D   C
                        C     Q    E    G    H    I    L    K    M    F    P    S    T   W
                                                                                         W     Y    V    B    Z
A    2   -2    0    0   -2    0    0    1   -1   -1   -2   -1   -1   -3    1    1    1   -6   -3    0    2    1
R   -2    6    0   -1   -4    1   -1   -3    2   -2   -3    3    0   -4    0    0   -1    2   -4   -2    1    2
N    0    0    2    2   -4    1    1    0    2   -2   -3    1   -2   -3    0    1    0   -4   -2   -2    4    3
D    0   -1    2    4   -5    2    3    1    1   -2   -4    0   -3   -6   -1    0    0   -7   -4   -2    5    4
C   -2   -4   -4   -5   12   -5   -5   -3   -3   -2   -6   -5   -5   -4   -3    0   -2   -8    0   -2   -3   -4
Q    0    1    1    2   -5    4    2   -1    3   -2   -2    1   -1   -5    0   -1   -1   -5   -4   -2    3    5
E    0   -1    1    3   -5    2    4    0    1   -2   -3    0   -2   -5   -1    0    0   -7   -4   -2    4    5
G    1   -3    0    1   -3   -1    0    5   -2   -3   -4   -2   -3   -5    0    1    0   -7   -5   -1    2    1
H   -1    2    2    1   -3    3    1   -2    6   -2   -2    0   -2   -2    0   -1   -1   -3    0   -2    3    3
I   -1   -2   -2   -2   -2   -2   -2   -3   -2    5    2   -2    2    1   -2   -1    0   -5   -1    4   -1   -1
L   -2   -3   -3   -4   -6   -2   -3   -4   -2    2    6   -3    4    2   -3   -3   -2   -2   -1    2   -2   -1
K   -1    3    1    0   -5    1    0   -2    0   -2   -3    5    0   -5   -1    0    0   -3   -4   -2    2    2
M   -1    0   -2   -3   -5   -1   -2   -3   -2    2    4    0    6    0   -2   -2   -1   -4   -2    2   -1    0
F   -3   -4   -3   -6   -4   -5   -5   -5   -2    1    2   -5    0    9   -5   -3   -3    0    7   -1   -3   -4
P    1    0    0   -1   -3    0   -1    0    0   -2   -3   -1   -2   -5    6    1    0   -6   -5   -1    1    1
S    1    0    1    0    0   -1    0    1   -1   -1   -3    0   -2   -3    1    2    1   -2   -3   -1    2    1
T    1   -1    0    0   -2   -1    0    0   -1    0   -2    0   -1   -3    0    1    3   -5   -3    0    2    1
W
W
Y
    -6
    -3
          2
         -4
              -4
              -2
                   -7
                   -4
                        -8
                        -8
                         0
                             -5
                             -4
                                  -7
                                  -4
                                       -7
                                       -5
                                            -3
                                             0
                                                 -5
                                                 -1
                                                      -2
                                                      -1
                                                           -3
                                                           -4
                                                                -4
                                                                -2
                                                                      0
                                                                      7
                                                                          -6
                                                                          -5
                                                                               -2
                                                                               -3
                                                                                    -5
                                                                                    -3
                                                                                         17
                                                                                         17
                                                                                          0
                                                                                               0
                                                                                              10
                                                                                                   -6
                                                                                                   -2
                                                                                                        -4
                                                                                                        -2
                                                                                                             -4
                                                                                                             -3
V    0   -2   -2   -2   -2   -2   -2   -1   -2    4    2   -2    2   -1   -1   -1    0   -6   -2    4    0    0
B    2    1    4    5   -3    3    4    2    3   -1   -2    2   -1   -3    1    2    2   -4   -2    0    6    5
Z    1    2    3    4   -4    5    5    1    3   -1   -1    2    0   -4    1    1    1   -4   -3    0    5    6
BLOSUM Matrices
   PAM matrices were based on only a small
    number of observed substitutions (~1500)
   Perform best in identifying distant
    relationships
   BLOCKS database (BLOcks Subsitution
    Matrix)
   Regions of closely-related proteins alignable
    without gaps
   BLOSUM62  PAM150
   BLOSUM50  PAM250
                                                133
Scoring/Substitution Matrices
   BLOSUM62
                                134
An Introduction to Bioinformatics Algorithms   www.bioalgorithms.info
The Blosum50 Scoring Matrix
Examples of Scoring Scheme
   For DNA sequences, CLUSTAL-W recommends
    use of the identity matrix for substitution
    –   +1 for a match
    –   0 for a mismatch
    –   Penalty 10 for gap open
    –   Penalty 0.1 for gap extension by one residue
   For protein sequences
    – BLOSUM 62 matrix for substitution
    – Penalty 11 for gap open
    – Penalty 1 for gap extension by one residue
                                                       136
Dynamic Programming
 General algorithmic development technique
 Reuses the results of previous computations
    – Store intermediate results in a table for reuse
   Look up in table for earlier result to build from
                                                        137
Global vs. Local Alignment
                             138
Global Alignment
   Needleman-Wunsch 1970
   Idea: Build up optimal alignment from optimal
    alignments of subsequences
                                                    139
Three Steps of Dynamic
Programming
   A simple scoring scheme is assumed where
    – Si,j = 1 (match score); otherwise
    – Si,j = 0 (mismatch score)
    – w = 0 (gap penalty)
   Three steps in dynamic programming
    – Initialization
    – Matrix fill (scoring)
    – Traceback (alignment)
                                           140
 Initialization Step
    This example assumes there is no gap
     opening or gap extension penalty
             GAATTCAGTTA
             -----------
-------
GGATCGA
             G-   -G   G
  3 cases:
             -G   G-   G                    141
Matrix Fill Step
                   142
Traceback Step
                 143
Z-score
Z = (Xs-Xt) /s
Xs = average of distribution scores with random sequences
Xt = average of distribution score with real sequences
s = SD of distribution scores with random sequences
Accuracy of the alignment:
  Z<3 not significant
  3<Z<6 putatively significant
  6<Z<10 possibly significant
  Z>10 significant
                                                            144
P-value
   The probability that the observed match could
    have happened by chance
                             Optimal local alignment scores
                             for pairs of random amino acid
                             sequences of the same length
                             follow an extreme-value
                             distribution
                             P(S < x) = exp[Kexp(x)]
                             P(S  x) = 1  exp[Kexp(x)]
    A p-value of 0.01 means that 1 in 100 matches giving this
                                                                145
    score are to unrelated sequences.
E-value
 The expected number of pairs with score at
  least S is given by the E-value for the score S
                E = Kmn exp(S)
 E-value takes into accout the size of the
  database being scanned.
 The parameters K and lambda can be
  thought of simply as natural scales for the
  search space size and the scoring system
  respectively.
                                               146
Comparison of the Performance
   Compare the performance (execution time) of
    the three programs
    – SSEARCH
    – FASTA
    – BLAST
                                             147
  Module 4
Topic: BLAST
                INTRODUCTION
    An important goal of genomics and proteomics is to determine
    if a particular sequence is like another sequence. This is
    accomplished by comparing the new sequence with sequences
    that have already been reported and stored in a database.
   This process is principally one that uses alignment procedures
    to uncover the “like” sequence in the database.
   The alignment process will uncover those regions that are
    identical or closely similar and those regions with little (or
    any) similarity.
   Two alignment types are used: global and local.
                             BLAST
   BLAST stands for Basic Local Alignment Search Tool
   BLAST was developed by Stephen Altschul, Warren
    Gish, Webb Miller, Eugene Myers, and David J. Lipman at
    NCBI in 1990.
   It is a local alignment tool.
   It helps to find regions of local similarity between sequences.
   It is a program compares nucleotide or protein sequences to sequence
    databases and calculates the statistical significance of matches.
   BLAST can be used to infer functional and evolutionary
    relationships between sequences as well as help identify members of
    gene families.
NCBI HOMEPAGE
NCBI-BLAST HOMEPAGE
     TYPES
        BLAST
Amino acid      DNA
sequence      sequence
    Blastp       Blastn
    tBlastn      Blastx
                 tBlastx
          STEPS
Specifying A Sequence Of Interest
   Selecting BLAST Program
       Selecting Database
 Selecting Optional Parameters
Selecting Formatting Parameters
                          PROCESS
   The first step of the BLAST algorithm is to break the query
    into short words of a specific length.
   For example, twelve amino acids near the amino terminal of the
    Aradbidopsis thaliana protein phosphoglucomutase sequence are:
                             NYLENFVQATFN
   This sequence is broken down into three character words by
    selecting the first amino acid characters.
             NYLYLE LEN ENF NFV FVQ VQA QATATF TFN
   These words are then compared against a sequence in a
    database.
   For example, word match with rabbit muscle phosphoglucomutase:
    Query ENF
    Subject SSTNYAENTIQSIISTVEPAQR
   This search is performed for all words. Those words whose T
    value was greater than 18 were used as to extend the
    alignment.
   For every pair of sequences (query and target) that have a
    word or words in common, BLAST extends the alignment in
    both directions to find alignments that score greater (are more
    similar) until the alignment score decreases in value.
   For example, consider the following alignment between the A. thaliana
    and rabbit muscle phosphoglucomutase:
    Query NLYENFVQATFNALTAEKV
          NY ENF+Q +       + +      +
    Subject NYAENTIQSIISTVEPAQR
   Once this alignment process is completed for a query and each
    subject sequence in the database, a report is generated. This
    report provides a list of those alignments (default size of 50)
    with a value greater than the S cutoff value.
   Those alignments whose score is above the cutoff are called a
    High Scoring Segment Pair (HSP).
   For each alignment reported, an Expect (e) Value is reported.
                 BLAST OUTPUT
    The blast output is basically displayed in three ways or
     formats.
A.    Graphical display: shows where the query is similar to other
      sequences.
B.    Hit list: number of sequences similar to query, ranked by
      similarity.
C.    Alignment: every alignment between the query and the
      reported hits.
    BLAST OUTPUT
A. GRAPHICAL DISPLAY
          •    Query sequence is at the top,
              with colour key for alignment
              scores.
          •    Each bar represents the portion
              of another sequence that‟s
              similar to your query sequence :-
              Red bars- most similar
              sequence.
              Pink bars- match less good.
              Green bars- not impressive
              match.
              Blue bars- worst score.
              Black bars- bad hits.
                               BLAST OUTPUT
                                 B. HIT LIST
   1 - This   portion of each description links to the sequence record for a particular hit.
   2 - Score or bit score is a value calculated from the number of gaps and substitutions
    associated with each aligned sequence. The higher the score, the more significant the
    alignment.
   3 - E Value (Expect Value) describes the likelihood that a sequence with a similar score
    will occur in the database by chance. The smaller the E Value, the more significant the
    alignment
   4 - These links provide the user with direct access from BLAST results to related
    entries in other databases. „L‟ links to Locus Link records and „S‟links to structure
    records in NCBI's Molecular Modelling Database.
BLAST OUTPUT
C. ALIGNMENT
               APPLICATIONS
•   BLAST can be used for
    several purposes. These
    include:
   Identifying Species
   Establishing Phylogeny
   DNA Mapping
   Locating Domains
MULTIPLE SEQUENCE
  ALLIGNMENT
              Module 4
Topic: Motifs and Patterns, PROSITE,
   Hidden Markov Models (HMMs)
                    Motifs
• Defined as a nucleotide or amino acid
 sequence pattern that is widespread and
 is associated with a biological function.
  –   A sequence motif = A structural Motif.
  –   A sequence motif residing in the coding
      region may encode a structural motif.
  –   Non-coding nucleotide motifs may have
      regulatory role. May have recognition sites
      for DNA binding proteins.
    Motifs, profiles and patterns
• Conserved region of a DNA or protein –
  Motif
• Qualitative expression of a motif – Pattern
  – Regular Expression
  – C[TA]TTG{X}
• Quantitative expression of a motif –
  Profile
  – Position Specific Scoring Matrices (PSSMs)
  – Weight matrices
Motifs/Patterns
N{P}[ST]{P}
[FILV]Qxxx[RK]Gxxx[RK]xx[FILVW
Y]
[] -> or (Probability information is
lost)
{} -> Not
() -> repeated
^ -> Beginning
                        Profiles
• Quantitative representation.
• More useful for training
  dataset.
  TCTAGAAGATGGCAGTGGCGAAGA A 0,0,0,100 ,0, 75,100, 75
  TCTAGAAAATGACAGTGGCGAAGA        T                  25,
                                                     ATG    0 ATG
  TCTAGAAAATGGCAGTAGCGAAGA        100,0,100,0,0,     0,    25
  TCTACTA AATGA TAGTAGCGAAGA      G 0, 0, 0, 0, 75   ,0,   ATG 0
                                  C 0,100,0,0,       0,    ATG
                                                2    ,0,
                                  5                  0,
   De novo prediction of Motifs
• MEME; EXTREME; AlignAce, Amadeus,
  CisModule, FIRE, Gibbs Motif Sampler,
  PhyloGibbs, SeSiMCMC, ChIPMunk
  and Weeder. SCOPE, MotifVoter, and
  Mprofiler
MEME (Multiple Expectation Maximization
 for Motif Elicitation)
                                                  Figure 3.
                                                  Resources
MacIsaac KD, Fraenkel E (2006) Practical Strategies for Discovering Regulatory DNA Sequence Motifs. PLoS Comput
Biol 2(4): e36. doi:10.1371/journal.pcbi.0020036
http://journals.plos.org/ploscompbiol/article?id=info:doi/10.1371/journal.pcbi.0020036
MRLSFVPLLQLSRLVVSTQHSTKMSTVYRTCKMNEIALSLLAPTQPLDADQ
GVMSPMASSDQ
TTSIGDFRFLRTHHDKEERGLLVTSLTKGLAETSFPYR
YTSMCATICSITHSRADAAPAKQAH
What is Pattern Recognition?
  •A Technique to identify interesting patterns of events such as Amino acid,
  Nucleotide, Gene Expression levels etc. that appear in number of times in a
  particular set of data.
Pattern Recognition in Molecular Biology
• Human Genome Project
• Protein analysis
• Gene Expression & DNA Micro Analysis
• Drug Discovery
 Pattern Discovery in Proteins
• Three main steps
     - Proteins related to a query sequence are found by searching the database for
similar sequences.
     - Sequences revealed from this initial screen are then used as query sequences to
search other family members
     - This process is repeated till exhaustion.
Tandem Repeats
• These are two or more contiguous, approximate copies of a pattern of nucleotides.
• There duplicates occur as a result of mutational events in which an original segment
of DNA, the pattern is converted into a sequence of individual copies.
• They have been linked to a number of different diseases.
• These might play a role in gene regulation and in the development of immune system
cells.
Types of Patterns
   Deterministic
      Matches a given string or not.
   Probabilistic
     each sequence is given a probability that
     this sequence is generated by a model.
     The higher the probability, the better is the
     match between sequence and pattern.
                    PROSITE
PROSITE is a protein database. It consists of entries describing
the protein families, domains and functional sites as well as
amino acid patterns,
signatures, and profiles in them, which are manually curated by a
team of the      Swiss Institute of Bioinformatics and tightly
integrated into Swiss-Prot protein annotation.
PROSITE was created in 1988 by Amos Bairoch, who directed the
group for more than 20 years. Since July 2009 the director of the
PROSITE,   Swiss-Prot     and   Vital-IT   groups   is   Ioannis
Xenarios.
PROSITE' s uses include identifying possible functions of newly
discovered      proteins and analysis of known proteins       for
previously undetermined activity. Properties from well-studied
genes can be propagated to biologically related organisms, and
for different or poorly known genes biochemical functions can be
predicted from similarities.
PROSITE offers tools for protein sequence analysis and motif
detection. It is part of the ExPASy proteomics analysis servers.
        HMM-BASED TOOLS
• GENSCAN (Burge 1997)
• FGENESH (Solovyev 1997)
• HMMgene (Krogh 1997)
• GENIE (Kulp 1996)
• GENMARK (Borodovsky & McIninch 1993)
• VEIL (Henderson, Salzberg, & Fasman 1997)
         Module 4
Topic: Phylogenetic analysis
           What is Phylogenetic Tree?
• A branching diagram
• Showing the inferred evolutionary relationships among
  various biological species
• Based upon similarities and differences in their physical or
  genetic characteristics
• Each node with descendants represents the inferred most
  recent common ancestor of the descendants
                              History
•    Early representations of "branching"
    phylogenetic trees include a "paleontological
    chart" showing the geological relationships
    among plants and animals in the
    book Elementary Geology, by Edward Hitchcock
    in 1840.
• Charles Darwin in 1859 also produced one of the
  first illustrations and crucially popularized the
  notion of an evolutionary "tree" in his seminal
  book The Origin of Species.
What does this tree looks like?
 What do the lines represent?
PHYLOGENETIC TREE
Phylogeny is the evolutionary history of a
 kind of organism .
In phylogenetic studies , the most convenient
 way to study the evolutionary relationship
 among a group of organism is through the
 illustration of phylogenetic tree.
DEFINITION –Phylogenetic tree is a two
 dimensional graph showing evolutionary
 relationship between organism , or genes
 from various organism .
Characteristics :
Nodes can be internal or external .
Each internal node represent the last common
 ancestor of the two lineage .
External node (also termed as terminal node ,
 leaves ) represent the tip of the tree .
Node correspond to species , organism or
 sequences .
Similarly, branches can be internal or external .
Internal branches or internodes connect two
 nodes , whereas external branches connect a tip
 and a node .
A phylogenetic tree branches either be :
              - Scaled
              - Unscaled
In scaled branches , their length are
 proportional to the evolutionary change .
 Example - phylogram .
In unscaled branches , the branch length is
 not proportional to the number of changes .
  Example -cladogram
When constructing phylogenetic trees ,researcher identify
 homologous features that are         shared by some species
 but not by others.
This allows them to group species based On their shared
 characterstics .
Historically, comparison of morphological similarities and
 differences have been used to construct evolutionarytrees.
In this approach, species that share certain charactersticts
 (i.e.,homologous trait) tend to be placed closer togetheron
 the tree .
In 1963,Linsus pauling and Emile Zuckerkandl were the first
 to suggest the use of molecular data to establish
Evolutionary relationship
 When comparing homologous genes in different species,
  the DNA sequences from closely related species are more
   similer to each other than are the sequences from
    distantly related species .
Phylogenetic tree based on homology
  Phylogenetic tress are now based on homology which
   refers to similarities among various species that occur
   because the species are derived from a common
   ancestor.
   Attributes that are the result of homology are saidto
   be homologous.
  Phylogenetic tree reconstruction
 Phylogenetic trees are constructed :
   - To reconstruct the evolutionary past.
 - To develop an understanding of when and
 which speciation event may have occurred to
 give rise to the organism exhibited today .
A phylogenetic analysis consist of four steps and
 these are :
 SEQUENCE ALIGNMENT :- Sequence
     alignment is the essential preliminary to the
     tree reconstruction . The data used in
     reconstruction of a DNA –based phylogenetic
     tree are obtained by comparing nucleotide
     sequences.
These comparison are made by aligning the
 sequences so that nucleotide differences can
 be scored .
 DETERMINING THE SUBSITUTION
  MODEL
 TREE BUILDING
 TREE EVALUATION
     Construction of phylogenetic tree
           2 types   of method
Character based            Distance based
   method                     method
A.    Maximum parsimony
B.    Maximum likelihood
      Character based method :
This method is also called as discrete
  method and are based directly on the
  sequence characters rather than on pairwise
  distances .
The two most popular character based
  methods are :
1. MAXIMUM PARSIMONY
2. MAXIMUM LIKELIHOOD
              Maximum parsimony
Parsimony method is one of the pioneer
 method of phylogeny construction .
Parsimony groups taxa together in way that
 minimize the number of changes .
It assume that the best hypothesis is one
 that requires the fewest number of
 evolutionary changes hence it is also called
 as minimum evolution method .
It also states that the preferred hypothesis is
 the one that is simplest .
EXAMPLE : If two species possess a tail then
 there are two hypothesis :
First assuming that a tail arose once during
 evolution and that both species have descended
 from a common ancestor with a tail .
Second hypothesis assuming that tails arose
 twice during evolution and that the tails in the
 two species are not due to descent from a
 common ancestor .
So the first assumption is simplest one and is
 accepted .
    Maximum likelihood approach
The maximum likelihood method presents
 an additional opportunity to evaluate trees
 with variations in mutation rates in
 different lineage .
The method can be used       to    explore
 relationship among more diverse sequences
 and condition that are not well handled by
 maximum parsimony methods .
     Distance based method :
Distance method are based on the amount of
 dissimilarity ( distance ) between two aligned
 sequences .
Such method remain important when using
 fossil data to build phylogenies for extinct
 species and for living species it is more common
 to use DNA sequences from the two species .
This method assume that all sequence involved
 are homologous and that tree branches are
 additive , meaning the distance between the
 two taxa equals the sum of all branch branch
 lengths connecting them .
Limitations Of Phylogenetic tree
           - Limitations
1. Inaccurate evolutionary
   history
2. The data used is little noisy
3. Problem facing in single type
   of character basing
4. Homoplasy would be unlikely
   from natural selection
5. Length of branch doesn’t mean
   the timing passed
         - Fields of study
1.   Cladistics
2.   Comparative phylogenetics
3.   Computational phylogenetics
4.   Evolutionary taxonomy
5.   Evolutionary biology
6.   Phylogenetics
Applications:
• Find out the evolutionary history .
• Can measure phylogenetic diversity using
  phylogenetic trees .
• Search for natural products .
• Infectious bacteria and viruses to trace their
  evolutionary histories.
Applications:
• Find out what trends they've undergone in their
  history .
• To guide our search for new species.
•    Find out how our species spread geographically
    in their evolution.
• To tell us when taxa originated and where.
       Module 4
Topic: Clustal, PHYLip &
    Bootstrapping
Clustal Omega
• Purpose: Clustal Omega is a widely used tool for performing multiple
  sequence alignments (MSA). It aligns protein or nucleotide sequences to
  identify conserved regions and evolutionary relationships.How It
  Works:Uses a guide tree to align sequences progressively.
• Employs Hidden Markov Models (HMMs) for greater accuracy in large
  datasets.
• Applications:Comparative genomics.
• Identifying functional domains.
• Evolutionary and phylogenetic studies.
• Advantages: Fast and scalable, handling thousands of sequences efficiently.
   Phylogenetic Analysis using PHYLIP - Unrooted trees
Theory :
• PHYLIP is a complete phylogenetic analysis package which was
  developed by Joseph Felsestein at University of Washington.
• PHYLIP is used to find the evolutionary relationships between
  different organisms. Some of the methods available in this
  package are maximum parsimony method, distance matrix
  and likelihood methods.
• The data is presented to the program from a text file, which is
  prepared by the user using common text editors such as word
  processor, etc. Some of the sequence analysis programs such
  as ClustalW can write data files in PHYLIP format.
• Most of the programs look for the input file called "infile" -- if
  they Phylogenetic analysis: Analyze the evolutionary
  relationships between different organisms and this analysis
  would help to find out the changes that occured in organisms
  during the evolution.
• Boot Strapping: It is a way to test the reliability of Dataset.
• Query: User can give input called as a query. This can be
  either a protein or nucleotide sequence.
• Rooted tree: A tree which is having a special node as main
  node also called the root. A tree without root is treated as a
  free tree.
• Tree topology: Tree topology refers to the arrangement of
  phylogenetic tree.
PHYLIP file format :
• The input files have information about the number of
  sequences, nucleic acids and amino acids.
• The sequence has 10 characters length. Spaces can be
  added to the end of the short sequences to make them
  long.
• Gaps can be represented as ‘-‘.
• Missing data can be represented as ‘?’
• Spaces between the alignments are allowed usually
  after every 10 bases.
Methods involved in PHYLIP:
1.Maximumparsimonymethod
2.Distancemethod
3. Maximum likelihood methods
• Maximum parsimony method: It is a character-based method
  which infers a phylogenetic tree by minimizing the total number of
  evolutionary steps or total tree length for a given set of data. It is
  also referred to as sequence based tree reconstruction method.
• Distance methods: Evolutionary distances are calculated for all
  operational taxonomic units and build tree where distance
  between the operational taxonomic units match these distances.
• Maximum likelihood method: Refers to a model of sequence
  evolution which finds the tree and gives highest likelihood of the
  observed data.
Programs used in PHYLIP :
• The following are the methods available in PHYLIP program.
•
• Dnapars: Estimates the phylogeny using parsimony method from nucleic acid sequence.
•
• Dnamove: It is an interactive process used for construction of phylogeny from nucleic acid sequences
  using parsimony method.
•
• Dnapenny: Estimates the parsimonious phylogeny for nucleic acid sequences which uses branch and
  bound theory.
•
• Dnacomp: States the phylogeny of nucleic acids and searches for the largest sites which have uniquely
  evolved on the same tree.
•
• Dnainvar: Computes the nucleic acid sequence which tests the alternative tree topologies. The
  programs tabulate (chart) the frequencies of occurrences of different nucleotide patterns.
•
• Dnaml: Estimates the phylogenies from nucleotide sequences by maximum likelihood method without
  assuming molecular clock. Molecular clock defines to calculate timings of evolutionary events.
•
• Dnamlk: It estimates the phylogeny using maximum likelihood method, it assumes the molecular clock.
  Boot strap analysis
• It involves resampling one's own data, with replacement, to create a series of
  bootstrap samples of the same size as the original data.
• In the case of nucleic acid (amino acid) sequences, the resampled data are the
  nucleotides (amino acids) of a sequence while the statistical significance of a
  specific cluster is given by the fraction of trees, based on the resampled data,
  containing that cluster.
• Bootstrapping can be considered a two-step process comprising the
  generation of (many) new data sets from the original set and the
  computation of a number that gives the proportion of times that a particular
  branch (e.g., a taxon) appeared in the tree. That number is commonly
  referred to as the bootstrap value.
Bootstrapping
• urpose: A statistical technique to measure the confidence of phylogenetic
  tree branches.How It Works:Resamples the original data with replacement
  to create multiple datasets.
• Builds a tree for each resampled dataset.
• Computes the frequency (bootstrap value) of branches across all trees.
• Applications:Validating phylogenetic trees.
• Ensuring reliability of inferred evolutionary relationships.
• Interpreting Bootstrap Values:Higher values (e.g., >70%) indicate strong
  support for a branch.
• Lower values suggest weak or uncertain relationships.