Biological Databases
Dr. Upeksha Ganegoda
    Department of Computational Mathematics
1
    Outline
     Different types of biological networks
     Database Structures
     Biological database types based on content
2
    Biological Networks
     Protein-protein interaction network
     Metabolic network
     Gene regulatory network
     RNA network
3
    Protein-protein interaction network
     A protein can interact with another protein, in order to build
      a protein complex or to activate it. By using a protein-
      protein interaction network it shows how and which proteins
      are interact each other.
     Node represent a protein, Arc represent the interaction
      between two protein.
     Can use different types of graph algorithms to identify:
         Protein complexes
         Protein functions
         Protein Hubs
         etc
4
    Protein Hub?
    Protein complexes?
5
    Metabolic network
     Metabolic networks give an in-depth insight of the molecular
      mechanisms of a particular organism. It will correlate the
      genome with molecular physiology and provide the most
      comprehensive of all biological networks.
      Ex: Databases such as the Kyoto Encyclopedia of Genes and
      Genomes (KEGG) and the Biochemical Genetic and
      Genomics knowledgebase (BIGG) contain the metabolic
      network of a wide range of species.
6
    Gene-regulatory network
     It is a common type of regulatory network
     gene regulatory network consist of DNA segments in a cell
      which interact with each other indirectly (by using their
      RNA and protein expression products) and with other
      materials in the cell to manage the gene expression levels of
      mRNA and proteins.
7
    RNA networks
     RNA networks show the interaction between RNA-RNA or
      RNA-DNA interactions. By understanding the microRNA’s
      role in disease, the researchers able to construct microRNA-
      gene networks by using predicted microRNA targets
      available in public databases such as Target Scan, PicTar,
      microRNA, miRBase and miRDB.
8
    Representation as a network
    Network G = (V, E, w), where V represents the set of proteins,
    E is the set of interactions and w denotes the weight of each
    interaction
    Network can construct as
       Directional graph Ex: gene-regulatory graph
                                                        r1
                                         g1
                                                             g2
       Bidirectional graph Ex: PPI network        r2
                                              p1
                          p1
                                    p1
9
       Main functions of biological databases
      Make biological data available to scientists.
       As much as possible of a particular type of information should
       be available in one single place (book, site, database). Published
       data may be difficult to find or access, and collecting it from
       the literature is very time-consuming. And not all data is
       actually published explicitly in an article (genome sequences).
      To make biological data available in computer-
       readable form.
       Since analysis of biological data almost always involves
       computers, having the data in computer-readable form (rather
       than printed on paper) is a necessary first step.
10
         What is a database?
      How can data be stored...
       Flat-file format, with fields separated by some delimiter
       Nancy|Dengler|Botany|University of Toronto|25 Willocks St, Toronto, ON. M5S 3B2
       Peter|Lewis|Dept. of Biochemistry|Uni. Toronto|1 King’s College Circle, Toronto, ON. M5S 1A8
       John|Coleman|Department of Botany|University of Toronto|25 Willcocks St, Toronto, ON. M5S 3B2
       John|Coleman|Dept. of Biology|York University|4700 Keele St, Toronto, ON. M3J 1P3
       These data could also be stored in a spreadsheet
       What are the problems with this sort of database?
       Relational Databases offer a solution...
11
     Database structures
      Flat files
      Relational
      Object oriented
12
        Relational database
      Nancy|Dengler|Botany|University of Toronto|25 Willocks St, Toronto, ON. M5S 3B2
      Peter|Lewis|Dept. of Biochemistry|Uni. Toronto|1 King’s College Circle, Toronto, ON. M5S 1A8
      John|Coleman|Department of Botany|University of Toronto|25 Willcocks St, Toronto, ON. M5S 3B2
      John|Coleman|Dept. of Biology|York University|4700 Keele St, Toronto, ON. M3J 1P3
      A relational database consists of a relations (tables) containing attributes (fields or
      columns). Each row in a table is known as a tuple or a record. Information should be
      ‘normalized’ so that it is non-redundant this means that every row should be unique,
      although this ideal is not always observed.
               Professor_id      First_name        Last_name          Contact_id
Table          1                 Nancy             Dengler            1
               2                 Peter             Lewis              2
'Professors'   3                 John              Coleman            1
               4                John               Coleman            3
               Contact_id      Institution          Department           Address
Table          1              University of Toronto Dept. of Botany      25 Willocks St, Toronto, ON. M5S 3B2
               2              Uni. Toronto          Dept. of Biochemisty 1 King’s College Circle, Toronto, ON. M5S 1A8
'Contacts'     3              York University       Dept. of Biology      4700 Keele St, Toronto, ON. M3J 1P
13
14
       Different Database Types
      Primary databases
       Contain original biological data. Ex. Raw nucleic acid sequence data from
       GeneBank, EMBL database, DNA Data Bank.
      Secondary databases
        Contain computationally processed or manually curated information based
        on original information from primary database. Ex. SWISS-PROT, TrEMBL
        (contain translated nucleic acid sequences), PIR (contain annotated protein
        sequences).
      Specialized databases
         This will cater to a particular research interest. Ex. Flybase, WormBase,
          AceDB, and TAIR
15
     Pitfalls of biological databases
      Overreliance of sequence information without understanding
       the reliability of the information.
      High level of redundancy
      Annotations of genes can occasionally be false or incomplete.
16
     Accession codes, identifiers
      Many of the biological databases (GenBank, UNIPROT etc.)
       have two (or more!) different ways of identifying a given
       entry:
       • Identifier
       • Accession code (or number)
17
      Identifier
       An identifier ("locus" in GenBank, "entry name" in UNIPROT) is a
       string of letters and digits that understandable in some meaningful way
       by a human.
     Identifiers are not as stable as accession numbers, mainly because they are
     modified by the curators if the presumed function of the protein is found
     to be something else.
     UNIPROT: B5YME7
     GenBank: XM_002295694
     An identifier can change. For example, the database curators may decide
     that the identifier for an entry no longer is appropriate. This can happen
     very rarely.
18
      Accession code (number)
      An accession code (or number) is a number (with a few
      characters in front) that uniquely identifies an entry. It is often
      assigned arbitrarily. For example, the accession code for
      B5YME7_THAPS in UNIPROT is B5YME7.
      In the case of GenBank, the accession code for the human
      BRAC2 gene sequence is XM_002295694.
19
       Versions and Gene Indices
     In 1992, NCBI began assigning a unique number for each sequence
     submitted – the GenInfo Identifier (GI) number. The same accession number
     may be associated with a different GI if a newer or corrected sequence is
     submitted.
     Records typically contain the Accession.Version identifier, such as
     XM_002295694.1, in the VERSION line of the record. This identifier is
     mapped to its unique corresponding GI number, which is the “primary key”
     of GenBank.
     To specify a sequence exactly in GenBank, use either its GI or
     Accession.Version. To retrieve the most up-to-date sequence, use the
     accession number without version.
20
21
22
          GenBank Flatfile Format (GBFF)
      The GenBank flatfile format (GBFF) explain the nucleotide sequences of a specific
        gene. It contains all of the information associated with the sequence, as well as the
        sequence itself.
        The GBFF has 3 parts: the header, the features, and the sequence itself.
                     identifier               length     source type NCBI entry date
                                                                     taxonomic group
23
     GenBank flatfile format - Header
     DEFINITION: The biology of the molecule in a sentence.
     ACCESSION: Code(s)
     VERSION: Number; GI number
     KEYWORDS: Keywords as defined by the submitters
24
     SOURCE: Contains organism
     name
     ORGANISM: Contains complete
     taxonomic information from the
     NCBI taxonomy server.
     REFERENCE: Details on a
     publication about the sequence.
     COMMENT: Contains misc.
     information and revision details.
25
     GenBank Flatfile Format – Features
     A direct representation of the biological information in the
     record.
       The Source Feature must be present in all GenBank records, and
       contains information as to where the molecule comes from
       /organism = “Homo sapiens”, and, potentially, map, chromosome
       and tissue type information.
        In some records the CDS (coding sequence) feature is present:
26
27
     GenBank Flatfile Format – Sequence
      The last part of the GenBank flat file record is the sequence
       itself:
28
     Nucleotide Databases – Growth of
     GenBank
      from http://www.ncbi.nlm.nih.gov/genbank/statistics
29
     Other facilities in NCBI database
30
     Disease details
31
     Gene Details
32
     Gene expression details….
33
34
35
36
37
38
39