Introduction to
Bioinformatics
          &
Biological Databases
     Ishtiaq Ahmad
    IIB GCU Lahore
      What Is Bioinformatics?
• Bioinformatics is the unified discipline formed
  from the combination of biology, computer
  science, and information technology.
• "The mathematical, statistical and computing
  methods that aim to solve biological problems
  using DNA and amino acid sequences and
  related information.“ –Frank Tekaia
         A Molecular Alphabet
• Macromolecules are polymers of monomers
• All monomers belong to the same general class,
  but there are several types with distinct and well-
  defined characteristics
• Many monomers can be joined to form a single,
  large macromolecule; the ordering of monomers
  in the macromolecule encodes information, just
  like the letters of an alphabet
        Other related Fields:
       Computational Biology
• The study and application of computing
  methods for biological data
• Primarily concerned with the computation
  of data related to evolutionary, population
  and theoretical biology aspects.
          Related Fields:
         Medical Informatics
• The study and application of computing
  methods to improve communication,
  understanding, and management of
  medical data
• Generally concerned with how the data is
  manipulated rather than the data itself.
            Related Fields:
           Cheminformatics
• The study and application of computing
  methods, along with chemical and
  biological technology, for drug design and
  development
            Related Fields:
              Genomics
• Analysis and comparison of the entire
  genome of a single species or of multiple
  species
• A genome is the set of all genes
  possessed by an organism
• Genomics existed before any genomes
  were completely sequenced, but in a very
  primitive state
            Related Fields:
             Proteomics
• Study of how the genome is expressed in
  proteins, and of how these proteins
  function and interact
• Concerned with the actual states of
  specific cells, rather than the potential
  states described by the genome
           Related Fields:
         Pharmacogenomics
• The application of genomic methods to
  identify drug targets
• For example, searching entire genomes
  for potential drug receptors, or by studying
  gene expression patterns in tumors
           Related Fields:
          Pharmacogenetics
• The use of genomic methods to determine
  what causes variations in individual
  response to drug treatments
• The goal is to identify drugs that may be
  only be effective for subsets of patients, or
  to tailor drugs for specific individuals or
  groups
    History of Bioinformatics
• Genetics
• Computers and Computer Science
• Bioinformatics
        History of Genetics
• Gregor Mendel
• Chromosomes
• DNA
      History of Chromosomes
•   Walter Flemming
•   August Weissman
•   Theodor Boveri
•   Walter S. Sutton
•   Thomas Hunt Morgan
History of Computers
                           Computer Timeline
•   ~1000BC The abacus
•   1621 The slide rule invented
•   1625 Wilhelm Schickard's mechanical calculator
•   1822 Charles Babbage's Difference Engine
•   1926 First patent for a semiconductor transistor
•   1937 Alan Turing invents the Turing Machine
•   1939 Atanasoff-Berry Computer created at Iowa State
     – the world's first electronic digital computer
•   1939 to 1944 Howard Aiken's Harvard Mark I (the IBM ASCC)
•   1940 Konrad Zuse -Z2 uses telephone relays instead of mechanical logical
    circuits
•   1943 Collossus - British vacuum tube computer
•   1944 Grace Hopper, Mark I Programmer (Harvard Mark I)
•   1945 First Computer "Bug", Vannevar Bush "As we may think"
                        Computer Timeline (cont.)
•   1948 to 1951 The first commercial computer – UNIVAC
•   1952 G.W.A. Dummer conceives integrated circuits
•   1954 FORTRAN language developed by John Backus (IBM)
•   1955 First disk storage (IBM)
•   1958 First integrated circuit
•   1963 Mouse invented by Douglas Englebart
•   1963 BASIC (standing for Beginner's All Purpose Symbolic Instruction Code) was written (invented) at Dartmouth
    College, by mathematicians John George Kemeny and Tom Kurtzas as a teaching tool for undergraduates
•   1969 UNIX OS developed by Kenneth Thompson
•   1970 First static and dynamic RAMs
•   1971 First microprocessor: the 4004
•   1972 C language created by Dennis Ritchie
•   1975 Microsoft founded by Bill Gates and Paul Allen
•   1976 Apple I and Apple II microcomputers released
•   1981 First IBM PC with DOS
•   1985 Microsoft Windows introduced
•   1985 C++ language introduced
•   1992 Pentium processor
•   1993 First PDA
•   1994 JAVA introduced by James Gosling
•   1994 Csharp language introduced
          Putting it all Together
• Bioinformatics is basically where the findings in genetics
  and the advancement in technology meet in that
  computers can be helpful to the advancement of
  genetics.
• Depending on the definition of Bioinformatics used, or
  the source , it can be anywhere between 30 to 55 years
  old
   – Bioinformatics like studies were being performed in
     the ’60s long before it was given a name
      • Sometimes called “molecular evolution”
   – The term Bioinformatics was first published in 1991
              Genomics
• Classic Genomics
• Post Genomic era
  – Comparative Genomics
  – Functional Genomics
  – Structural Genomics
         What is Genomics?
• Genome
  – complete set of genetic instructions for
    making an organism
• Genomics
  – any attempt to analyze or compare the entire
    genetic complement of a species
  – Early genomics was mostly recording genome
    sequences
             History of Genomics
• 1995
   – Haemophilus influenzea genome sequenced (flu bacteria, 1.8 Mb)
• 1996
   – Saccharomyces cerevisiae (baker's yeast, 12.1 Mb)
• 1997
   – E. coli (4.7 Mbp)
• 2000
   – Pseudomonas aeruginosa (6.3 Mbp)
   – A. thaliana genome (100 Mb)
   – D. melanogaster genome (180Mb)
           2001 The Big One
• The Human Genome sequence is
  published
  – 3 Gb
             What next?
• Post Genomic era
  – Comparative Genomics
  – Functional Genomics
  – Structural Genomics
       Comparative Genomics
• the management and analysis of the
  millions of data points that result from
  Genomics
   – Sorting out the mess
Comparative genomics involves the management
and analysis of vast amounts of data resulting from
genomics studies.
         Functional Genomics
• Other, more direct, large-scale ways of
  identifying gene functions and
  associations
  – (for example yeast two-hybrid methods
Functional genomics aims to directly identify the
functions and associations of genes within a
genome. It involves large-scale methods for
studying gene functions, interactions, and
regulatory mechanisms.
         Structural Genomics
• emphasizes high-throughput, whole-
  genome analysis.
  – outlines the current state
  – future plans of structural genomics efforts
    around the world and describes the possible
    benefits of this research
 Structural genomics emphasizes high-throughput
 analysis of the 3D structures of biomolecules, such
 as proteins and nucleic acids, at a genome-wide
 scale. It seeks to determine the structures of all the
 proteins encoded by an organism's genome.
       What Is Proteomics?
• Proteomics is the study of the proteome—
  the “PROTEin complement of the
  genOME”
• More specifically, "the qualitative and
  quantitative comparison of proteomes
  under different conditions to further
  unravel biological processes"
      What Makes Proteomics
            Important?
• A cell’s DNA—its genome—describes a
  blueprint for the cell’s potential, all the
  possible forms that it could conceivably
  take. It does not describe the cell’s actual,
  current form, in the same way that the
  source code of a computer program does
  not tell us what input a particular user is
  currently giving his copy of that program.
      What Makes Proteomics
            Important?
• All cells in an organism contain the same DNA.
• This DNA encodes every possible cell type in
  that organism—muscle, bone, nerve, skin, etc.
• If we want to know about the type and state of a
  particular cell, the DNA does not help us, in the
  same way that knowing what language a
  computer program was written in tells us nothing
  about what the program does.
        Biological Databases
• Biological databases are the collection of
  biological data organized and annotated in such
  form that can be reused for research purposes.
• Source of the data contained in the biological
  databases can be highly sophisticated
  experimental results, published literature or
  computational analyses related to taxonomy,
  phylogeny, genomics, proteomics, microarray
  gene expression etc.
Basic Components of Biological
    Database Architecture
 Biological database design, development
 and management are the basic areas in
 bioinformatics, which requires following;
  rational database management system
• RDBMS programs from computer Science.
• Information retrieval system from digital
  libraries.
Information in Biological Databases
    The information contained in different
    biological databases may be
•   A gene or protein sequence, SwissProt,
    GenBank etc.
•   Descriptions in text form.
•   Ontological classification
•   Citation record
•   Tables
        Data Formats of Biological
               Databases
    Majority of them contain semi-structured
    data in form of text descriptions
•   Tabular data.
•   Tab or space delimited data records.
•   XML data format. extensible markup language
•   Cross referencing other databases.
 Primary Sequence Databases
• Genome sequence
  - Nucleotide sequence of gene(s)
  - DNA and RNA
• Proteome sequence
  - Amino acid sequence of proteins
    expressed or derived from the gene
    sequences
       Genome Databases
• Collect, organize, annotate, analyze and
  manage the whole genome sequence of
  single or different organisms.
Examples: Corn, a database of maize genome
 Ensembl, a database of human, mouse, other
  vertebrates and eukaryotes genomes
• These databases are accessible publicly
      Important Genome Databases
•   Corn: Maiz genome www.maizgdb.org
                                                    Education Resources
•   ERIC: Enteropathogen genome www.ericbrc.org Information Center
•   National Microbial Pathogen Data Resource www.nmpdr.org
•   JGI Genomes: Eukaryote and microbial genome joint genome institute
    http://genome.jgi.doe.gov/
•   MGI Mouse Genome www.informatics.jax.org mouse genome institute
•   Wormbase: C. elegans genome
•   Flybase: Genome of fruit fly
•   Saccharomyces Genome Database: Genome of yeast model organism
•   Ensembl: Human, mouse, other vertebrates and eukaryotic genome
    database www.ensembl.org
•   TAIR: Arabidoopsis http://arabidopsis.org
    The Arabidopsis Information Resource
Nucleotide (Gene) Sequence Databases
 • DDBJ: DNA Data Bank of Japan
   http://www.ddbj.nig.ac.jp/Welcome-e.html
 • EMBL Nucleotide DB: European Molecular Biology
   Laboratory http://www.ebi.ac.uk/embl/index.html
 • GenBank: National Center for Biotechnology Information (NCBI)
   www.pubmed.com
Protein Sequence Databases
Protein sequences have been stored in
different databases as annotations
containing general and specific details
about different aspects of protein
properties and features along with
sequence details of each protein.
List of Protein Sequences Databases
• Uniprot: http://www.ebi.ac.uk/, http://expasy.org
• PIR: http://www-nbrf.georgetown.edu/pir/searchdb.html
• SwissProt: http://expasy.org
• PROSITE: Database of Protein Families and Domains
  www.expasy.org/prosite
• DIP: Database of Interacting Proteins sequences and
  structures http://dip.doe-mbi.ucla.edu/
• Pfam: Protein families database of alignments and
  HMMs http://www.sanger.ac.uk/Software/Pfam
• ProDom: Comprehensive set of Protein Domain Families
  http://protein.foulouse.inra.fr/prodom/current/html/home.
  php
   Protein Structure Databases
• Protein Data Bank (PDB) www.rcsb.org
• CATH (Class, Architecture, Topology,
  Homologous super-family): Protein structure
  classification www.cathdb.info
• SCOP: Structural Classification of protein
  http://scop.mrc-lmb.cam.ac.uk/scop/
• PDBe: www.ebi.ac.uk/pdbe/
• SWISS-MODEL: A Server and collection of
  protin structures from PDB acting as templates
  http://swissmodel.expasy.org//SWISS-
  MODEL.html
• ModBase: A database of comparative structure
  Models of proteins http://salilab.org/modbase
Protein-Protein Interaction Databases
• STRING: A database of experimental &
  predicted protein-protein interactions
  http://string.embl.de/
• DIP: Database of Interacting Proteins
  sequences and structures http://dip.doe-
  mbi.ucla.edu/
• BIND: A database of biomolecular
  interaction network www.bind.ca
 Metabolic Pathway Databases
• BioCyc: A collection of 3563
  Pathway/Genome Databases
  with tools for understanding their
  data http://biocyc.org/
• KEGG: Kyoto Encyclopedia of Genes and Genomes
• MANET Molecular Ancestry Networks
• Reactome
Microarray-Gene Expression Databases
 •   ArrayExpress (EBI)
 •   Gene Expression Omnibus (NCBI)
 •   maxd (Univ. of Manchester)
 •   SMD (Stanford University)
 •   GPX (Scottish Centre for Genomic
     Technology and Informatics)
Mathematical Model Databases
• CellML: http://www.cellml.org/models
• Biomodels: http://www.ebi.ac.uk/biomodels/
     PCR Primer Databases
• PathoOligoDB: A free QPCR oligo
  database for pathogens
        Meta-Databases
A type of database source or platform
hosting different database sources
presenting the data of these databases in
a new and rather simpler and unified form
or containing the information of that
particular gene or protein with its
implication to a specific disease etc.
Entrez is one of the example of a meta-
database.
       Major Meta Databases
•   Entrez
•   euGenes
•   GeneCards
•   SOURCE
•   mGen
•   Bioinformatic Harvester
•   MetaBase
Questions and Answers