Bls 211 (Bioinformatics) - 1
Bls 211 (Bioinformatics) - 1
AND TECHNOLOGY,
DEPARTMENT OF BIOLOGICAL SCIENCES
COURSE CODE: BLS 211 (INTRODUCTION TO BIOINFORMATICS)
LECTURER IN CHARGE: MR. I.I AJAYI
COURSE OUTLINES
INTRODUCTION TO BASIC MOLECULAR BIOLOGY, GENOMICS,
BIOLOGICAL DATABASES, AND HIGH-THROUGHPUT DATA SOURCES.
Bioinformatics refers to the field concerned with the collection and storage of biologic a l
information while Computational Biology refers to the aspect of developing algorithms and
statistical models necessary to analyze biological data through the aid of computers.
Or
Bioinformatics is the development and application of computational tools in managing all kinds
of biological data, whereas computational biology is more confined to the theoretica l
development of algorithms used for bioinformatics.
Or
Aims of bioinformatics
To organize data in a way that allows researchers to access existing information and to submit
new entries as they are produced, e.g. the Protein Data Bank for 3D macromolecular structures.
To develop tools and resources that aid in the analysis of data such as FASTA (8) and PSI-
BLAST.
To use these tools to analyze the data and interpret the results in a biologically meaningf ul
manner.
Research areas in bioinformatics
Genomics :- Genomics is the study of an organism's genome. This term is given by Thomas H.
Roderick in 1987.
Proteomics :- It is defined as the study of the proteome. Proteome refers to the entire set of
expressed proteins in a cell.
Biological database: Biological databases are libraries of life sciences information, collected
from scientific experiments, published literature, high-throughput experiment technology, and
computational analyses. They contain information from research areas including genomics,
proteomics, metabolomics, microarray gene expression, and phylogenetics.
Biological Data Mining: Biological Data mining is the discovery of useful knowledge from
biological databases. Data mining employs algorithms and techniques from statistics, machine
learning, artificial intelligence, databases and data warehousing etc. Some of the most popular
tasks are classification, clustering, association and sequence analysis, and regression.
Clustering, association and sequence analysis, and regression
System biology: Systems biology studies biological systems by systematically perturbing them
(biologically, genetically, or chemically); monitoring the gene, protein, and informatio na l
pathway responses; integrating these data; and ultimately, formulating mathematical models
that describe the structure of the system and its response to individual perturbations.
Biological databases are libraries of life sciences information, collected from scientific
experiments, published literature, high-throughput experiment technology, and computatio na l
analyses. They contain information from research areas including genomics, proteomics,
metabolomics, microarray gene expression, and phylogenetics.
Biological databases can be broadly classified into sequence, structure and functio na l
databases.
Nucleic acid and protein sequences are stored in sequence databases
Structure databases store solved structures of RNA and proteins.
Functional databases provide information on the physiological role of gene products,
for example enzyme activities, mutant phenotypes, or biological pathways.
Applications of bioinformatics
Biological databases are libraries of life sciences information, collected from scientific
experiments, published literature, high-throughput experiment technology, and computatio na l
analysis. They contain information from research areas
including genomics, proteomics, metabolomics, microarray gene expression,
and phylogenetics
Biological databases can be broadly classified into sequence, structure and functio na l
databases. Nucleic acid and protein sequences are stored in sequence databases
Functional databases provide information on the physiological role of gene products, for
example enzyme activities, mutant phenotypes, or biological pathways.
i. Sequence alignment
ii. Sequence database searching
iii. Motif and pattern discovery
iv. Gene and promoter finding
v. Reconstruction of evolutionary relationships
vi. Genome assembly and comparison.
vii. Mention area of Structure Analysis
viii. Protein and nucleic acid structure analysis
ix. Comparison, classification, and prediction.
x. Mention area of functional analyses
xi. Gene expression profiling
xii. Protein– protein interaction prediction
xiii. Protein subcellular localization prediction
xiv. Metabolic pathway reconstruction simulation
Major class of biological databases
i. Primary databases
ii. Secondary databases
iii. Specialized databases.
Primary databases contain original biological data. They are archives of raw sequence or
structural data submitted by the scientific community. ENA, GenBank and DDBJ and Protein
Data Bank (PDB) are examples of primary databases.
Specialized databases are those that cater to a particular research interest. For example,
Flybase, HIV sequence database, and Ribosomal Database Project are databases that specialize
in a particular organism or a particular type of data.
Databases act as a store house of Biological information.
I. Biological Databases are used to store and organize data in such a way that
information can be retrieved easily via a variety of search criteria.
II. It allows knowledge discovery, which refers to the identification of connections
between pieces of information that were not known when the information was first
entered. This facilitates the discovery of new biological insights from raw data.
III. Secondary databases have become the molecular biologist’s reference library over
the past decade or so, providing a wealth of information on just about any gene or
gene product that has been investigated by the research community.
IV. It helps to solve cases where many users want to access the same entries of data.
V. Allows the indexing of data.
VI. It helps to remove redundancy of data.
SELECTED TOPICS IN BIOINFORMATICS
Bioinformatics is driven by the outpouring of massive genomics data. The best place to find
updated online resources is the two special issues published annually by Nucleic Acid
Research: the January issue is on databases and the July issue is on Web servers of online
analysis tools. The Bioinformatics Links Directory is a good entry point with curated links to
most such molecular resources, tools and databases. NCBI repository nucleotide database is
the best testimony of ongoing genomic revolution. Together with EBI and DDBJ , they are the
3 major information databases in the world, they exchange their sequence data on a daily base
to ensure that the basic sequence information stored in their ‘primary databases’ are equivale nt.
PDB is the biggest repository for 3D bio-macromolecule structure data. In addition to basic
sequence information there are web resources for the ever-changing gene nomenclature,
HUGO Gene Symbol Database, GO and GeneCards, metabolic network information, KEGG,
and for promoters and transcription factors, EPD and TRANSFAC respectively.
Sequence analyses
There are many algorithms in sequence analyses. They may be grouped by DNA, RNA and
protein or single, double, multiple sequence analyses. Here we only give some typical and yet
important examples in each category.
Microarrays typically contain thousands of ‘spots’ each holding many copies of a differe nt
probe molecule. Molecules that were taken from a sample of interest (say tumor cells) and are
capable of hybridizing to the probe molecules (cDNA or RNA) are marked with a fluoresce nt
marker and ‘washed over’ the microarray. Hence, their relative abundance in the sample can
be inferred from the luminescence of the spot. Since abundance is relative, a control sample
with a contrasting fluorophore is invariably used. By choosing the set of probes, one can
assemble a microarray to measure a variety of genetic patterns. The uses of microarrays have
expanded from gene-expression profiling (which genes are over/under-expressed in the sample
relative to the control) and now include comparing whole genomes (e.g. normal vs. tumo r),
identifying alternate RNA transcripts, locating genes that have been methylated (turned off by
having had methyl groups attached), detecting protein modifications and interactions, and
many forms of genotyping. New types of microarrays are still appearing.
Systems biology
Biologists have elucidated the complete gene sequences of several model organis ms
and provided general understanding of the molecular machinery involved in gene
expression. The next logical step is to understand how all the components interact with
each other in order to model complex biological systems. It is envisioned that only with
this ‘systems view’ will we improve the accuracy of our diagnostic and therapeutic
endeavors. The field of systems biology emerged at the turn of this century and aims to
merge our piecemeal knowledge into comprehensive models of the whole dynamic of
these systems. The challenge is daunting; considering the potential of serum
proteomics, Weston and Hood warn: “In addition to the immense repertoire of proteins
present, the dynamic range of these proteins is on the order of 109, with serum albumin
being most abundant (30-50 mg/mL) and low-level proteins such as interleuk in6
present at 0-5 pg/mL)... Identifying proteins at each end of this spectrum in a single
experiment is not feasible with current technologies.” “Further complicating the study
of the human plasma proteome are temporal and spatial dynamics. The turnover of
some proteins is several fold faster than others, and the protein content of the arteries
may differ substantially from that of the veins, or the capillary proteome may be specific
to its location, etc.” The goal of gene and protein networks research is to quantitative ly
understand how different genes and their regulating proteins are grouped together in
genetic circuits, and how stochastic fluctuations influence gene expression in these
complex systems. For example, Thattai and van Oudenaarden focus on the importance
of noise in the expression of genes by using both experimental and theoretica l
approaches. They investigated the bistability that arises from a positive feedback loop
in the lactose utilization network of E. Coli. In its simplest form, the network may be
modeled as a single positive feedback loop: Lactose uptake induces the synthesis of
lactose permease, which in turn promotes the further uptake of lactose. Because of this
bistability, the response of a single cell to an external inducer depends on whether the
cell had been induced recently, a phenomenon known as hysteresis. The question is
how the gene network architecture helps cells remember their history for more than 100
cell generations. The field is still new, but the reader may find tutorials and pointers to
emerging modeling efforts at the web sites of three major systems biology
organizations: Europe, USA and Japan.
Clinical genotyping
A dream of long-standing has been the possibility that predisposition to disease and therapy
response may be predictable from a person’s genome. The well-known link between mutatio ns
in the BRCA1/2 genes and breast cancer predisposition48, and the more recent link between a
mutation in the EGFR gene and response to the drug Iressa56 are just two examples of the
results that have encouraged the enthusiasm. The approach proposed is the ‘association study
‘in which the genomes are sequenced from a group of people known to be in a phenotypic
group (e.g. prone to disease, responsive to therapy) and those not in this group. The strength of
association between a proposed genetic pattern and the phenotypic trait is measured by a simple
chi squared type statistic. Operational questions that need to be addressed to advance this
paradigm include: how are gene sequences measured, how are candidate genetic patterns
selected for testing, what statistical safeguards are needed to minimize false positives and
negatives, and how to get the biological validation. Sequencing an entire genome is both
expensive66 and largely unnecessary. This is because 99.9% of the human genome is common
to us all.
Hence, interest has focused on Single Nucleotide Polymorphisms (SNPs), or differences of one
nucleotide at one locus. Accumulating knowledge of SNPs in the human genome is availab le
from the NCBI (dbSNP19) that currently contains over ten million reference SNPs, about half
validated. The level of interest may also be gaged by observing that a recent study lists 30
companies with SNP-technology offerings. Hence, there appear to be too many potential
genetic variants to make genome-wide association studies practical. Schemes for reducing the
numbers of candidates include focusing only on SNPs in protein coding regions and on ‘non-
synonymous’ SNPs (i.e. SNPs that alter the amino acid). Such approaches depend on the
common variant/common-disease (CVCD) hypothesis. Additional help for this paradigm may
come as knowledge of which non-synonymous SNPs are most likely to produce deleterious
protein alterations and algorithms that exploit this knowledge are being developed. This is
referred to as the ‘direct’ approach and is expected to yield results for single-gene disorders.
An ‘indirect’ approach involves defining haplotypes.
These are sets of SNPs at different loci located in close proximity on the same chromosome;
they tend to be inherited as a unit. That is, they exhibit ‘linkage disequilibrium’ (LD). Hence,
the haplotype and not the individual SNPs is proposed as the effective unit of genotype
characterization, greatly reducing the combinatorics. Identifying haplotypes poses
experimental and bioinformatic challenges. Some propose family studies, as a way to identify
haplotypes related to diseases and their LD, wherein parents and offspring in families with
disease prevalence are carefully studied. In contrast, population studies involve collecting
genotypes from a suitable sample from, say different ethnic groups, and applying pattern
discovery algorithms to locate suitable haplotypes. The HapMap project is an internatio na l
collaborative project to collect data on about 270 individuals in five populations groups and
information on about 600,000 SNPs and make it publicly available. Unsupervised learning
algorithms for inferring haplotypes include: The Clark algorithm that begins with one or more
homozygous individuals (or heterozygous at at most one locus – a problem for some datasets)
and builds its initial haplotype set. It then adds the heterozygous individuals and extends the
set as needed only to cover them (a parsimony criterion). Some genotypes may be left
unassigned to haplotypes in some datasets. Expectation Minimization (EM) algorithms (e.g.
Escoffier et al.) make an initial guess at haplotype frequencies and iteratively converge (with
reasonable probability) so all genotypes are assigned. EM algorithms can be computationa lly
challenged by large datasets. Bayesian approaches have been reported to perform better than
the previous two classes, but all these approaches may fail to exploit some genetic alterations.
Two additional bioinformatics challenges involving haplotypes are the search for haplotype
blocks (larger SNP regions that still may satisfy LD criteria) and the location of minimal sets
of SNPs that may serve to identify the different genotypes (called tagSNPs). Good haplotype
blocks would further reduce the combinatorics of genotype candidates that need to be
considered and tagSNPs would reduce the amount of DNA that is needed to new individua ls.
For a discussion of algorithms for tagSNP identification and the issues related to.
For cDNA microarrays, methods to deal with spatial biases have been recently proposed. For
mass spectroscopy, normalization usually at least includes total ion current normalization to
correct for differences in overall spectrum intensity. More controversial is within-spectr um
normalization60 wherein the selected measurements are linearly scaled to [0,1] in order to
preserve only the relative protein abundances. Another issue with MS data is the choice to do
peak identification (requiring specifying a noise cutoff) or binning (merging adjacent
intensities to reflect machine precision). A major bioinformatics issue in this emerging field is
how to cope with these datasets that are measurement-rich, but case-poor. One traditiona l
approach to this is to reduce the number of measurements either by filtering out those that fail
to meet some specified criteria of ‘signal’ (e.g. using a signal to noise cutoff, and/or a cutoff of
likelihood that the measurement means are different between the two groups), or by using
principle components analysis (PCA). One difficulty with PCA is that results may be diffic ult
to interpret biologically. An alternative approach is sometimes called a ‘wrapper’ approach in
which the space of possible measurement subsets is searched using some form of gradient
descent or evolutionary search algorithm, wherein the worth of any proposed subset is
evaluated by inducing a classifier and testing its classification accuracy. A risk with the former
is the possibility of missing patterns that include measurements that are not strongly
discriminating by themselves. The risk with the wrapper approaches is the possibility of
discovering patterns that overexploit chance variance in the small samples (overfitting). One
method strongly recommended to avoid overfitting is cross-validation. Unfortunately, the
scope for cross-validation is severely hampered by the small sample sizes. Michaels et al. have
shown how sensitive are the discovered patterns to the specific set of learning cases used.
Another issue involves whether or not to use correlated measurements in a classifier.
Arguments based upon Vapnik’s approach to structured risk minimization dictate the use of
the smallest measurement sets that do the job.
The use of modeling in medicine has a long history. Prognostic models have been developed
from early ‘illness scores’ initially devised by experts to try to predict disease outcomes. Later
these models used regression methods that required increasing amounts of data. These models
may be considered ‘static.’ Dynamic models have been used in epidemiology for a long time
and in the modeling of physiological systems52 like the cardiovascular. But there is a new
opportunity just emerging in this era of molecular medicine: the building of systems biology
models that capture the dynamics of disease at the molecular/cellular level and applying them
to medical diagnosis and/or prognosis. In distinction from the ‘static pattern recognitio n’
problem mentioned above, this approach is a ‘dynamic pattern recognition’ task. As such, it
requires a series of vectors of measurements taken across the time course of disease. Without
loosing sight of the challenges already mentioned Weston and Hood also opine that networks
have key nodal points where therapy/intervention can effectively be focused. While there are
not concrete clinical applications yet, the promise is clear.
Personalized medicine
The aim of personalized medicine is to find the right therapy for individual patients based on
their genotype, environment and lifestyle. A tantalizing example is the Iressa story56. It works
miraculously for about 10% of the patients with advanced non-small cell lung cancer, those
with a mutation of the epidermal growth factor receptor EGFR gene. This dream obviously
depends on the maturation of much that has been covered above. In a broad sense, it includes
development of genomics-based personalized medicines, predisposition testing, preventive
medicine, combination of diagnostics with therapeutics, and monitoring of therapy. But an
additional bioinformatics challenge, not mentioned above will be Clinical Decision Support
Systems (CDSS) able to distill the voluminous and complex data into actionable clinic a l
recommendations, whether it is preventive, diagnostic, or therapeutic41. CDSS involves
linking two types of information: patient-specific and knowledge-based.
Personal information related to the patient history is documented in patient records. Some
personal medical documents, which are already in use to various extents in different countries,
include the personal emergency card, the mother-child record, and the vaccination certifica te.
A promising source of personal medical information is the data stored in the electronic patient
record combined with the genomic information from genotyping and from particular molecular
diagnostic tests. Molecular imaging enables visualization of cellular and molecular processes
that may be used to infer information about the genomic and proteomic profiles.
As a result, the bioinformatic analysis of genomic and proteomic profiles may be valuable to
assist the interpretation of images using molecular probes. Molecular diagnostics and
molecular imaging can provide the two aspects of the disease: molecular diagnostics can
provide the information of the exact mutation of a particular gene and classify the exact type
of cancer, while molecular imaging can target the very same type of cells with that particular
mutation in order to provide diagnostic information and disease staging.
Current methods in bioinformatics have been used for immediate impact in diseases that are at
the top of the killer list: heart disease and cancer. However, these technologies may also enable
non-invasive and inexpensive first indicators that a regular person is becoming a patient.
Nutritional genomics studies the genome-wide influences of nutrition, with a far reaching
potential in the prevention of nutrition-related disease. Nutrition is not like pharmacology or
toxicology, where the drug acts upon a single receptor/target and dose related pathologic a l
effects are induced with related strong effects on transcriptomic changes. Our daily food
consumption consists of complex mixtures of many possibly bioactive chemical compounds,
chronically administered in varying composition, and with a multitude of biological reactions
based on our genotype.
http://bioinformatics.ubc.ca/resources/links_directory/.
http://www.ncbi.nlm.nih.gov/.
http://www.ebi.ac.uk/.
http://www.ddbj.nig.ac.jp/.
http://www.rcsb.org/pdb/
http://www.gene.ucl.ac.uk/nomenclature/.
http://www.genome.ad.jp/kegg/.
http://www.epd.isb-sib.ch/.
http://www.gene-regulation.de/.
http://www.systemsbiology.org/.
http://nutrigenomics.ucdavis.edu/bioinformatics.htm.