0% found this document useful (0 votes)
126 views15 pages

Bls 211 (Bioinformatics) - 1

The document outlines the course BLS 211 (Introduction to Bioinformatics) at Bamidele Olumilua University, detailing the distinctions between bioinformatics and computational biology, and the aims and applications of bioinformatics. It covers various research areas including genomics, proteomics, and computer-aided drug design, as well as the classification and functions of biological databases. Additionally, it discusses sequence analysis, microarray analysis, and the importance of data collection and dissemination in bioinformatics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
126 views15 pages

Bls 211 (Bioinformatics) - 1

The document outlines the course BLS 211 (Introduction to Bioinformatics) at Bamidele Olumilua University, detailing the distinctions between bioinformatics and computational biology, and the aims and applications of bioinformatics. It covers various research areas including genomics, proteomics, and computer-aided drug design, as well as the classification and functions of biological databases. Additionally, it discusses sequence analysis, microarray analysis, and the importance of data collection and dissemination in bioinformatics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

BAMIDELE OLUMILUA UNIVERSITY OF EDUCATION, SCIENCE

AND TECHNOLOGY,
DEPARTMENT OF BIOLOGICAL SCIENCES
COURSE CODE: BLS 211 (INTRODUCTION TO BIOINFORMATICS)
LECTURER IN CHARGE: MR. I.I AJAYI
COURSE OUTLINES
INTRODUCTION TO BASIC MOLECULAR BIOLOGY, GENOMICS,
BIOLOGICAL DATABASES, AND HIGH-THROUGHPUT DATA SOURCES.

Difference between bioinformatics and computational biology

Bioinformatics refers to the field concerned with the collection and storage of biologic a l
information while Computational Biology refers to the aspect of developing algorithms and
statistical models necessary to analyze biological data through the aid of computers.

Or

Bioinformatics is the development and application of computational tools in managing all kinds
of biological data, whereas computational biology is more confined to the theoretica l
development of algorithms used for bioinformatics.

Or

Bioinformatics differs from a related field known as computational biology. Bioinformatics is


limited to sequence, structural, and functional analysis of genes and genomes and their
corresponding products and is often considered computational molecular biology while
computational biology encompasses all biological areas that involve computation.

Aims of bioinformatics

To organize data in a way that allows researchers to access existing information and to submit
new entries as they are produced, e.g. the Protein Data Bank for 3D macromolecular structures.

To develop tools and resources that aid in the analysis of data such as FASTA (8) and PSI-
BLAST.

To use these tools to analyze the data and interpret the results in a biologically meaningf ul
manner.
Research areas in bioinformatics

Genomics :- Genomics is the study of an organism's genome. This term is given by Thomas H.
Roderick in 1987.

Proteomics :- It is defined as the study of the proteome. Proteome refers to the entire set of
expressed proteins in a cell.

Computer Aided Drug Designing : Computer-Aided Drug Design (CADD) is a specialized


discipline that uses computational methods to simulate drug-receptor interactions. CADD
methods are heavily dependent on bioinformatics tools, applications and databases.

Biological database: Biological databases are libraries of life sciences information, collected
from scientific experiments, published literature, high-throughput experiment technology, and
computational analyses. They contain information from research areas including genomics,
proteomics, metabolomics, microarray gene expression, and phylogenetics.

Biological Data Mining: Biological Data mining is the discovery of useful knowledge from
biological databases. Data mining employs algorithms and techniques from statistics, machine
learning, artificial intelligence, databases and data warehousing etc. Some of the most popular
tasks are classification, clustering, association and sequence analysis, and regression.
Clustering, association and sequence analysis, and regression

Microarray informatics: Microarray Technology is a powerful tool to monitor gene expression


or gene expression changes of hundreds or thousands of genes in a single experiment.

Molecular Phylogenetics: Molecular phylogenetics is the study of organisms on a molecular


level to gather information about the phylogenetic relationships between different organisms.

System biology: Systems biology studies biological systems by systematically perturbing them
(biologically, genetically, or chemically); monitoring the gene, protein, and informatio na l
pathway responses; integrating these data; and ultimately, formulating mathematical models
that describe the structure of the system and its response to individual perturbations.

Agro-informatics: Agri-Informatics” (AI) is coined to cover broadly for all types of


development and applications of computerized information technology-based solutions for
gathering, managing and analyzing data produced by agricultural systems and to develop
models and forecasting systems.

Biological databases are libraries of life sciences information, collected from scientific
experiments, published literature, high-throughput experiment technology, and computatio na l
analyses. They contain information from research areas including genomics, proteomics,
metabolomics, microarray gene expression, and phylogenetics.

Biological databases can be broadly classified into sequence, structure and functio na l
databases.
 Nucleic acid and protein sequences are stored in sequence databases
 Structure databases store solved structures of RNA and proteins.
 Functional databases provide information on the physiological role of gene products,
for example enzyme activities, mutant phenotypes, or biological pathways.
Applications of bioinformatics

i. Development of Varietal Information System: plant variety protection (PVP) Act,


various terms such as extant variety, candidate variety, reference variety, example
variety and farmer’s variety are frequently used.
ii. Plant Genetic Resources Data Base: Such as gene pool, genetic stock and germplas m
with several plant characters such as highly heritable morphological, yield contributing
characters, quality characters, resistance to biotic and abiotic stresses and characters of
agronomic value.
iii. Biometrical Analysis: Simple measures of variability such as mean, standard deviatio n,
standard error, coefficient of variation, Correlations, path Coefficients, Discrimina nt
function analysis, Stability analysis, Diallel, partial diallel, line x tester trialle l,
quadriallel, biparental and triple test cross analysis.
iv. Storage and Retrieval of Data: Data can be easily stored in various storage devices such
as hard disks, compact disks, pen drive, data cards, etc. Storage of data in computers
require less space and is very safe as compared to storage of data in paper registers and
files.
v. Studies on Plant Modelling: Computers are useful tools for undertaking studies on
modelling of plants such model plants can be developed through hybridization and
directional selection.
vi. Pedigree Analysis: Computer aided studies are useful in pedigree analysis of various
cultivars and hybrids. Information about the parentage of cultivars and hybrids can be
retrieved any time and can be used in planning plant breeding programs especially in
the selection of parents for use in hybridization programs.
vii. Preparation of Reports: After biometrical analysis of data, results are interpreted and
various types of reports or documents are prepared.
viii. Updating of Information: In plant breeding and genetics, results of multi-seasonal and
long term experiments require continuous updating.
ix. Diagrammatic Representation: Inclusion of diagrams makes the reports, research
papers, articles, bulletins, etc. more attractive, informative and easily understandable.
x. Planning of Breeding Programs: Plant breeders have to plan various breeding programs
such as sowing plans of various breeding experiments, selfing and crossing plans,
breeder seed production plan, hybrid seed production plan, germplasm collectio n,
conservation, evaluation, distribution, utilization and documentation plan, screening
plan of breeding material against biotic and abiotic stresses, selection, quality
evaluation and multi-location testing plans.
BIOLOGICAL DATA BASE

Biological databases are libraries of life sciences information, collected from scientific
experiments, published literature, high-throughput experiment technology, and computatio na l
analysis. They contain information from research areas
including genomics, proteomics, metabolomics, microarray gene expression,
and phylogenetics

Biological databases can be broadly classified into sequence, structure and functio na l
databases. Nucleic acid and protein sequences are stored in sequence databases

Structure databases store solved structures of RNA and proteins.

Functional databases provide information on the physiological role of gene products, for
example enzyme activities, mutant phenotypes, or biological pathways.

Area of Sequence Analysis

i. Sequence alignment
ii. Sequence database searching
iii. Motif and pattern discovery
iv. Gene and promoter finding
v. Reconstruction of evolutionary relationships
vi. Genome assembly and comparison.
vii. Mention area of Structure Analysis
viii. Protein and nucleic acid structure analysis
ix. Comparison, classification, and prediction.
x. Mention area of functional analyses
xi. Gene expression profiling
xii. Protein– protein interaction prediction
xiii. Protein subcellular localization prediction
xiv. Metabolic pathway reconstruction simulation
Major class of biological databases

i. Primary databases
ii. Secondary databases
iii. Specialized databases.
Primary databases contain original biological data. They are archives of raw sequence or
structural data submitted by the scientific community. ENA, GenBank and DDBJ and Protein
Data Bank (PDB) are examples of primary databases.

Secondary databases contain computationally processed or manually curated informatio n,


based on original information from primary databases. Translated protein sequence databases
containing functional annotation belong to this category. Examples are SWISS-Prot and
Protein Information Resources (PIR) (successor of Margaret Dayhoff’s Atlas of Protein
Sequence and Structure. InterPro (protein families, motifs and domains), UniProt,
Knowledgebase (sequence and functional information on proteins), Ensembl (variatio n,
function, regulation and more layered onto whole genome sequences).

Specialized databases are those that cater to a particular research interest. For example,
Flybase, HIV sequence database, and Ribosomal Database Project are databases that specialize
in a particular organism or a particular type of data.
Databases act as a store house of Biological information.

I. Biological Databases are used to store and organize data in such a way that
information can be retrieved easily via a variety of search criteria.
II. It allows knowledge discovery, which refers to the identification of connections
between pieces of information that were not known when the information was first
entered. This facilitates the discovery of new biological insights from raw data.
III. Secondary databases have become the molecular biologist’s reference library over
the past decade or so, providing a wealth of information on just about any gene or
gene product that has been investigated by the research community.
IV. It helps to solve cases where many users want to access the same entries of data.
V. Allows the indexing of data.
VI. It helps to remove redundancy of data.
SELECTED TOPICS IN BIOINFORMATICS

Data collection and dissemination.

Bioinformatics is driven by the outpouring of massive genomics data. The best place to find
updated online resources is the two special issues published annually by Nucleic Acid
Research: the January issue is on databases and the July issue is on Web servers of online
analysis tools. The Bioinformatics Links Directory is a good entry point with curated links to
most such molecular resources, tools and databases. NCBI repository nucleotide database is
the best testimony of ongoing genomic revolution. Together with EBI and DDBJ , they are the
3 major information databases in the world, they exchange their sequence data on a daily base
to ensure that the basic sequence information stored in their ‘primary databases’ are equivale nt.
PDB is the biggest repository for 3D bio-macromolecule structure data. In addition to basic
sequence information there are web resources for the ever-changing gene nomenclature,
HUGO Gene Symbol Database, GO and GeneCards, metabolic network information, KEGG,
and for promoters and transcription factors, EPD and TRANSFAC respectively.

Sequence analyses

There are many algorithms in sequence analyses. They may be grouped by DNA, RNA and
protein or single, double, multiple sequence analyses. Here we only give some typical and yet
important examples in each category.

 Comparing sequences: Pairwise comparison and similarity searches: Starting with a


molecular sequence, one of the first questions everyone would ask is ‘is it similar or
related to a known sequence?’ The basic tool is similarity comparison/ alignment; it has
three components: a similarity (or distance) measure which gives a score to a pair of
aligned sequences, an objective function to be optimized and an algorithm to obtain
optimal alignment. A scoring matrix provides a numerica l value (penalty) to apply to
each mismatch, be it a deletion, insertion, or substitution. In addition, protein sequence
mismatches must account for the similarities/differences of each possible amino acid
pair (PAM and BLOSUM matrices). Needleman-Wunsch global alignment algorithm
(searching for best alignment of the two entire sequences from the beginnings to ends)
and Smith-Waterman local alignment (searching for best subsequence alignment) are
best known rigorous (most sensitive) algorithms based on dynamic programming. Since
rigorous search is costly, faster approximate algorithms are most often used (e.g.
FASTA, BLAST, BLAT).
Multi- sequence alignment and phylogenetic trees: Aligning multiple sequences is of
interest to explore what nucleic acid or protein sequences are most preserved by
evolution, thus suggesting critical functions, and may be used to infer the evolutio nar y
distances among species. The optimal alignment of a set of sequences may not contain
the optimal pair-wise alignments. ClustalW is the most commonly used program. It
uses a progressive method (hierarchical clustering by pairwise alignments) and weights
each sequence to reduce redundancy. The recently developed Tcoffee is similar to
ClustalW, but it compares segments across the entire sequence set. It can combine
sequences and structures, evaluate alignment or integrate several different alignme nts.
Although ClustalW can be used to build phylogenetic trees, Phylip and PAUP are much
more accurate, powerful and versatile.
 Analyzing DNA sequences Finding protein coding genes: In bacterial DNA, each
protein is encoded by a contiguous fragment called an open reading frame (ORF,
beginning with a start codon and ending with a stop codon). In eukaryotes, especially
in vertebrates, the coding region is split into several fragments called exons, and the
intervening fragments are called introns. Finding eukaryotic protein coding genes is
essentially to predict the exon-intron structures. Almost every possible statistica l
pattern recognition and machine learning algorithms has been applied to this diffic ult
problem. Identification of promoters and transcription factor binding site (TFBS)
motifs: In order to study gene regulation and have a better interpretation of microarray
expression data, promoter prediction and TFBSs that find differences between sets of
known promoter and non-promoter sequences have been applied, for example quadratic
discriminative analysis discovery have become important. A number of machine
learning approaches (FirstEF) artificial neural networks (DPF, relevance vector
machine) and Monte Carlo sampling (Eponine34). Because of a lack of protein coding
signatures, current promoter predictions are much less reliable than protein coding
region predictions, except for CpG island genes. Once regulatory regions, such as
promoters, are obtained, finding TFBS motifs within these regions may proceed either
by enumeration or by alignment to find the enriched motifs.
Microarray analysis

Microarrays typically contain thousands of ‘spots’ each holding many copies of a differe nt
probe molecule. Molecules that were taken from a sample of interest (say tumor cells) and are
capable of hybridizing to the probe molecules (cDNA or RNA) are marked with a fluoresce nt
marker and ‘washed over’ the microarray. Hence, their relative abundance in the sample can
be inferred from the luminescence of the spot. Since abundance is relative, a control sample
with a contrasting fluorophore is invariably used. By choosing the set of probes, one can
assemble a microarray to measure a variety of genetic patterns. The uses of microarrays have
expanded from gene-expression profiling (which genes are over/under-expressed in the sample
relative to the control) and now include comparing whole genomes (e.g. normal vs. tumo r),
identifying alternate RNA transcripts, locating genes that have been methylated (turned off by
having had methyl groups attached), detecting protein modifications and interactions, and
many forms of genotyping. New types of microarrays are still appearing.

 Expression microarray analysis: Expression microarrays are used to measure mRNA


abundance for large number of genes. The low-level computational tasks, such as
experimental design and pre-processing (image analysis and normalization), are aimed
to reduce uncontrollable sample variations, which may depend on specific types of
microarrays. Many data analysis packages can be found from the open source
Bioconductor software repository40. Normalization: Normalization, critical step for
data preprocessing, removes unwanted variances from data by exploiting and enforcing
known or assumed invariance of the data49. Common approaches include:
1) rescaling by median of all or ‘housekeeping’ genes or by spike RNA controls
2) explicit one parameter (log-) or two parameter (asinh-) transformation,
3) local regression smoothing (LOESS), and
4) quantile normalization. Exploratory analysis: Since the number of genes
(measurements) generally far exceeds the number of observations(cases), substantia l
variable reduction (e.g. low-varying genes filtering) is usually done before any machine
learning or statistical algorithms are applied. Exploratory analysis aims to find patterns
in the data, common methods include Clustering (genes, cases, or both), Princip le
Component Analysis (PCA) and Multi-Dimensional Scaling (MDS). Bayesian
Networks (BN) have also been used to describe interactions between genes.
Identifying differentially expressed genes (DEGs): The most common task of
microarray studies is to identify genes that are differentially regulated across differe nt
classes of samples; examples are: finding the genes affected by a treatment, or finding
marker genes that discriminate cancer from normal tissues. Statistical tests include t-
test and permutation test for two groups and ANOVA/F-test for multi- groups. To
correct for multiple testing, often q-value for specifying the smallest False Positive Rate
(FPR) is used instead of the conventional p-values. There are also several emerging
nonparametric approaches, such as the Empirical Bayes (EB), the Significance Analys is
of Microarray (SAM) method and the Mixture Model Method (MMM), seem even more
powerful (see e.g. Pan57 for performance comparison). This is an active research area,
a plethora of well-established and new methods are being applied, and a consensus best
practice has yet to emerge.
 Genomic microarray analysis Most of the human genome does not express protein
(≈98%), so the gene expression microarrays of the previous section, necessarily
examine a small fraction of it; hence, the interest in approaches capable of providing a
genome wide view. The major application of genomic microarrays is for localiza tio n
of DNA binding proteins or for detecting DNA copy number changes, although
genomic tiling arrays have also been used to detect novel RNA transcripts.
Identification of protein binding sites in chromatin DNA: ChIP-chip is the most popular
method for localization of chromatin DNA binding proteins in vivo. Combining
microarray data (either expression or localization data) with promoter analysis for
TFBS motif identification is becoming a powerful extension to methods described. If
positive and negative gene sets extracted from microarray data are available, then motif-
discovery turns into a classification problem: identify motifs that best discriminate the
two gene sets. If continuous scores are available, then the problem turns into a
regression problem: identify motifs that best correlate with these scores. Such analyses
are very useful in Gene Regulatory Network (GRN) and Cis Regulatory Modules
(CRM) studies.
Identification of amplification and deletions in the human genome

One of the important applications of genomic arrays in cancer is to detect amplificatio ns


(potential oncogene loci) and deletions (potential tumor suppressor gene loci).
ArrayCGH (comparative genomic hybridization) and ROMA (representatio na l
oligonucleotide microarray analysis) are two emerging technologies capable of yielding
a genome-wide picture of the number of copies of the DNA. The bioinforma tics needs
include schemes for reducing noise and ways to visualize the enormous amounts of
information and focus in on what’s of biological significance. Other types of arrays,
such as Alternative Splicing arrays, Protein binding microarray, Protein microarray,
Tussie/cell array and microRNA array, etc. are also used. The translation of microarra y-
based results to clinical applications challenges the technology at all levels. These
include robust probe design, uniform sample preparation and increased reproducibility
of array measurements as well as advanced data analysis tools (for new computatio na l
challenges). The recent advances in genomic sciences and array technologies are
accelerating the translation of microarrays to clinical applications and will offer
enormous potential for improved health care in cancer and a other human diseases.

Systems biology

Biologists have elucidated the complete gene sequences of several model organis ms
and provided general understanding of the molecular machinery involved in gene
expression. The next logical step is to understand how all the components interact with
each other in order to model complex biological systems. It is envisioned that only with
this ‘systems view’ will we improve the accuracy of our diagnostic and therapeutic
endeavors. The field of systems biology emerged at the turn of this century and aims to
merge our piecemeal knowledge into comprehensive models of the whole dynamic of
these systems. The challenge is daunting; considering the potential of serum
proteomics, Weston and Hood warn: “In addition to the immense repertoire of proteins
present, the dynamic range of these proteins is on the order of 109, with serum albumin
being most abundant (30-50 mg/mL) and low-level proteins such as interleuk in6
present at 0-5 pg/mL)... Identifying proteins at each end of this spectrum in a single
experiment is not feasible with current technologies.” “Further complicating the study
of the human plasma proteome are temporal and spatial dynamics. The turnover of
some proteins is several fold faster than others, and the protein content of the arteries
may differ substantially from that of the veins, or the capillary proteome may be specific
to its location, etc.” The goal of gene and protein networks research is to quantitative ly
understand how different genes and their regulating proteins are grouped together in
genetic circuits, and how stochastic fluctuations influence gene expression in these
complex systems. For example, Thattai and van Oudenaarden focus on the importance
of noise in the expression of genes by using both experimental and theoretica l
approaches. They investigated the bistability that arises from a positive feedback loop
in the lactose utilization network of E. Coli. In its simplest form, the network may be
modeled as a single positive feedback loop: Lactose uptake induces the synthesis of
lactose permease, which in turn promotes the further uptake of lactose. Because of this
bistability, the response of a single cell to an external inducer depends on whether the
cell had been induced recently, a phenomenon known as hysteresis. The question is
how the gene network architecture helps cells remember their history for more than 100
cell generations. The field is still new, but the reader may find tutorials and pointers to
emerging modeling efforts at the web sites of three major systems biology
organizations: Europe, USA and Japan.

Clinical genotyping

A dream of long-standing has been the possibility that predisposition to disease and therapy
response may be predictable from a person’s genome. The well-known link between mutatio ns
in the BRCA1/2 genes and breast cancer predisposition48, and the more recent link between a
mutation in the EGFR gene and response to the drug Iressa56 are just two examples of the
results that have encouraged the enthusiasm. The approach proposed is the ‘association study
‘in which the genomes are sequenced from a group of people known to be in a phenotypic
group (e.g. prone to disease, responsive to therapy) and those not in this group. The strength of
association between a proposed genetic pattern and the phenotypic trait is measured by a simple
chi squared type statistic. Operational questions that need to be addressed to advance this
paradigm include: how are gene sequences measured, how are candidate genetic patterns
selected for testing, what statistical safeguards are needed to minimize false positives and
negatives, and how to get the biological validation. Sequencing an entire genome is both
expensive66 and largely unnecessary. This is because 99.9% of the human genome is common
to us all.

Hence, interest has focused on Single Nucleotide Polymorphisms (SNPs), or differences of one
nucleotide at one locus. Accumulating knowledge of SNPs in the human genome is availab le
from the NCBI (dbSNP19) that currently contains over ten million reference SNPs, about half
validated. The level of interest may also be gaged by observing that a recent study lists 30
companies with SNP-technology offerings. Hence, there appear to be too many potential
genetic variants to make genome-wide association studies practical. Schemes for reducing the
numbers of candidates include focusing only on SNPs in protein coding regions and on ‘non-
synonymous’ SNPs (i.e. SNPs that alter the amino acid). Such approaches depend on the
common variant/common-disease (CVCD) hypothesis. Additional help for this paradigm may
come as knowledge of which non-synonymous SNPs are most likely to produce deleterious
protein alterations and algorithms that exploit this knowledge are being developed. This is
referred to as the ‘direct’ approach and is expected to yield results for single-gene disorders.
An ‘indirect’ approach involves defining haplotypes.

These are sets of SNPs at different loci located in close proximity on the same chromosome;
they tend to be inherited as a unit. That is, they exhibit ‘linkage disequilibrium’ (LD). Hence,
the haplotype and not the individual SNPs is proposed as the effective unit of genotype
characterization, greatly reducing the combinatorics. Identifying haplotypes poses
experimental and bioinformatic challenges. Some propose family studies, as a way to identify
haplotypes related to diseases and their LD, wherein parents and offspring in families with
disease prevalence are carefully studied. In contrast, population studies involve collecting
genotypes from a suitable sample from, say different ethnic groups, and applying pattern
discovery algorithms to locate suitable haplotypes. The HapMap project is an internatio na l
collaborative project to collect data on about 270 individuals in five populations groups and
information on about 600,000 SNPs and make it publicly available. Unsupervised learning
algorithms for inferring haplotypes include: The Clark algorithm that begins with one or more
homozygous individuals (or heterozygous at at most one locus – a problem for some datasets)
and builds its initial haplotype set. It then adds the heterozygous individuals and extends the
set as needed only to cover them (a parsimony criterion). Some genotypes may be left
unassigned to haplotypes in some datasets. Expectation Minimization (EM) algorithms (e.g.
Escoffier et al.) make an initial guess at haplotype frequencies and iteratively converge (with
reasonable probability) so all genotypes are assigned. EM algorithms can be computationa lly
challenged by large datasets. Bayesian approaches have been reported to perform better than
the previous two classes, but all these approaches may fail to exploit some genetic alterations.

Two additional bioinformatics challenges involving haplotypes are the search for haplotype
blocks (larger SNP regions that still may satisfy LD criteria) and the location of minimal sets
of SNPs that may serve to identify the different genotypes (called tagSNPs). Good haplotype
blocks would further reduce the combinatorics of genotype candidates that need to be
considered and tagSNPs would reduce the amount of DNA that is needed to new individua ls.
For a discussion of algorithms for tagSNP identification and the issues related to.

Clearly, genotype-disease association discovery faces many challenges, substantial population


samples and careful matching of controls may be needed as the haplotypes discovered in
stratified samples often exhibit substantial differences – genotypes that are meaningful and
practical will take work to identify. DNA sequencing measurements are still costly. There
appears to be significant opportunities for improved algorithms; for example, complex diseases
may resist current approaches calling for more sophisticated pattern discovery methods.
Algorithms for the ‘static pattern discovery’ paradigm discussed in the following section
probably apply; genetic algorithms have barely been applied in this domain so far.

For cDNA microarrays, methods to deal with spatial biases have been recently proposed. For
mass spectroscopy, normalization usually at least includes total ion current normalization to
correct for differences in overall spectrum intensity. More controversial is within-spectr um
normalization60 wherein the selected measurements are linearly scaled to [0,1] in order to
preserve only the relative protein abundances. Another issue with MS data is the choice to do
peak identification (requiring specifying a noise cutoff) or binning (merging adjacent
intensities to reflect machine precision). A major bioinformatics issue in this emerging field is
how to cope with these datasets that are measurement-rich, but case-poor. One traditiona l
approach to this is to reduce the number of measurements either by filtering out those that fail
to meet some specified criteria of ‘signal’ (e.g. using a signal to noise cutoff, and/or a cutoff of
likelihood that the measurement means are different between the two groups), or by using
principle components analysis (PCA). One difficulty with PCA is that results may be diffic ult
to interpret biologically. An alternative approach is sometimes called a ‘wrapper’ approach in
which the space of possible measurement subsets is searched using some form of gradient
descent or evolutionary search algorithm, wherein the worth of any proposed subset is
evaluated by inducing a classifier and testing its classification accuracy. A risk with the former
is the possibility of missing patterns that include measurements that are not strongly
discriminating by themselves. The risk with the wrapper approaches is the possibility of
discovering patterns that overexploit chance variance in the small samples (overfitting). One
method strongly recommended to avoid overfitting is cross-validation. Unfortunately, the
scope for cross-validation is severely hampered by the small sample sizes. Michaels et al. have
shown how sensitive are the discovered patterns to the specific set of learning cases used.
Another issue involves whether or not to use correlated measurements in a classifier.
Arguments based upon Vapnik’s approach to structured risk minimization dictate the use of
the smallest measurement sets that do the job.

Another informatics approach to this is an ‘ensemble’ approach wherein multiple classifie rs


are derived and the final decision comes from some form of voting scheme (e.g. weighted sum)
among them. Another key decision required in a wrapper approach is the choice of classifier.
The arguments from risk minimization for using the simplest effective classifier entail
assumptions about the homogeneity of the disease classes that in some cases are clearly
unsupportable. In the end, what is an investigator to do? We conjecture that none of the early
studies, that have done so much to show the potential and stir excitement, will be shown to
have located the best diagnostics. We believe the way forward will be found by a community-
wide effort that involves incremental improvements in the measurement devices, careful
bioinformatics that lead to new hypotheses about disease mechanisms, and larger studies that
include more of the inherent disease variability (and the natural interpersonal variance) and
that exploit better screening to reduce the variability that can be controlled.

Prognostic disease models

The use of modeling in medicine has a long history. Prognostic models have been developed
from early ‘illness scores’ initially devised by experts to try to predict disease outcomes. Later
these models used regression methods that required increasing amounts of data. These models
may be considered ‘static.’ Dynamic models have been used in epidemiology for a long time
and in the modeling of physiological systems52 like the cardiovascular. But there is a new
opportunity just emerging in this era of molecular medicine: the building of systems biology
models that capture the dynamics of disease at the molecular/cellular level and applying them
to medical diagnosis and/or prognosis. In distinction from the ‘static pattern recognitio n’
problem mentioned above, this approach is a ‘dynamic pattern recognition’ task. As such, it
requires a series of vectors of measurements taken across the time course of disease. Without
loosing sight of the challenges already mentioned Weston and Hood also opine that networks
have key nodal points where therapy/intervention can effectively be focused. While there are
not concrete clinical applications yet, the promise is clear.

Personalized medicine
The aim of personalized medicine is to find the right therapy for individual patients based on
their genotype, environment and lifestyle. A tantalizing example is the Iressa story56. It works
miraculously for about 10% of the patients with advanced non-small cell lung cancer, those
with a mutation of the epidermal growth factor receptor EGFR gene. This dream obviously
depends on the maturation of much that has been covered above. In a broad sense, it includes
development of genomics-based personalized medicines, predisposition testing, preventive
medicine, combination of diagnostics with therapeutics, and monitoring of therapy. But an
additional bioinformatics challenge, not mentioned above will be Clinical Decision Support
Systems (CDSS) able to distill the voluminous and complex data into actionable clinic a l
recommendations, whether it is preventive, diagnostic, or therapeutic41. CDSS involves
linking two types of information: patient-specific and knowledge-based.

Personal information related to the patient history is documented in patient records. Some
personal medical documents, which are already in use to various extents in different countries,
include the personal emergency card, the mother-child record, and the vaccination certifica te.
A promising source of personal medical information is the data stored in the electronic patient
record combined with the genomic information from genotyping and from particular molecular
diagnostic tests. Molecular imaging enables visualization of cellular and molecular processes
that may be used to infer information about the genomic and proteomic profiles.

As a result, the bioinformatic analysis of genomic and proteomic profiles may be valuable to
assist the interpretation of images using molecular probes. Molecular diagnostics and
molecular imaging can provide the two aspects of the disease: molecular diagnostics can
provide the information of the exact mutation of a particular gene and classify the exact type
of cancer, while molecular imaging can target the very same type of cells with that particular
mutation in order to provide diagnostic information and disease staging.

Health and wellness monitoring

Current methods in bioinformatics have been used for immediate impact in diseases that are at
the top of the killer list: heart disease and cancer. However, these technologies may also enable
non-invasive and inexpensive first indicators that a regular person is becoming a patient.
Nutritional genomics studies the genome-wide influences of nutrition, with a far reaching
potential in the prevention of nutrition-related disease. Nutrition is not like pharmacology or
toxicology, where the drug acts upon a single receptor/target and dose related pathologic a l
effects are induced with related strong effects on transcriptomic changes. Our daily food
consumption consists of complex mixtures of many possibly bioactive chemical compounds,
chronically administered in varying composition, and with a multitude of biological reactions
based on our genotype.

The role of bioinformatics in nutrigenomics is multifold: to create nutrigenomic databases, to


setup special ontologies in using available resources, setup and track laboratory samples being
tested and their results, pattern recognition, classification, and data mining, and simulation of
complex interactions between genomes, nutrition, and health disparities.A key objective is the
development of tools to identify selective and sensitive multi-parameter (pathway supported)
biomarkers of prevention (transcriptomic and metabolic profiles or fingerprints) based on the
perturbation of homeostasis. Nutrigenomics research will have a profound impact on our
understanding of the relationship between the genotype and the environme nt. The nutritio na l
supplement and functional food industries will continue robust growth in response to advances
in nutritional genomics research and its applications.

POLYMERIZED CHAIN REACTION (PCR)


REVISION QUESTIONS

1. What are the 3 components DNA Nucleotide?


2. Difference between RNA and DNA?
3. Draw a chat comprises of the Genetic code?
4. What are the 4 bases that make up the 4 nucleotides found in DNA?
5. Which part of the DNA double helix is covalently bonded, and which is hydrogen-
bonded?
6. Outline three (3) major types of double helix?
7. What is Bioinformatics and differentiate between bioinformatics and computatio na l
biology?
8. Outline 3 aim of bioinformatics?
9. List and briefly explain five (5) research areas in bioinformatics?
10. What are Biological Database?
11. Enumerate 3 classes of Biological databases?
12. List 5 applications and 5 limitations of bioinformatics?
13. List and briefly explain 10 techniques in Animal molecular biology?
REFERENCES

http://bioinformatics.ubc.ca/resources/links_directory/.

http://www.ncbi.nlm.nih.gov/.
http://www.ebi.ac.uk/.

http://www.ddbj.nig.ac.jp/.

http://www.rcsb.org/pdb/

http://www.gene.ucl.ac.uk/nomenclature/.
http://www.genome.ad.jp/kegg/.

http://www.epd.isb-sib.ch/.

http://www.gene-regulation.de/.

http://www.systemsbiology.org/.
http://nutrigenomics.ucdavis.edu/bioinformatics.htm.

You might also like