Training Problem
Another tricky problem is how to create an HMM in the first place, given a
particular set of related training sequences. It is necessary to estimate the amino
acid emission distributions in each state and all state-to-state transition
probabilities from a set of related training sequences. This is done by using the
Baum-Welch Algorithm or the Forward Backward Algorithm.
The algorithm proceeds by making an initial guess of the parameters (which may
well be entirely wrong) and then refining it by assessing its worth, and attempting
to reduce the errors it provokes when fitted to the given data. In this sense, it is
performing a form of gradient descent, looking for a minimum of an error measure.
1. Applications of HMM’s
In this section, we will delve into greater depth at specific problems in the area of
computational biology and examine the role of HMM's.
1.1. Gene finding and prediction
We introduce here the gene-prediction HMMs that can be used to predict the
structure of the gene. Our objective is to find the coding and non-coding regions of
an unlabeled string of DNA nucleotides.
The motivation behind this is to
assist in the annotation of genomic data produced by genome sequencing
methods
gain insight into the mechanisms involved in transcription, splicing and
other processes
1
As shown in the diagram above, a string of DNA nucleotides containing a gene will
have separate regions
Introns – non-coding regions within a gene
Exons – coding regions
These regions are separated by functional sites
Start and stop codons
Splice sites – acceptors and donors
In the process of transcription, only the exons are left to form the protein sequence
as depicted below.
2
Many problems in biological sequence analysis have a grammatical structure
. HMMs are very useful in modeling grammar. The input to such a HMM is the
genomic DNA sequence and the output, in the simplest case is a parse tree of
exons and introns on the DNA sequence.
Shown below is a simple model for unspliced genes that recognizes the start
codon, stop codon (only one of the three possible stop codons are shown) and the
coding/non-coding regions. This model has been trained with a test set of gene
data.
Having such a model, how can we predict genes in a sequence of anonymous
DNA ? We simply use the Viterbi algorithm to find the most probable path through the
model
3
4
1.2. Protein- Profile HMMs
As we have seen earlier, protein structural similarities make it possible to create a
statistical model of a protein family which is called a profile. The idea is, given a
single amino acid target sequence of unknown structure, we want to infer the
structure of the resulting protein.
The profile HMM is built by analyzing the distribution of amino-acids in a training
set of related proteins. This HMM in a natural way can model positional dependant
gap penalties.
The basic topology of a profile HMM is shown above. Each position, or module, in
Matching
Deletion
states Insertion states
states
the model has three states. A state shown as a rectangular box is a match state
that models the distribution of letters in the corresponding column of an
alignment. A state shown by a diamond-shaped box models insertions of random
letters between two alignment positions, and a state shown by a circle models a
deletion, corresponding to a gap in an alignment. States of neighboring positions
are connected, as shown by lines. For each of these lines there is an associated
5
`transition probability', which is the probability of going from one state to the
other.
The match state represents a consensus amino acid for this position in the protein
family. The delete state is a non-emitting state, and represents skipping this
consensus position in the multiple alignment. Finally, the insert state models the
insertion of any number of residues after this consensus position.
A repository of protein profile HMMs can be found in the PFAM Database
(http://pfam.wustl.edu).
Building profiles from a family of proteins(or DNA) a profile HMM can be made for
searching a database for other members of the family. As we have seen before in
the section on HMM problems, profile HMM’s can also be used for the following
Scoring a sequence
We are calculating the probability of a sequence given a profile by simply
multiplying emmision and transition probabilities along the path.
Classifying sequences in a database
Given a HMM for a protein family and some unknown sequences, we are trying to
find a path through the model where the new sequence fits in or we are tying to
‘align’ the sequence to the model. Alignment to the model is an assignment of
states to each residue in the sequence. There are many such alignments and the
Vitterbi’s algorithm is used to give the probability of the sequence for that
alignment.
6
Creating Multiple sequence alignment
HMMs can be used to automatically create a multiple alignment from a group of
unaligned sequences. By taking a close look at the alignment, we can see the
history of evolution. One great advantage of HMMs is that they can be estimated
from sequences, without having to align the sequences first. The sequences used to
estimate or train the model are called the training sequences, and any reserved
sequences used to evaluate the model are called the test sequences. The model
estimation is done with the forward-backward algorithm, also known as the Baum-
Welch algorithm. It is an iterative algorithm that maximizes the likelihood of the
training sequences.
1.3. Prediction of protein secondary structure using HMM’s
Prediction of secondary structures is need for the prediction of protein function. As
an alternative method to direct X-ray analysis, a HMM is used to
Analyze the amino-acid sequences of proteins
Learn secondary structures such as helix, sheet and turn
Predict the secondary structures of sequences
The method is to train the four HMMs of secondary structure – helix, sheet, turn
and other – by training sequences. The Baum-Welch method is used to train the
HMMs. So, the HMM of helix is able to produce helix-like sequences with high
probabilities. Now, these HMMs can be used to predict the secondary structure of
the test sequence. The forward-backward algorithm is used to compute the
probabilities of these HMMs outputting the test sequence. The sequence has the
secondary structure whose HMM showed the highest probability to output the
sequence.
7
2. HMM implementation
These are the two publicly available HMM implementation software.
HMMER - http://hmmer.wustl.edu/
SAM system - http://www.cse.ucsc.edu/research/compbio/sam.html
3. Advantages of HMMs
HMM’s can accommodate variable-length sequence.
Because most biological data has variable-length properties, machine learning
techniques which require a fixed-length input, such as neural networks or support
vector machines, are less successful in biological sequence analysis
Allows position dependant gap penalties.
HMM’s treat insertions and deletions is a statistical manner that is dependant on
position.
4. Limitations of HMMs
Linear Model
So, they are unable to capture higher order correlations among amino-acids.
Markov Chain assumption of independent events
Probabilities of states are supposed to be independent which is not true of biology
…
Eg,
…
P(x
) .
P(y) must be independent ofP(y
P(x), and vice versa
)
8
Standard Machine Learning Problems
In the training problem, we need to watch out for local maxima and so model may
not converge to a truly optimal parameter set for a given training set. Secondly,
since the model is only as good as your training set, this may lead to over-fitting.
5. Open areas for research in HMMs in biology
Integration of structural information into profile HMMs.
Despite the almost obvious application of using structural information on a
member protein family when one exists to better the parameterization of the
HMM, this has been extremely hard to achieve in practice.
Model architecture
The architectures of HMMs have largely been chosen to be the simplest
architectures that can fit the observed data. Is this the best architecture to use?
Can one use protein structure knowledge to make better architecture decisions, or,
in limited regions, to learn the architecture directly from the data? Will these
implied architectures have implications for our structural understanding?
Biological mechanism
In gene prediction, the HMM’s may be getting close to replicating the same sort of
accuracy as the biological machine (the HMM’s have the additional task of finding
the gene in the genomic DNA context, which is not handled by the biological
machine that processes the RNA). What constraints does our statistical model
9
place on the biological mechanism— in particular, can we consider a biological
mechanism that could use the same information as the HMM?
6. References
1. L. R. Rabiner and B. H. Juang, An Introduction to Hidden Markov Models, IEEE
ASSP Magazine, January 1986, pp. 1-16.
2. K. Asai, S. Hayamizu and H. Handa, Prediction of protein secondary structures
by hidden Markon models, Computer Application in the Biosciences (CABIOS),
Vol 9, No 2, 1993, pp. 141-146.
3. Krogh, A., M. Brown, I.S. Mian, K. Sjolander, and D. Haussler. Hidden Markov
Models in Computational Biology: Applications to Protein Modeling, J. Mol.
Biol., Vol. 235, pp.1501-1531, 1994.
4. S. Eddy. Profile hidden Markov models. Bioinformatics, 14:755--763, 1998.
5. L. R. Rabiner, A Tutorial on Hidden Markov Models and Selected Applications
in Speech Recognition , Proceedings of the IEEE, 77 , no. 2, 257--285, February
1989.
6. Baldi, P., Chauvin, Y., Hunkapiller, T. and McClure, M.A. (1993) Hidden Markov
Models in Molecular Biology: New Algorithms and Applications, In Advances in
Neural Information Processing Systems 5, Eds. S.J. Hanson, J.D. Cowan and C.
Lee Giles, (Morgan Kaufmann) pp 747-754
7. Baldi, P., Chauvin, Y., Hunkapiller, T., and McClure, M.A. (1994) Hidden Markov
Models of Biological Primary Sequence Information, Proceedings of the
National Academy of Science, USA 91: 1059-1063.
8. David Kulp, David Haussler, Martin G. Reese, and Frank H. Eeckman, A
generalized hidden markov model for the recognition of human genes in DNA,
10
Procedings of the Fourth International Conference on Intelligent Systems for
Molecular Biology (Menlo Park, CA), AAAI Press, 1996.
9. R. Hughey and A. Krogh. Hidden Markov models for sequence analysis:
extension and analysis of the basic method. Computer Applications in the
Biosciences, 12:95-107.1996
http://www.cse.ucsc.edu/research/compbio/html_format_papers/hughkrogh96/c
abios.html
10.Henderson,J., Salzberg,S. and Fasman,K. Finding genes in human DNA with a
hidden Markov model. Journal of Computational Biology 1997, 4, 127--141.
11.An Introduction to Hidden Markov Models for Biological Sequences by A.
Krogh. In S. L. Salzberg et al., eds., Computational Methods in Molecular
Biology, 45-63. Elsevier, 1998.
12.E Birney. Hidden Markov models in biological sequence analysis. Deep
computing for the life sciences. Volume 45, Numbers ¾ 2001.
http://www.research.ibm.com/journal/rd/453/birney.html
13.HMMer - http://hmmer.wustl.edu/
14.SAM system - http://www.cse.ucsc.edu/research/compbio/sam.html
11