0% found this document useful (0 votes)

43 views12 pages

HMMs in Computational Biology

1. HMMs are commonly used to model biological sequence data like DNA and proteins. They can be applied to problems like gene finding, protein family modeling, and secondary structure prediction. 2. Training an HMM involves estimating emission and transition probabilities from a set of training sequences using algorithms like Baum-Welch. 3. HMMs have advantages like accommodating variable sequence lengths and position-dependent gaps, but also limitations like making independence assumptions that do not always hold for biological data.

Uploaded by

juanpapo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views12 pages

HMMs in Computational Biology

Uploaded by

juanpapo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 12

Training Problem

Another tricky problem is how to create an HMM in the first place, given a

particular set of related training sequences. It is necessary to estimate the amino

acid emission distributions in each state and all state-to-state transition

probabilities from a set of related training sequences. This is done by using the

Baum-Welch Algorithm or the Forward Backward Algorithm.

The algorithm proceeds by making an initial guess of the parameters (which may

well be entirely wrong) and then refining it by assessing its worth, and attempting

to reduce the errors it provokes when fitted to the given data. In this sense, it is

performing a form of gradient descent, looking for a minimum of an error measure.

1. Applications of HMM’s

In this section, we will delve into greater depth at specific problems in the area of

computational biology and examine the role of HMM's.

1.1. Gene finding and prediction

We introduce here the gene-prediction HMMs that can be used to predict the

structure of the gene. Our objective is to find the coding and non-coding regions of

an unlabeled string of DNA nucleotides.

The motivation behind this is to

 assist in the annotation of genomic data produced by genome sequencing

methods

 gain insight into the mechanisms involved in transcription, splicing and

other processes

1
As shown in the diagram above, a string of DNA nucleotides containing a gene will

have separate regions

 Introns – non-coding regions within a gene

 Exons – coding regions

These regions are separated by functional sites

 Start and stop codons

 Splice sites – acceptors and donors

In the process of transcription, only the exons are left to form the protein sequence

as depicted below.

2
Many problems in biological sequence analysis have a grammatical structure

. HMMs are very useful in modeling grammar. The input to such a HMM is the

genomic DNA sequence and the output, in the simplest case is a parse tree of

exons and introns on the DNA sequence.

Shown below is a simple model for unspliced genes that recognizes the start

codon, stop codon (only one of the three possible stop codons are shown) and the

coding/non-coding regions. This model has been trained with a test set of gene

data.

Having such a model, how can we predict genes in a sequence of anonymous

DNA ? We simply use the Viterbi algorithm to find the most probable path through the

model

3
4
1.2. Protein- Profile HMMs

As we have seen earlier, protein structural similarities make it possible to create a

statistical model of a protein family which is called a profile. The idea is, given a

single amino acid target sequence of unknown structure, we want to infer the

structure of the resulting protein.

The profile HMM is built by analyzing the distribution of amino-acids in a training

set of related proteins. This HMM in a natural way can model positional dependant

gap penalties.

The basic topology of a profile HMM is shown above. Each position, or module, in

Matching
Deletion
states Insertion states
states
the model has three states. A state shown as a rectangular box is a match state

that models the distribution of letters in the corresponding column of an

alignment. A state shown by a diamond-shaped box models insertions of random

letters between two alignment positions, and a state shown by a circle models a

deletion, corresponding to a gap in an alignment. States of neighboring positions

are connected, as shown by lines. For each of these lines there is an associated

5
`transition probability', which is the probability of going from one state to the

other.

The match state represents a consensus amino acid for this position in the protein

family. The delete state is a non-emitting state, and represents skipping this

consensus position in the multiple alignment. Finally, the insert state models the

insertion of any number of residues after this consensus position.

A repository of protein profile HMMs can be found in the PFAM Database

(http://pfam.wustl.edu).

Building profiles from a family of proteins(or DNA) a profile HMM can be made for

searching a database for other members of the family. As we have seen before in

the section on HMM problems, profile HMM’s can also be used for the following

Scoring a sequence

We are calculating the probability of a sequence given a profile by simply

multiplying emmision and transition probabilities along the path.

Classifying sequences in a database

Given a HMM for a protein family and some unknown sequences, we are trying to

find a path through the model where the new sequence fits in or we are tying to

‘align’ the sequence to the model. Alignment to the model is an assignment of

states to each residue in the sequence. There are many such alignments and the

Vitterbi’s algorithm is used to give the probability of the sequence for that

alignment.

6
Creating Multiple sequence alignment

HMMs can be used to automatically create a multiple alignment from a group of

unaligned sequences. By taking a close look at the alignment, we can see the

history of evolution. One great advantage of HMMs is that they can be estimated

from sequences, without having to align the sequences first. The sequences used to

estimate or train the model are called the training sequences, and any reserved

sequences used to evaluate the model are called the test sequences. The model

estimation is done with the forward-backward algorithm, also known as the Baum-

Welch algorithm. It is an iterative algorithm that maximizes the likelihood of the

training sequences.

1.3. Prediction of protein secondary structure using HMM’s

Prediction of secondary structures is need for the prediction of protein function. As

an alternative method to direct X-ray analysis, a HMM is used to

 Analyze the amino-acid sequences of proteins

 Learn secondary structures such as helix, sheet and turn

 Predict the secondary structures of sequences

The method is to train the four HMMs of secondary structure – helix, sheet, turn

and other – by training sequences. The Baum-Welch method is used to train the

HMMs. So, the HMM of helix is able to produce helix-like sequences with high

probabilities. Now, these HMMs can be used to predict the secondary structure of

the test sequence. The forward-backward algorithm is used to compute the

probabilities of these HMMs outputting the test sequence. The sequence has the

secondary structure whose HMM showed the highest probability to output the

sequence.

7
2. HMM implementation

These are the two publicly available HMM implementation software.

HMMER - http://hmmer.wustl.edu/

SAM system - http://www.cse.ucsc.edu/research/compbio/sam.html

3. Advantages of HMMs

 HMM’s can accommodate variable-length sequence.

Because most biological data has variable-length properties, machine learning

techniques which require a fixed-length input, such as neural networks or support

vector machines, are less successful in biological sequence analysis

 Allows position dependant gap penalties.

HMM’s treat insertions and deletions is a statistical manner that is dependant on

position.

4. Limitations of HMMs

 Linear Model

So, they are unable to capture higher order correlations among amino-acids.

 Markov Chain assumption of independent events

Probabilities of states are supposed to be independent which is not true of biology

…
Eg,
…
P(x
) .
P(y) must be independent ofP(y
P(x), and vice versa
)

8
 Standard Machine Learning Problems

In the training problem, we need to watch out for local maxima and so model may

not converge to a truly optimal parameter set for a given training set. Secondly,

since the model is only as good as your training set, this may lead to over-fitting.

5. Open areas for research in HMMs in biology

 Integration of structural information into profile HMMs.

Despite the almost obvious application of using structural information on a

member protein family when one exists to better the parameterization of the

HMM, this has been extremely hard to achieve in practice.

 Model architecture

The architectures of HMMs have largely been chosen to be the simplest

architectures that can fit the observed data. Is this the best architecture to use?

Can one use protein structure knowledge to make better architecture decisions, or,

in limited regions, to learn the architecture directly from the data? Will these

implied architectures have implications for our structural understanding?

 Biological mechanism

In gene prediction, the HMM’s may be getting close to replicating the same sort of

accuracy as the biological machine (the HMM’s have the additional task of finding

the gene in the genomic DNA context, which is not handled by the biological

machine that processes the RNA). What constraints does our statistical model

9
place on the biological mechanism— in particular, can we consider a biological

mechanism that could use the same information as the HMM?

6. References

1. L. R. Rabiner and B. H. Juang, An Introduction to Hidden Markov Models, IEEE

ASSP Magazine, January 1986, pp. 1-16.

2. K. Asai, S. Hayamizu and H. Handa, Prediction of protein secondary structures

by hidden Markon models, Computer Application in the Biosciences (CABIOS),

Vol 9, No 2, 1993, pp. 141-146.

3. Krogh, A., M. Brown, I.S. Mian, K. Sjolander, and D. Haussler. Hidden Markov

Models in Computational Biology: Applications to Protein Modeling, J. Mol.

Biol., Vol. 235, pp.1501-1531, 1994.

4. S. Eddy. Profile hidden Markov models. Bioinformatics, 14:755--763, 1998.

5. L. R. Rabiner, A Tutorial on Hidden Markov Models and Selected Applications

in Speech Recognition , Proceedings of the IEEE, 77 , no. 2, 257--285, February

1989.

6. Baldi, P., Chauvin, Y., Hunkapiller, T. and McClure, M.A. (1993) Hidden Markov

Models in Molecular Biology: New Algorithms and Applications, In Advances in

Neural Information Processing Systems 5, Eds. S.J. Hanson, J.D. Cowan and C.

Lee Giles, (Morgan Kaufmann) pp 747-754

7. Baldi, P., Chauvin, Y., Hunkapiller, T., and McClure, M.A. (1994) Hidden Markov

Models of Biological Primary Sequence Information, Proceedings of the

National Academy of Science, USA 91: 1059-1063.

8. David Kulp, David Haussler, Martin G. Reese, and Frank H. Eeckman, A

generalized hidden markov model for the recognition of human genes in DNA,

10
Procedings of the Fourth International Conference on Intelligent Systems for

Molecular Biology (Menlo Park, CA), AAAI Press, 1996.

9. R. Hughey and A. Krogh. Hidden Markov models for sequence analysis:

extension and analysis of the basic method. Computer Applications in the

Biosciences, 12:95-107.1996

http://www.cse.ucsc.edu/research/compbio/html_format_papers/hughkrogh96/c

abios.html

10.Henderson,J., Salzberg,S. and Fasman,K. Finding genes in human DNA with a

hidden Markov model. Journal of Computational Biology 1997, 4, 127--141.

11.An Introduction to Hidden Markov Models for Biological Sequences by A.

Krogh. In S. L. Salzberg et al., eds., Computational Methods in Molecular

Biology, 45-63. Elsevier, 1998.

12.E Birney. Hidden Markov models in biological sequence analysis. Deep

computing for the life sciences. Volume 45, Numbers ¾ 2001.

http://www.research.ibm.com/journal/rd/453/birney.html

13.HMMer - http://hmmer.wustl.edu/

14.SAM system - http://www.cse.ucsc.edu/research/compbio/sam.html

Training Problem
No ratings yet
Training Problem
9 pages
Three Problems of Hidden Markov Models: 1) Scoring Problem
No ratings yet
Three Problems of Hidden Markov Models: 1) Scoring Problem
11 pages
1.1. An Example of A HMM For Protein Sequences: Output Prob
No ratings yet
1.1. An Example of A HMM For Protein Sequences: Output Prob
16 pages
Profile HMMs in Sequence Alignment
No ratings yet
Profile HMMs in Sequence Alignment
4 pages
Hidden Markov Models Sean R Eddy: Analysis Has
No ratings yet
Hidden Markov Models Sean R Eddy: Analysis Has
5 pages
Hidden Markov Models and Their Applications in Biological Sequence Analysis Byung-Jun Yoon
No ratings yet
Hidden Markov Models and Their Applications in Biological Sequence Analysis Byung-Jun Yoon
30 pages
Chapter4.4 HMM
No ratings yet
Chapter4.4 HMM
20 pages
HMMs for Gene Finding in Bioinformatics
No ratings yet
HMMs for Gene Finding in Bioinformatics
32 pages
133 Thando Tshaka Presentation
No ratings yet
133 Thando Tshaka Presentation
8 pages
Hidden Markov Models
No ratings yet
Hidden Markov Models
24 pages
HMM-Mona Singh
No ratings yet
HMM-Mona Singh
11 pages
P (X P (Y: 1. Open Areas For Research in Hmms in Biology
No ratings yet
P (X P (Y: 1. Open Areas For Research in Hmms in Biology
1 page
Cse291d 19
No ratings yet
Cse291d 19
43 pages
CG 10 402 PDF
No ratings yet
CG 10 402 PDF
14 pages
HMMER
No ratings yet
HMMER
5 pages
nbt1004 1315 PDF
No ratings yet
nbt1004 1315 PDF
2 pages
HMM in BI
No ratings yet
HMM in BI
37 pages
HMM Lecture Notes
No ratings yet
HMM Lecture Notes
7 pages
The Central Dogma of Molecular Biology: Cctgagccaactattgatg
No ratings yet
The Central Dogma of Molecular Biology: Cctgagccaactattgatg
18 pages
Protein Profile HMMs Explained
No ratings yet
Protein Profile HMMs Explained
1 page
HMM Limitations in Biology
No ratings yet
HMM Limitations in Biology
1 page
Biological Sequence Analysis Probabilistic Models of Proteins and Nucleic Acids 1st Edition Open Access Download
100% (8)
Biological Sequence Analysis Probabilistic Models of Proteins and Nucleic Acids 1st Edition Open Access Download
14 pages
Hidden Markov Models in Bioinformatics
No ratings yet
Hidden Markov Models in Bioinformatics
28 pages
Binfo (HMM)
No ratings yet
Binfo (HMM)
16 pages
HMM Er User Guide
No ratings yet
HMM Er User Guide
94 pages
Multiple Sequence Alignment
No ratings yet
Multiple Sequence Alignment
16 pages
Profile HMMs for Bioinformatics
No ratings yet
Profile HMMs for Bioinformatics
36 pages
Nmeth 1818
No ratings yet
Nmeth 1818
6 pages
Smooth On-Line Learning Algorithms For Hidden Markov Models
No ratings yet
Smooth On-Line Learning Algorithms For Hidden Markov Models
12 pages
RNA & Protein Structure Analysis
No ratings yet
RNA & Protein Structure Analysis
51 pages
Mario Stanke Thesis
No ratings yet
Mario Stanke Thesis
104 pages
2024 05 24 595730v1 Full
No ratings yet
2024 05 24 595730v1 Full
22 pages
Gene Prediction in Bacteria and Eukaryotes
No ratings yet
Gene Prediction in Bacteria and Eukaryotes
45 pages
Lecture2022 - 3 /!
No ratings yet
Lecture2022 - 3 /!
60 pages
1.1. Multiple Sequence Alignment
No ratings yet
1.1. Multiple Sequence Alignment
16 pages
Gene Prediction Using Statistical Methods
No ratings yet
Gene Prediction Using Statistical Methods
47 pages
PSSM
No ratings yet
PSSM
17 pages
HMMer Userguide2.4
No ratings yet
HMMer Userguide2.4
99 pages
SVM in Bioinformatics: Understandin G
No ratings yet
SVM in Bioinformatics: Understandin G
49 pages
Dokumen - Pub Machine Learning in Bioinformatics of Protein Sequences Algorithms Databases and Resources For Modern Protein Bioinformatics 9811258570 9789811258572
No ratings yet
Dokumen - Pub Machine Learning in Bioinformatics of Protein Sequences Algorithms Databases and Resources For Modern Protein Bioinformatics 9811258570 9789811258572
378 pages
An Introduction To Patterns, Profiles, Hmms and Psi-Blast
No ratings yet
An Introduction To Patterns, Profiles, Hmms and Psi-Blast
92 pages
Pattern Rrecognition 2
No ratings yet
Pattern Rrecognition 2
7 pages
Clustering of Protein Domains For Functional and Evolutionary Studies
No ratings yet
Clustering of Protein Domains For Functional and Evolutionary Studies
11 pages
HH Suite
No ratings yet
HH Suite
5 pages
Jahmm 0.6.1 Userguide
No ratings yet
Jahmm 0.6.1 Userguide
23 pages
Mahee 718&saba 710
No ratings yet
Mahee 718&saba 710
4 pages
Bioinf
No ratings yet
Bioinf
2 pages
HMM Stanke 2003
No ratings yet
HMM Stanke 2003
11 pages
MATH3353 Notes
No ratings yet
MATH3353 Notes
100 pages
Unit 6 - Bioinformatics
No ratings yet
Unit 6 - Bioinformatics
41 pages
HMM HW Solutions
No ratings yet
HMM HW Solutions
13 pages
1181 سروری
No ratings yet
1181 سروری
4 pages
Hidden Markov Model (HMM) Architecture
No ratings yet
Hidden Markov Model (HMM) Architecture
15 pages
Proteins Bioinfo Latest
No ratings yet
Proteins Bioinfo Latest
45 pages
Name: Gaurav Bhargava Roll No: IBI2009011
No ratings yet
Name: Gaurav Bhargava Roll No: IBI2009011
3 pages
Method and Application of Hidden Markov's Model, SVM and Deep Learning Algorithm in Secondary Structure of Protein?
No ratings yet
Method and Application of Hidden Markov's Model, SVM and Deep Learning Algorithm in Secondary Structure of Protein?
4 pages
Lab Report 07
100% (1)
Lab Report 07
19 pages
Flash On English For Tourism - Answer Key and Transcripts: Unit 1, Pp. 4-7
100% (1)
Flash On English For Tourism - Answer Key and Transcripts: Unit 1, Pp. 4-7
15 pages
Distributor Agreement Terms
No ratings yet
Distributor Agreement Terms
4 pages
Universal Helmet Laws Reduce Injuries
No ratings yet
Universal Helmet Laws Reduce Injuries
13 pages
Cisco Certifications - Wikipedia
No ratings yet
Cisco Certifications - Wikipedia
1 page
Age of Fantasy - Saurians v2.0: Background About OPR
No ratings yet
Age of Fantasy - Saurians v2.0: Background About OPR
2 pages
GLOBE - LATITUDE AND LONGITUDE (Key Notes)
No ratings yet
GLOBE - LATITUDE AND LONGITUDE (Key Notes)
2 pages
3B. Purchase Order (CBG Staggered Delivery)
No ratings yet
3B. Purchase Order (CBG Staggered Delivery)
4 pages
Zero 2022 Collection - BD
No ratings yet
Zero 2022 Collection - BD
38 pages
Entrepreneurship Module: Gov't Requirements
No ratings yet
Entrepreneurship Module: Gov't Requirements
13 pages
Cookery Week 6-Day2
100% (2)
Cookery Week 6-Day2
3 pages
Sawasdee SET: S-T Retracement, Opportunity To Buy
No ratings yet
Sawasdee SET: S-T Retracement, Opportunity To Buy
14 pages
PMF IAS Geography
100% (3)
PMF IAS Geography
899 pages
Media Patterns PDF PAT0140
No ratings yet
Media Patterns PDF PAT0140
4 pages
JIG Bulletin-146-Independent-Deadman-Testing-141022
No ratings yet
JIG Bulletin-146-Independent-Deadman-Testing-141022
3 pages
Mariem Abidi Rapport PFE 2020 Final
No ratings yet
Mariem Abidi Rapport PFE 2020 Final
101 pages
New Generation History Grade 11 Learners Book
100% (1)
New Generation History Grade 11 Learners Book
4 pages
Wall - Wash - Sampling - and - Analysis - Procedures COMPLETE
No ratings yet
Wall - Wash - Sampling - and - Analysis - Procedures COMPLETE
27 pages
Mahmoud A
No ratings yet
Mahmoud A
82 pages
Peta in Business Math
No ratings yet
Peta in Business Math
2 pages
e Bill
No ratings yet
e Bill
1 page
Volvo 47705930 - Us
71% (7)
Volvo 47705930 - Us
234 pages
Xray Pin in Hole Reflow
No ratings yet
Xray Pin in Hole Reflow
6 pages
Contoh Business Plan Indonesia
No ratings yet
Contoh Business Plan Indonesia
9 pages
List of LEED Projects
No ratings yet
List of LEED Projects
4 pages
Overview of OFDM Channel Estimation Techniques For DVB-T2 Systems
No ratings yet
Overview of OFDM Channel Estimation Techniques For DVB-T2 Systems
6 pages
SU-Field and 76 Solutions
No ratings yet
SU-Field and 76 Solutions
90 pages
Dell - Pre .DES-1241.by .VCEplus.60q-DeMO 2
50% (2)
Dell - Pre .DES-1241.by .VCEplus.60q-DeMO 2
19 pages
Deep Learning Revision Guide
No ratings yet
Deep Learning Revision Guide
6 pages
Paranoid
No ratings yet
Paranoid
8 pages
Chapter 11 - Forms-Clubs and Societies
No ratings yet
Chapter 11 - Forms-Clubs and Societies
35 pages