0% found this document useful (0 votes)
47 views10 pages

Reliability 6

This document describes the development of reliability scores for five membrane protein topology prediction algorithms (TMHMM, HMMTOP, MEMSAT, PHD, and TopPred). The reliability scores were designed to estimate the probability that a predicted topology is correct. They were evaluated on a test set of 92 bacterial proteins with known topologies, and on predicted topologies from three fully sequenced genomes. The results show that the reliability scores worked well for TMHMM and MEMSAT, allowing the likelihood of a correct prediction to be estimated. Limited experimental data, like the location of protein termini, was also found to improve prediction accuracy by up to 10 percentage points.

Uploaded by

Shampa Sen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views10 pages

Reliability 6

This document describes the development of reliability scores for five membrane protein topology prediction algorithms (TMHMM, HMMTOP, MEMSAT, PHD, and TopPred). The reliability scores were designed to estimate the probability that a predicted topology is correct. They were evaluated on a test set of 92 bacterial proteins with known topologies, and on predicted topologies from three fully sequenced genomes. The results show that the reliability scores worked well for TMHMM and MEMSAT, allowing the likelihood of a correct prediction to be estimated. Limited experimental data, like the location of protein termini, was also found to improve prediction accuracy by up to 10 percentage points.

Uploaded by

Shampa Sen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

doi:10.1016/S0022-2836(03)00182-7 J. Mol. Biol.

(2003) 327, 735–744

Reliability Measures for Membrane Protein Topology


Prediction Algorithms
Karin Melén1, Anders Krogh2 and Gunnar von Heijne1*
1
Department of Biochemistry We have developed reliability scores for five widely used membrane
and Biophysics, Stockholm protein topology prediction methods, and have applied them both on a
Bioinformatics Center test set of 92 bacterial plasma membrane proteins with experimentally
Stockholm University determined topologies and on all predicted helix bundle membrane
SE-106 91 Stockholm, Sweden proteins in three fully sequenced genomes: Escherichia coli, Saccharomyces
2 cerevisiae and Caenorhabditis elegans. We show that the reliability scores
Department of Molecular
work well for the TMHMM and MEMSAT methods, and that they allow
Biology, Bioinformatics Centre
the probability that the predicted topology is correct to be estimated for
University of Copenhagen
any protein. We further show that the available test set is biased towards
Universitetsparken 15
high-scoring proteins when compared to the genome-wide data sets, and
DK-2100 Copenhagen
provide estimates for the expected prediction accuracy of TMHMM across
Denmark the three genomes. Finally, we show that the performance of TMHMM is
considerably better when limited experimental information (such as the
in/out location of a protein’s C terminus) is available, and estimate that
at least ten percentage points in overall accuracy in whole-genome predic-
tions can be gained in this way.
q 2003 Elsevier Science Ltd. All rights reserved
*Corresponding author Keywords: membrane protein; topology prediction; bioinformatics

Introduction helix bundle proteins, in which one or several


a-helices span the membrane, and the b-barrel
It is estimated that some 20– 25% of all open proteins, in which eight or more anti-parallel trans-
reading frames (ORFs) in fully sequenced genomes membrane b-strands form a closed barrel. The
encode integral membrane proteins.1 Strikingly, b-barrel membrane proteins have so far been
however, considerably less than 1% of all 3D found only in the outer membranes of Gram-
protein structures deposited in the Protein Data negative bacteria, mitochondria, and chloroplasts,
Bank2 are of membrane proteins. Theoretical whereas the a-helical membrane proteins are
structure prediction methods are thus of particular present in all types of membranes. Here, we con-
importance for membrane proteins. Most current sider only methods for predicting the topology of
methods in this field do not deal with predicting helix bundle membrane proteins.
the 3D structure, but rather try to predict the most The best current topology prediction methods
likely topology of the protein, i.e. the in/out are claimed to predict the correct topology for
location of the N and C termini relative to the some 70– 85% of all proteins, although, as will be
membrane, and the number and positions of the shown below, this is an overestimate. Rather, we
membrane-spanning regions. Topology infor- estimate an overall prediction accuracy of 55 – 60%
mation can be generated experimentally by correctly predicted topologies when entire pro-
different approaches such as gene fusion, proteo- teomes are analyzed. Importantly, none of the
lytic digestion in situ, antibody binding, and most widely used methods (except PHD, see
chemical modification. A good topology model is below) provides any estimate of the reliability of a
a necessary prerequisite for experimental struc- given prediction, i.e. some measure of whether the
ture–function studies and can be used as a starting topology of a particular protein is more or less
point for attempts to model the 3D structure. likely to be correct than average.
From a structural point of view, there are two In this study, we have tried to construct useful
major groups of integral membrane proteins: the reliability scores for five widely used topology
prediction methods: TMHMM,1 HMMTOP,3
Abbreviations used: ORF, open reading frame. MEMSAT,4 PHD5 and TopPred.6 The goal has
E-mail address of the corresponding author: been to use these scores to compare performance
gunnar@dbb.su.se characteristics on a test set of proteins with

0022-2836/03/$ - see front matter q 2003 Elsevier Science Ltd. All rights reserved
736 Topology Prediction

experimentally determined topologies with per- probability values close to the borders between
formance characteristics on three complete gen- different classes often are low, even though the
omes, Escherichia coli, Saccharomyces cerevisiae and exact point of transition between one class and
Caenorhabditis elegans, and to assess to what extent another generally makes no difference to the
limited, easily obtainable experimental topology overall topology, we mask out a small number
information can be used to improve the theoretical of residues (three, five, seven, nine) on each side
predictions. of each border before locating the minimum
probability value. For the score evaluation pre-
sented below, we masked out nine residues at
each side of each border; the results are essen-
Results tially the same in the whole interval three to
nine masked residues (data not shown).
Construction of reliability scores
S3: The quotient p(best topology)/p(all possible
Judging by published bench-marking studies, topologies), calculated after a masking step as
TMHMM, HMMTOP and MEMSAT seem to have described below. The two probability values are
the best overall performance characteristics of the included in the standard TMHMM output,
available topology prediction programs.7,8 Two where p(best topology) is calculated with the
less well performing but widely used methods, N-best algorithm and p(all possible topologies)
PHD and TopPred, have been included for com- is calculated with the forward algorithm, as
parison. Each method is described below in some described.1 A quotient close to 1 implies that the
detail, together with a discussion of the reliability best path through the model (i.e. the predicted
scores that we have constructed from the raw out- topology) is much more probable than all
put from each program. alternative paths (i.e. all other topologies).
TMHMM can generate a list of several high-
scoring paths where the top ones frequently
TMHMM have very similar topologies (corresponding to
TMHMM is based on a hidden Markov model shifts of one or a few residues at the borders
with seven types of states (helix core, helix caps between different classes that do not change the
on either side of the membrane, short loop on cyto- overall topology). Since the exact borders
plasmic side/inside, short and long loop on non- between the classes are not generally known
cytoplasmic side/outside, and a globular domain even for the experimentally determined top-
state). Each type of state has a probability ologies, it is reasonable to mask out some
distribution over the 20 amino acids that have residues (we have used ten) on each side of a
been estimated from membrane proteins with class border and consider all topologies com-
experimentally known topologies. TMHMM out- patible with the “best” topology after masking
puts the most probable topology of the protein as the same prediction. We thus sum the prob-
given the model. The output is a labelled sequence abilities for all paths that give the same topology
of the three classes i (inside or cytoplasmic), h prediction after masking as the best path before
(helix) and o (outside or extra-cytoplasmic) that dividing by p(all possible paths) as obtained
obeys the “biological grammar” that a helix must from the raw output.
be followed by a loop and that inside and outside
loops must alternate. Posterior probabilities for HMMTOP
being in the three classes ( p(i), p(h), and p(o)) are
calculated for every residue in the sequence. We HMMTOP is a hidden Markov model with five
have constructed three different reliability scores states (inside loop, inside helix tail, helix, outside
(S1– S3) for TMHMM (see Methods). helix tail and outside loop). For a given amino
acid sequence it finds the most probable path
S1: The mean posterior probability of the through the model. Instead of taking into account
labelled sequence. A high mean posterior prob- only the absolute amino acid composition in the
ability indicates that most of the residues have a separate parts of the protein, it searches for the
high probability for their assigned classes and combination of states that gives the highest differ-
thus that the overall prediction might be con- ence in the amino acid distributions. The idea is
sidered reliable. The posterior probability values that a switch in the topology should be reflected
for each residue are calculated as described.1 A in a large amino acid distribution change (maxi-
possible shortcoming of this score is that a small mum divergence). In the raw output, numbers are
region with low probabilities embedded in a given for the entropy of the best path (i.e. the
long sequence with generally high scores will most probable topology) and the entropy of the
not greatly affect S1, even though it indicates an whole model. We have used the difference in
uncertainty in the prediction. entropy (i.e. entropy of best path 2 entropy of
S2: The minimum posterior probability in the model) as a measure of the reliability. The smaller
sequence of labelled residues. A low S2 score the difference, the better the best path represents
indicates that there is at least one part of the pro- the whole model, and the more likely to be correct
tein where the prediction is doubtful. Since the the predicted topology should be.
Topology Prediction 737

Figure 1. Relation between test


set cumulative coverage and the
fraction of correct topology predic-
tions for five different prediction
methods over a set of 92 prokaryo-
tic membrane proteins with experi-
mentally determined topologies.
TMHMM S3 score, filled squares;
MEMSAT, open squares; HMMTOP,
open circles; PHDhtm (web version,
multi-sequence mode), filled circles;
PHDhtm (single-sequence mode),
filled triangles; TopPred, open tri-
angles (for TopPred, many
sequences did not generate more
than one topology. For those cases
no reliability score could be calcu-
lated, which explains the total
TopPred coverage of only 36%).

MEMSAT membrane regions (the model). Finally, the overall


orientation of the protein in the membrane is pre-
MEMSAT is based on a model with five struc-
dicted by applying the “positive-inside” rule.11,12
tural states (inside loop, inside helix end, helix
PHDhtm is the only method in our study that
middle, outside helix end, outside loop). Each
automatically provides some sort of reliability
state is associated with a statistical table (log likeli-
measure. In the output, there is one reliability
hoods) of the frequency of the 20 amino acids. The
index for the model (i.e. for the number and
tables have been constructed from membrane
locations of the transmembrane regions) that is
proteins of known topologies and treat single- and
based on a comparison between the two highest-
multispanning membrane proteins separately. A
scoring models, and a second reliability index for
dynamic programming algorithm solves the
the orientation that is proportional to the charge
problem of finding the optimal state assignments
difference between the outside and inside parts of
for the query sequence. The algorithm computes
the protein. Both indices range from 0 (low) to 9
scores for all possible topologies starting with one
(high). However, the two indices are not combined
helix, and then increases the number of helices
into a single reliability score for the overall top-
one at a time until the scores become too low. The
ology. We have evaluated both the two existing
output produces a list of topologies representing
indices and the mean value of the two indices as
all possible number of TM helices (in both orien-
reliability scores.
tations) and their scores. The topology with the
Because the other four methods only use infor-
highest score is the final prediction. To assess the
mation in a single query sequence (and not infor-
reliability, we have calculated the difference in
mation from homologous sequences) we decided
scores between the best and the second best predic-
to run PHDhtm in single-sequence mode for the
tion. If the difference is high, the top-scoring
main analysis. However, we have also used the
topology should be more likely to be correct.
multi-sequence mode for comparison.

PHD
TopPred
PHD is a general tool for predicting secondary
structure of proteins, and the PHDhtm routine is TopPred was the first topology prediction
the part handling membrane proteins. It is method that combined hydrophobicity analysis
designed to use information from homologous and the positive-inside rule. It first calculates a
proteins. The first step in the method is a BLAST standard hydrophobicity profile for the query
search9 against the SWISSPROT database.10 A protein. Peaks above an upper cut-off (i.e. regions
multiple sequence alignment of the hits is con- rich in hydrophobic residues) are considered to be
structed and a neural network then estimates the confident transmembrane helix predictions
preference for each residue to be in a trans- whereas peaks between the upper and a lower
membrane helix or in a loop. The highest-scoring cut-off are regarded as putative transmembrane
putative transmembrane segment is used in a helices. Consequently, several topologies can be
second step to decide whether the protein is a constructed with or without the putative
helix bundle integral membrane protein. The third helix/helices. Out of these possible topologies, the
step is a dynamic program algorithm that finds one with the largest difference in the number of
the optimal number and locations of trans- positively charged amino acids between the two
738 Topology Prediction

sides of the membrane is given as the best predic-


tion. We have calculated a reliability score as the
difference between the charge-difference values
for the two top-scoring topologies. If no putative
helices are identified from the hydrophobicity
plot, only one topology is predicted, and thus no
reliability score can be calculated in such cases.

Reliability scores correlate with


prediction accuracy
The five methods and their corresponding
reliability scores described above were evaluated
over a previously collected test set (see Methods)
composed of 92 prokaryotic helix bundle mem-
brane proteins with experimentally determined
topologies. For each method and score, the 92 top-
ology predictions were ranked from high to low
scores. The results are summarized in Figure 1 in
the form of a plot of prediction accuracy versus Figure 2. TMHMM S3 and MEMSAT scores for 92 test
cumulative coverage of the test set. set proteins. Open circles, both predictions correct; filled
circles, both predictions false; open squares, TMHMM
As is clear from this Figure, TMHMM and prediction correct, MEMSAT prediction false; filled
MEMSAT have the best prediction characteristics squares, TMHMM prediction false, MEMSAT prediction
according to this test (for TMHMM, only the S3 correct.
score is shown, as the S1 and S2 scores yield essen-
tially the same results). For both methods, , 50% of
the predictions have reliability scores correspond-
ing to a prediction accuracy of , 90%, and , 70% the available experimental data, and a significant
of the proteins have scores corresponding to a fraction of the proteins in our test set have been
prediction accuracy , 80%. If the entire test set is used in the original construction of the different
considered (100% coverage), the prediction prediction methods. This has made it difficult to
accuracy is 65 –70%. obtain realistic estimates of the expected perform-
For HMMTOP, PHDhtm, and TopPred, our ance characteristics when the methods are applied
definitions of reliability scores do not seem very to previously uncharacterized proteins, and differ-
useful. We repeated the PHDhtm analysis by run- ent authors come to different conclusions on this
ning the web version in multi-sequence mode, point.7,8 From a couple of recent studies,13,14 it is
which improved the overall accuracy on the clear, however, that the available test sets of pro-
whole test set from 51% to 63%, but did not teins with experimentally determined topologies
improve the discrimination between good and bad is biased, although the extent of the bias is
predictions based on the reliability score. The two unknown.
individual reliability indices given by PHDhtm The reliability scores constructed here make it
were no better than the mean reliability score possible to address this question using whole-
shown in the Figure (data not shown). genome data. We have therefore calculated the
Interestingly, the top-scoring proteins are, to a TMHMM S3 score distributions for the predicted
significant extent, different for the two best helix bundle membrane protein proteomes of one
methods, TMHMM and MEMSAT. By simply com- prokaryotic, E. coli15 and two eukaryotic,
bining the two scores as shown in Figure 2 S. cerevisiae16 and C. elegans,17 organisms, and have
(TMHMM score S3 . 0.7 and/or MEMSAT score compared these distributions to the distributions
. 4) we reach a prediction accuracy of , 95% for obtained for the test set.
the , 60% top-scoring proteins in the test set. How- As TMHMM has been shown to be able to dis-
ever, this apparent improvement needs to be criminate between soluble and integral membrane
confirmed on a larger data set. A more elaborate proteins with very great accuracy,1 the three mem-
scheme for combining different topology predic- brane protein proteomes were defined as all ORFs
tion methods has been presented,8 and it is for which TMHMM predicts at least two trans-
possible that one can find “optimized” combi- membrane helices. Predicted single-spanning
nations of reliability scores that perform better proteins were not included, since cleavable signal
than the individual scores discussed here. peptides are often predicted as transmembrane
helices, thus erroneously identifying many
secreted proteins as single-spanning membrane
Proteins with known topology constitute a
proteins. Even so, an unknown proportion of the
biased set compared to full-size proteomes
membrane proteins identified in this way will
The development and evaluation of topology contain cleavable signal peptides, in contrast to
prediction methods is, to some extent, limited by the test set proteins, which all lack cleavable signal
Topology Prediction 739

Figure 3. TMHMM S3 score


distributions. The fraction of all
predicted membrane proteins with
two or more TM helices in each
genome or in the test set (76
proteins) and for each score interval
is shown.

peptides. This may reduce the S3 scores slightly for tion accuracies of 70– 85% are serious
some of the predicted proteins, but we consider it overestimates.
unlikely that this is enough to explain the There are several possible explanations for the
differences between the proteome sets and the test test set bias. First, even though jack-knife pro-
set reported below. cedures were used in the development of the
The results are presented in Figure 3, where the prediction methods, there are many subtle ways
percentages of membrane proteins are plotted for in which the methods may have been overtrained.
different score intervals. To be able to compare the It is quite likely that the proteins for which experi-
score distributions for the three proteomes with mental topologies have been reported have some
the test set, we removed all single-spanning characteristics such as unusually hydrophobic
sequences in the test set, ending up with 76 transmembrane segments that simultaneously
sequences and a TMHMM accuracy for this simplify both experimental mapping and predic-
reduced set of 63%. The most striking result is that tion. There are many families of membrane
there is a much larger fraction of high-scoring proteins for which no experimental topology is
proteins in the test set compared to the three available and which have thus not been seen by
proteomes, and thus that the overall prediction the prediction methods.
accuracy of , 66% reported in Figure 1 is a clear Looking more carefully at the results for the
overestimate. To obtain a more realistic estimate, individual genomes (Figure 3), it is interesting to
we first derived an empirical relation between the note that S. cerevisiae has a particularly large frac-
prediction accuracy and the S3 score by dividing tion of low-scoring proteins, while C. elegans and
the 92 test set predictions, ranked from high to E. coli have more similar score distributions. We
low scores, into four equal-size groups and then did not expect C. elegans to have the greatest pre-
plotting the average prediction accuracy in each dicted accuracy, since it is a eukaryote and the
group against the mean score for that group, relationship A ¼ 80 £ S3 þ 20 was derived from pro-
Figure 4(A). The accuracy/score relation is karyotic proteins. However, we suspected that the
reasonably well described by the straight line family of 7TM-receptors, known to be exception-
A ¼ 80 £ S3 þ 20. Using this relation, we calcu- ally large in C. elegans,18 might have contributed to
lated the expected A-values for all proteins in the the results. We therefore identified all C. elegans
respective membrane protein proteomes, which is proteins predicted to have seven transmembrane
plotted against the cumulative coverage in helices and an extracellular N terminus (985 out of
Figure 4(B). As a control, we plotted the real mean totally 4059) and analyzed the 7TM and non-7TM
accuracy and the calculated accuracy (A) for the sets separately. The 7TM set was found to have a
test set; the two latter curves agree well and we score distribution similar to that of the test set,
thus conclude that the expected accuracy A is a whereas the score distribution for the remaining
reasonable representation of the real data. The C. elegans membrane proteins almost coincided
mean prediction accuracies estimated in this with that of E. coli (data not shown). Finally, the
way for the whole proteomes (56% for E. coli, 53% combination of the TMHMM S3 and MEMSAT
for S. cerevisiae and 59% for C. elegans) are sig- scores discussed above (Figure 2), gave the follow-
nificantly lower than the , 66% obtained for the ing coverages for the three proteomes: 45% for
test set, suggesting that the widely quoted predic- E. coli, 46% for S. cerevisiae and 56% for C. elegans,
740 Topology Prediction

Figure 4. Expected performance


of TMHMM over all predicted
membrane proteins with two or
more TM helices in each genome.
(A) Mean fraction of correctly pre-
dicted proteins versus the mean
TMHMM S3 score for each quartile
of the test set of 92 proteins. The
least-squares fit is given by
A ¼ 80 £ S3 þ 20, where A is the
expected accuracy (i.e. the prob-
ability that a prediction with a
given S3 score is correct). (B) Esti-
mated relation between cumulative
coverage and the fraction of correct
topology predictions for the test set
of 92 proteins and for all predicted
membrane proteins with two or
more TM helices in each genome.
Test set (original data), open circles;
test set (calculated data), filled cir-
cles; C. elegans, filled triangles;
E. coli, open squares; S. cerevisiae,
filled squares.

which should be compared to the 60% coverage of certain cases.19 With the introduction of reliability
the test set. scores, it is now possible to extend this strategy to
entire proteomes. The basic TMHMM algorithm
allows one to fix the class-assignment for any pos-
Inclusion of limited experimental information: ition in the sequence by setting the probability for
a strategy for large-scale topology mapping a position to belong to a certain class to 1.0 a priori.
Given the rather low estimates for the expected If the C-terminal residue of each protein in the test
mean prediction accuracy over full-size proteomes set is assigned to its experimentally known class,
discussed above, it is clear that topology predic- the relation between accuracy and coverage
tions, in general, provide only a rough guide to becomes much more favourable and the overall
the true topology of a protein. On the other hand, mean accuracy increases from 66% to 77% (Figure
the reliability scores presented here can be used to 5(B)). Similarly, if the N terminus is fixed, the over-
reduce considerably the necessary experimental all mean accuracy increases to 79%, and if both
work required to reach a satisfactory level of pre- termini are fixed it reaches 88% (data not shown).
diction accuracy. Again, there is an approximately linear relation-
We have shown that limited experimental infor- ship between the accuracy and the S3 score; with a
mation such as a determination of the in/out fixed C terminus, the relation is A c ¼ 70 £ S3c þ 30
location of the C terminus of a protein can be used (data not shown).
in conjunction with topology prediction to rapidly Finally, we tried to estimate how much the pre-
provide a very reliable topology model, at least in diction accuracy across the E. coli, S. cerevisiae and
Topology Prediction 741

Figure 5. Influence of experimen-


tal information on TMHMM per-
formance. (A) Relation between
increase in S3 score for the test set
of 92 proteins with the C-terminal
residue fixed to its experimentally
known location and the value of
p(last aa); DS3c ¼ 20.57 £ p(last
aa) þ 0.57. (B) Relation between
cumulative coverage and fraction
of correct predictions. Observed
accuracy for the test set with fixed
C-terminal locations, filled circles;
and with fixed N-terminal
locations, open circles. Expected
accuracy, A cp, for the three genomes
assuming that the C-terminal
location is known: C. elegans, filled
triangles; E. coli, open squares;
S. cerevisiae, filled squares.

C. elegans membrane protein proteomes would p(last aa), the larger is the mean increase in the S3
improve if the location of each protein’s C termi- score when the C-terminal residue is assigned to
nus was known. To this end, we used the test set its known class. This expression was used for esti-
to measure the difference in S3 score, DS3c, mating the increase in S3 score for all proteins in
between the score obtained with the C terminus the three proteomes from which the estimated S3c
fixed and the score obtained in the absence of any scores can be calculated; S3cp ¼ S3 þ DS3c, assum-
experimental information (DS3c ¼ S3c 2 S3) and ing that the C-terminal location is known. The
plotted DS3c versus the probability value for the expected accuracy, A cp, was then calculated from
location of the C-terminal residue obtained in the the expression for A c above. The results are shown
absence of experimental information, p(last aa), i.e. in Figure 5(B). The estimated increase in overall
the probability value for the assigned class of the accuracy for the proteomes is from 56% to 67% for
last amino acid in the sequence (Figure 5(A)). E. coli, from 53% to 67% for S. cerevisiae, and
Although the data are rather scattered, there is a from 59% to 71% for C. elegans. It should be
linear trend described by DS3c ¼ 2 0.57 £ p(last emphasised that these numbers are only rough
aa) þ 0.57. In other words, the smaller the value of estimates, but they nevertheless suggest that
742 Topology Prediction

prediction performance would improve signifi- Finally, we have tried to estimate the expected
cantly if C-terminal mapping data were available. improvement in prediction accuracy if the in/out
Generally applicable methods for determining location of the C terminus of every protein in a
the location of the C-terminal end of a protein on proteome was known from experimental data,
the basis of either reporter fusions or engineered since relatively rapid methods for such determi-
acceptor sites for N-linked glycosylation exist for nations are now available. For all three proteomes,
E. coli, S. cerevisiae and mammalian proteins,20 – 22 we find that TMHMM will predict the correct top-
and we have shown that such methods can be ology for , 70% of all membrane proteins, given
used on a relatively large scale (our unpublished that the C-terminal location is known. Again, the
work). On the basis of TMHMM-predictions,1 we likelihood that a given prediction is correct can be
have estimated that the membrane protein pro- estimated from the reliability score.
teome of E. coli consists of 769 proteins with two In summary, we describe new reliability scores
or more transmembrane helices, and that of for TMHMM and MEMSAT, two of the currently
S. cerevisiae of 847 such proteins. The results pre- best-performing topology prediction methods, that
sented above suggest that highly reliable topology make it possible to estimate the likelihood that a
models for a majority of these proteins should be given prediction is correct and that can be used in
obtainable from a simple experimental determi- conjunction with limited experimental information
nation of the C-terminal location. to provide high-quality topology models for entire
proteomes.

Discussion
Methods
Membrane protein topology prediction is an
important area in contemporary bioinformatics, Prediction methods
and provides a useful starting point for experimen- TMHMM2.0,1 HMMTOP2.0,3,23 MEMSAT version 1.8,4
tal studies of membrane proteins. While the overall PHDhtm version 1998.015 and TopPred version 1.06
performance of different topology prediction were used in single-sequence mode and with default
methods has been much discussed lately,7,8,13 essen- parameter settings. PHDhtm was also run in its multiple
tially no work has been done trying to estimate the sequence alignment mode on the website.†
reliability of individual predictions. Here, we have
constructed simple reliability scores for five widely Definition of reliability scores
used methods, TMHMM, HMMTOP, MEMSAT,
PHDhtm and TopPred, and have applied them to
a test set of 92 prokaryotic proteins with experi-
TMHMM S1 ¼ ðp1 ðlabelÞ þ p2 ðlabelÞ þ · · · þ pN ðlabelÞÞ=N
mentally determined topologies and to the full-
size membrane protein proteomes from E. coli, where N is the sequence length and pi(label) is the
S. cerevisiae and C. elegans. posterior probability for the assigned class (label ¼ i, o
For TMHMM and MEMSAT, there is a good cor- or h) for residue i.
relation between the reliability scores we have
defined and the expected accuracy of a prediction. TMHMM S2 ¼ min½p1 ðlabelÞ; p2 ðlabelÞ; …; pN ðlabelÞ
For both methods, , 50% of the predictions have
reliability scores corresponding to a prediction TMHMM S3 ¼ pðbest topologyÞ=pðall possible topologiesÞ
accuracy of , 90%, and , 70% of the proteins have
scores corresponding to a prediction accuracy of To calculate p(best topology) we first identify all high-
, 80% over the test set. For the remaining three scoring predictions that are compatible with the highest-
scoring one by masking ten residues on either side of
methods, we were unable to derive useful each class border. All predictions that have the same
reliability scores. class assignments as the highest-scoring one after mask-
We have further used the TMHMM reliability ing are considered as being the same, and p(best top-
score to assess the degree of bias in the test set as ology) is the summed probabilities (as given by
compared to the predicted membrane protein pro- TMHMM) for these predictions. These individual prob-
teomes of E. coli, S. cerevisiae and C. elegans. In con- abilities as well as p(all possible topologies) are calcu-
formity with the results of two recent studies,13,14 lated as described.1
we find that the test set is biased towards high-
HMMTOP : score ¼ entropyðbest pathÞ 2 entropyðmodelÞ
scoring proteins, and we estimate that only some
53 –59% of all predicted topologies for these pro-
teomes are correct, compared to 63% for the test MEMSAT : score ¼ scoreðbest topologyÞ
set when only proteins with two or more trans-
membrane helices are considered (or 66% for the 2 scoreðsecond best topologyÞ
whole test set). The reliability scores make it
possible to estimate the likelihood that a given pre- PHDhtm : score ¼ ððindexðmodelÞ þ indexðorientationÞÞ=2
diction is correct, allowing experimental topology
mapping efforts to be focused on proteins with
low reliability scores. † http://cubic.bioc.columbia.edu/predictprotein/
Topology Prediction 743

TopPred : score 3. Tusnady, G. E. & Simon, I. (1998). Principles govern-


ing amino acid composition of integral membrane
¼ D positive chargesðbest topologyÞ proteins: application to topology prediction. J. Mol.
Biol. 283, 489–506.
2 D positive chargesðsecond best topologyÞ 4. Jones, D. T., Taylor, W. R. & Thornton, J. M. (1994). A
model recognition approach to the prediction of all-
helical membrane protein structure and topology.
Definition of correct predictions Biochemistry, 33, 3038–3049.
5. Rost, B., Fariselli, P. & Casadio, R. (1996). Topology
A predicted topology is considered correct if it has the
prediction for helical transmembrane proteins at
correct number of transmembrane segments and the cor-
86% accuracy. Protein Sci. 5, 1704– 1718.
rect location of the N terminus.
6. von Heijne, G. (1992). Membrane protein structure
prediction—hydrophobicity analysis and the
Data sets positive-inside rule. J. Mol. Biol. 225, 487– 494.
7. Möller, S., Croning, M. & Apweiler, R. (2001).
The test set used is a collection of 92 prokaryotic helix Evaluations of methods for the predictive evaluation
bundle membrane proteins with experimentally known of membrane spanning regions. Bioinformatics, 17,
topologies.24 We selected proteins belonging to “trust 646 –653.
levels” A, B and C, but excluded level C proteins with 8. Ikeda, M., Arai, M., Lao, D. & Shimizu, T. (2001).
only partial topologies. We removed all sequences that Transmembrane topology prediction methods: a re-
were annotated to contain an N-terminal signal or a assessment and improvement by a consensus
pro-peptide. method using a data-set of experimentally character-
The highest level of sequence identity (as determined ized transmembrane topologies. In Silico Biol. 2,
by ClustalW25 alignments) between any two proteins in 1 – 15.
the test set was 59%, and 71 sequences had less than 9. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. &
30% mutual identity as determined by the Hobohm 2 Lipman, D. J. (1990). Basic local alignment search
algorithm.26
tool. J. Mol. Biol. 215, 403– 410.
For the proteome analysis, all predicted ORFs from 10. O’Donovan, C., Martin, M. J., Gattiker, A., Gasteiger,
three fully sequenced genomes, E. coli†, S. cerevisiae‡ E., Bairoch, A. & Apweiler, R. (2002). High-quality
and C. elegans§, were downloaded. protein knowledge resource: SWISS-PROT and
To extract the membrane proteins, TMHMM was run TrEMBL. Brief. Bioinformat. 3, 275– 284.
on all ORFs in the respective genomes and all proteins 11. von Heijne, G. (1986). The distribution of positively
with two or more predicted transmembrane segments charged residues in bacterial inner membrane pro-
were retained. Proteins with a single predicted trans- teins correlates with the trans-membrane topology.
membrane segment were not included, since a consider-
EMBO J. 5, 3021– 3027.
able but unknown fraction of these segments are
12. von Heijne, G. (1989). Control of topology and mode
cleavable signal peptides rather than transmembrane
of assembly of a polytopic membrane protein by
helices.1 The numbers of proteins analyzed were 749 for
positively charged residues. Nature, 341, 456– 458.
E. coli, 847 for S. cerevisiae and 4059 for C. elegans.
13. Käll, L. & Sonnhammer, E. (2002). Reliability of
transmembrane predictions in whole-genome data.
FEBS Letters, 532, 415–418.
14. Nilsson, J., Persson, B. & von Heijne, G. (2002). Pre-
Acknowledgements diction of partial membrane protein topologies
using a consensus approach. Protein Sci., 11,
This work was supported by a grant from the 2974 –2980.
Swedish Knowledge Foundation via the Research 15. Blattner, F. R., Plunkett, G., Bloch, C. A., Perna, N. T.,
School of Medical Bioinformatics and AstraZeneca Burland, V., Riley, M. et al. (1997). The complete
to K.M., and by grants from the Foundation for genome sequence of Escherichia coli K-12. Science,
Strategic Research and the Swedish Research 277, 1453– 1462.
16. Goffeau, A., Aert, R., Agostini-Carbone, M., Ahmed,
Council to G.v.H.
A., Aigle, M., Alberghina, L. et al. (1997). The yeast
genome directory. Nature, suppl. 387, 1– 105.
17. Stein, L., Sternberg, P., Durbin, R., Thierry-Mieg, J. &
References Spieth, J. (2001). WormBase: network access to the
genome and biology of Caenorhabditis elegans. Nucl.
1. Krogh, A., Larsson, B., von Heijne, G. &
Acids Res., 29, 82 – 86.
Sonnhammer, E. (2001). Predicting transmembrane
protein topology with a hidden Markov model. 18. Bargmann, C. (1998). Neurobiology of the
Application to complete genomes. J. Mol. Biol. 305, Caenorhabditis elegans genome. Science, 282, 2028–2033.
567–580. 19. Drew, D., Sjöstrand, D., Nilsson, J., Urbig, T., Chin,
2. Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., C. N., de Gier, J. W. & von Heijne, G. (2002). Rapid
Bhat, T. N., Weissig, H. et al. (2000). The Protein topology mapping of Escherichia coli inner-mem-
Data Bank. Nucl. Acids Res. 28, 235–242. brane proteins by prediction and PhoA/GFP
fusion analysis. Proc. Natl Acad. Sci. USA, 99,
2690 – 2695.
† http://bmb.med.miami.edu/EcoGene/EcoWeb/ 20. Manoil, C. (1991). Analysis of membrane protein
‡ ftp://genome-ftp.stanford.edu/pub/yeast/ topology using alkaline phosphatase and b-
yeast_ORFs/ galactosidase gene fusions. Methods Cell. Biol. 34,
§ ftp://ftp.sanger.ac.uk/pub/wormbase/ 61 – 75.
744 Topology Prediction

21. Deak, R. & Wolf, D. (2001). Membrane topology and 24. Möller, S., Kriventseva, E. & Apweiler, R. (2000). A
function of Der3/Hrd1p as a ubiquitin-protein ligase collection of well-characterised integral membrane
(E3) involved in endoplasmic reticulum degradation. proteins. Bioinformatics, 16, 1159–1160.
J. Biol. Chem. 276, 10663 –10669. 25. Thompson, J. D., Higgins, D. G. & Gibson, T. J.
22. Popov, M., Tam, L. Y., Li, J. & Reithmeier, R. A. F. (1994). CLUSTAL W: improving the sensitivity of
(1997). Mapping the ends of transmembrane seg- progressive multiple sequence alignment through
ments in a polytopic membrane protein—scanning sequence weighting, position-specific gap penalties
N-glycosylation mutagenesis of extracytosolic loops and weight matrix choice. Nucl. Acids Res. 22,
in the anion exchanger, Band 3. J. Biol. Chem. 272, 4673– 4680.
18325– 18332. 26. Hobohm, U., Scharf, M., Schneider, R. & Sander, C.
23. Tusnady, G. E. & Simon, I. (2001). The HMMTOP (1992). Selection of representative protein data sets.
transmembrane topology prediction server. Protein Sci. 1, 409– 417.
Bioinformatics, 17, 849– 850.

Edited by F. E. Cohen

(Received 19 November 2002; received in revised form 30 January 2003; accepted 31 January 2003)

You might also like