Reliability 6
Reliability 6
0022-2836/03/$ - see front matter q 2003 Elsevier Science Ltd. All rights reserved
736 Topology Prediction
experimentally determined topologies with per- probability values close to the borders between
formance characteristics on three complete gen- different classes often are low, even though the
omes, Escherichia coli, Saccharomyces cerevisiae and exact point of transition between one class and
Caenorhabditis elegans, and to assess to what extent another generally makes no difference to the
limited, easily obtainable experimental topology overall topology, we mask out a small number
information can be used to improve the theoretical of residues (three, five, seven, nine) on each side
predictions. of each border before locating the minimum
probability value. For the score evaluation pre-
sented below, we masked out nine residues at
each side of each border; the results are essen-
Results tially the same in the whole interval three to
nine masked residues (data not shown).
Construction of reliability scores
S3: The quotient p(best topology)/p(all possible
Judging by published bench-marking studies, topologies), calculated after a masking step as
TMHMM, HMMTOP and MEMSAT seem to have described below. The two probability values are
the best overall performance characteristics of the included in the standard TMHMM output,
available topology prediction programs.7,8 Two where p(best topology) is calculated with the
less well performing but widely used methods, N-best algorithm and p(all possible topologies)
PHD and TopPred, have been included for com- is calculated with the forward algorithm, as
parison. Each method is described below in some described.1 A quotient close to 1 implies that the
detail, together with a discussion of the reliability best path through the model (i.e. the predicted
scores that we have constructed from the raw out- topology) is much more probable than all
put from each program. alternative paths (i.e. all other topologies).
TMHMM can generate a list of several high-
scoring paths where the top ones frequently
TMHMM have very similar topologies (corresponding to
TMHMM is based on a hidden Markov model shifts of one or a few residues at the borders
with seven types of states (helix core, helix caps between different classes that do not change the
on either side of the membrane, short loop on cyto- overall topology). Since the exact borders
plasmic side/inside, short and long loop on non- between the classes are not generally known
cytoplasmic side/outside, and a globular domain even for the experimentally determined top-
state). Each type of state has a probability ologies, it is reasonable to mask out some
distribution over the 20 amino acids that have residues (we have used ten) on each side of a
been estimated from membrane proteins with class border and consider all topologies com-
experimentally known topologies. TMHMM out- patible with the “best” topology after masking
puts the most probable topology of the protein as the same prediction. We thus sum the prob-
given the model. The output is a labelled sequence abilities for all paths that give the same topology
of the three classes i (inside or cytoplasmic), h prediction after masking as the best path before
(helix) and o (outside or extra-cytoplasmic) that dividing by p(all possible paths) as obtained
obeys the “biological grammar” that a helix must from the raw output.
be followed by a loop and that inside and outside
loops must alternate. Posterior probabilities for HMMTOP
being in the three classes ( p(i), p(h), and p(o)) are
calculated for every residue in the sequence. We HMMTOP is a hidden Markov model with five
have constructed three different reliability scores states (inside loop, inside helix tail, helix, outside
(S1– S3) for TMHMM (see Methods). helix tail and outside loop). For a given amino
acid sequence it finds the most probable path
S1: The mean posterior probability of the through the model. Instead of taking into account
labelled sequence. A high mean posterior prob- only the absolute amino acid composition in the
ability indicates that most of the residues have a separate parts of the protein, it searches for the
high probability for their assigned classes and combination of states that gives the highest differ-
thus that the overall prediction might be con- ence in the amino acid distributions. The idea is
sidered reliable. The posterior probability values that a switch in the topology should be reflected
for each residue are calculated as described.1 A in a large amino acid distribution change (maxi-
possible shortcoming of this score is that a small mum divergence). In the raw output, numbers are
region with low probabilities embedded in a given for the entropy of the best path (i.e. the
long sequence with generally high scores will most probable topology) and the entropy of the
not greatly affect S1, even though it indicates an whole model. We have used the difference in
uncertainty in the prediction. entropy (i.e. entropy of best path 2 entropy of
S2: The minimum posterior probability in the model) as a measure of the reliability. The smaller
sequence of labelled residues. A low S2 score the difference, the better the best path represents
indicates that there is at least one part of the pro- the whole model, and the more likely to be correct
tein where the prediction is doubtful. Since the the predicted topology should be.
Topology Prediction 737
PHD
TopPred
PHD is a general tool for predicting secondary
structure of proteins, and the PHDhtm routine is TopPred was the first topology prediction
the part handling membrane proteins. It is method that combined hydrophobicity analysis
designed to use information from homologous and the positive-inside rule. It first calculates a
proteins. The first step in the method is a BLAST standard hydrophobicity profile for the query
search9 against the SWISSPROT database.10 A protein. Peaks above an upper cut-off (i.e. regions
multiple sequence alignment of the hits is con- rich in hydrophobic residues) are considered to be
structed and a neural network then estimates the confident transmembrane helix predictions
preference for each residue to be in a trans- whereas peaks between the upper and a lower
membrane helix or in a loop. The highest-scoring cut-off are regarded as putative transmembrane
putative transmembrane segment is used in a helices. Consequently, several topologies can be
second step to decide whether the protein is a constructed with or without the putative
helix bundle integral membrane protein. The third helix/helices. Out of these possible topologies, the
step is a dynamic program algorithm that finds one with the largest difference in the number of
the optimal number and locations of trans- positively charged amino acids between the two
738 Topology Prediction
peptides. This may reduce the S3 scores slightly for tion accuracies of 70– 85% are serious
some of the predicted proteins, but we consider it overestimates.
unlikely that this is enough to explain the There are several possible explanations for the
differences between the proteome sets and the test test set bias. First, even though jack-knife pro-
set reported below. cedures were used in the development of the
The results are presented in Figure 3, where the prediction methods, there are many subtle ways
percentages of membrane proteins are plotted for in which the methods may have been overtrained.
different score intervals. To be able to compare the It is quite likely that the proteins for which experi-
score distributions for the three proteomes with mental topologies have been reported have some
the test set, we removed all single-spanning characteristics such as unusually hydrophobic
sequences in the test set, ending up with 76 transmembrane segments that simultaneously
sequences and a TMHMM accuracy for this simplify both experimental mapping and predic-
reduced set of 63%. The most striking result is that tion. There are many families of membrane
there is a much larger fraction of high-scoring proteins for which no experimental topology is
proteins in the test set compared to the three available and which have thus not been seen by
proteomes, and thus that the overall prediction the prediction methods.
accuracy of , 66% reported in Figure 1 is a clear Looking more carefully at the results for the
overestimate. To obtain a more realistic estimate, individual genomes (Figure 3), it is interesting to
we first derived an empirical relation between the note that S. cerevisiae has a particularly large frac-
prediction accuracy and the S3 score by dividing tion of low-scoring proteins, while C. elegans and
the 92 test set predictions, ranked from high to E. coli have more similar score distributions. We
low scores, into four equal-size groups and then did not expect C. elegans to have the greatest pre-
plotting the average prediction accuracy in each dicted accuracy, since it is a eukaryote and the
group against the mean score for that group, relationship A ¼ 80 £ S3 þ 20 was derived from pro-
Figure 4(A). The accuracy/score relation is karyotic proteins. However, we suspected that the
reasonably well described by the straight line family of 7TM-receptors, known to be exception-
A ¼ 80 £ S3 þ 20. Using this relation, we calcu- ally large in C. elegans,18 might have contributed to
lated the expected A-values for all proteins in the the results. We therefore identified all C. elegans
respective membrane protein proteomes, which is proteins predicted to have seven transmembrane
plotted against the cumulative coverage in helices and an extracellular N terminus (985 out of
Figure 4(B). As a control, we plotted the real mean totally 4059) and analyzed the 7TM and non-7TM
accuracy and the calculated accuracy (A) for the sets separately. The 7TM set was found to have a
test set; the two latter curves agree well and we score distribution similar to that of the test set,
thus conclude that the expected accuracy A is a whereas the score distribution for the remaining
reasonable representation of the real data. The C. elegans membrane proteins almost coincided
mean prediction accuracies estimated in this with that of E. coli (data not shown). Finally, the
way for the whole proteomes (56% for E. coli, 53% combination of the TMHMM S3 and MEMSAT
for S. cerevisiae and 59% for C. elegans) are sig- scores discussed above (Figure 2), gave the follow-
nificantly lower than the , 66% obtained for the ing coverages for the three proteomes: 45% for
test set, suggesting that the widely quoted predic- E. coli, 46% for S. cerevisiae and 56% for C. elegans,
740 Topology Prediction
which should be compared to the 60% coverage of certain cases.19 With the introduction of reliability
the test set. scores, it is now possible to extend this strategy to
entire proteomes. The basic TMHMM algorithm
allows one to fix the class-assignment for any pos-
Inclusion of limited experimental information: ition in the sequence by setting the probability for
a strategy for large-scale topology mapping a position to belong to a certain class to 1.0 a priori.
Given the rather low estimates for the expected If the C-terminal residue of each protein in the test
mean prediction accuracy over full-size proteomes set is assigned to its experimentally known class,
discussed above, it is clear that topology predic- the relation between accuracy and coverage
tions, in general, provide only a rough guide to becomes much more favourable and the overall
the true topology of a protein. On the other hand, mean accuracy increases from 66% to 77% (Figure
the reliability scores presented here can be used to 5(B)). Similarly, if the N terminus is fixed, the over-
reduce considerably the necessary experimental all mean accuracy increases to 79%, and if both
work required to reach a satisfactory level of pre- termini are fixed it reaches 88% (data not shown).
diction accuracy. Again, there is an approximately linear relation-
We have shown that limited experimental infor- ship between the accuracy and the S3 score; with a
mation such as a determination of the in/out fixed C terminus, the relation is A c ¼ 70 £ S3c þ 30
location of the C terminus of a protein can be used (data not shown).
in conjunction with topology prediction to rapidly Finally, we tried to estimate how much the pre-
provide a very reliable topology model, at least in diction accuracy across the E. coli, S. cerevisiae and
Topology Prediction 741
C. elegans membrane protein proteomes would p(last aa), the larger is the mean increase in the S3
improve if the location of each protein’s C termi- score when the C-terminal residue is assigned to
nus was known. To this end, we used the test set its known class. This expression was used for esti-
to measure the difference in S3 score, DS3c, mating the increase in S3 score for all proteins in
between the score obtained with the C terminus the three proteomes from which the estimated S3c
fixed and the score obtained in the absence of any scores can be calculated; S3cp ¼ S3 þ DS3c, assum-
experimental information (DS3c ¼ S3c 2 S3) and ing that the C-terminal location is known. The
plotted DS3c versus the probability value for the expected accuracy, A cp, was then calculated from
location of the C-terminal residue obtained in the the expression for A c above. The results are shown
absence of experimental information, p(last aa), i.e. in Figure 5(B). The estimated increase in overall
the probability value for the assigned class of the accuracy for the proteomes is from 56% to 67% for
last amino acid in the sequence (Figure 5(A)). E. coli, from 53% to 67% for S. cerevisiae, and
Although the data are rather scattered, there is a from 59% to 71% for C. elegans. It should be
linear trend described by DS3c ¼ 2 0.57 £ p(last emphasised that these numbers are only rough
aa) þ 0.57. In other words, the smaller the value of estimates, but they nevertheless suggest that
742 Topology Prediction
prediction performance would improve signifi- Finally, we have tried to estimate the expected
cantly if C-terminal mapping data were available. improvement in prediction accuracy if the in/out
Generally applicable methods for determining location of the C terminus of every protein in a
the location of the C-terminal end of a protein on proteome was known from experimental data,
the basis of either reporter fusions or engineered since relatively rapid methods for such determi-
acceptor sites for N-linked glycosylation exist for nations are now available. For all three proteomes,
E. coli, S. cerevisiae and mammalian proteins,20 – 22 we find that TMHMM will predict the correct top-
and we have shown that such methods can be ology for , 70% of all membrane proteins, given
used on a relatively large scale (our unpublished that the C-terminal location is known. Again, the
work). On the basis of TMHMM-predictions,1 we likelihood that a given prediction is correct can be
have estimated that the membrane protein pro- estimated from the reliability score.
teome of E. coli consists of 769 proteins with two In summary, we describe new reliability scores
or more transmembrane helices, and that of for TMHMM and MEMSAT, two of the currently
S. cerevisiae of 847 such proteins. The results pre- best-performing topology prediction methods, that
sented above suggest that highly reliable topology make it possible to estimate the likelihood that a
models for a majority of these proteins should be given prediction is correct and that can be used in
obtainable from a simple experimental determi- conjunction with limited experimental information
nation of the C-terminal location. to provide high-quality topology models for entire
proteomes.
Discussion
Methods
Membrane protein topology prediction is an
important area in contemporary bioinformatics, Prediction methods
and provides a useful starting point for experimen- TMHMM2.0,1 HMMTOP2.0,3,23 MEMSAT version 1.8,4
tal studies of membrane proteins. While the overall PHDhtm version 1998.015 and TopPred version 1.06
performance of different topology prediction were used in single-sequence mode and with default
methods has been much discussed lately,7,8,13 essen- parameter settings. PHDhtm was also run in its multiple
tially no work has been done trying to estimate the sequence alignment mode on the website.†
reliability of individual predictions. Here, we have
constructed simple reliability scores for five widely Definition of reliability scores
used methods, TMHMM, HMMTOP, MEMSAT,
PHDhtm and TopPred, and have applied them to
a test set of 92 prokaryotic proteins with experi-
TMHMM S1 ¼ ðp1 ðlabelÞ þ p2 ðlabelÞ þ · · · þ pN ðlabelÞÞ=N
mentally determined topologies and to the full-
size membrane protein proteomes from E. coli, where N is the sequence length and pi(label) is the
S. cerevisiae and C. elegans. posterior probability for the assigned class (label ¼ i, o
For TMHMM and MEMSAT, there is a good cor- or h) for residue i.
relation between the reliability scores we have
defined and the expected accuracy of a prediction. TMHMM S2 ¼ min½p1 ðlabelÞ; p2 ðlabelÞ; …; pN ðlabelÞ
For both methods, , 50% of the predictions have
reliability scores corresponding to a prediction TMHMM S3 ¼ pðbest topologyÞ=pðall possible topologiesÞ
accuracy of , 90%, and , 70% of the proteins have
scores corresponding to a prediction accuracy of To calculate p(best topology) we first identify all high-
, 80% over the test set. For the remaining three scoring predictions that are compatible with the highest-
scoring one by masking ten residues on either side of
methods, we were unable to derive useful each class border. All predictions that have the same
reliability scores. class assignments as the highest-scoring one after mask-
We have further used the TMHMM reliability ing are considered as being the same, and p(best top-
score to assess the degree of bias in the test set as ology) is the summed probabilities (as given by
compared to the predicted membrane protein pro- TMHMM) for these predictions. These individual prob-
teomes of E. coli, S. cerevisiae and C. elegans. In con- abilities as well as p(all possible topologies) are calcu-
formity with the results of two recent studies,13,14 lated as described.1
we find that the test set is biased towards high-
HMMTOP : score ¼ entropyðbest pathÞ 2 entropyðmodelÞ
scoring proteins, and we estimate that only some
53 –59% of all predicted topologies for these pro-
teomes are correct, compared to 63% for the test MEMSAT : score ¼ scoreðbest topologyÞ
set when only proteins with two or more trans-
membrane helices are considered (or 66% for the 2 scoreðsecond best topologyÞ
whole test set). The reliability scores make it
possible to estimate the likelihood that a given pre- PHDhtm : score ¼ ððindexðmodelÞ þ indexðorientationÞÞ=2
diction is correct, allowing experimental topology
mapping efforts to be focused on proteins with
low reliability scores. † http://cubic.bioc.columbia.edu/predictprotein/
Topology Prediction 743
21. Deak, R. & Wolf, D. (2001). Membrane topology and 24. Möller, S., Kriventseva, E. & Apweiler, R. (2000). A
function of Der3/Hrd1p as a ubiquitin-protein ligase collection of well-characterised integral membrane
(E3) involved in endoplasmic reticulum degradation. proteins. Bioinformatics, 16, 1159–1160.
J. Biol. Chem. 276, 10663 –10669. 25. Thompson, J. D., Higgins, D. G. & Gibson, T. J.
22. Popov, M., Tam, L. Y., Li, J. & Reithmeier, R. A. F. (1994). CLUSTAL W: improving the sensitivity of
(1997). Mapping the ends of transmembrane seg- progressive multiple sequence alignment through
ments in a polytopic membrane protein—scanning sequence weighting, position-specific gap penalties
N-glycosylation mutagenesis of extracytosolic loops and weight matrix choice. Nucl. Acids Res. 22,
in the anion exchanger, Band 3. J. Biol. Chem. 272, 4673– 4680.
18325– 18332. 26. Hobohm, U., Scharf, M., Schneider, R. & Sander, C.
23. Tusnady, G. E. & Simon, I. (2001). The HMMTOP (1992). Selection of representative protein data sets.
transmembrane topology prediction server. Protein Sci. 1, 409– 417.
Bioinformatics, 17, 849– 850.
Edited by F. E. Cohen
(Received 19 November 2002; received in revised form 30 January 2003; accepted 31 January 2003)