Gardiner 2001
Gardiner 2001
ABSTRACT A genetic algorithm (GA) for pro- validity of the complementary shape hypothesis. Betts and
tein–protein docking is described, in which the Sternberg8investigated the conformational changes that
proteins are represented by dot surfaces calculated occurred on complex formation for a total of 30 complexes.
using the Connolly program. The GA is used to move These investigators concluded that, for the enzyme–
the surface of one protein relative to the other to inhibitor complexes that comprised most of their data,
locate the area of greatest surface complementarity recognition could be “treated as lock and key to a first
between the two. Surface dots are deemed comple- approximation.” In a study of specific interprotein interac-
mentary if their normals are opposed, their Con- tions (e.g., nonpolar buried surface area, polar buried
nolly shape type is complementary, and their hydro- surface area, number of hydrogen bonds), Norel et al.9
gen bonding or hydrophobic potential is fulfilled. found no correlation between any of these and the likeli-
Overlap of the protein interiors is penalized. The GA hood that a putative docking was correct. Instead, their
is tested on 34 large protein–protein complexes finding confirmed the “long held notion of the importance
where one or both proteins has been crystallized of molecular shape complementarity in the binding and
separately. Parameters are established for which 30 hence in docking.” Both proteins are usually treated as
of the complexes have at least one near-native solu- rigid bodies, since to allow flexibility may be computation-
tion ranked in the top 100. We have also successfully ally intractable. Many shape-based docking algorithms
reassembled a 1,400-residue heptamer based on the have been proposed.10 –15 Some include other information
top-ranking GA solution obtained when docking (e.g., hydrophobicity, electrostatics) either in conjunction
two bound subunits. Proteins 2001;44:44 –56. with shape matching16 or as a subsequent filter.17 Several
© 2001 Wiley-Liss, Inc.
geometric docking programs use the Connolly molecular
surface representation.18 Three are of particular interest,
Key words: shape complementarity; macromolecu- as they have some similarity to the method we describe
lar docking; protein–protein interac- below.
tion; molecular recognition; protein– Connolly19 described a method for selecting “critical”
protein docking; genetic algorithm points of the Connolly surface. The idea of the critical
points is to preserve the important features (local maxima
INTRODUCTION
and minima) of the protein surface, while reducing the
Protein–protein recognition represents a fundamental computational intractability associated with processing
aspect of biological function. However, although protein very large numbers of surface points. These critical points
structures are now routinely determined by methods such (termed “knobs” and “holes”) represent points that would
as X-ray crystallography, it is much more difficult to be expected to match (knobs on one protein with holes on
ascertain the structure of protein complexes, and only another and vice versa) in a successful docking. Connolly
about 100 of these are currently deposited in the Protein required four pairs of critical points to match in order to
Data Bank (PDB).1 Protein docking studies are therefore define a rigid body translation and this proved too restric-
important as an aid to our understanding of the ways in tive in one of the two complexes he attempted to dock.
which proteins associate.2 Methods for protein ligand However, Norel et al.20 showed that pairs of two critical
docking, such as DOCK,3 FLOG,4 FLEXX,5 and GOLD,6 points plus the normals to the surface at these points are
are widely used in drug-discovery programs. However, sufficient to dock the example at which Connolly’s algo-
because of the complexity of the problem, protein–protein rithm failed. In further work, the same group docked 26
docking is still largely at the theoretical stage. Many
solutions have been proposed but, although very promis-
ing results have been obtained in some cases,7 there is still Grant sponsor: Engineering and Physical Sciences Research Coun-
cil; Grant sponsor: Biotechnology and Biological Sciences Research
considerable scope for the development of methodology Council.
and exploration of alternative or ancillary approaches. *Correspondence to: Eleanor Gardiner, Department of Information
Protein docking methods are generally based on the idea Studies, Sheffield University, Western Bank, Sheffield S10 2TN,
of complementarity between the interacting molecules. United Kingdom. E-mail: e.gardiner@sheffield.ac.uk
This complementarity may be geometric or electrostatic, or Received 10 November 2000; Accepted 23 February 2001
both. Recently there have been two investigations into the Published online 00 Month 2001
“large” protein–protein complexes from the PDB using the protein docking,33 nuclear magnetic resonance (NMR),
whole of the surface of both proteins and without using any and X-ray crystallographic data analysis.34 At Sheffield,
chemical information such as hydrogen bonding poten- we have used GAs successfully for many different applica-
tial.21 The dockings usually took only a few minutes of tions, including database searching,35 protein folding,36
CPU time and generated a manageable number of solu- protein ligand docking and database screening,37 identify-
tions with the correct solution ranked high (in 22 cases, ing features common to sets of ligands,38 selection of
ranked 1). In further work, they applied the same algo- compounds for combinatorial libraries,39 and protein sur-
rithm to 19 complexes using data taken from the native face comparison,40 inter alia. In this article, we present a
protein structures.9 Recently Hou et al.22 used a genetic genetic algorithm for protein docking based on the surface
algorithm in combination with a tabu search to match dots comparison method of Poirrette et al.,40 but which we have
of the Connolly surface in the initial global search stage of modified in order to capture the regions of greatest comple-
their docking. They also required the surface normals to mentarity between the two proteins that are being docked.
match; the fitness score was then calculated according to We demonstrate that, for a large number and variety of
the amount of buried surface area, weighted by a penalty complexes, using mainly native data, our GA is able to
for steric clash, proportional to the number of overlapping produce highly ranked putative complexes that resemble
atoms. This is similar in form to our fitness function the crystallographically determined complex.
(detailed under Materials and Methods), except that we
take account of matching dots rather than buried surface MATERIALS AND METHODS
area and calculate overlap in a different manner. They The basic method is based on that of Poirrette et al.,40
followed this by performing local energy calculations for who used a GA for surface comparison. A GA was used to
the top ranked solutions. Most of the computation in the generate rotations of the smaller (query) protein relative
global search is done by the tabu search element of their to the larger (target) protein surface, which was held
hybrid algorithm. The genetic element is used to diversify static. Both proteins were treated as rigid bodies. We
the search. The local search is then also performed by a provide brief details, paying particular attention to the
GA; it attempts to minimize the interaction energy (van modifications made for protein docking.
der Waals energy, electrostatic energy, and hydrogen- We first generated the proteins’ Connolly dot surfaces
bond energy) between the participating molecules. They using the program MS,18,41 with a probe radius of 1.5 Å
report good results on a test set of seven complexes (five and a dot density of 1 dot/Å2. In addition to three-
bound and two native). dimensional (3D) coordinates for each dot, this program
Most early attempts at protein–protein docking involved calculates the direction of the normal to the surface and
only reassembling complexes using coordinates taken from also the type of surface (1 ⫽ convex, 2 ⫽ saddle, 3 ⫽
the complex. More recently there have been several stud- concave). We modified the program slightly so that each
ies using native proteins.9,16,23–25 There have also been dot was also labeled with the hydrogen bonding potential
two protein–protein docking challenges.7,26. of the nearest atom. The classification we adopted was as
Most docking studies concentrate on enzyme–inhibitor follows: -H-donor, H-acceptor, H-donor/acceptor, and non-
complexes (most commonly serine protease–inhibitor com- H-bonding. All oxygen atoms were labeled as H-acceptor
plexes) and antibody–antigen complexes. A recent compari- with the exception of the OH of tyrosine, the OG of serine
son of the protein–protein interactions in these two differ- and the OG1 of threonine, all of which were labeled as
ent types of complex concluded that, in general, a protease– H-donor/acceptors. For simplicity, all nitrogen atoms were
inhibitor interface was more static and hence more easily labeled as H-donors, including the histidine nitrogens
predicted than an antibody–antigen interface.27 This re- ND1 and NE2 whose individual protonation states are
sult is generally borne out by docking studies which have difficult to assign. All other atoms were classified as
used unbound proteins.23 non-H-bonding.40
In our previous work on docking, we used techniques Both proteins were translated so that their centers of
from graph theory to predict maximum sets of possible gravity were positioned at the origin of space. This origin
hydrogen bonds and used these to superpose the proteins was also the center of a 3D grid placed around the target
to form a complex.28 While this method was able to protein; it was sufficiently large that any query dot could
reassemble complexes, it proved less successful when contact any target dot, and the entire query molecule still
docking native proteins, and here we have considered the remain within the grid. Each grid box was a 2-Å cube. Dots
use of a genetic algorithm (GA) for this purpose. contained within the same grid box were given the same grid
GAs have been shown to be of use in many areas of coordinate, which was that of the corner of the grid box
bioinformatics. A GA is a search algorithm based on nearest to the origin. This “coarsening” of the surface
accepted theories of biological evolution and natural selec- representation allowed for some tolerance in the matching
tion. It operates on a population of solutions, applying the of the protein surfaces, which was necessary to accommo-
principle of “survival of the fittest.” After a number of date movements of the proteins from the native to the
generations, a good, although not necessarily optimal, bound conformation, and also it enabled the implementa-
solution is obtained. Applications of GAs to structural tion of a very fast routine for dot matching, thus increasing
biology include investigation of protein folding,29 ligand both the effectiveness and the efficiency of the program.
binding to receptor sites,30 protein structure and sequence Each chromosome, corresponding to a potential docking,
comparison,31,32 protein secondary structure prediction,30 consisted of six integer elements, corresponding to the six
46 E.J. GARDINER ET AL.
degrees of freedom necessary to define the movement of stringent a complementarity requirement was unsuccess-
one rigid body relative to another. The first three elements ful in that “fit” solutions did not correspond to correct
corresponded to rotations of 0 –359° about the x, y, and z orientations of the query protein. Thus we have adopted a
axes, respectively, and the remaining three to translations principle of “not similar” matching where necessary. So,
(measured in tenths of an Ångstrom) along the original x, for a pair of target and query dots, a match was declared if
y, and z axes. An initial population of chromosomes was (1) their grid coordinates were the same (thus, they occupy
generated at random, each was applied to the query the same region of space); (2) their surface normals were
protein dots (the target remained fixed), and the fitness of nearly opposite (the angle between them should be close to
the resulting complex was calculated (as described below). 180°—the actual value was a parameter); (3) their Con-
The GA was of a type known as steady-state-with-no- nolly shape type was not the same, unless they were type
duplicates, meaning that a fixed proportion (in our case 2. (i.e., 1 matched 2 or 3, 3 matched 1, or 2 and 2 matched
5%) of the chromosome population was replaced after each anything); or (4) their hydrogen bonding potential was
generation and that duplicate chromosomes were not satisfied, i.e., H-donor matched H-acceptor or H-donor/
allowed.42,43 The chromosomes replaced were those that acceptor, H-acceptor matched H-donor or H-donor/accep-
were least fit in the population. New chromosomes were tor, H-donor/acceptor matched H-donor, H-acceptor or
generated by applying the genetic operators mutation H-donor/acceptor, non-H-bonding matched non-H-bond-
and/or crossover to parents chosen by roulette wheel ing.
selection. Mutation took one parent chromosome and We also wished to apply a penalty for overlap of the
produced one child. In standard mutation, a position was protein interiors whilst allowing for slight surface clashes.
chosen at random along the parent and randomly mutated We did this by defining a set of “interior” points of the
to another value (keeping within the range 0 –359 if a target protein. These were grid points that were near to a
rotation was being mutated). Small-creep mutation oper- target protein atom and that were further than a user
ated in the same way except that the replacement value defined “thickness” parameter from any surface dot. In all
was close to the original (within 5° for a rotation and the tests described in this report, a thickness of 2 Å was
within 2 Å for a translation). When mutation was the used. The coordinates of the interior points were hashed to
chosen genetic operator, standard or small-creep muta- integer values and stored in a sorted list. As their creation
tions were chosen at random (with equal likelihood). took 5–30 min on a Silicon Graphics 270-MHz O2 worksta-
Crossover took two parents and produced two children, in tion with R12000 processor (depending on the size of the
one of two ways. In single-point crossover a position along target), a set of interior points for a particular thickness
the chromosome was chosen at random and all elements was only created once and was stored on file for subse-
subsequent to the chosen point were then swapped over quent GA runs. For a possible orientation of the query with
between the two chromosomes. In exchange-crossover, a respect to the target, we counted the number of matching
point was again chosen at random and just that element dots and the number of clashes between query dots and
was swapped between the two chromosomes. When cross- target interior points. The penalty was then given by
over was the chosen genetic operator, single-point or
exchange-crossover were again chosen at random (with J* number of clashes if any dots matched
Penalty ⫽ 100,000 if no dots matched
equal likelihood). After a predetermined number of genera-
tions (1,000 for all the tests reported here) the GA was where J, the penalty multiplier, was an input parameter.
presumed to have converged and the fittest population The fitness of the chromosome was then given by
member was saved as a solution.
We have also applied a technique called niche restric- Fitness ⫽ number of matches ⫺ penalty
tion,44 which we use to force the GA to explore different
We recognize that this is not an even-handed approach.
regions of solution space. After the first run through, an
For example, a long side-chain may protrude from the
area (niche) around the fittest solution was removed from
target. A single side-chain is “thin” (i.e., by our definition,
the solution space. The area removed was a sphere of
it has no interior points) and thus, if it penetrates the
radius 2 Å centered on the center of gravity of the fittest
query, this will not be penalized. However, for the query
solution. The GA was then restarted as before, except that
protein, such a side-chain’s penetration of the target would
no chromosomes were allowed to place the center of
attract a penalty, as it would clash with the target interior
gravity of the query protein within the restricted niche.
points. We have taken this pragmatic approach to mini-
After the second runthrough, a second niche was defined
mize computation (and hence time) required for docking.
and so on. In our case, we obtained 5 niches per GA run,
Calculating the new positions of the query points is one of
each niche corresponding to a potential solution.
the most time-consuming parts of the GA. If interior points
Fitness Function were to be defined for the query, their position would also
have to be recalculated for each alteration of each chromo-
The GA thus described is similar to the surface match- some, which would be a further time burden. Some results
ing algorithm of Poirrette et al.40 The critical difference in of this “lopsidedness” is discussed later.
the new method lies in the fitness function since for protein The GA may be used with entire protein surfaces or one
docking we are looking for complementary rather than or both proteins may be just a site. If a site was used,
similar surfaces, with minimal overlap of the protein surface dots were generated for only that part of the
interiors. However, during initial testing we found that too protein surface. However, if that protein was the larger of
PROTEIN DOCKING 47
Parameter Value
Population size 400
No. of runs 1,000
Percentage of population replaced each iteration (%) 5
Mutation/crossover rate (%) 50
Selection pressure 1.1
Niche sphere diameter (Å) 2
No. of niches 5
Complex (PDB Code) Normal anglea Jb Top rankc Top 10 hitsd Total hitse RMSDf Best RMSDg
1cho ⫾30° 3.0 6 2 4 1.47 1.47
⫾20° 1.5 4 1 5 0.77 0.77
⫾20° 3.0 2 4 5 1.05 0.93
⫾15° 1.0 1 7 9 1.69 0.70
2sni ⫾30° 3.0 5 1 1 1.63 1.63
⫾20° 1.5 1 5 8 1.60 1.32
⫾20° 3.0 2 3 6 2.99 0.76
⫾15° 1.0 1 7 10 1.40 1.36
3hfl ⫾30° 3.0 0
⫾20° 1.5 25 0 2 1.63 1.63
⫾20° 3.0 34 0 4 0.94 0.94
⫾15° 1.0 1 3 5 2.00 2.00
*Each row corresponds to 20 runs of the GA, each producing 5 niches giving 100 solutions per row.
a
180° normal angle is the angle allowed between the surface normals for a permitted match.
b
Penalty multiplier.
c
Rank (by fitness score, of 100) of the highest ranked solution which resembled the complex.
d
Number of complex-resembling hits in the 10 top ranked solutions.
e
Number of complex-resembling hits (of 100).
f
Root-mean-square deviation of the inhibitor C␣ atoms of the most highly ranked predicted complex when its enzyme is
superposed on the Protein Data Bank (PDB) complex enzyme.
g
Predicted complex closest to the actual complex. Predicted complexes were deemed to resemble the complex if their RMSD
was ⬍3 Å.
solved by X-ray crystallography to a resolution of ⱖ2.8 Å, were 3hhr (human growth hormone–receptor complex)
and each of which has at least one of its component and 2btf (-actin–profilin complex), in which all residues
proteins crystallized separately to a similarly high resolu- within 6 Å of a query atom were chosen (the shape of these
tion.8 In order to test the GA on a wide variety of targets indicating that most target atoms are within 8 Å of
complexes, we have attempted to dock each of these a query atom). Clearly, this procedure would be impossible
complexes using whatever native data was available. We when predictively docking two proteins, the structure of
also included the well-known barnase– barstar complex whose complex was unknown. However, when two pro-
(PDB code 1brs), another antibody–antigen complex (PDB teins are known to associate, there is frequently some
code 3hfm) and the -lactamase–inhibitor complex, which biochemical evidence (e.g., from mutagenesis studies or
was the subject of a recent docking challenge.25 To our prediction studies) to indicate the likely position of a
knowledge, this is the largest docking study of native binding site residue, so it is reasonable to assume some
proteins to date. Table IIIA–C contains details of these 34 knowledge of the general region in which binding occurs.
test complexes. The results of these native dockings are given in Table
For our first attempt at these native dockings, we used IVA–D. As suggested by Gabb et al.,16 we have assessed
the entire surfaces of both proteins for the enzyme–
the quality of these dockings by calculating the RMSD of
inhibitor complexes. For the antibody–antigen complexes,
the C␣ interface atoms after the C␣ atoms of both proteins
we used sites of the target antibody as these proteins are
of the predicted complex are superposed on those of the
very large and the binding sites of Fabs are well known to
crystal complex.16 (Any atom within 10 Å of the opposite
be in the complementarity determining regions (CDR). As
protein is considered an interface atom.) This value is a
in the bound test, for the antibody–antigen dockings we
good measure of the correctness of the predicted complex
used only those residues of the target antibody protein
which were within 8 Å of the known antigen-binding site. as it removes the distorting effect that small errors at the
This gave target sites that were still big enough to give the interface may have on the position of atoms distant from
GA plenty of scope for finding incorrect orientations. We that interface. For each complex, we also created a “refer-
also take the view that protein docking should be a ence structure” by separately superposing the C␣ atoms of
pragmatic exercise—all available biochemical information the unbound proteins onto their bound counterparts. We
can be used if necessary. then calculated the RMSD between the C␣ interface atoms
For the “other” complexes, we first attempted to dock of the reference structure and the crystallized complex.
using the whole of both proteins. Given the mixed results This base RMSD value is a good approximation to the best
of this strategy (see below), we later chose the active site that could be expected of a rigid-body docking process. We
for the larger protein of each complex and repeated the GA consider a ”correct“ docking to be one in which the C␣
runs with these sites. These sites were chosen by looking interface RMSD minus base RMSD is ⬍4 Å, and dockings
at the known structure of the crystallized complex and are included in Table IV only if this criterion was satisfied.
selecting any residue of the target protein, which had an All timings are in CPU minutes for a Silicon Graphics
atom within 8 Å of a query protein atom. The exceptions 270-MHz O2 workstation with R12000 processor.
PROTEIN DOCKING 49
Enzyme–Inhibitor Complexes one set of parameters. Figure 2a shows the C␣ trace of the
Table IVA shows that the GA performed extremely well best predicted docking of the Kallikrein A (PDB code
on these complexes. In every case, for at least one parame- 2pka)–pancreatic trypsin inhibitor (PDB code 1bpi) super-
ter combination, a complex that was close to the PDB posed onto the actual complex (PDB code 2kai). The
complex was found. This means that, although there are predicted complex is shown in lime green (enzyme) and red
literally millions of ways in which one protein may be (inhibitor) and the actual complex in blue (enzyme) and
oriented with respect to the other, the GA produced a list of turquoise (inhibitor).
400 such orientations, which always contained at least one The two complexes, 1cho and 2sni, which gave similarly
(and in most cases many) approximately correct predic- high-ranking hits when docked using bound coordinates
tion. For 13 of the 19 complexes, correct hits were found for gave quite different results when docked using native
every parameter combination. For 11 of the complexes, at coordinates. The worst top rank for the native 1cho
least one good hit was found ranked in the top 5 for at least docking was 5, whilst for 2sni the best native docking rank
50 E.J. GARDINER ET AL.
TABLE IV. Results of Enzyme–Inhibitor, Antibody–Antigen, Other Complex, and Other Complex (Site)
Normal angle ⫾20° J 1.5 Normal angle ⫾15° J 1.0 Normal angle ⫾20° J 3.0 Normal angle ⫾30° J 3.0
Total
Complexa Tdotsb Qdotsc Based Ranke RMSDf Nhitsg Ranke RMSDf Nhitsg Ranke RMSDf Nhitsg Ranke RMSDf Nhitsg timeh
A. Enzyme/Inhibitor Results
B. Antibody–Antigen Results
was 14. However, this could be explained by the fact that the enzyme, with the error being in the orientation of the
the inhibitor for the native 1cho docking was taken from inhibitor within the site. This reflects the more deeply
the complex, whereas for 2sni, both proteins were native. concave nature of the enzyme active sites compared with
The majority of the incorrect orientations predicted by the the remainder of the enzyme surface.
GA, and certainly most of the high-ranking incorrect The run times given are the total times for 20 runs of the
predictions, did position the inhibitor in the active site of GA, producing 100 solutions. The time needed to generate
PROTEIN DOCKING 51
(selected as described above). Even so, the dockings still case it was only 1 from 400 possible solutions. However,
generally involved more surface dots than for the enzyme– when attention was restricted to docking a site of one
inhibitor cases described above and thus provided a more protein against the whole of the other protein, its perfor-
severe test, the results of which are given in Table IVB. mance was, as was hoped, much improved (Table IVD).
The bound docking tests suggested that the GA did not The GA found at least one hit for each complex. For four of
perform quite as well on antibody–antigen complexes, and the complexes—1atn, 1gla, 1mda, and TEM1-BLIP—it
the native test confirmed this. Nevertheless, in 6 of 8 cases, found several hits for most of the parameter combinations,
the GA did find at least one hit in the top 400, the two while for two complexes—1gla and 1spb— only one hit was
exceptions being the Fab–lysozyme complex 1vfb and the found. It was pleasing to find that the GA was able to find
Fab–neuraminidase complex 1nmb. In fact, in the case of solutions ranked in the top 10 for the TEM-1/BLIP com-
1nmb, which (with 1nca) has the largest query protein of plex, as our previous attempts (using clique-detection
any complex considered (neuraminidase has 389 residues), methods) had completely failed for this complex.28 Figure
hits were found that somewhat resembled the crystallized 2c shows the predicted TEM-1/BLI P complex with the
complex but whose RMSD (of the order of 7 Å) were too lowest RMSD, superposed on the actual complex. A direct
great too allow them to be considered correct. comparison with the docking challenge result is difficult as
The complexity of the docking was also reflected in the we are docking with the benefit of hindsight, whereas the
longer run times. The run time is generally dependent on challenge participants did not know the answer. However,
the size of the query protein. Hence, all the runs with although all the participants ranked near-native com-
lysozyme (which has 129 residues) as the query took about plexes top in their submissions, some did use their judg-
half an h for one run and thus to obtain 400 solutions took ment to “overrule” the ranking of their scoring proce-
about 40 h. Neuraminidase is a very large query protein dure,25 so the fact that our best hit was ranked 8 by the GA
(389 residues) and the two neuraminidase complexes took does not mean that we would have failed the docking
about 120 h for 400 solutions. Figure 2b shows the C␣ trace challenge. However, in this example, it is notable that the
of the predicted Fab–neuraminidase (PDB codes 1nca, group of Abagyan and Totrov and that of Sternberg and
7nn9) complex, with the lowest RMSD, superposed on the Jackson scored the correct solution top in their respective
actual 1nca complex. For clarity, only the part of the Fab ranking systems.7
that contacts the neuraminidase is shown, with its light
chain in lime green and its heavy chain in turquoise. This Performance of the GA
is the same for the predicted and actual complexes, since
The GA was run with four combinations of the normal
the Fab coordinates used in docking were taken from the
angle/penalty multiplier parameters. Although there was
complex. The predicted neuraminidase is shown in red and
no consistently best performer for the enzyme–inhibitor
the complexed in blue.
complexes, overall it did seem that the 180°⫾ 30° with
Other Complexes penalty multiplier J ⫽ 3.0 combination was the most
satisfactory. Table V, which summarizes the results for all
These complexes provided a very stiff test indeed for the complexes for this parameter combination, shows that for
GA. To the best of our knowledge, most have not previously 30 of the 34 complexes, the GA found at least one solution
been used in theoretical docking studies. Exceptions are ranked in the top 100, when these parameters were
the TEM-1/BLIP complex, which was the subject of a applied. For 21 of these cases, more than one good solution
docking challenge,7 and the human growth hormone– was found, and for 15 cases, a good solution was found
growth hormone receptor complex (3hhr), which has re- ranked in the top 20. In 6 cases, this parameter combina-
cently been studied.45 Although human growth hormone is tion produced a docking that resembled the complex which
itself a complex, consisting of two chains that are identical was ranked one or two. However, using all four parameter
in sequence, but different in conformation, we have fol- combinations, at least one good solution among 400 was
lowed Betts and Sternberg8 in treating the hormone as a found for all but two of the test complexes, although this
single preassembled unit; this we have attempted to dock was at the expense of having more false hits.
with its receptor. All the complexes are very large, and the We have been quite generous in our definition of what
dockings involved many surface dots. Another particular constituted a hit as we wished to assess the utility of the
problem was that, in some cases, the conformation of the GA in docking as wide a range of complexes as possible.
native protein had some significant differences to that of The interface RMSD of our “hits” ranged from 0.77 Å
the complex. For example, the RMSD between native (enzyme–inhibitor complex, 2tec) to 6.78 Å (human growth
methylamine dehydrogenase (2bbk) and that in the 1mda hormone–receptor complex, 3hhr), although in the case of
complex with amicyanin is 6 Å— considering just those 3hhr, the base interface RMSD was 2.92 Å. There are,
residues 4 Å from the interface gives an RMSD of 2.8 Å. however, different “grades” of nonhit. For example, the
This clearly poses a problem for a rigid body docking fittest docking of the enzyme–inhibitor complex 2sic with
procedure. an RMSD of ⬍4 Å is ranked 25 (Table V). In fact, 8 of the
In view of these difficulties, the GA performed better top 10 fittest solutions had RMSD values of ⬍5 Å. Because
than might have been expected, even when docking entire we chose a cutoff RMSD value of 4 Å (relaxed by the base
proteins (Table IVC). For three of the complexes—3hhr, RMSD), none of these were classed as hits although they
1spb (subtilisin–subtilisin prosegment complex), and clearly resembled the crystallized complex more closely
TEM1-BLIP—it found a correct docking, although in each than a prediction with RMSD of 20 Å.
PROTEIN DOCKING 53
TABLE V. Docking All Complexes Using Normal ⴞ30° and minutes and about 2 h, respectively. The “correct” hits
J 3.0 produced by the GA also tend to have slightly worse RMSD
than those of these other groups.
Complex Rank RMSD Nhits
1brb 5 4.08 11 Docking Subunits
1cgi 34 5.04 8
2kai 4 4.39 8 We are also interested in docking protein subunits to
2ptc 1 4.38 6 form oligomers, as some proteins, such as bacterial toxins,
2sic 25 3.91 5 are monomeric in solution but associate to form oligomeric
2sni 51 2.05 2 pores (e.g., aerolysin, hemolysin E). The intersubunit
1acb 2 2.32 1
interactions that occur when dimers (or other such com-
1brc 28 2.63 5
1cho 1 1.79 9 plexes) form are thought to be largely driven by hydropho-
1cse 36 2.40 8 bic forces.46 As our GA modeled these interactions (requir-
1ppe 2 0.96 12 ing non H-bonding atoms to match with similar), we
1sbn 4 2.49 14 wished to see how well it performed in this context. A
1stf 37 3.00 1 difficulty with this kind of docking is that subunits are not
1tab None often able to be crystallized separately from the complete
1tgs None protein. Thus, docking tests are usually performed using
2tec 18 1.27 4 bound coordinates. If the GA were ever to be used to
4htc 2 1.91 2
predict the structure of an oligomer, the minimum test it
1udi 13 2.23 3
must pass is to reassemble crystallized oligomers. This
1brs 1 3.14 2
1mlca 55 3.17 2 problem has been studied by Cummings et al.,47 who
1vfba None reassembled diubiquitin from the subunits of the crystal-
1ncaa 74 1.19 1 lized dimer and also by Hendrix et al.,45 who, as part of a
1nmba None three-protein docking of human growth hormone and its
1igca 20 4.60 4 receptor (which is composed of two monomers, each of
1jela 100 3.58 1 which interacts with the hormone in a different fashion),
3hfla 83 2.50 1 reassembled the entire complex using the bound coordi-
3hfma 50 3.38 1 nates. We successfully assembled this complex (PDB code
1atna 90 3.98 1
3hhr, Table IVC,D) by treating the receptor as a single
1glaa 40 4.35 2
1spba 73 1.42 1
protein (a dimer). To the best of our knowledge, the
2btfa 16 1.90 2 reassembly of a unit larger than a trimer has not yet been
3hhr 91 6.78 1 attempted.
1mdaa 1 5.05 18 We have tested this use of the GA on three different
temblipa 8 2.81 4 proteins, HIV-1 protease (PDB code 4hpv), proteasome
a
Indicates that only part of the target protein was used.
activator reg-alpha (PDB code 1avo) and diubuquitin (PDB
code 1aar). HIV-1 protease is a dimer with two identical
chains, A and B, each of 99 residues. We separated the
PDB file into two files and docked the two chains, using A
The GA performed well in comparison with two other as the target. The proteasome activator is a heptamer with
recent large docking studies that have used unbound 14 chains, 7 pairs that form a circular helical assembly
proteins. In their study, Ritchie and Kemp24 docked 18 with a central hole. Each subunit consists of two chains of
enzyme–inhibitor and antibody–antigen complexes and 59 and 139 residues, respectively. In this case, we docked
found solutions ranking in the top 100 in only 4 cases, two identical copies of the A,B-chain subunit. This is a
when their dockings were constrained to search only the slightly more stringent test than that imposed on 4hpv, as
receptor-binding site. They required some knowledge of the RMSD between the A,B-chain subunit and the adja-
the ligand-binding site to generate more highly ranked cent C,D-chain subunit is 0.2 Å. Diubiquitin is a dimer
solutions. However, they successfully docked the Fab consisting of two distinct copies of the ubiquitin monomer;
fragment–antigen complex 1vfb, which the GA was unable the C␣ RMSD between the two chains is 1.4 Å. We docked
to dock using any of our parameter combinations. Our two identical copies of the A chain which provides a more
results are also comparable to those of Norel et al., 9 who severe test than docking the A chain against the B chain.
docked 19 unbound complexes. In 15 cases, they found a The results, which were very encouraging, are shown in
solution in the top 100, and 5 of these were in the top 10. Table VI. For the 4hpv dimer, the top 15 solutions for
However, their test cases only involved 8 different com- normal angle ⫾20° and J ⫽ 1.5 were all within 2.6 Å of the
plexes—the multiple results came from using the same actual complex. The 1aar and 1avo results, whilst not as
proteins crystallized to different resolution or in different outstanding as 4hpv, were still very good with each
crystal forms. Thus, although large, their test set was not parameter set providing multiple hits within the top ten
as extensive as ours. In general, the GA is somewhat ranked solutions. This time, taking all three complexes
slower than comparable docking methods, with one hun- into consideration, the parameter combination of normal
dred orientations taking 3–33 h. For example, the methods angle ⫾20° and J ⫽ 1.5 did seem significantly better than
of Norel et al.,9 and Ritchie and Kemp24 took only a few the other three since a correct hit was ranked top for every
54 E.J. GARDINER ET AL.
Best
Complex (PDB code) Normal anglea Jb Top Rankc Top 10 hitsd Total hitse RMSDf RMSDg
4hpv ⫾30° 3.0 1 10 18 1.99 1.52
⫾20° 1.5 1 10 20 2.00 0.98
⫾20° 3.0 1 10 13 1.86 1.72
⫾15° 1.0 1 10 27 2.79 1.38
1avo ⫾30° 3.0 4 4 7 2.31 1.27
⫾20° 1.5 1 4 6 1.42 1.42
⫾20° 3.0 2 2 3 3.53 3.53
⫾15° 1.0 4 2 6 1.81 1.81
1aar ⫾30° 3.0 4 3 5 2.66 2.47
⫾20° 1.5 1 5 8 2.33 2.17
⫾20° 3.0 1 3 4 2.32 2.32
⫾15° 1.0 2 6 7 1.92 1.87
*Each row corresponds to 20 runs of the GA, each producing 5 niches giving 100 solutions per row.
a
180° normal angle is the angle allowed between the surface normals for a permitted match.
b
Penalty multiplier.
c
Rank (by fitness score, of 100) of the highest ranked solution which resembled the complex.
d
Number of complex-resembling hits in the 10 top ranked solutions.
e
Number of complex-resembling hits (of 100).
f
Root-mean-square deviation of the query C␣ atoms of the predicted complex when the target monomer is superposed upon
the monomer of the crystallized dimer.
g
Predicted complex closest to the actual complex. Predicted complexes were deemed to resemble the complex if this RMSD
was ⬍4 Å.
complex. However, the docking was done using bound assembly (e.g., a trans-membrane pore) is expected to be
coordinates and we anticipate that native docking might composed of a number of subunits, a further constraint is
again require looser angle matching constraints. imposed on the docking by the fact that when a number
When attempting to predict the structure of an oligomer, (which may be known) of the subunits are docked together,
one needs to assemble multiple copies of a subunit. We the last must also dock with the first. From a list of 100
have done this for the proteasome activator reg-alpha, dockings produced by the GA, only a few, if any, will satisfy
1avo. Figure 3a,b shows views of a heptamer, based on the this condition.
predicted 1avo dimer with the lowest RMSD to the actual Diubiquitin is unusual, in that the structure of the
dimer, superposed onto the actual heptamer. Naturally, monomer (ubiquitin) has also been determined alone. As a
the predicted assembly becomes slightly further out of step final test, we also attempted to form diubiquitin by docking
with the crystallized complex with each additional sub- two copies of ubiquitin. This was also attempted by
unit. Figure 3c shows just the first three docked subunits. Cummings et al.,47 who reported success when they used
Even so the overall RMSD between all 1,400 C␣ atoms of only part of the monomer as their docking target. They
our assembly and the crystallized complex is only 4.2 Å. also found it necessary to delete the C-terminal residue
The case of 1avo illustrates the lack of even-handedness Gly 76 and to truncate the flexible residue Arg 42 by
of the treatment of the query and target proteins. As both deleting all atoms beyond the CB. They give biochemical
are the same, we anticipated that the GA would produce a justifications for all their modifications. Our results proved
similar number of dockings with the query bound “before” very similar to theirs. The GA gave no correct hits when
as “after” the target (Fig. 4a). In practice, however, most the entire surface of the monomer was used for both query
correct solutions (and all the correct solutions which rank and target. However, when the target protein was re-
highly) have the query after the target (in position 1, see stricted to only those residues which are within 8 Å of the
Fig. 4a). The reason for this seemed to be that there was a other monomer (in the crystallized dimer) and the same
helix (residues 1–31), which stuck out from the monomer deletion of Gly 76 and alteration of Arg 42 performed, the
(Fig. 4b). This helix was “thin” (had few interior points GA found 6 correct hits using the parameter combination
defined) in the target and so when the GA approached of normal angle ⫾30° and J ⫽ 3. The fittest correct hit
position 1 this did not cause a clash with query dots (with an interface RMSD of 3.9 Å to the crystal dimer) was
because the query has no interior points. However, when ranked.4
the GA approached position 2 and overlap of the query
helix with the target interior occurred, this caused a CONCLUSIONS
penalized clash (Fig. 4c). This type of situation may be a We have developed a GA for protein–protein docking,
reason for the failure of the GA in the unbound cases based on matching complementary Connolly surfaces. We
where few or no solutions were found. have tested the GA first on three complexes using bound
The 1avo assembly also illustrates a situation in which coordinates and then on 34 complexes (some very large)
the GA may be of immediate practical use. When a circular using unbound coordinates and it has performed very well.
PROTEIN DOCKING 55
ACKNOWLEDGMENTS
The authors thank Professor M.G. James for supplying
us with the coordinates of TEM-1 and BLIP and the
Biotechnology and Biological Sciences Research Council
and Tripos, Inc., for computing resources. The Krebs
Institute is a BBSRC designated Biomolecular Sciences
Centre and is a member of the BBSRC North of England
Structural Biology Centre.
REFERENCES
1. Berman HM, Westbrook J, Feng Z, Gililand G, Bhat TN, Weissig,
Shindyalov IN, Bourne PE. The Protein Data Bank. Nucleic Acids
Res 2000;28:235–242.
2. Hart TN, Read RJ. Multiple-start Monte Carlo docking of flexible
ligands. In: Merz K Jr, Le Grand S, editors. The protein folding
problem and tertiary structure prediction. Boston: Birkhauser;
1994. p 77–108.
3. Kuntz ID, Blaney JM, Oatley SJ, Langridge R, Perrin TE. A
geometric approach to macromolecule-ligand interactions. J Mol
Biol 1982;161:269 –288.
4. Miller MD, Kearsley SK, Underwood DJ, Sheridan RP. Flog—a
system to select quasi-flexible ligands complementary to a recep-
tor of known 3-dimensional Structure. J Computer Aided Mol
Design 1994;8:153–174.
Fig. 4. The GA prefers one of two alternate binding sites for 1avo The 5. Rarey M, Wefing S, Lengauer T. Placement of medium-sized
two binding sites are shown by chalk lines at positions 1 and 2 of the target molecular fragments into active sites of proteins. J Computer
monomer. a: As 1avo is a heptamer, the query protein will bind in position Aided Mol Design 1996;10:41–54.
1 and position 2. However, almost all correct dockings found by the GA 6. Jones G, Willett P, Glen RC, Leach AR, Taylor R. Development
were in position 1. b: 1avo monomer showing residues 1–31 protruding. and validation of a genetic algorithm for flexible docking. J Mol
c: The target protein has interior points defined. If the query helix (black Biol 1997;267:727–748.
line) penetrates the target (position 2), a penalized clash with interior 7. Strynadka NCJ, Eisenstein M, Katchalski-Katzir E, Shoichet BK,
points occurs. If the target helix penetrates the query (position 1) no such Kuntz ID, Abagyan R, Totrov M, Janin J, Cherfils J, Zimmerman
clash occurs. Although no such clash occurs in the actual binding of the F, Olson A, Duncan B, Rao M, Jackson R, Sternberg M, James
two monomers, this clash drives the GA away from the solution at position 2. MNG. Molecular docking programs successfully predict the bind-
ing of a beta-lactamase inhibitory protein to TEM-1 beta-
lactamase. Nature Struct Biol 1996;3:233–239.
When proteins dock, many factors have an influence. We 8. Betts MJ, Sternberg MJE. An analysis of conformational changes
have modeled only two, the surface shape and the “chemi- on protein–protein association: implications for predictive dock-
ing. Protein Eng 1999;12:271–283.
cal shape.” Even so, these have been sufficient, in most 9. Norel R, Petrey D, Wolfson HJ, Nussinov R. Examination of shape
cases, to find orientations near to the crystallized complex. complementarity in docking of unbound proteins. Proteins 1999;36:
We have established a set of parameters for which the GA 307–317.
10. Helmer-Citterich M, Tramontano A. Puzzle: a new method for
found solutions ranked in the top 100 for 30 of the 34 automated protein docking based on surface shape complementar-
complexes. The GA has also been used to dock subunits, ity. J Mol Biol 1994;235:1021–1031.
including successfully reassembling a heptamer from bound 11. Jiang F, Kim SH. Soft docking—matching of molecular-surface
components. cubes. J Mol Biol 1991;219:79 –102.
12. Katchalski-Katzir E, Shariv I, Eisenstein M, Friesem AA, Aflalo
A great strength of the GA is that the solutions produced C, Vakser IA. Molecular-surface recognition— determination of
are already ranked which eliminates the need for a further geometric fit between proteins and their ligands by correlation
screening/ranking step in the docking process. Whilst it is techniques. Proc Natl Acad Sci USA 1992;89:2195–2199.
13. Lenhof HP. New contact measures for the protein docking prob-
not true that the fittest solution was necessarily correct, a lem. Proceedings of the 1st Annual Conference on Computational
list of 100 putative orientations, of which at least one is Molecular Biology, RECOMB 1997; 182–191.
probably close to the crystallographic complex, is a very 14. Meyer M, Wilson P, Schomberg D. Hydrogen bonding and molecu-
good start, although there is still room for improvement. lar surface shape complementarity as a basis for protein docking.
J Mol Biol 1996;264:199 –210.
The fitness function at present is very simple. We intend 15. Shoichet BK, Kuntz ID. Protein docking and complementarity. J
to try to enhance it by including some electrostatic ele- Mol Biol 1991;221:327–346.
56 E.J. GARDINER ET AL.
16. Gabb HA, Jackson RM, Sternberg MJE. Modelling protein dock- 32. Notredame C, Higgins DG. SAGA: sequence alignment by genetic
ing using shape complementarity, electrostatics and biochemical algorithm. Nucleic Acids Res 1996;24:1515–1524.
information. J Mol Biol 1997;272:106 –120. 33. Verkhivker GM, Rejto PA, Gehlhaar DK, Freer ST. Exploring the
17. Jackson RM, Sternberg MJE. A continuum model for protein– energy landscapes of molecular recognition by a genetic algo-
protein interactions—application to the docking problem. J Mol rithm: analysis of the requirements for robust docking of HIV-1
Biol 1995;250:258 –275. protease and FKRP-12 complexes. Proteins 1996;25:342–353.
18. Connolly ML. Solvent-accessible surfaces of proteins and nucleic 34. Pearlman DA. Automated detection of problem restraints in NMR
acids. Science 1983;221:708 –713. data sets using the FINGAR genetic algorithm method. J Biomol
19. Connolly ML. Shape complementarity at the hemoglobin a1b1 NMR 1999;13:325–335.
subunit interface. Biopolymers 1986;25:1229 –1247. 35. Wild DJ, Willett P. Similarity searching in files of three-
20. Norel R, Lin SL, Wolfson HJ, Nussinov R. Shape complementarity dimensional chemical structures. Alignment of molecular electro-
at protein–protein interfaces. Biopolymers 1994;34:933–940. static potential fields with a genetic algorithm. J Chem Inform
21. Norel R, Lin SL, Wolfson HJ, Nussinov R. Molecular-surface Comput Sci 1996;36:159 –167.
complementarity at protein–protein interfaces—the critical role 36. Bayley MJ, Jones G, Willett P, Williamson MP. GENFOLD: a
played by surface normals at well placed, sparse, points in genetic algorithm for folding protein structures using NMR re-
docking. J Mol Biol 1995;252:263–273. straints. Protein Sci 1998;7:491– 499.
22. Hou TJ, Wang JM, Chen LR, Xu XJ. Automated docking of 37. Jones G, Willett P, Glen RC, Leach AR, Taylor R. Further
peptides and proteins by using a genetic algorithm combined with development of a genetic algorithm for ligand docking and its
a tabu search. Protein Eng 1999;12:639 – 647. application to screening combinatorial libraries. In: Parrill AL,
23. Jackson RM, Gabb HA, Sternberg MJE. Rapid refinement of Reddy MR editors. Rational drug design: novel methodology and
protein interfaces incorporating solvation: application to the practical applications. ACS Symposium Series. Washington, DC:
docking problem. J Mol Biol 1998;276:265–285. American Chemical Society; 1999. Vol 719, p 271–291.
24. Ritchie DW, Kemp GJL. Protein docking using spherical polar 38. Holliday JD, Willett P. Using a genetic algorithm to identify
Fourier correlations. Proteins 2000;39:178 –194. common structural features in sets of ligands. J Mol Graph Model
25. Palma PN, Krippahl L, Wampler JE, Moura JJG. BiGGER: a new 1997;15:221–232.
(soft) docking algorithm for predicting protein interactions. Pro- 39. Gillet VJ, Willett P, Bradshaw J, Green DVS. Selecting combinato-
teins 2000;39:372–384. rial libraries to optimize diversity and physical properties. J Chem
26. Dixon JS. Evaluation of the CASP2 docking section. Proteins Inform Comput Sci 1999;39:169 –177.
1997; 1(suppl):198 –204. 40. Poirrette AR, Artymiuk PJ, Rice DW, Willett P. Comparison of
27. Jackson RM. Comparison of protein–protein interactions in serine protein surfaces using a genetic algorithm. J Comput-Aided Mol
protease–inhibitor and antibody–antigen complexes: implications Design 1997;11:557–569.
for the protein docking problem. Protein Sci 1999;8:603– 613. 41. Connolly ML. Analytical molecular-surface calculation. J Appl
28. Gardiner EJ, Willett P, Artymiuk PJ. Graph-theoretic techniques Crystallogr 1983;16:548 –558.
for macromolecular docking. J Chem Inform Comput Sci 2000;40: 42. Goldberg DE. Genetic algorithms in search, optimisation and
273–279. machine learning. Reading, PA: Addison-Wesley; 1989. 421 p.
29. Pedersen JT, Moult J. Protein folding simulations with genetic 43. Davis L. Handbook of genetic algorithms. New York: Van Nostrand-
algorithms and a detailed molecular description. J Mol Biol Reinhold; 1991. 385p.
1997;269:240 –259. 44. Beasley D, Bull DR, Martin RR. A sequential niche technique for
30. Vivarelli F, Giusti G, Villani M., Campanini R, Fariselli P, multimodal function optimization. Evol Comput 1993;1:101–125.
Compiani M, Casadio R. Lgann—a parallel system combining a 45. Hendrix DK, Klien TE, Kuntz ID. Macromolecular docking of a
local genetic algorithm and neural networks for the prediction of three-body system: the recognition of human growth hormone by
secondary structure of proteins. Computer Applications in the its receptor. Protein Sci 1999;8:1010 –1022.
Biosciences 1995;11:253–260. 46. Janin J, Miller S, Chothia C. Surface, subunit interfaces and
31. May ACW, Johnson MS. Improved genetic algorithm-based pro- interior of oligomeric proteins. J Mol Biol 1988;204:155–164.
tein structure comparisons: pairwise and multiple superpositions. 47. Cummings MD, Hart TN, Read RJ. Monte-Carlo docking with
Protein Eng 1995;8:873– 882. ubiquitin. Protein Sci 1995;4:885– 899.