J. Med. Chem.
1996, 39, 2887-2893                                              2887
                                                        Articles
The Properties of Known Drugs. 1. Molecular Frameworks
Guy W. Bemis* and Mark A. Murcko
Vertex Pharmaceuticals, 130 Waverly Street, Cambridge, Massachusetts 02139-4242
Received April 19, 1996X
         In order to better understand the common features present in drug molecules, we use shape
         description methods to analyze a database of commercially available drugs and prepare a list
         of common drug shapes. A useful way of organizing this structural data is to group the atoms
         of each drug molecule into ring, linker, framework, and side chain atoms. On the basis of the
         two-dimensional molecular structures (without regard to atom type, hybridization, and bond
         order), there are 1179 different frameworks among the 5120 compounds analyzed. However,
         the shapes of half of the drugs in the database are described by the 32 most frequently occurring
         frameworks. This suggests that the diversity of shapes in the set of known drugs is extremely
         low. In our second method of analysis, in which atom type, hybridization, and bond order are
         considered, more diversity is seen; there are 2506 different frameworks among the 5120
         compounds in the database, and the most frequently occurring 42 frameworks account for only
         one-fourth of the drugs. We discuss the possible interpretations of these findings and the way
         they may be used to guide future drug discovery research.
Introduction
  The drug design process is largely driven by the
instincts, intuition, and experiences of pharmaceutical
research scientists. It is often instructive to attempt
to “capture” these experiences by analyzing the histori-
cal record, i.e., successful drug design projects of the
past. The inferences drawn from this analysis can play
an important role in shaping our thinking on current
and future projects. For this reason, we would like to
analyze the structures of a large number of drugssthe
ultimate product of a successful drug design effort.
There is a wealth of information implicitly encoded in
the two-dimensional and three-dimensional structures             Figure 1. Graph representation of molecules.
of molecules that are currently sold as drugs. This
includes toxicity, stability (both chemical and meta-            Methods
bolic), synthetic accessibility, starting material costs,           The current version of the CMC database (v. 94.1) includes
and the like. Our goal for this paper is to begin to             more than 6700 compounds. However, many of these do not
deconvolute this information in order to apply it to the         meet our criteria for various reasons, e.g., imaging agents,
                                                                 dental resins, and veterinary compounds. Thus, our first task
design of new drugs.                                             was to identify and remove these compounds. We eliminated
  There are several computational tools available for            all compounds for which no therapeutic activity class was
this analysis: substructure searching using one of               given, as well as compounds which fell into any of the following
several commercially available software packages (e.g.           classes: radiopaque agents, contrast agents, solvents, anes-
                                                                 thetics, disinfectants, topicals, local agents, spermicides, wet-
Merlin, ISIS, Unity),1-3 automated ring searching using          ting agents, flavoring agents, pharmaceutical aids, surgical
one of several published algorithms,4-8 and shape                aids, dental, surfactants, sunscreens, ultraviolet screens,
descriptor methods.9-12 We use shape descriptor meth-            emetics, preservatives, aerosol propellants, chelators, kera-
ods because they are easily implemented and are flexible         tolytics, insecticides, astringents, herbicides, laxatives, sweet-
enough to allow the analysis to be performed in an               eners, dental caries prophylactics, adhesives, dentistry, phar-
automated way.                                                   maceutic aids, veterinary, buffers, scabicides, and ecto-
                                                                 parasiticides. After this process, the CMC database had 5120
  We analyze the Comprehensive Medicinal Chemistry               remaining entries.15
(CMC) database13 which contains two-dimensional and                 Our analysis of the structures in the CMC database has
predicted three-dimensional structures and important             been carried out on two levels, using atomic properties and
biochemical properties for known drugs. The CMC                  graph properties. Atomic properties include such information
                                                                 as element type, atomic hybridization, and atomic charge.
database has been developed from Pergammon’s Com-                Graph properties of molecules are the connectivity properties
prehensive Medicinal Chemistry series.14                         of the atoms representing a molecule, that is, the information
                                                                 that may be derived from a molecular structure by considering
  * To whom correspondence should be addressed. E-mail: bemis@   each atom to be a vertex and each bond to be an edge on a
vpharm.com.                                                      graph.16 The graph for a particular molecule may be consid-
  X Abstract published in Advance ACS Abstracts, July 1, 1996.   ered an archetype for each instance of that molecular shape.
                        S0022-2623(96)00292-0 CCC: $12.00        © 1996 American Chemical Society
2888 Journal of Medicinal Chemistry, 1996, Vol. 39, No. 15                                                     Bemis and Murcko
Figure 2. Graph representation of a typical drug molecule.
                                                                   Figure 4. Hierarchical description of molecules.
                                                                   thioridazine molecule consists of two ring systems: a six-ring
                                                                   and three linearly fused six-rings. Together these rings and
                                                                   linkers define the framework of this molecule. The concept of
                                                                   a framework is central to our paper, and provides an important
                                                                   distinction between our present work and work done previ-
                                                                   ously.6
                                                                      We can now classify molecules and their constituent atom
                                                                   groupings into a hierarchy as shown in Figure 4. This
                                                                   classification scheme is very useful for analyzing the structures
                                                                   of drug molecules for several reasons. First, well-represented
                                                                   frameworks can be identified, and emphasis can be placed on
                                                                   these for new drug discovery. Second, linkers and ring systems
                                                                   can be identified for potential use in a combinatorial-type
                                                                   approach to compound library generation. Third, compound
                                                                   libraries may be evaluated for their relationship to the shapes
                                                                   of known drugs. In other words, we can evaluate how well
                                                                   the diversity space of a library overlaps with our representa-
                                                                   tion of drug-space.
                                                                      We begin our analysis by identifying side chain atoms,
Figure 3. Distinguishing between ring systems, linkers, and        which is done as follows. Each atom bonded to only one other
side chains.                                                       atom is identified as a side chain atom and removed from the
That is, for a molecule such as pyridine (Figure 1a), the          molecule. This process is repeated until either the molecule
molecular graph or archetype is the graph with six vertices        disappears (acyclic molecules) or until each atom is bonded to
(Figure 1b). The same archetype represents molecules such          at least two other atoms. The remaining atoms are identified
as benzene, cyclohexane, and pyran, among many others              as the framework atoms. The next step in our analysis is the
(Figure 1c). Thus the structures of molecules can be readily       identification of atoms within the framework that are in rings
analyzed in terms of a hierarchy in which molecular arche-         (or cycles in the graph) using a depth-first search.17 Any atom
types are at the top, and individual molecules are at the bottom   not part of a ring is identified as a linker atom. This process
(Figure 1).                                                        follows the hierarchy shown in Figure 4.
   When analyzing drug molecules, one is faced with a slightly        The molecular frameworks obtained in this manner were
more complicated set of graphs than in the simple example          grouped into clusters of identical shape description. Our
shown in Figure 1. To demonstrate this point, we might             analysis has been carried out in two ways: we have conducted
consider the antidepressant thioridazine, which is shown along     both a purely graph theoretical analysis and an analysis which
with its graph representation or archetype in Figure 2. We         also considers atomic properties. Both methods follow es-
can now pick out structural elements which can be used to          sentially the same formal procedure with the only difference
further order groups of atoms within a molecular graph. We         being the shape descriptor used. For the graph analysis we
may dissect any molecule into four units: ring, framework,         used two-dimensional triangle shape descriptors12 and for the
linker, and side chains. We adopted the following definitions      analysis including atomic properties we used topological
to aid our analysis.                                               torsions.11 For computation of topological torsions, we found
   Ring Systems. We define ring systems to be cycles within        it necessary to retain the π electrons associated with frame-
the graph representation of molecules and cycles sharing an        work atoms when side chains were removed. For example,
edge (a connection between two atoms or a bond). For               cyclohexanone would have the sp2 oxygen tagged as a side
example, benzene, naphthalene, and anthracene are all single       chain atom, and the sp2 carbon tagged as having two associ-
ring systems. Treating cycles this way makes sense from a          ated pi electrons. On the basis of the topological torsion
chemical structural point of view. As an approximation, the        representation, the cyclohexanone framework would therefore
cycles and fused cycles in a molecule represent rigid units in     have a different shape description than the cyclohexane
which many degrees of freedom are removed from a collection        framework. The cyclohexanone framework is therefore rep-
of atoms.                                                          resented with two dots next to the sp2 carbon to indicate the
   Linker Atoms. Atoms that are on the direct path connect-        associated electrons. We have used this notation in Charts 2
ing two ring systems are defined as linker atoms. As can be        and 3.
seen in Figure 3, thioridazine has a two-atom linker connecting
the two ring systems. Molecules such as biphenyl have a zero       Results
atom linkersthe six-membered rings are different ring sys-
tems.                                                                 First we summarize the results of the graph theory
   Side Chain Atoms. Any nonring, nonlinker atoms are              (archetype) analysis and then the atomic property
defined as side chain atoms. Figure 3 shows that thioridazine      (instance) analysis. Finally, we discuss the relationship
has two side chains: a single-atom side chain attached to the
six-ring and a two-atom side chain attached to the fused
                                                                   between the two kinds of analysis.
tricyclic ring system.                                                From the graph theory analysis, there are 1179
   Framework. The framework is defined as the union of ring        different frameworks among the 5120 compounds ana-
systems and linkers in a molecule. As shown in Figure 3, the       lyzed. Of these frameworks, 783 (66%) are unique, i.e.,
Properties of Known Drugs                                   Journal of Medicinal Chemistry, 1996, Vol. 39, No. 15 2889
Chart 1. Graph Frameworks for Compounds in the CMC Database as Classified by Connectivity Triangles (Numbers
Indicate Frequency of Occurrence)
they are found in only one drug molecule. Chart 1           type, hybridization, and bond order) are considered.
shows graph frameworks for compounds in the CMC             Somewhat more diversity is seen; there are 2506 dif-
database as classified by connectivity triangles. We        ferent frameworks among the 5120 compounds in the
have shown only frameworks that exist in at least 20        database. Again, a large majority of these frameworks
drugs. This set of 32 frameworks accounts for 50% of        (1908, or 76%) are unique. Chart 2 shows atomic prop-
the 5120 total drug molecules. Clearly the six-ring is      erty-based drug frameworks (drug instances) that occur
the most commonly used framework for these drugs.           in the CMC at least 10 times. Naturally, because this
Acyclic molecules (those with no framework) account for     classification scheme accounts for hybridization and
306 (6%) of the molecules we examined.                      bond order, one would expect a more diverse set of
  Our second method of analysis uses topological tor-       frameworks to be required to represent the drug data-
sions11 for classification. Several atom properties (atom   base. Even so, this set of 41 frameworks accounts for
2890   Journal of Medicinal Chemistry, 1996, Vol. 39, No. 15                                 Bemis and Murcko
Chart 2. Atomic Frameworks for Compounds in the CMC Database as Classified by Topological Torsions (Numbers
Indicate Frequency of Occurrence)
Properties of Known Drugs                                   Journal of Medicinal Chemistry, 1996, Vol. 39, No. 15 2891
Chart 3. All Six-Membered Rings Found in the CMC            for over half of known drugs (as defined by our subset
Database (Numbers Indicate Frequency of Occurrence)         of the CMC database).
                                                               A problem sometimes encountered when using mo-
                                                            lecular shape descriptors is multiple representations,
                                                            cases where different shapes are represented by identi-
                                                            cal shape descriptions. There are a number of ways to
                                                            deal with this problem, such as adding more detail to
                                                            the shape descriptor or using multiple shape descrip-
                                                            tors.6 For small data sets such as the CMC, perhaps
                                                            the simplest solution is to look through groups of
                                                            molecules with identical shape descriptions and pick out
                                                            cases of multiple representation. This is the method
                                                            we used. An example of multiple representation is
                                                            found in the topological torsion shape description of
                                                            these two molecules:
                                                              We found two examples of the B molecular framework
                                                            grouped with 30 examples of the type A framework so
                                                            we assigned them to separate clusters.
                                                              Finally, we should note that as a control, a partial
                                                            analysis was performed also on the complete CMC
                                                            database (approximately 6700 compounds), and the
                                                            results were substantially the same.
                                                            Discussion
                                                               This is our first attempt at classifying the shapes of
                                                            drug molecules, and our goal is to provide a “high-level
                                                            overview” of the gross structural features of these
                                                            molecules. Accordingly, for purposes of this research,
                                                            we have deliberately defined “shape” in simple terms.
                                                            The first classification scheme ignores such important
                                                            features as the details of substituents on rings, chain
                                                            branching, bond order, atom types, stereochemistry, and
                                                            three-dimensional conformation. The second classifica-
                                                            tion method does account for bond order and atom types.
1235 (24%) of the 5120 molecules we examined. Clearly          There is no reason to believe that the set of 5120
benzene is the most commonly used framework for these       molecules in our database represents all the possible
drugs.                                                      shapes that a drug may take. However, it is instructive
   It is instructive to understand the relationship be-     to examine the universe of known drugs to see what
tween the graph theory frameworks, which can be             patterns may exist. Once these patterns have been
viewed as providing a “high-level” or “generic” classifi-   deduced, the drug designer may apply them in various
cation scheme, and the atom property-based frame-           ways. For example, one might attempt to bias a de novo
works, which further subdivide classes of frameworks        design program or a combinatorial chemistry effort to
based on their chemical properties. As an example, we       produce a set of molecules which either contains or does
may consider the atomic property based framework for        not contain these patterns.
the most popular graph theory based frameworksthe              The reader must bear in mind that “shape” in this
six-ring. Chart 3 shows the set of six-ring atomic frame-   work refers to the two-dimensional topological graph of
works that accounts for the 606 six-ring scaffolds found    the molecules. While three-dimensional shape is par-
in our filtered version of the CMC database. Over-          tially encoded in the two-dimensional graph of a mol-
whelmingly, the most common six-ring atomic frame-          ecule, we expect that the three-dimensional conforma-
work is benzene. Of the drug molecules we considered,       tions of drugs with the same topological shape will not
8.5% (433 out of 5120) have benzene as their molecular      all be similar, although certain conformations would be
framework.                                                  expected to appear more frequently than others.
   Chart 1 can be further broken down (by inspection)          Of course, the preferences we have identified for
into rings and linkers. The linkers present are chains      certain shapes do not necessarily reveal some funda-
with zero to seven nodes shown in Chart 4srings and         mental truth about drugs, receptors, metabolism, or
linkers. Rings have a dashed line showing points where      toxicity. Instead, it may reflect the constraints imposed
linkers can potentially be attached. By using this set      by the scientists who have produced these drugs.
of 14 rings (with eight potential attachment points) and    Constraints due to synthetic or patent considerations,
eight linkers, we can derive the molecular frameworks       cost, or a general conservatism (i.e., a tendency to make
2892   Journal of Medicinal Chemistry, 1996, Vol. 39, No. 15                                                        Bemis and Murcko
Chart 4. Graph Representations of the Rings and Linkers for the Most Common Drug Frameworks Found in Chart 1a
  a Linkers are depicted with open valences on each end; the number of nodes in each linker is given to the left. Rings are depicted with
dashed lines indicating possible points of attachment for linkers.
new compounds which are structurally similar to known                  composed of a particular framework divided by the
compounds) all may be reflected in these findings.                     number of drugs made from that framework.
   However, half of the known drugs fall into only 32                     As an example, the biphenyl molecular framework
shape categories. The drugs which possess these topo-                  (Chart 2) constitutes 16 drugs in our database. The
logical shapes (Chart 1) are quite different in polarity,              CMC lists the following distinct therapeutic classes for
conformation, hydrogen-bonding potential, and other                    these drugs: antiamebic, antifungal, antiinfective, anti-
properties; they bind to different classes of receptor; and            hypercholesteremic, antihyperlipoproteinemic, fascioli-
they serve different pharmacological needs. And yet,                   cide, antirheumatic, analgesic, anti-inflammatory, anti-
they all have the same topological shape.                              thrombotic, uricosuric, and antiarrhythmic. The pharma-
   In part, the results in Chart 1 stem from the simplic-              cological promiscuity parameter for this molecular
ity of our classification scheme, but it also may reflect              framework is therefore 12/16 or 0.75.
some of the properties which are beneficial for producing                 This parameter would be extremely useful for several
drugs. For example, if we consider the set of 32                       purposes such as choosing a scaffold upon which to begin
frameworks in Chart 1, we see that most (23) contain                   a combinatorial design effort. Unfortunately, the exact
at least two six-rings linked or fused together. We also               pharmacological target for each drug is not known, and
see that only three of these frameworks have more than                 often multiple therapeutic categories are listed for
five rotatable bonds.                                                  drugs, so this analysis would require either dealing with
   A “pharmacological promiscuity” parameter could be                  a very restricted subset of drugs or grouping together
provided for each of our frameworks. This was sug-                     similar low-level pharmacological targets.
gested to us by one external reviewer and several                         It is intriguing to consider ways in which our analysis
internal reviewers. This parameter would be defined                    might be used to direct a de novo design effort. For
by the ratio of targets to frameworks, that is, the                    example, on the basis of the above-mentioned observa-
number of pharmacological targets acted upon by drugs                  tion that two six-membered rings are a common motif,
Properties of Known Drugs                                                 Journal of Medicinal Chemistry, 1996, Vol. 39, No. 15 2893
one might begin a de novo exercise by docking two                           (9) Concepts and Applications of Molecular Similarity; Johnson, M.
benzene rings into the active site using shape-based                            A., Maggiora, G. M., Eds.; JohnWiley & Sons, Inc.: New York,
                                                                                1990.
methods that ignore electrostatics.18,19 Next, one could                   (10) Carhart, R. E.; Smith, D. H.; Venkataraghavan, R. Atom Pairs
link or fuse these rings into a single ligand using one of                      as Molecular Features in Structure-Activity Studies: Definition
several algorithms,20-23 placing special emphasis on                            and Applications. J. Chem. Inf. Comput. Sci. 1985, 25, 82-85.
                                                                           (11) Nilakantan, R.; Bauman, N.; Dixon, J. S.; Venkataraghavan, R.
scaffolds found in Chart 1 (directed linking). Finally,                         Topological Torsion: A New Molecular Descriptor for SAR
one could assign atom types for the ligand based on                             Applications. Comparison with Other Descriptors. J. Chem. Inf.
electrostatic complementarity with the active site,19                           Comput. Sci., 1987, 27, 82-85.
placing special emphasis on the atomic distributions                       (12) Bemis, G. W.; Kuntz, I. D. A fast and efficient method for 2D
                                                                                and 3D molecular shape description. J. Comput.-Aided Mol. Des.
found for the scaffolds found in Charts 2 and 3 (directed                       1992, 6, 607-628.
atom assignment). Some minimization would likely be                        (13) Comprehensive Medicinal Chemistry (CMC-3D) Release 94.1 is
needed as different atomic hybridizations are overlaid                          available from MDL Information Systems Inc., San Leandro, CA.
on the initial benzene fragments.                                          (14) Comprehensive Medicinal Chemistry, Vol. 6; Hansch, C., Sammes,
                                                                                P. G., J. B., Taylor, Series Eds.; Pergamon: Oxford, 1990.
   Many other approaches also are possible. For ex-                        (15) A similar process of removing compounds from the CMC has
ample, one might attempt to utilize the frameworks                              been carried out as part of an analysis of the molecular weights
found in Chart 2. These could be used as seed struc-                            of known drugs: Kim, E. E.; Baker, C. T.; Dwyer, M. D.; Murcko,
                                                                                M. A.; Rao, B. G.; Tung, R. D.; Navia, M. A. Crystal Structure
tures for de novo structure generation by random                                of HIV-1 Protease in Complex with VX-478, a Potent and Orally
combination of fragments24,25 and linkers such as those                         Bioavailable Inhibitor of the Enzyme. J. Am. Chem. Soc. 1995,
in the ILIAD database.23 Finally, our collection of “rings                      117, 1181-1182.
and linkers” in Chart 4 might be used in conjunction                       (16) For a good introduction to molecules as graphs, see: Hansen,
                                                                                P. J.; Jurs, P. C. Chemical Applications of Graph Theory. J.
with fragment perception algorithms26 and similarity                            Chem. Ed. 1988, 65, 574-580.
methods27 to select compounds for synthesis and testing                    (17) Cormen, T. H.; Leiserson, C. E.; Rivest, R. L. Introduction to
from a combinatorial library or compound collection                             Algorithms; MIT Press: Cambridge, 1990; pp 477-485.
                                                                           (18) Kuntz, I. D.; Blaney, J. M.; Oatley, S. J.; Langridge, R.; Ferrin,
database.
                                                                                T. E. A geometric approach to macromolecule-ligand interactions.
   Future research in the area of “drug database mining”                        J. Mol. Biol. 1982, 161, 269-288.
will focus on other properties of known drugs including                    (19) Meng, E. C.; Shoichet, B. K.; Kuntz, I. D. Automated docking
their flexibility, log P, solubility, and a more detailed                       with grid-based energy evaluation. J. Comput. Chem. 1992, 13,
                                                                                505-524.
shape description that includes such features as charge                    (20) Roe, D. C.; Kuntz, I. D. BUILDER v.2: Improving the chemistry
and hydrogen bonding potential.                                                 of a de novo design strategy. J. Comput.-Aided Mol. Des. 1995,
                                                                                9, 269-282.
  Acknowledgment. We would like to thank Ajay,                             (21) Lewis, R. A.; Roe, D. C.; Huang, C.; Ferrin, T. E.; Langridge, R.;
Chris Baker, Joshua Boger, Chris Lepre, Roger Tung,                             Kuntz, I. D. Automated site-directed drug design using molec-
                                                                                ular lattices. J. Mol. Graph. 1992, 10, 66-78.
Pat Walters, Keith Wilson, and Bob Zelle for their                         (22) Lauri, G.; Bartlett, P. A. CAVEAT: a Program to Facilitate the
suggestions and their careful reading of the manuscript.                        Design of Organic Molecules. J. Comput.-Aided Mol. Des. 1994,
We thank Scott Thomas for help with processing the                              8, 51-66.
CMC database.                                                              (23) Gillet, V. J.; Newell, W.; Mata, P.; Myatt, G.; Sike, S.; Zsoldos,
                                                                                Z.; Johnson, A. P. SPROUT: Recent Developments in the De
                                                                                Novo Design of Molecules. J. Chem. Inf. Comput. Sci. 1994, 34,
References                                                                      207-217.
  (1) Available from DAYLIGHT Chemical Chemical Information                (24) Nilakantan, R.; Bauman, N.; Venkataraghavan, R. A Method
      Systems, Inc., Irvine, CA.                                                for Automatic Generation of Novel Chemical Structures and Its
  (2) Available from MDL Information Systems, Inc., San Leandro,                Potential Applications to Drug Discovery. J. Chem. Inf. Comput.
      CA.                                                                       Sci. 1991, 31, 527-530.
  (3) Available from Tripos, Inc., St. Louis, MO.                          (25) Pearlman, D. A.; Murcko, M. A. CONCERTS: Dynamic connec-
  (4) Klingebiel, U.; Specht, K. Automatic Generation of the Chemical           tion of fragments as an approach to de novo ligand design. J.
      Ringcode from a Connectivity Chart, J. Chem. Inf. Comput. Sci.            Med. Chem. 1996, 39, 1651-1663.
      1980, 20, 113-116.                                                   (26) Barakat, M. T.; Dean, P. M. The Atom Assignment Problem in
  (5) Randic, M. Ring ID Numbers, J. Chem. Inf. Comput. Sci. 1988,
                                                                                Automated De Novo Drug Design. 2. A Method for Molecular
      28, 142-147.
  (6) Nilakantan, R.; Bauman, N.; Haraki, K.; Venkataraghavan, R.               Graph and Fragment Perception. J. Comput.-Aided Mol. Des.
      A Ring-Based Chemical Structural Query System: Use of a                   1995, 9, 351-358.
      Novel Ring-Complexity Heuristic. J. Chem. Inf. Comput. Sci.          (27) Willett, P. Algorithms for the Calculation of Similarity in
      1990, 30, 65-68.                                                          Chemical Structure Databases. In Concepts and Applications of
  (7) Domokos, L. Beilstein Ring Search System. 1. General Design.              Molecular Similarity; Johnson, M. A., Maggiora, G. M., Eds.;
      J. Chem. Inf. Comput. Sci. 1993, 33, 663-667.                             JohnWiley & Sons, Inc.: New York, 1990; pp 43-64.
  (8) Fan, B. T.; Panaye, A.; Doucet, J.-P.; Barbu, A. Ring Perception.
      A New Algorithm for Directly Finding the Smallest Set of
      Smallest Rings from a Connection Table. J. Chem. Inf. Comput.
      Sci. 1993, 33, 657-662.                                                   JM9602928