0% found this document useful (0 votes)
20 views72 pages

Adv QSAR

Uploaded by

rdsatyadev
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views72 pages

Adv QSAR

Uploaded by

rdsatyadev
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

Advanced QSAR

Genetic algorithms
Advanced Methods
• Neural Networks
• Genetic/Evolutionary Algorithms
• Monte Carlo
• Alternate descriptors
– Reduced graphs
– Molecular connectivity indices
– Indicator variables (0 or 1)
• Combinatorics (e.g. multiple substituent sites)
QSAR Prediction Models

The descriptors obtained after feature optimization are used to develop prediction
models. They have used three different approaches to develop QSAR prediction models
– linear MLR, non-linear DT, and ANN as discussed below. These belong to different
classes and would help us in developing the robust QSAR prediction models for this
dataset.

Multiple Linear Regression (MLR)—

The MLR models serve as the basis for a number of multivariate methods. They
establish a quantitative relationship between a group of predictor variables (X) and a
response Y as shown in Equation (4). This relationship is useful for understanding which
predictors have the greatest effect and the direction of the effect.

Here Y is an n × 1 vector of observations, X is an n × p matrix of regressors, β is a p × 1


vector of parameters, and ∈ is an n × 1 vector of normally distributed noise. The aim of
this regression method is to estimate the β8 = (βl,…, βp), by using
Decision Trees (DT)—

The DTs are widely used machine learning methods used in pharmaceutical industry for
predicting the quantitative structure-activity relationships. They are known for their
predictive ability and are easy to interpret. DT approximates the discrete-valued
functions, is robust to noisy data and capable of learning disjunctive expressions.

DTs approach a classification or regression problem in divide-and-conquer fashion. The


decision trees, can classify both categorical and numerical data, but the output attribute
must be categorical. There are no a priori assumptions about the nature of the data, but
the multiple output attributes are not allowed. There are various classes of decision trees
based on the process of construction, pruning methodology and their application.

The classification and regression trees (CART) and random forests (RF) are more widely
used in drug design as well as QSAR. The model tree (i.e., M5 decision tree) is mainly
used for developing models to predict the values. Unlike other decision trees (e.g., CART),
the M5 decision trees store multivariate linear models at their leaf nodes. So, they are
analogous to piecewise linear functions, and hence are considered as nonlinear in nature.
Hybrid-Genetic Algorithm based Descriptor Optimization and
QSAR Models for Predicting the Biological Activity of Tipranavir
Analogs for HIV Protease Inhibition
A. Srinivas Reddy, Sunil Kumar, and Rajni Garg
J Mol Graph Model. 2010 June ; 28(8): 852–862. doi:10.1016/j.jmgm.2010.03.005.
GA-PLS Method for QSAR.

The algorithm of the GA-PLS method is implemented as follows.

Step 1. The descriptors are generated using one of the available methods.

Step 2. All applicable descriptors are enumerated arbitrarily, and this enumeration
is maintained throughout the whole analysis. A population of 100 different random
combinations of these descriptors is generated. In order to apply GA methodology,
each combination is considered as a parent. Each parent represents a binary string
of digits, either one or zero; the length of each string is the same and is equal to the
total number of descriptors (indices). The value of one implies that the
corresponding descriptor is included for the parent, and zero means
that the descriptor is excluded.

Step 3. Using each parent combination of descriptors, a QSAR equation is generated


for the whole dataset using the PLS algorithm; thus for each parent an initial value of
q2 is obtained.
The [1 - (n - 1)(1 –q2)/(n - c)] value, where q2 is cross-validated r2, n is the number
of compounds, and c is the optimal number of components, is then used as the
fitting function to guide GA.
M5 decision trees are constructed by first using a decision tree induction algorithm to
build the initial tree, and then a multivariate linear regression model is constructed for
each node of the tree. Each linear model is simplified by eliminating the parameters (by
using a greedy search method) to minimize its estimated error. Then pruning of the tree
is performed by examining each non-leaf node of the model tree to assign the linear
model with lower estimated error to leaf nodes. Lastly smoothing process is applied to
improve the prediction accuracy of the tree.

The advantage of the M5 tree over CART is that it is much smaller than CART, the
decision is clear and the regression functions do not normally involve too many
variables. We have used the model tree program implemented in Weka [42]. Various
parameters and measures used for evaluating the fitness of the DT models.
Artificial Neural Network (ANN)—
The ANN is used to identify correlated patterns between the input and target values and
can subsequently predict outcomes from fresh inputs. The ANNs generally consist of a
number of interconnected processing elements or neurons. The descriptors obtained by
feature selection are used to build the inputs of the ANN. Each input is associated with
some weights (w) and biases (b) depending on the relative importance of the input.
We used a 3-layer ANN with back propagation for fitness evaluation of the population (in
hybrid GA-ANN) as well as for the QSAR prediction model development. We investigated
the performance of different learning schemes (e.g., Bayesian [48], scaled conjugate
gradient , gradient descent, etc.), and observed that the Bayesian regularization (BR)
learning showed the best performance for feature selection as well as biological activity
prediction of this dataset.
We used the Bayesian learning algorithm by MacKay, and Foresee and Hagan ,
implemented in Matlab ANN toolbox. The training function in Bayesian learning
algorithm updates the weight and bias values according to Levenberg-Marquardt
optimization procedure. It minimizes a combination of sum of squared errors generated by
the outputs (ED) and sum of squares of weights that reflect the connections of network
(Ew), and then determines the correct combination in order to produce a network that
generalizes well. Thus the performance index modification (F) involves taking into account
the sum of squares of the network weights (equation 5). Then optimization technique is
used to minimize F.
Here α and β are black box parameters and do not represent the momentum factor and
learning rat
Over fitting and Model Validation—

To avoid over fitting, they used the 10-fold crossvalidation method to estimate the fitness
of the model and compare all three types of Q SAR models (i.e., MLR, DT and ANN based
models). In 10-fold cross validation method, the data is partitioned into 10 sets of size
n/10 each. Among them, nine sets are used for training and the remaining one set is used
for testing. The procedure was repeated 10-times, and average accuracy was computed.
Over fitting in ANN is also avoided by using the BR training algorithm. The oversized
networks which can cause over fitting can also be avoided by pruning away unnecessary
neurons or starting with a low number of neurons and then gradually increasing as
necessary while watching the generalization performance [54]. Reducing the number of
inputs is also beneficial, which was done with the feature optimization as discussed earlier
in this section.
For model validation, various measures such as correlation coefficient (R) (Equation (6a)),
root mean squared error (RMSE) (Equation (6b)) and cross validated R-square (R2
cv) (Equation (6c)) were used. These measures are computed from the difference between
the experimental values (Yi) and the predicted value (Ŷi) for the ith compound as discussed
below.

The linear correlation coefficient (R), measures the strength and the direction of a linear
relationship between (Yi) and (Ŷi). Here n is the number of pairs of data.
The root mean squared error (RMSE) and cross validated R-square (Rcv 2) values give
the variance of predicted activity from the experimental activity. They are good
measures to compare the fitness of the models and are defined below.
The algorithm of the GA-PLS method (contd)

Step 4. Two parents are selected randomly based on the roulette wheel selection
method (i. e., high fitting parents are more likely to be selected).

Step 5. The population is evolved by performing a crossover between two randomly


selected parents which produces two offsprings.

Step 6. Each offspring is subjected to a random single-point mutation, i.e. randomly


selected one (or zero) is changed to zero (or one).

Step 7. The fitness of each offspring is evaluated as described above


(cf. Step 4).

Step 8. If the resulting offsprings are characterized by a higher value of fitness


function, then they replace parents; otherwise, parents are kept.

Step 9. Steps 4 - 7 are repeated until a predefined maximum number of crossovers


are reached.
Generation of QSAR Models.

The 28 bradykinin potentiating pentapeptides were included in the training set to


generate QSAR equations using the GA-PLS method.

The two most active compounds, VEWAK and VKWAP, were excluded from the
training set.

The calculated log RAI values compared favorably with the experimental data (data
not shown).

The equations correctly predicted them to have higher activities compared to


activities of compounds in the training set (the log RAI values of 1.79, 1.48, and 1.47
were obtained for VEWAK using ISA-ECI, Z1-Z2-Z3, and topological indices,
respectively, and the log RAI values of 1.80, 1.74, and 1.95 were obtained for VKWAP
using ISA-ECI, Z1-Z2-Z3, and topological indices as descriptors, respectively).

The statistics obtained from the PLS regression analyses and the GA-PLS method
applied to the training set using ISA-ECI, Z1-Z2-Z3, and topological indices
are shown in Table 1.
In order to test the reliability of the prediction using pre-constructed QSAR
equations with these descriptors, we incorporated the modified “degree of fit”
condition.

According to this condition, if RSD of dependent variables of a virtual peptide is less


than the RSD of the X matrix of the training set, the predicted values are considered
to be reliable. If this condition is not met the log RAI of the virtual peptide is not
predicted or set to a low log RAI number to avoid selecting it.

The condition does not allow the Focus-2D program to over-extrapolate. Since the
number of peptides in the training set is very small compared to theoretical number
of different pentapeptides (3.2 million), the extrapolation of QSAR relationship
should be done very carefully in small increments, and the “degree of fit” condition
implemented here allows us to do this.

The RSD values (of the X matrix of the training set) of 0.886, 0.818, and 0.195 were
obtained for ISA-ECI, Z1-Z2-Z3, and topological indices description methods,
respectively and used to test the reliability of the prediction (Table 1).
Int. J. Mol. Sci. 2008, 9, 1961-1976; DOI:
10.3390/ijms9101961

In the autocorrelation vectors calculated by ADRIANA.Code, the hydrogen atoms


were included.
2D molecular autocorrelation vectors [32] for physicochemical atomic properties
were calculated for each molecule by using the following equation:

where A(d) is the topological autocorrelation coefficient referring to atom pairs i, j which
are separated by d bonds. pi is an atomic property, e.g. the σ charge on atom i. Thus, for
each compound, a series of coefficients for different topological distances d, a so-called
autocorrelation vector is obtained; Seven distances from distance of d=0 to d=6 were
considered. Seven atomic properties are represented by pi: σ charge (SigChg) [33-34], π
charge (PiChg) [35], total charges (TotChg), σ electronegativity (SigEN), π
electronegativity (PiEN), lone-pair electronegativity (LpEN) and atomic polarizability
(Apolariz) [36].
For example, ethanol (Figure 1) has three pairs of atoms that are separated by four
bonds: H1-H4, H2-H4 and H3-H4. Thus, the corresponding autocorrelation for the
topological distance four computes to:
Support Vector Machine (SVM) Analysis

The Libsvm program was used to build SVM models [22]. This software is based on the
function of classification. After some improvement, it can also be applied to the
regression problem well. More introductions and implementations about Libsvm can
be found in their website [44-45]. The Libsvm regression was realized by the ε-
Support Vector Regression (ε-SVR) with a radial basis function (RBF) kernel function.

The ε-SVR algorithm is a generalization of the better known support vector


classification algorithm to the regression case. Given n training vectors xi and a vector
y Rn ∈ such that i y R ∈ , we want to find an estimate for the function y = f(x) which is
optimal from a structural risk minimization viewpoint. According to ε-SVR, this
estimate is:

where b is a bias term and k(xi,xj) is a special function called the kernel. The
coefficients ai and ai* are the solutions of the quadratic problem:
Performance of the models
• Parameters that include sensitivity, specificity, accuracy, and MCC.
• These parameters can be represented by following equations:

Where TP and TN are correctly predicted positive and negative examples; FP and FN are falsely
predicted positive and negative examples. MCC is a Matthew's correlation coefficient, which
consider over and under prediction;

MCC 1 is regarded as a perfect prediction, whereas 0 is regarded as random prediction.


parameters C and ε can be chosen by the user. The “penalty parameter” C may be as high
as infinity, while usual values for ε are 0.1 or 000.1.

The kernel function is used to convert the data into a higher-dimensional space in order to
account for nonlinearities in the estimate function. A commonly used kernel is the Radial
Basis Function (RBF) kernel:

The parameter γ is selected by the user.


According to the program guide, two necessary steps had to be taken in advance: the
scaling of input data and searching for best parameters. The input data (the descriptors
selected by genetic-PLS) was compressed into [0.1, 0.9] through the formula:

where x was the original value, and x* is the scaled value. xmin and xmax are the
corresponding minimum and maximum values of the descriptor variable, respectively.
There are three parameters to adjust the efficiency of Libsvm program: C, γ and ε. An
autosearching program named “grid regression” was adopted. It could search for best
parameters C, γ and ε through a leave-k-out cross validation method. Meanwhile,
overfitting of training set could be prevented. Here a leave-25%-out cross validation was
carried out. Manual searches were then performed around the leave-25%-out cross
validation results to select the best parameters.
Partial Least Square (PLS) Models

Partial least square analysis was carried out with six selected ADRIANA.Code descriptors, six
selected Cerius2 descriptors and nine combined descriptors to build Model 1A, Model 2A
and Model 3A, respectively. 380 compounds in the training set were used to build models,
172 compounds in the test set were used to predict human intestinal absorption (HIA).
The equations were obtained as follows:
Hybrid-Genetic Algorithm based Descriptor Optimization and QSAR Models for
Predicting the Biological Activity of Tipranavir Analogs for HIV Protease Inhibition
A. Srinivas Reddy, Sunil Kumar, and Rajni Garg
J Mol Graph Model. 2010 June ; 28(8): 852–862. doi:10.1016/j.jmgm.2010.03.005.

identifying a small subset of descriptors, which represent the total set of descriptors, is
important for developing a good QSAR prediction model. We have used
the hybrid-GA optimization technique, where GA is used for searching the descriptor
subspace, whereas the MLR, CFS, DT or ANN is used for fitness evaluation.

GA is governed by biological evolution rules and can investigate several possible


solutions simultaneously, each of which explores different regions in the descriptor
space. Fitness of each solution is evaluated by one of the above mentioned techniques.

The MLR and DT are the linear fitness functions whereas ANN is a non-linear function.
The use of ANN can handle the optimization of non-linear descriptor space more
efficiently [19].
The major steps employed in the hybrid-GA scheme are

The first step in the hybrid GA is to create a population of N individuals (feature subsets).
Each individual encodes the same number of randomly chosen descriptors, and the
fitness of each individual in this generation is determined. The compounds in the dataset
were divided into a training (66%) and test set (34%). The test set is not used during
training but serves to test the predictive ability of final models. Here, the root mean
squared error (RMSE) is taken as the fitness measure. Next, a fraction of children of the
next generation is produced by crossover and the rest by mutation from the parents on
the basis of their scaled fitness scores. The new offspring contains characteristics from
one or both parents, and is evaluated for fitness. The cycle continues for a
predetermined number of generations, or until the results do not change
continuously for a specified number of generations.

The GA-MLR, GA-DT and GA-CFS techniques are implemented using WEKA program, and
the GA-ANN is implemented in MATLAB. The values of various hybrid-GA
parameters and measures are discussed in Section III.A. Since the MLR, DT and ANN are
also used for developing the QSAR models, they are discussed in the next section,
whereas the GA-CFS is discussed below.
GA-CFS—As shown in Figure 5, the correlation-based feature selection (CFS)
algorithm evaluates each feature subset by considering the individual
predictive ability of each feature along with the degree of redundancy
between them, and returns a numeric measure that guides the search.

The CFS fitness function takes into account the usefulness of individual
features for predicting the activity along with the level of inter-correlation to
give the goodness of feature subsets. It has wide range of applications for
feature selection including QSAR.

CFS is a simple filter algorithm that ranks feature subsets according to a


correlation based heuristic evaluation function (see Equation (1)). The bias of
the evaluation function is toward subsets that contain features that are highly
correlated with the class and yet uncorrelated with each other.
Descriptor Generation

Descriptors were generated using the chemical modeling program


Cerius2 (Accelrys Inc, San Diego, CA) and
QSARIS (MDL Information Systems Inc, San Liandro, CA).

Descriptors were divided into one of several physical property categories:


conformational, electronic, informational, molecular shape analysis (MSA), quantum
mechanical, receptor, spatial, structural, thermodynamic, or topological.

The same descriptors were used for the generation of both the self-organizing
map (SOM) and genetic functional approximation (GFA) algorithm models.

Self-organizing Maps With Descriptor Selection

In order to nonlinearly relate bioactivity and molecular features, a set of


computational models were developed using a custom-built stepwise descriptor
selection algorithm coupled to a supervised SOM.

30 The algorithm was developed using Matlab 6.1 with the SOM Toolbox 2.0
The SOM-Stepwise Feature Reduction Algorithm

A SOM-oriented variation of a forward entry step-wise regression algorithm was


employed to select a small number of independent variables from which the SOM
model was constructed.

The algorithm used is identical to the forward entry stepwise method with the following
variations:

1. the entry statistic was defi ned to be the percentage of training set molecules
correctly classifi ed by the model, any increase of which resulted in the inclusion of
the descriptors being tested;

2. the entry statistic was calculated by constructing a SOM using a supervised learning
algorithm; and

3. the descriptors to be entered into the model were chosen in pairs, 1 from each of 2
randomized lists of the full set of descriptors.

Algorithm training was conducted iteratively until some convergence criteria were met.
Genetic Functional Activity Algorithms

GFAs as described by Rogers and Hopfinger 20 were implemented using Cerius


2 software (Version 4.6, Accelrys Inc, San Diego, CA).

GFAs were conducted using static 100 member populations; linear, quadratic, and
offset quadratic basis functions;

the mutation probability for both additions and deletions was set at 0.2%.

Lack-of-fit criterion was set at 1.0, and initial equation length was set to 3.

Models were evolved over 50 000 to 80 000 generations using standard


convergence criterion. \\]

The AAPS Journal 2005; 7 (3) Article 68


Self-organizing Maps

U-matrix maps represent the descriptor-based relationships of the training set of


molecules by the SOM and are constructed by projecting all selected descriptor ’ s
contributions to the mapping.

D-matrix maps indicate the classifi cation mapping of the SOM, as learned from the
corresponding training sets. All 3 maps illustrate strong clustering.

The color level in all the maps (see Figures 2-4 ) describes the relative clustering in
the data; the darker areas indicate small distances, while the lighter areas indicate
larger distances.

The colored hexagons indicate the points on the map corresponding to training
molecules and are colored by the affinity classification and labeled with the
corresponding training molecule names
Figure 1 shows the schematic diagram of targeted combinatorial library design
using FOCUS-2D which consists of description, evaluation, and optimization steps.
Molconn-X program was used to generate topological descriptors (indices) for
pentapeptides. These descriptors have been developed by Kier and Hall on the basis
of chemical graph theory.

Programs implemented in FOCUS-2D as well as genetic algorithms-partial least


squares (GA-PLS) routine for QSAR developed earlier were written in C programming
language. The descriptor variables were autoscaled prior to PLS and GA-PLS9
calculations. All calculations were done on the IBM RS6000 workstation (Model 340).
e-LEA3D: a computational-aided drug design web server
Dominique Douguet
Nucleic Acids Research, 2010, Vol. 38, Web Server issue W615–W621
e-LEA3D web server integrates three complementary tools to perform computer-aided
drug design based on molecular fragments. In drug discovery projects, there is a
considerable interest in identifying novel and diverse molecular scaffolds to
enhance chances of success. The de novo drug design tool is used to invent new ligands to
optimize a user-specified scoring function. The composite scoring function includes both
structure- and ligand-based evaluations. The de novo approach is an alternative to a blind
virtual screening of large compound collections.

A heuristic based on a genetic algorithm rapidly finds which fragments or combination of


fragments fit a QSAR model or the binding site of a protein. While the approach is ideally
suited for scaffold-hopping, this module also allows a scan for possible substituents to a
user-specified scaffold.
The second tool offers a traditional virtual screening and filtering of an uploaded library of
compounds.
The third module addresses the combinatorial library design that is based on a user-drawn
scaffold and reactants coming, for example, from a chemical supplier.

The e-LEA3D server is available at: http://bioinfo.ipmc.cnrs.fr/lea.html.


Monte Carlo
Protein design automation BASSIL I. DAHIYK

Protein Science (1996), 5:895-903. C


The PDA side-chain selection algorithm requires as input a back-bone structure
defining the desired fold.

Using a rotamer description of side chains, an optimal sequence for a backbone


can be found by screening all possible sequences of rotamers, where each backbone
position can be occupied by each amino acid in all its possible rotameric states.

The discrete nature of rotamer sets allows a simple calculation of the number of
rotamer sequences to be tested. A backbone of length n with m possible rotamers
per position will have mn possible rotamer sequences.

The size of the search space grows exponentially with sequence length, which, for
typical values of n and m, render intractable an exhaustive search. This
combinatorial “explosion” is the primary obstacle to be overcome in the simulation
phase of PDA.
Simulation algorithm

We use an extension of the Dead-End Elimination (DEE) theorem (Desmet et al.,


1992, 1994; Goldstein, 1994) to solve the combinatorial search problem.

The DEE theorem is the basis for a very fast discrete search algorithm that was
designed to pack protein side chains on a fixed backbone with a known sequence.

Side chains are described by rotamers and an atomistic force field is used to
score rotamer arrangements. The DEE theorem guarantees that, if the
algorithm converges, the global optimum packing is found. The DEE method is
readily extended to our inverse folding design paradigm by simply releasing the
constraint that a position is limited to the rotamers of a single amino acid.

This extension of DEE greatly increases the number of rotamers at each


position and requires a significantly modified implementation to ensure
convergence (B.I. Dahiy & S.L. Mayo). The guarantee that only the
global optimum will be found is still valid.

The initial scoring function for sequence arrangements used in the search was an
atomic van der Waals potential. The van der Waals potential reflects excluded
volume and steric packing interactions, which are important determinants of the
specific three-dimensional arrangement of protein side chains.
Following DEE optimization, a rank-ordered list of sequences is generated by a
Monte Carlo search in the neighborhood of the DEE solution. This list of
sequences is necessary because of possible differences between the theoretical
and actual potential surfaces. Starting at the DEE solution, random positions are
changed to other rotamers, and the new sequence energy is calculated. If the
new sequence meets the Boltzmann criteria for acceptance, it is used as the
starting point for another jump (Metropolis et al., 1953).

After a predetermined number of jumps, the best scoring sequences are output
as a rank- ordered list. Starting at the global optimum is critical for the Monte
Carlo routine to find high-scoring sequences and to avoid searching low-scoring
regions of sequence space.

Hence, the DEE algorithm and the Monte Carlo search are both critical for
providing candidate sequences for experimental testing.
Genetic function approximation (GFA) was performed in the CERIUS2 simulation
package version 1.6 (Biosym/Molecular Simulations, San Diego, California).

An initial population of 300 equations was generated consisting of random


combinations of three properties. Only linear terms were used and initial co-
efficients were determined by least-squares regression for each set of properties.
Redundant equations were eliminated and 10,000 generations of random
crossover mutations were performed.

If a mutant had a better score than the worst equation in the population, the
mutant replaced the worst equation. Also, mutation operators that add or remove
terms had a 50% probability of being applied each generation, but these
mutations were only accepted if the score was improved. No equation with greater
than three terms was allowed. Equations were scored during evolution using the
lack of fit (LOF) parameter, a scaled least-square error (LSE) measure that penalizes
equations with more terms and hence resists overfitting. LOF is defined as:

where c is the number of terms in the equation and M is the number of data
points. Correlated properties cannot be differentiated by QSAR techniques and
only create redundancy in the derived relations.

You might also like