0% found this document useful (0 votes)

20 views72 pages

Adv QSAR

Uploaded by

rdsatyadev

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views72 pages

Adv QSAR

Uploaded by

rdsatyadev

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 72

Advanced QSAR

Genetic algorithms
Advanced Methods
• Neural Networks
• Genetic/Evolutionary Algorithms
• Monte Carlo
• Alternate descriptors
– Reduced graphs
– Molecular connectivity indices
– Indicator variables (0 or 1)
• Combinatorics (e.g. multiple substituent sites)
QSAR Prediction Models

The descriptors obtained after feature optimization are used to develop prediction
models. They have used three different approaches to develop QSAR prediction models
– linear MLR, non-linear DT, and ANN as discussed below. These belong to different
classes and would help us in developing the robust QSAR prediction models for this
dataset.

Multiple Linear Regression (MLR)—

The MLR models serve as the basis for a number of multivariate methods. They
establish a quantitative relationship between a group of predictor variables (X) and a
response Y as shown in Equation (4). This relationship is useful for understanding which
predictors have the greatest effect and the direction of the effect.

Here Y is an n × 1 vector of observations, X is an n × p matrix of regressors, β is a p × 1

vector of parameters, and ∈ is an n × 1 vector of normally distributed noise. The aim of
this regression method is to estimate the β8 = (βl,…, βp), by using
Decision Trees (DT)—

The DTs are widely used machine learning methods used in pharmaceutical industry for
predicting the quantitative structure-activity relationships. They are known for their
predictive ability and are easy to interpret. DT approximates the discrete-valued
functions, is robust to noisy data and capable of learning disjunctive expressions.

DTs approach a classification or regression problem in divide-and-conquer fashion. The

decision trees, can classify both categorical and numerical data, but the output attribute
must be categorical. There are no a priori assumptions about the nature of the data, but
the multiple output attributes are not allowed. There are various classes of decision trees
based on the process of construction, pruning methodology and their application.

The classification and regression trees (CART) and random forests (RF) are more widely
used in drug design as well as QSAR. The model tree (i.e., M5 decision tree) is mainly
used for developing models to predict the values. Unlike other decision trees (e.g., CART),
the M5 decision trees store multivariate linear models at their leaf nodes. So, they are
analogous to piecewise linear functions, and hence are considered as nonlinear in nature.
Hybrid-Genetic Algorithm based Descriptor Optimization and
QSAR Models for Predicting the Biological Activity of Tipranavir
Analogs for HIV Protease Inhibition
A. Srinivas Reddy, Sunil Kumar, and Rajni Garg
J Mol Graph Model. 2010 June ; 28(8): 852–862. doi:10.1016/j.jmgm.2010.03.005.
GA-PLS Method for QSAR.

The algorithm of the GA-PLS method is implemented as follows.

Step 1. The descriptors are generated using one of the available methods.

Step 2. All applicable descriptors are enumerated arbitrarily, and this enumeration
is maintained throughout the whole analysis. A population of 100 different random
combinations of these descriptors is generated. In order to apply GA methodology,
each combination is considered as a parent. Each parent represents a binary string
of digits, either one or zero; the length of each string is the same and is equal to the
total number of descriptors (indices). The value of one implies that the
corresponding descriptor is included for the parent, and zero means
that the descriptor is excluded.

Step 3. Using each parent combination of descriptors, a QSAR equation is generated

for the whole dataset using the PLS algorithm; thus for each parent an initial value of
q2 is obtained.
The [1 - (n - 1)(1 –q2)/(n - c)] value, where q2 is cross-validated r2, n is the number
of compounds, and c is the optimal number of components, is then used as the
fitting function to guide GA.
M5 decision trees are constructed by first using a decision tree induction algorithm to
build the initial tree, and then a multivariate linear regression model is constructed for
each node of the tree. Each linear model is simplified by eliminating the parameters (by
using a greedy search method) to minimize its estimated error. Then pruning of the tree
is performed by examining each non-leaf node of the model tree to assign the linear
model with lower estimated error to leaf nodes. Lastly smoothing process is applied to
improve the prediction accuracy of the tree.

The advantage of the M5 tree over CART is that it is much smaller than CART, the
decision is clear and the regression functions do not normally involve too many
variables. We have used the model tree program implemented in Weka [42]. Various
parameters and measures used for evaluating the fitness of the DT models.
Artificial Neural Network (ANN)—
The ANN is used to identify correlated patterns between the input and target values and
can subsequently predict outcomes from fresh inputs. The ANNs generally consist of a
number of interconnected processing elements or neurons. The descriptors obtained by
feature selection are used to build the inputs of the ANN. Each input is associated with
some weights (w) and biases (b) depending on the relative importance of the input.
We used a 3-layer ANN with back propagation for fitness evaluation of the population (in
hybrid GA-ANN) as well as for the QSAR prediction model development. We investigated
the performance of different learning schemes (e.g., Bayesian [48], scaled conjugate
gradient , gradient descent, etc.), and observed that the Bayesian regularization (BR)
learning showed the best performance for feature selection as well as biological activity
prediction of this dataset.
We used the Bayesian learning algorithm by MacKay, and Foresee and Hagan ,
implemented in Matlab ANN toolbox. The training function in Bayesian learning
algorithm updates the weight and bias values according to Levenberg-Marquardt
optimization procedure. It minimizes a combination of sum of squared errors generated by
the outputs (ED) and sum of squares of weights that reflect the connections of network
(Ew), and then determines the correct combination in order to produce a network that
generalizes well. Thus the performance index modification (F) involves taking into account
the sum of squares of the network weights (equation 5). Then optimization technique is
used to minimize F.
Here α and β are black box parameters and do not represent the momentum factor and
learning rat
Over fitting and Model Validation—

To avoid over fitting, they used the 10-fold crossvalidation method to estimate the fitness
of the model and compare all three types of Q SAR models (i.e., MLR, DT and ANN based
models). In 10-fold cross validation method, the data is partitioned into 10 sets of size
n/10 each. Among them, nine sets are used for training and the remaining one set is used
for testing. The procedure was repeated 10-times, and average accuracy was computed.
Over fitting in ANN is also avoided by using the BR training algorithm. The oversized
networks which can cause over fitting can also be avoided by pruning away unnecessary
neurons or starting with a low number of neurons and then gradually increasing as
necessary while watching the generalization performance [54]. Reducing the number of
inputs is also beneficial, which was done with the feature optimization as discussed earlier
in this section.
For model validation, various measures such as correlation coefficient (R) (Equation (6a)),
root mean squared error (RMSE) (Equation (6b)) and cross validated R-square (R2
cv) (Equation (6c)) were used. These measures are computed from the difference between
the experimental values (Yi) and the predicted value (Ŷi) for the ith compound as discussed
below.

The linear correlation coefficient (R), measures the strength and the direction of a linear
relationship between (Yi) and (Ŷi). Here n is the number of pairs of data.
The root mean squared error (RMSE) and cross validated R-square (Rcv 2) values give
the variance of predicted activity from the experimental activity. They are good
measures to compare the fitness of the models and are defined below.
The algorithm of the GA-PLS method (contd)

Step 4. Two parents are selected randomly based on the roulette wheel selection
method (i. e., high fitting parents are more likely to be selected).

Step 5. The population is evolved by performing a crossover between two randomly

selected parents which produces two offsprings.

Step 6. Each offspring is subjected to a random single-point mutation, i.e. randomly

selected one (or zero) is changed to zero (or one).

Step 7. The fitness of each offspring is evaluated as described above

(cf. Step 4).

Step 8. If the resulting offsprings are characterized by a higher value of fitness

function, then they replace parents; otherwise, parents are kept.

Step 9. Steps 4 - 7 are repeated until a predefined maximum number of crossovers

are reached.
Generation of QSAR Models.

The 28 bradykinin potentiating pentapeptides were included in the training set to

generate QSAR equations using the GA-PLS method.

The two most active compounds, VEWAK and VKWAP, were excluded from the
training set.

The calculated log RAI values compared favorably with the experimental data (data
not shown).

The equations correctly predicted them to have higher activities compared to

activities of compounds in the training set (the log RAI values of 1.79, 1.48, and 1.47
were obtained for VEWAK using ISA-ECI, Z1-Z2-Z3, and topological indices,
respectively, and the log RAI values of 1.80, 1.74, and 1.95 were obtained for VKWAP
using ISA-ECI, Z1-Z2-Z3, and topological indices as descriptors, respectively).

The statistics obtained from the PLS regression analyses and the GA-PLS method
applied to the training set using ISA-ECI, Z1-Z2-Z3, and topological indices
are shown in Table 1.
In order to test the reliability of the prediction using pre-constructed QSAR
equations with these descriptors, we incorporated the modified “degree of fit”
condition.

According to this condition, if RSD of dependent variables of a virtual peptide is less

than the RSD of the X matrix of the training set, the predicted values are considered
to be reliable. If this condition is not met the log RAI of the virtual peptide is not
predicted or set to a low log RAI number to avoid selecting it.

The condition does not allow the Focus-2D program to over-extrapolate. Since the
number of peptides in the training set is very small compared to theoretical number
of different pentapeptides (3.2 million), the extrapolation of QSAR relationship
should be done very carefully in small increments, and the “degree of fit” condition
implemented here allows us to do this.

The RSD values (of the X matrix of the training set) of 0.886, 0.818, and 0.195 were
obtained for ISA-ECI, Z1-Z2-Z3, and topological indices description methods,
respectively and used to test the reliability of the prediction (Table 1).
Int. J. Mol. Sci. 2008, 9, 1961-1976; DOI:
10.3390/ijms9101961

In the autocorrelation vectors calculated by ADRIANA.Code, the hydrogen atoms

were included.
2D molecular autocorrelation vectors [32] for physicochemical atomic properties
were calculated for each molecule by using the following equation:

where A(d) is the topological autocorrelation coefficient referring to atom pairs i, j which
are separated by d bonds. pi is an atomic property, e.g. the σ charge on atom i. Thus, for
each compound, a series of coefficients for different topological distances d, a so-called
autocorrelation vector is obtained; Seven distances from distance of d=0 to d=6 were
considered. Seven atomic properties are represented by pi: σ charge (SigChg) [33-34], π
charge (PiChg) [35], total charges (TotChg), σ electronegativity (SigEN), π
electronegativity (PiEN), lone-pair electronegativity (LpEN) and atomic polarizability
(Apolariz) [36].
For example, ethanol (Figure 1) has three pairs of atoms that are separated by four
bonds: H1-H4, H2-H4 and H3-H4. Thus, the corresponding autocorrelation for the
topological distance four computes to:
Support Vector Machine (SVM) Analysis

The Libsvm program was used to build SVM models [22]. This software is based on the
function of classification. After some improvement, it can also be applied to the
regression problem well. More introductions and implementations about Libsvm can
be found in their website [44-45]. The Libsvm regression was realized by the ε-
Support Vector Regression (ε-SVR) with a radial basis function (RBF) kernel function.

The ε-SVR algorithm is a generalization of the better known support vector

classification algorithm to the regression case. Given n training vectors xi and a vector
y Rn ∈ such that i y R ∈ , we want to find an estimate for the function y = f(x) which is
optimal from a structural risk minimization viewpoint. According to ε-SVR, this
estimate is:

where b is a bias term and k(xi,xj) is a special function called the kernel. The
coefficients ai and ai* are the solutions of the quadratic problem:
Performance of the models
• Parameters that include sensitivity, specificity, accuracy, and MCC.
• These parameters can be represented by following equations:

Where TP and TN are correctly predicted positive and negative examples; FP and FN are falsely
predicted positive and negative examples. MCC is a Matthew's correlation coefficient, which
consider over and under prediction;

MCC 1 is regarded as a perfect prediction, whereas 0 is regarded as random prediction.

parameters C and ε can be chosen by the user. The “penalty parameter” C may be as high
as infinity, while usual values for ε are 0.1 or 000.1.

The kernel function is used to convert the data into a higher-dimensional space in order to
account for nonlinearities in the estimate function. A commonly used kernel is the Radial
Basis Function (RBF) kernel:

The parameter γ is selected by the user.

According to the program guide, two necessary steps had to be taken in advance: the
scaling of input data and searching for best parameters. The input data (the descriptors
selected by genetic-PLS) was compressed into [0.1, 0.9] through the formula:

where x was the original value, and x* is the scaled value. xmin and xmax are the
corresponding minimum and maximum values of the descriptor variable, respectively.
There are three parameters to adjust the efficiency of Libsvm program: C, γ and ε. An
autosearching program named “grid regression” was adopted. It could search for best
parameters C, γ and ε through a leave-k-out cross validation method. Meanwhile,
overfitting of training set could be prevented. Here a leave-25%-out cross validation was
carried out. Manual searches were then performed around the leave-25%-out cross
validation results to select the best parameters.
Partial Least Square (PLS) Models

Partial least square analysis was carried out with six selected ADRIANA.Code descriptors, six
selected Cerius2 descriptors and nine combined descriptors to build Model 1A, Model 2A
and Model 3A, respectively. 380 compounds in the training set were used to build models,
172 compounds in the test set were used to predict human intestinal absorption (HIA).
The equations were obtained as follows:
Hybrid-Genetic Algorithm based Descriptor Optimization and QSAR Models for
Predicting the Biological Activity of Tipranavir Analogs for HIV Protease Inhibition
A. Srinivas Reddy, Sunil Kumar, and Rajni Garg
J Mol Graph Model. 2010 June ; 28(8): 852–862. doi:10.1016/j.jmgm.2010.03.005.

identifying a small subset of descriptors, which represent the total set of descriptors, is
important for developing a good QSAR prediction model. We have used
the hybrid-GA optimization technique, where GA is used for searching the descriptor
subspace, whereas the MLR, CFS, DT or ANN is used for fitness evaluation.

GA is governed by biological evolution rules and can investigate several possible

solutions simultaneously, each of which explores different regions in the descriptor
space. Fitness of each solution is evaluated by one of the above mentioned techniques.

The MLR and DT are the linear fitness functions whereas ANN is a non-linear function.
The use of ANN can handle the optimization of non-linear descriptor space more
efficiently [19].
The major steps employed in the hybrid-GA scheme are

The first step in the hybrid GA is to create a population of N individuals (feature subsets).
Each individual encodes the same number of randomly chosen descriptors, and the
fitness of each individual in this generation is determined. The compounds in the dataset
were divided into a training (66%) and test set (34%). The test set is not used during
training but serves to test the predictive ability of final models. Here, the root mean
squared error (RMSE) is taken as the fitness measure. Next, a fraction of children of the
next generation is produced by crossover and the rest by mutation from the parents on
the basis of their scaled fitness scores. The new offspring contains characteristics from
one or both parents, and is evaluated for fitness. The cycle continues for a
predetermined number of generations, or until the results do not change
continuously for a specified number of generations.

The GA-MLR, GA-DT and GA-CFS techniques are implemented using WEKA program, and
the GA-ANN is implemented in MATLAB. The values of various hybrid-GA
parameters and measures are discussed in Section III.A. Since the MLR, DT and ANN are
also used for developing the QSAR models, they are discussed in the next section,
whereas the GA-CFS is discussed below.
GA-CFS—As shown in Figure 5, the correlation-based feature selection (CFS)
algorithm evaluates each feature subset by considering the individual
predictive ability of each feature along with the degree of redundancy
between them, and returns a numeric measure that guides the search.

The CFS fitness function takes into account the usefulness of individual
features for predicting the activity along with the level of inter-correlation to
give the goodness of feature subsets. It has wide range of applications for
feature selection including QSAR.

CFS is a simple filter algorithm that ranks feature subsets according to a

correlation based heuristic evaluation function (see Equation (1)). The bias of
the evaluation function is toward subsets that contain features that are highly
correlated with the class and yet uncorrelated with each other.
Descriptor Generation

Descriptors were generated using the chemical modeling program

Cerius2 (Accelrys Inc, San Diego, CA) and
QSARIS (MDL Information Systems Inc, San Liandro, CA).

Descriptors were divided into one of several physical property categories:

conformational, electronic, informational, molecular shape analysis (MSA), quantum
mechanical, receptor, spatial, structural, thermodynamic, or topological.

The same descriptors were used for the generation of both the self-organizing
map (SOM) and genetic functional approximation (GFA) algorithm models.

Self-organizing Maps With Descriptor Selection

In order to nonlinearly relate bioactivity and molecular features, a set of

computational models were developed using a custom-built stepwise descriptor
selection algorithm coupled to a supervised SOM.

30 The algorithm was developed using Matlab 6.1 with the SOM Toolbox 2.0
The SOM-Stepwise Feature Reduction Algorithm

A SOM-oriented variation of a forward entry step-wise regression algorithm was

employed to select a small number of independent variables from which the SOM
model was constructed.

The algorithm used is identical to the forward entry stepwise method with the following
variations:

1. the entry statistic was deﬁ ned to be the percentage of training set molecules
correctly classiﬁ ed by the model, any increase of which resulted in the inclusion of
the descriptors being tested;

2. the entry statistic was calculated by constructing a SOM using a supervised learning
algorithm; and

3. the descriptors to be entered into the model were chosen in pairs, 1 from each of 2
randomized lists of the full set of descriptors.

Algorithm training was conducted iteratively until some convergence criteria were met.
Genetic Functional Activity Algorithms

GFAs as described by Rogers and Hopﬁnger 20 were implemented using Cerius

2 software (Version 4.6, Accelrys Inc, San Diego, CA).

GFAs were conducted using static 100 member populations; linear, quadratic, and
offset quadratic basis functions;

the mutation probability for both additions and deletions was set at 0.2%.

Lack-of-ﬁt criterion was set at 1.0, and initial equation length was set to 3.

Models were evolved over 50 000 to 80 000 generations using standard

convergence criterion. \\]

The AAPS Journal 2005; 7 (3) Article 68

Self-organizing Maps

U-matrix maps represent the descriptor-based relationships of the training set of

molecules by the SOM and are constructed by projecting all selected descriptor ’ s
contributions to the mapping.

D-matrix maps indicate the classiﬁ cation mapping of the SOM, as learned from the
corresponding training sets. All 3 maps illustrate strong clustering.

The color level in all the maps (see Figures 2-4 ) describes the relative clustering in
the data; the darker areas indicate small distances, while the lighter areas indicate
larger distances.

The colored hexagons indicate the points on the map corresponding to training
molecules and are colored by the afﬁnity classiﬁcation and labeled with the
corresponding training molecule names
Figure 1 shows the schematic diagram of targeted combinatorial library design
using FOCUS-2D which consists of description, evaluation, and optimization steps.
Molconn-X program was used to generate topological descriptors (indices) for
pentapeptides. These descriptors have been developed by Kier and Hall on the basis
of chemical graph theory.

Programs implemented in FOCUS-2D as well as genetic algorithms-partial least

squares (GA-PLS) routine for QSAR developed earlier were written in C programming
language. The descriptor variables were autoscaled prior to PLS and GA-PLS9
calculations. All calculations were done on the IBM RS6000 workstation (Model 340).
e-LEA3D: a computational-aided drug design web server
Dominique Douguet
Nucleic Acids Research, 2010, Vol. 38, Web Server issue W615–W621
e-LEA3D web server integrates three complementary tools to perform computer-aided
drug design based on molecular fragments. In drug discovery projects, there is a
considerable interest in identifying novel and diverse molecular scaffolds to
enhance chances of success. The de novo drug design tool is used to invent new ligands to
optimize a user-specified scoring function. The composite scoring function includes both
structure- and ligand-based evaluations. The de novo approach is an alternative to a blind
virtual screening of large compound collections.

A heuristic based on a genetic algorithm rapidly finds which fragments or combination of

fragments fit a QSAR model or the binding site of a protein. While the approach is ideally
suited for scaffold-hopping, this module also allows a scan for possible substituents to a
user-specified scaffold.
The second tool offers a traditional virtual screening and filtering of an uploaded library of
compounds.
The third module addresses the combinatorial library design that is based on a user-drawn
scaffold and reactants coming, for example, from a chemical supplier.

The e-LEA3D server is available at: http://bioinfo.ipmc.cnrs.fr/lea.html.

Monte Carlo
Protein design automation BASSIL I. DAHIYK

Protein Science (1996), 5:895-903. C

The PDA side-chain selection algorithm requires as input a back-bone structure
defining the desired fold.

Using a rotamer description of side chains, an optimal sequence for a backbone

can be found by screening all possible sequences of rotamers, where each backbone
position can be occupied by each amino acid in all its possible rotameric states.

The discrete nature of rotamer sets allows a simple calculation of the number of
rotamer sequences to be tested. A backbone of length n with m possible rotamers
per position will have mn possible rotamer sequences.

The size of the search space grows exponentially with sequence length, which, for
typical values of n and m, render intractable an exhaustive search. This
combinatorial “explosion” is the primary obstacle to be overcome in the simulation
phase of PDA.
Simulation algorithm

We use an extension of the Dead-End Elimination (DEE) theorem (Desmet et al.,

1992, 1994; Goldstein, 1994) to solve the combinatorial search problem.

The DEE theorem is the basis for a very fast discrete search algorithm that was
designed to pack protein side chains on a fixed backbone with a known sequence.

Side chains are described by rotamers and an atomistic force field is used to
score rotamer arrangements. The DEE theorem guarantees that, if the
algorithm converges, the global optimum packing is found. The DEE method is
readily extended to our inverse folding design paradigm by simply releasing the
constraint that a position is limited to the rotamers of a single amino acid.

This extension of DEE greatly increases the number of rotamers at each

position and requires a significantly modified implementation to ensure
convergence (B.I. Dahiy & S.L. Mayo). The guarantee that only the
global optimum will be found is still valid.

The initial scoring function for sequence arrangements used in the search was an
atomic van der Waals potential. The van der Waals potential reflects excluded
volume and steric packing interactions, which are important determinants of the
specific three-dimensional arrangement of protein side chains.
Following DEE optimization, a rank-ordered list of sequences is generated by a
Monte Carlo search in the neighborhood of the DEE solution. This list of
sequences is necessary because of possible differences between the theoretical
and actual potential surfaces. Starting at the DEE solution, random positions are
changed to other rotamers, and the new sequence energy is calculated. If the
new sequence meets the Boltzmann criteria for acceptance, it is used as the
starting point for another jump (Metropolis et al., 1953).

After a predetermined number of jumps, the best scoring sequences are output
as a rank- ordered list. Starting at the global optimum is critical for the Monte
Carlo routine to find high-scoring sequences and to avoid searching low-scoring
regions of sequence space.

Hence, the DEE algorithm and the Monte Carlo search are both critical for
providing candidate sequences for experimental testing.
Genetic function approximation (GFA) was performed in the CERIUS2 simulation
package version 1.6 (Biosym/Molecular Simulations, San Diego, California).

An initial population of 300 equations was generated consisting of random

combinations of three properties. Only linear terms were used and initial co-
efficients were determined by least-squares regression for each set of properties.
Redundant equations were eliminated and 10,000 generations of random
crossover mutations were performed.

If a mutant had a better score than the worst equation in the population, the
mutant replaced the worst equation. Also, mutation operators that add or remove
terms had a 50% probability of being applied each generation, but these
mutations were only accepted if the score was improved. No equation with greater
than three terms was allowed. Equations were scored during evolution using the
lack of fit (LOF) parameter, a scaled least-square error (LSE) measure that penalizes
equations with more terms and hence resists overfitting. LOF is defined as:

where c is the number of terms in the equation and M is the number of data
points. Correlated properties cannot be differentiated by QSAR techniques and
only create redundancy in the derived relations.

Article Metode Numerik - English
No ratings yet
Article Metode Numerik - English
12 pages
Draft Xai
No ratings yet
Draft Xai
16 pages
T6 Moleculardescriptors Variable Selection
No ratings yet
T6 Moleculardescriptors Variable Selection
11 pages
Multivariate Linear QSPR/QSAR Models: Rigorous Evaluation of Variable Selection For PLS
No ratings yet
Multivariate Linear QSPR/QSAR Models: Rigorous Evaluation of Variable Selection For PLS
10 pages
Machine Learning Project
No ratings yet
Machine Learning Project
12 pages
Unit II
No ratings yet
Unit II
34 pages
PySiRC Supplementary Information
No ratings yet
PySiRC Supplementary Information
8 pages
A Comparative Study On Mushrooms Classification
No ratings yet
A Comparative Study On Mushrooms Classification
8 pages
IV Ai & Ds Al3451 ML Unit5
No ratings yet
IV Ai & Ds Al3451 ML Unit5
23 pages
A Performance Comparison of Modern Stati
No ratings yet
A Performance Comparison of Modern Stati
12 pages
Mars 05
No ratings yet
Mars 05
28 pages
CDT B1 Lab06 MondayWeek2
No ratings yet
CDT B1 Lab06 MondayWeek2
6 pages
Lecture 16
No ratings yet
Lecture 16
5 pages
A Detailed Analysis of The Supervised Machine Learning Algorithms
No ratings yet
A Detailed Analysis of The Supervised Machine Learning Algorithms
5 pages
Unit 5 Updated
No ratings yet
Unit 5 Updated
21 pages
Decision Tree DT
No ratings yet
Decision Tree DT
20 pages
Machine Learning Ensemble Guide
No ratings yet
Machine Learning Ensemble Guide
26 pages
A Sampling of Various Other Learning Methods
No ratings yet
A Sampling of Various Other Learning Methods
34 pages
Main Algorithms Used in Machine Learning Lecture Notes
No ratings yet
Main Algorithms Used in Machine Learning Lecture Notes
26 pages
Tài Liệu Việt Nam -Anh Bảo
No ratings yet
Tài Liệu Việt Nam -Anh Bảo
4 pages
Rf&DTfratello 2018
No ratings yet
Rf&DTfratello 2018
10 pages
Decision Trees
No ratings yet
Decision Trees
34 pages
Chapter 4
No ratings yet
Chapter 4
14 pages
ML Unit 2 Final - III Yr
No ratings yet
ML Unit 2 Final - III Yr
72 pages
7 Selectia Trasaturilor
No ratings yet
7 Selectia Trasaturilor
54 pages
Unit-7 ML
No ratings yet
Unit-7 ML
11 pages
Module 3
No ratings yet
Module 3
64 pages
Roumen,+Bolboaca Biomath
No ratings yet
Roumen,+Bolboaca Biomath
11 pages
Online13 Koch
No ratings yet
Online13 Koch
19 pages
Doc-20250117-Wa0014. 20250117 193235 0000
No ratings yet
Doc-20250117-Wa0014. 20250117 193235 0000
22 pages
TTNT 09 Learning From Examples
No ratings yet
TTNT 09 Learning From Examples
58 pages
Decision Tree Classifiers With Ga Based Feature Selection
No ratings yet
Decision Tree Classifiers With Ga Based Feature Selection
10 pages
Edusar
No ratings yet
Edusar
4 pages
Intro to Regression & Decision Trees
No ratings yet
Intro to Regression & Decision Trees
11 pages
Feature Selection Engineering
No ratings yet
Feature Selection Engineering
72 pages
Chapter 5 2018 2019
No ratings yet
Chapter 5 2018 2019
5 pages
QSAR-Co 1.0.0 User Manual
No ratings yet
QSAR-Co 1.0.0 User Manual
29 pages
Decision Tree
No ratings yet
Decision Tree
15 pages
MLunit 2 Mynotes
No ratings yet
MLunit 2 Mynotes
15 pages
Computational Statistics Assignement1
No ratings yet
Computational Statistics Assignement1
4 pages
Big Data Mid Term
No ratings yet
Big Data Mid Term
14 pages
Discovering New Knowledge - Data Mining
No ratings yet
Discovering New Knowledge - Data Mining
55 pages
This Dataset Comes From An Original (Non-Machine-Learning) Study and Received in December 1995
No ratings yet
This Dataset Comes From An Original (Non-Machine-Learning) Study and Received in December 1995
4 pages
FDS 16 Regression Tree
No ratings yet
FDS 16 Regression Tree
27 pages
Sat - 149.Pdf - Prediction of Bigmart Sales Using Machine Learning Algorihms
No ratings yet
Sat - 149.Pdf - Prediction of Bigmart Sales Using Machine Learning Algorihms
11 pages
ML Unit@4
No ratings yet
ML Unit@4
70 pages
Unit-Ii Chapter-3 Beyond Binary Classification Handling More Than Two Classes
No ratings yet
Unit-Ii Chapter-3 Beyond Binary Classification Handling More Than Two Classes
16 pages
Chapter 5 Learning Deterministic Models
No ratings yet
Chapter 5 Learning Deterministic Models
28 pages
Embedded Methods: Isabelle Guyon André Elisseeff
No ratings yet
Embedded Methods: Isabelle Guyon André Elisseeff
12 pages
Interpretable ML
No ratings yet
Interpretable ML
13 pages
Ml-Unit Iii-1
No ratings yet
Ml-Unit Iii-1
46 pages
ml2 PDF
No ratings yet
ml2 PDF
5 pages
Heart Disease Prediction
No ratings yet
Heart Disease Prediction
27 pages
Intro to Exploratory Data Analysis
No ratings yet
Intro to Exploratory Data Analysis
17 pages
Application of Cart Algorithm in Hepatitis Diseaseas Diagnosis
No ratings yet
Application of Cart Algorithm in Hepatitis Diseaseas Diagnosis
5 pages
Ia1 ML Scheme Common To Is, Ai, Cs
No ratings yet
Ia1 ML Scheme Common To Is, Ai, Cs
10 pages
Unit V - Big Data Programming
No ratings yet
Unit V - Big Data Programming
22 pages
The Stability of Feature Selection and Class
No ratings yet
The Stability of Feature Selection and Class
6 pages
Chemometrics and Intelligent Laboratory Systems: 10.1016/j.chemolab.2016.07.011
No ratings yet
Chemometrics and Intelligent Laboratory Systems: 10.1016/j.chemolab.2016.07.011
26 pages
CD Alert Scrub Typhus
No ratings yet
CD Alert Scrub Typhus
8 pages
Nominall Roll
No ratings yet
Nominall Roll
92 pages
CD Alert Avian Influenza H5N1
No ratings yet
CD Alert Avian Influenza H5N1
8 pages
Biophysj00094 0071
No ratings yet
Biophysj00094 0071
7 pages
3 Rdunitii
No ratings yet
3 Rdunitii
12 pages
Node 9
No ratings yet
Node 9
11 pages
Johnson, Stephen M - Characterological Transformation, The Hard Work Miracle
100% (3)
Johnson, Stephen M - Characterological Transformation, The Hard Work Miracle
328 pages
Portable Fire Extinguishers - : Part 7: Characteristics, Performance Requirements and Test Methods
No ratings yet
Portable Fire Extinguishers - : Part 7: Characteristics, Performance Requirements and Test Methods
56 pages
Physics 1 M-1 AK PDF
No ratings yet
Physics 1 M-1 AK PDF
3 pages
Cke900 Spec
No ratings yet
Cke900 Spec
4 pages
Centrifugal Pump Inventory List
No ratings yet
Centrifugal Pump Inventory List
7 pages
Chloewilsonsmresume
No ratings yet
Chloewilsonsmresume
1 page
Hydraulics Lab: Drag & Energy Dissipation
100% (1)
Hydraulics Lab: Drag & Energy Dissipation
16 pages
DP13 Tge SHD RM3 303 THM 00 6001 00
No ratings yet
DP13 Tge SHD RM3 303 THM 00 6001 00
1 page
Course Title: Business Mathematics-I Course Code: Qam102 Credit Units: 03 Course Level: Ug L T P/S SW/F W Total Credit Units
No ratings yet
Course Title: Business Mathematics-I Course Code: Qam102 Credit Units: 03 Course Level: Ug L T P/S SW/F W Total Credit Units
4 pages
Physics - Horizontal Projectile Motion NOTES
No ratings yet
Physics - Horizontal Projectile Motion NOTES
19 pages
Skill Assessment: Tableau
No ratings yet
Skill Assessment: Tableau
1 page
Panasonic LC-RD1217P
No ratings yet
Panasonic LC-RD1217P
2 pages
Thermal Diffusivity of Saliva and Tears Obtainded by Thermal Lens Spectros
No ratings yet
Thermal Diffusivity of Saliva and Tears Obtainded by Thermal Lens Spectros
2 pages
Kindergarten DLL MELC Q3
No ratings yet
Kindergarten DLL MELC Q3
7 pages
Mechanical Engineering Corrosion Guide
100% (1)
Mechanical Engineering Corrosion Guide
6 pages
Supply Chain Analysis For Inverter Air Conditioners
100% (1)
Supply Chain Analysis For Inverter Air Conditioners
110 pages
BioK QQ 1.2 QP
No ratings yet
BioK QQ 1.2 QP
5 pages
5 Paragraph Essay Examples Middle School
100% (2)
5 Paragraph Essay Examples Middle School
6 pages
DeepETA - How Uber Predicts Arrival Times Using Deep Learning
No ratings yet
DeepETA - How Uber Predicts Arrival Times Using Deep Learning
18 pages
16 Marks Metrology and Measurments
No ratings yet
16 Marks Metrology and Measurments
2 pages
Brutal Hadcore?.dsc
No ratings yet
Brutal Hadcore?.dsc
3 pages
Fuzzy Logic for Tech Enthusiasts
No ratings yet
Fuzzy Logic for Tech Enthusiasts
12 pages
Interface Designer V5 EN
No ratings yet
Interface Designer V5 EN
140 pages
Bank Valve Loader 416e Lms
No ratings yet
Bank Valve Loader 416e Lms
4 pages
MAXHUB XBoard V7 Series Datasheet V1.3 (20240619)
No ratings yet
MAXHUB XBoard V7 Series Datasheet V1.3 (20240619)
4 pages
3G Missing Neighbors
No ratings yet
3G Missing Neighbors
3 pages
Kajian Viktimologi Terhadap Anak Sebagai Korban Tindak Pidana Kesusilaan Mendeley Fix
No ratings yet
Kajian Viktimologi Terhadap Anak Sebagai Korban Tindak Pidana Kesusilaan Mendeley Fix
20 pages
Maths Question Paper Maths
No ratings yet
Maths Question Paper Maths
6 pages
Sap Material Master Data (End User)
No ratings yet
Sap Material Master Data (End User)
13 pages
Mammut Sleep Well Pt1 E
100% (1)
Mammut Sleep Well Pt1 E
39 pages

Adv QSAR

Uploaded by

Adv QSAR

Uploaded by

Advanced QSAR

Multiple Linear Regression (MLR)—

Here Y is an n × 1 vector of observations, X is an n × p matrix of regressors, β is a p × 1

DTs approach a classification or regression problem in divide-and-conquer fashion. The

The algorithm of the GA-PLS method is implemented as follows.

Step 3. Using each parent combination of descriptors, a QSAR equation is generated

Step 5. The population is evolved by performing a crossover between two randomly

Step 6. Each offspring is subjected to a random single-point mutation, i.e. randomly

Step 7. The fitness of each offspring is evaluated as described above

Step 8. If the resulting offsprings are characterized by a higher value of fitness

Step 9. Steps 4 - 7 are repeated until a predefined maximum number of crossovers

The 28 bradykinin potentiating pentapeptides were included in the training set to

The equations correctly predicted them to have higher activities compared to

According to this condition, if RSD of dependent variables of a virtual peptide is less

In the autocorrelation vectors calculated by ADRIANA.Code, the hydrogen atoms

The ε-SVR algorithm is a generalization of the better known support vector

MCC 1 is regarded as a perfect prediction, whereas 0 is regarded as random prediction.

The parameter γ is selected by the user.

GA is governed by biological evolution rules and can investigate several possible

CFS is a simple filter algorithm that ranks feature subsets according to a

Descriptors were generated using the chemical modeling program

Descriptors were divided into one of several physical property categories:

Self-organizing Maps With Descriptor Selection

In order to nonlinearly relate bioactivity and molecular features, a set of

A SOM-oriented variation of a forward entry step-wise regression algorithm was

GFAs as described by Rogers and Hopﬁnger 20 were implemented using Cerius

Models were evolved over 50 000 to 80 000 generations using standard

The AAPS Journal 2005; 7 (3) Article 68

U-matrix maps represent the descriptor-based relationships of the training set of

Programs implemented in FOCUS-2D as well as genetic algorithms-partial least

A heuristic based on a genetic algorithm rapidly finds which fragments or combination of

The e-LEA3D server is available at: http://bioinfo.ipmc.cnrs.fr/lea.html.

Protein Science (1996), 5:895-903. C

Using a rotamer description of side chains, an optimal sequence for a backbone

We use an extension of the Dead-End Elimination (DEE) theorem (Desmet et al.,

This extension of DEE greatly increases the number of rotamers at each

An initial population of 300 equations was generated consisting of random

You might also like