Sorting Variables by Using Informative Vectors As A Strategy For Feature Selection in Multivariate Regression
Sorting Variables by Using Informative Vectors As A Strategy For Feature Selection in Multivariate Regression
Received: 30 May 2008, Revised: 1 August 2008, Accepted: 8 August 2008, Published online in Wiley InterScience: 29 October 2008
Keywords: variable selection; informative vectors; OPS; partial least squares; chemometrics
     J. Chemometrics 2009; 23: 32–48                      Copyright ß 2008 John Wiley & Sons, Ltd.
Sorting variables feature selection in multivariate regression
efficient for all types of variables. For example, in chemistry,          been found in the literature about algorithm automation,
diverse analytical techniques provide different types of predictors       definition of cut off criterion and vectors combination.
and each one presents peculiar and well-defined characteristics.             Hoskuldsson and Reinikainen [12,13,21] have proposed
   It is a usual practice among chemometricians to visualize              informative vectors and strategies for variable selection.
the plots of informative or prognostic vectors to identify                However, these authors did not compare or combine them with
desirable features from multivariate data. The informative                other well known informative vectors such as regression and/or
vectors are those obtained from some mathematical data                    correlation vectors.
treatment using the predictors and/or dependent variables.                   In this work, a new informative vector is proposed and a new
These vectors make the standard behavior of the variables more            strategy for automatic multiple variable selection/elimination is
visible.                                                                  presented. The methodology developed is based upon several
   In multivariate calibration, the elements of an informative            informative vectors and their combinations, introducing a simple
vector that have high absolute values are intuitively connected           and intuitive automatic procedure for variable selection.
with the regions from original data that improve the predictions.
In fact, if the visualized vector contains the needed information,
the feature selection can be performed from this vector. The              2.     THEORY
informative vector can be generated from different types of
response variables, because the information they contain is               2.1.   Notation
inherent to variables, no matter their nature.                            Scalars are defined as italic lower case characters (a, b and c),
   Several authors [14–16] advocate the use of the regression             vectors are typed in bold lower case characters (a, b and c) and
vector as a potential tool to select variables in multivariate            matrices as bold upper case characters (A, B and C). Matrix
calibration. Variables with low regression coefficients would not         elements are represented by corresponding italic lower case
contribute significantly for prediction and, hence, should be             characters with row and column index subscripts (xij is an
eliminated.                                                               element of X). In some cases, matrices will be written explicitly as
   Other informative vectors that could be used for variable              X (I  J) to emphasize their dimensions (I rows and J columns).
selection are those that, in some way, relate responsive variables        The identity matrix is represented as I with its proper dimensions
with the dependent variable as, for example, the correlation              indicated.
coefficients that compose the correlation vector [3,12,17,18].               Superscripts t, þ and 1 represent transpose, pseudo-inverse
Usually, a poorly correlated variable is not informative and can          and inverse operations, respectively. The symbol ^, e.g. ŷ,
be excluded. However, one must pay special attention to such              represents an estimated matrix, vector or scalar.
types of vectors since they bear univariate and not multivariate
information. Besides the regression and correlation vectors, other
                                                                          2.2.   Method: ordered predictors selection (OPS1)
types of vectors occurring in the literature [14,19,20] have the
same goal. The strategy of using informative vectors is simple,           In general, the essence of the method is to obtain a vector
intuitive and, in general, leads directly to the interpretation of the    (informative vector) that contains information about the location
selected variables and also to better predictions of unknowns.            of the best response variables for prediction (see Figure 1A). Such
   However, in spite of the successful but relatively rare use of         vectors can be obtained directly from calculations performed
informative or prognostic vectors for selecting variables, little has     with response and dependent variables, or from combinations of
   Figure 1. Variable selection steps using the OPS method. This figure is available in color online at www.interscience.wiley.com/journal/cem
                                                                                                                                                 33
J. Chemometrics 2009; 23: 32–48              Copyright ß 2008 John Wiley & Sons, Ltd.               www.interscience.wiley.com/journal/cem
                                                                                                         R. F. Teófilo, J. P. A. Martins, M. M. C. Ferreira
     different vectors obtained with the same purpose. Obviously, the        and SIMPLS, considers the y information during computations.
     vector length must be equal to the number of response variables         The PLSBdg algorithm can be summarized as follows [27,29].
     and each position in the vector must be aligned to the
                                                                             (1) initialize the algorithm for                    the    first   component,
     corresponding response. In the second step (Figure 1, B), the
                                                                                 v1 ¼ Xt y=kXt yk; a1 u1 ¼ Xv1
     original response variables (X matrix columns) are differentiated
                                                                             (2) for i ¼ 2, . . . , h components
     according to the corresponding absolute values of the
     informative vector elements obtained previously in step A. The              2:1: g i1 vi ¼ X t ui1  ai1 vi1
     higher the absolute value, the more important the response                  2:2: ai ui ¼ Xvi  g i1 ui1
     variable, which enables their sorting in descending order of
                                                                             with,
     magnitude in the third step (Figure 1C).
                                                                             Vh ¼ ðvi ; :::; vh Þ, Uh    ¼ ðui ; :::; uh Þ and Rh
        In the fourth step (Figure 1D), multivariate regression models
     are built and evaluated using a cross validation strategy. An initial         0                                    1
                                                                                      a1 g 1
     subset of variables (window) is selected to build and evaluate the            B         a2 g 2      0            C
     first model. Then, this matrix is expanded by the addition of a               B                                  C
                                                                                   B              ..     ..           C
                                                                                ¼B                   .      .         C
     fixed number of variables (increment) and a new model is built                B                                  C
                                                                                   @         0           ak1 g k1   A
     and evaluated. New increments are added until all or some
     percentage of variables are taken into account. Quality                                                  ak1
     parameters of the models are obtained for every evaluation
                                                                                It can be proved that
     and stored for future comparison.
        In the last step (Figure 1E), the evaluated variable sets (initial                          XVh ¼ Uh Rh ! Rh ¼ Uth XVh :                         (3)
     window and its extensions) are compared using the quality
     parameters calculated during validations. The model with the
     best quality parameters should contain variables with the best             The bidiagonal matrices are analogous but not identical
     prediction capability and so these are the selected variables.          to those derived by singular values decomposition (SVD)
                                                                             and, therefore, the PCR regression method is slightly different
                                                                             from PLS.
     2.3.   The OPS-PLS algorithm
                                                                                An important conceptual detail is that the scores calculated by
     Specifically, the algorithm used in this work consists of the           the bidiagonalization algorithm are different by a normalization
     following steps:                                                        factor from those produced in NIPALS and SIMPLS algorithms
                                                                             [30]. Ergon [31] and Pell et al. [30] have shown that there is a
       (i) Obtaining informative vectors or their combinations from X
                                                                             difference between the reconstructed matrix obtained by the
           and y;                                                                                ^ h ¼ Uh Rh Vt ) and those reconstructed by
                                                                             PLSBdg method ( X                h
      (ii) Building PLS regression models;
                                                                             NIPALS and SIMPLS methods, becoming smaller with higher
     (iii) Calculating quality parameters by leave-N-out cross vali-
                                                                             numbers of latent variables.
           dation and
                                                                                In this work, one of the informative vectors proposed makes
     (iv) Comparing the quality parameters for the obtained models.
                                                                             use of the reconstructed matrix and, therefore, PLSBdg is
        The algorithm has many distinguishing features: (1) it is            fundamental to obtain a consistent reconstructed matrix.
     computationally efficient when compared to other variable
     selection algorithms (e.g. genetic algorithm); (2) it may be            2.4.     The informative vectors
     completely automated with different informative vectors and
                                                                             2.4.1.    Regression vector (Reg.).
     their combinations; (3) it can be adapted for variable selection
     when treating multiway data sets [22] and also (4) it can be used       The regression vector obtained when multivariate regression is
     in methods of variable selection by intervals (e.g. iPLS) or in         performed can be defined as the expected change in the response,
     discriminant analysis [23].                                             per unit change in the variable, if all the variables and responses
        The PLS method to obtain all informative vectors and the             are linearly related [32]. In this work, the regression vector was
     bidiagonal algorithm for PLS1, were used in this work.                  considered as an informative vector for variable selection.
        Manne [24] has shown that PLS1 is equivalent to an algorithm           Once the matrices U, V and R are computed with h
     developed by Golub and Kahan [25] for matrix bidiagonalization.         components truncated in R, according to Equation 3, the
     Matrix bidiagonalization is a useful decomposition often                regression vector can be estimated directly by solving the least
     employed as a fast initialization in algorithms for singular values     squares problem as shown in Equation 4.
     decomposition [26].
        This method considers that any matrix X(I  J) can be written                 y ¼ Xb      !        y ¼ Uh Rh Vth b      !      ^ ¼ Vh R1 Ut y
                                                                                                                                       b                 (4)
                                                                                                                                               h   h
     as:
                                                                                Although the regression vectors obtained by NIPALS or PLSBdg
                                  X ¼ URVt                             (2)   are the same [29,30], the reconstructed X matrix from the
                                                                             bidiagonal method with h components is more consistent to
     where U(I  J) and V(I  J) are matrices with orthonormal               calculate the residual vector. Besides, the PLSBdg algorithm is
     columns, i.e. they satisfy Ut U ¼ Vt V ¼ I, and R(J  J) is a           significantly more efficient computationally when compared to
     bidiagonal matrix.                                                      NIPALS algorithm.
       Several papers in the literature describe the interesting relation       When using the regression vector for variable selection in the
     between PLS1 and bidiagonal decomposition [24,27–30]. The               OPS method, the first question asked is about the number of
     bidiagonal decomposition algorithm (PLSBdg), similar to NIPALS          components, h, to be used when obtaining this informative
34
     www.interscience.wiley.com/journal/cem                  Copyright ß 2008 John Wiley & Sons, Ltd.                        J. Chemometrics 2009; 23: 32–48
Sorting variables feature selection in multivariate regression
vector. Firstly, it is proposed in this work to build and validate a       component carries information from important columns of X into
PLS model, from which h ¼ hMod is determined. But maybe hMod               ^ h and the sum of squared residuals of the corresponding
                                                                           X
cannot generate a regression vector sufficiently informative for           columns in Eh approaches zero. So, the columns in Eh with low
variable selection. To find the best informative vector, a study was       values of squared sum indicate the more effective variables for
then performed on the full data set by increasing the number of            the regression. The informative vector proposed in this case is the
components in the model, starting from h ¼ hMod and carrying               inverse of the sum of squared residuals defined as q, according to
out the variable selection using the OPS algorithm. By varying the         Equation 6
h value, different informative vectors are generated, from which
                                                                                                       ^h                             1
the best component number (h ¼ hOPS) is selected. Therefore,                                  Eh ¼ X  X      !            qj ¼                (6)
two optimum numbers of components are employed in this                                                                              etj ej
work, one representing the component number for model
building (hMod.) and the other representing the component                  where ej is the j-th column of Eh.
number employed to generate the best informative vector in OPS
method (hOPS).                                                             2.4.4.   Covariance procedures vector (CovProc)
  The algorithm employed to study the regression vector                    Reinikainen and Hoskuldsson [21] have proposed a vector,
consists of the following steps.                                           named CovProc, to rank and then select the variables. Its
  for h ¼ hMod to n components                                             calculation is based on ranking the variables according to their
  generate the regression vector for variable selection with               covariance and selecting ‘an optimal’ number of variables to use
  hOPS ¼ h;                                                                in sequential dynamic systems. The H-principle suggests using
  run the OPS algorithm using the previously generated vector              XtyytX as the measure of strength between X and y. The method
  (use hMod for model building);                                           CovProc suggests sorting the variables according to weights
  store the minimum RMSECV obtained in the selection for all h;            derived from XtyytX and judging the subset selection based on
  end                                                                      expanding X according to the sorting obtained. The informative
  plot the component numbers versus RMSECV.                                vector is the diagonal of XtyytX.
J. Chemometrics 2009; 23: 32–48               Copyright ß 2008 John Wiley & Sons, Ltd.               www.interscience.wiley.com/journal/cem
                                                                                                               R. F. Teófilo, J. P. A. Martins, M. M. C. Ferreira
       According to Ferre et al. [35] and Bro et al. [37], the NAS vector                    where ŷ and ŷ are the scalar and vector of estimated values,
     can be calculated using the following equation                                       respectively, y is a scalar of mean values in y and Im is the number
                                                                                          of samples. When internal validation (cross validation—CV) is
                          xnas
                           i   ¼ ^yi ðbt Þþ ¼ ð^yi =bt bÞb                         (9)    applied, Im is the number of samples in the calibration set
                                                                                          (training), and the error and correlation coefficients are named
       Since the NAS vector is obtained for each sample, an average of                    RMSECV and Rcv, respectively. For external validation (a new set of
     the columns of the matrix that contains all vectors is a good                        samples), Im is the number of predicting samples (P) and in this
     estimate for an informative vector to be employed in the OPS                         case, the error and correlation coefficients are named RMSEP and
     method.                                                                              Rp, respectively.
       As in the regression vector, the NAS vector was also built using
     the component number h ¼ hOPS.
     2.4.8.   Vectors’ combinations                                                       The first data set was composed by NIR spectra of diesel samples
                                                                                          measured at the Southwest Research Institute (SWRI) in a project
     Besides the vectors Reg, Corr, SqRes, CovProc, VIP, NAS and StN,                     sponsored by the US Army. The data set was taken from the
     their combination also can be used to search the most predictive                     Eigenvector Research homepage at http://www.eigenvector.com.
     variables. This combination is obtained by performing the                            The following physical properties were modeled: bp50, boiling
     product of the absolute value of each element in one vector times                    point at 50% recovery/8C (ASTM D 86); CN, cetane number
     the corresponding element in the other vector. Before doing that,                    (equivalent to octane number for gasoline, ASTM D 613); d4052,
     the vectors are normalized. In the present work, pairs of these                      density, g mL1, 158C, (ASTM D 4052); freeze, freezing tempera-
     vectors were investigated.                                                           ture of the fuel/8C; Total, total aromatics, mass % (ASTM D 5186)
       To make the graphical representation straightforward, an                           and Visc., viscosity, cSt/408C. In this work, the data was split
     abbreviation for vectors pairs was used, i.e. Reg-Corr (RC);                         according to that on the web site without high leverage samples.
     Reg-SqRes (RS); Reg-CovProc (RCP); Reg-VIP (RV); Reg-NAS (RN);                       Also, the spectra obtained were preprocessed by taking the first
     Reg-StN (RStN) and so on for other pairs.                                            derivative. The number of samples in the training set/test set for
                                                                                          bp50, CN, d4052, freeze, Total and Visc. were 113/113; 113/112;
     2.4.9.   Model evaluation                                                            122/121; 116/115; 118/118 and 116/116, respectively.
     The quality of the models is assessed by the root mean square                           Further information about this data can be found at http://www.
     error (RMSE) calculated according to Equation 12 and the                             idrc-chambersburg.org/shootout_2002.htm or in Reference [39].
     correlation coefficient R given by Equation 13,
                                   vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
                                   uP
                                   u Im                                                   3.1.2.    Raman data set
                                   u ðyi  ^yi Þ2
                                   t
                           RMSE ¼ i¼1                         (12)                        This data set from literature is available at http://www.models.
                                              Im                                          kvl.dk/research/data/ and presented by Dyrby et al. [40]. The data
                                                                                          set consisted of Raman scattering for 120 samples using 3401
                                    P
                                    Im                                                    wave numbers in the range of 200–3600 cm1 and using the
                                    ð^yi  y  ^Þðyi  yÞ                                  average of 64 scans for each sample. The dependent variable
                         R ¼ si¼1
                              ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi          (13)     referred to the amount of active substance (containing a C –   –N
                               P
                               Im
                                                                                          group) in Escitalopram1 tablets in %w/w units. More exper-
                                    ð^yi  y   ^Þ2 ðyi  yÞ2
                                    i¼1                                                   imental details can be obtained in the literature [40].
                                                                                             Second derivative pre-treatment was applied to the samples
                                                                                          before building the model.
36
     www.interscience.wiley.com/journal/cem                               Copyright ß 2008 John Wiley & Sons, Ltd.              J. Chemometrics 2009; 23: 32–48
Sorting variables feature selection in multivariate regression
3.1.3.   Fluorescence data set                                        randomly generated. The data set was split into 10 samples for
                                                                      the training set and 10 samples for external validation.
This data set was designed by Bro et al. [41] for the study of
several topics in fluorescence spectroscopy and can be found at
http://www.models.kvl.dk/research/data/. The selected analytes        3.2.   Programs
have very similar excitation and emission spectra. Consequently,      All data analyses were performed using home-built functions
the calibration problem is rather complicated.                        written for Matlab 7 (MathWorks, Natick, USA). The OPS1 Toolbox
   Six different analytes were used: catechol (CATE), hydro-          routines, implemented in MATLAB 7, were registered and are
quinone (HYDR), indole (INDO), resorcinol (RESO), L-tryptophane       available on the Internet at http://lqta.iqm.unicamp.br
(TRYP) and DL-tyrosine (TYRO). The set used in this work
contained 404 mixtures of 2–4 fluorophores. The concentration
ranges of the fluorophores in the samples were: CATE                  4.     RESULTS AND DISCUSSION
(0–87 mmol L1), HYDR (0–22 mmol L1), INDO (0–5.46 mmol L1),
RESO (0–39.96 mmol L1), TRYP (0–7.44 mmol L1) and TYRO              4.1.   NIR data set
(0–12.14 mmol L1). Only the first replicate measurement of each
                                                                      Recent advances in process instrumentation (using NIR spec-
sample was used in the present work.
                                                                      troscopy, in particular) and chemometric methods have led to the
   The scan settings emission wavelengths were 230–500 nm
                                                                      popularization of NIR spectroscopy.
(recorded every 2 nm) and excitation wavelengths 230–320 nm
                                                                         For oil fractions and diesel fuels, the NIR spectroscopic region
(recorded every 5 nm). For further experimental details see
                                                                      (750–1550 nm) is especially attractive because most absorption
Reference [41].
                                                                      bands observed in this region arise from overtones and
   Prior to analysis, part of the recorded data was removed
                                                                      combination bands of carbon-hydrogen (C-H) stretching
(Raman and Rayleigh scattering) and the region removed was
                                                                      vibrations of hydrocarbon molecules.
interpolated using the algorithm from Bahram et al. [42] available
                                                                         Figure 2 presents the minimum RMSECV obtained by
at http://www.models.kvl.dk/source/. This procedure was carried
                                                                      leave-eleven-out cross validation. Besides the full data set (first
out in order to avoid the presence of any scattering effects.
                                                                      bar from bottom and named Full), when no variables were
Besides, the excitation wavelengths 230–240 and 300–320 nm
                                                                      excluded, results for the seven informative vectors and their
were excluded, together with the emission wavelengths 230–296
                                                                      combinations introduced in the previous session are shown.
and 422–500 nm. The 3D array generated for each sample with
                                                                      Vertical dot lines indicate the RMSECV value for full data set (with
dimensions 19  136 (excitation  emission) was cut down to
                                                                      no variable selection) and for the minimum RMSECV obtained by
11  62, corresponding to the 11 emission spectra recorded from
                                                                      the OPS method. A significant decrease in RMSECV values
each experiment. An unfolding was performed to apply PLS
                                                                      occurred when other informative vectors were combined with
regression. Thus a total of 682 variables were investigated for
                                                                      Reg. The subset of variables selected by the following informative
each sample.
                                                                      vectors: Corr, SqRes, CovProc, VIP and StN presented reasonable
                                                                      improvement in the prediction quality of the models principally
3.1.4.   GC data set                                                  when combined with Reg (RC, RS, RCP, RV and RStN), but the
                                                                      results are not still comparable to Reg and NAS vectors and their
The data GC was selected from the examples available in               combination RN.
Pirouette software [43]. This set is formed by 35 responses              It was observed that when the regression vector was used as
consisting of peak areas for a set of gas chromatographic runs on     an informative vector in the OPS algorithm, the number of
fuel samples and 3 dependent variables made up from                   components to build this informative vector played a primordial
measurements of the following physical properties: flash point,       role to obtain good results. The study performed by increasing
specific gravity and freeze point. A total of 16 samples were         the number of components to build the informative vector while
measured and one was detected as outlier and removed. Due to          keeping constant the number of components (hMod.) to build the
the small data set, the OPS method was carried out using all          model was fundamental. It was found that the optimum number
samples and after feature selection the training set and test set     of components (hOPS) to build a regression vector with the
were split into 11 and 4 samples, respectively.                       purpose of variable selection was, in most cases, significantly
                                                                      higher than hMod.
                                                                         Typical variations of RMSECV as the number of components is
3.1.5.   Data set QSAR                                                increased are presented in Figure 3. Three replicates were carried
The data set QSAR is from our research group [44] and available at    out and the mean value of minimum errors (RMSECV) and its
http://pcserever.iqm.unicamp.br/marcia//hiv1qsardata.html.           standard deviation bar (error bar) are shown for each selection.
The 14 molecular descriptors for 48 HIV-1 protease inhibitors         The number of components for minimum RMSECV indicates
were generated on the basis of 1D and 2D formulas and the             h ¼ hOPS. Figure 3 indicates that the number of components with
dependent variable was the in vitro inhibition activity [45],         minimum error is significantly higher than those for building the
pIC50 ¼ log IC50. The data set was split into 32 compounds for       model. In Figure 2, for comparison, each result was obtained with
the training set and the other 16 were for external validation.       hMod equal to 9 for Visc. and 10 for Total. This number of
                                                                      components for each specific physical property was kept
                                                                      constant for the window and each added increment.
3.1.6.   Simulated data set
                                                                         It seems that when an excessive number of components is
The simulated data set was proposed by one of reviewers of this       included to calculate the informative vector, more information
paper and consisted of 20 mixtures simulated by using UV-type         from each variable and its real contribution to the model seems
spectra from 4 analytes and their respective concentrations           to be better represented. This is an empirical observation and it
                                                                                                                                             37
J. Chemometrics 2009; 23: 32–48           Copyright ß 2008 John Wiley & Sons, Ltd.            www.interscience.wiley.com/journal/cem
                                                                                                          R. F. Teófilo, J. P. A. Martins, M. M. C. Ferreira
     Figure 2. Minimum RMSECV obtained for the seven informative vectors and their combinations used in the OPS algorithm. The standard deviation of
     three replicates is represented by a horizontal error bars. These results were obtained for two physical properties Visc. (A) and Total (B). The full bar
     RMSECV obtained (right hand dotted line) were 0.124 and 0.632 for Visc. and Total, respectively. The best results (left hand dotted line) obtained were
     0.090 and 0.477 for Visc. and Total, respectively. Other good values are indicated inside the bars for comparison.
     depends upon the data structure. More statistical studies are                     Figure 4 illustrates both regression vectors applied in different
     necessary, however these are out of the scope of the present                   situations in this context. The absolute values of regression
     work.                                                                          coefficients are larger for hOPS than for hMod.
       NAS vector, which is similar to the regression vector for inverse               Figure 5 shows the OPS plots for detection of the best points
     calibration, was also very efficient to improve the prediction                 for selection/elimination of variables. When the variables are
     when using h ¼ hOPS. The vectors SqRes and VIP, which are also                 arranged into descending order as indicated by the informative
     dependent upon the number of components, did not show                          vector and the first window is selected, a decrease in RMSECV and
     better results when using h ¼ hOPS, as obtained for the                        increase in Rcv is expected when expanding this first variable
     regression and NAS vectors. Thus, regression and NAS vectors                   subset by adding new variables. When the optimal number of
     were built with a number of components equal to hOPS while                     variables is reached, the error increases and the Rcv decreases
     SqRes and VIP vectors were built with a number of components                   with the inclusion of noninformative variables. The region where
     equal to hMod.                                                                 values of minimum RMSECV and maximum Rcv occur is the
     Figure 3. Plots of RMSECV versus the number of components for building the regression vector used as informative vector. The bars are the standard
     deviations of three replicates. Three physical properties are considered: (A) BP50 with hOPS ¼ 18; (B) D4052 with hOPS ¼ 12 and (C) Freeze with
     hOPS ¼ 11
38
     www.interscience.wiley.com/journal/cem                      Copyright ß 2008 John Wiley & Sons, Ltd.                   J. Chemometrics 2009; 23: 32–48
Sorting variables feature selection in multivariate regression
Figure 4. Regression vectors built with different numbers of components, hMod (black solid line) and hOPS (red doted line). (A) Physical property BP50
with hMod ¼ 8 and hOPS ¼ 18; (B) physical property D4052 with hMod ¼ 6 and hOPS ¼ 12. This figure is available in color online at www.
interscience.wiley.com/journal/cem
Figure 5. Typical OPS plots. The vertical arrows indicate the set of variables with better prediction ability. The doted circles indicate the optimum
regions. The horizontal dashed lines indicate the statistical parameters using the full data set. These results were obtained for the physical properties
(A) Visc. and (B) Total. For comparison, each result was obtained with the maximum number of hMod equal to 9 for Visc. and 10 for Total. The maximum
number of components was fixed in the window and each added increment.
optimum (vertical arrows with dotted circles) for the selection/               the species present in the sample [40]. Comparing the raw
elimination of variables. The horizontal dashed lines in the plots             spectrum from the pure active substance containing the cyanide
indicate the RMSECV or Rcv for full data.                                      group (C –– N) [40], with that from the investigated tablets, it is
   For all physical properties of this data set, the results obtained          visible that the cyanide group is easily detected in this region. The
when using the variables indicated by the optimum regions are                  distinct peak at 2233 nm seen in the tablet spectrum originated
significantly better than those obtained when all variables (full              from the C –  – N group. Several others peaks from the active
data set) were used. These results suggest that the proposed                   substance can be seen in the tablet spectrum (e.g. at 1614 and
methodology is in fact very efficient for variable selection.                  3075 nm), but none are as specific as the cyanide peak. The tablet
   Table I shows results obtained for the models built with all                spectra contain several peaks from the excipients including three
variables (full model) and selected variables. A meaningful                    intense peaks from titanium dioxide in the coating material (395,
reduction of the number of variables with improvement in                       511 and 634 nm) and several peaks from cellulose (1095, 1121,
statistical parameters was obtained. Note that the vectors Reg                 1380 and 2900 nm). There is a strong fluorescence background
and NAS are important vectors either alone or in combination.                  originating from an unknown residual in the primary micro-
   The selected variables are shown in Figure 6 where well                     crystalline cellulose excipient. Besides, cellulose contains trace
defined regions are clearly observed. Notice that baseline regions             amounts of organic substances that can be highly fluorescent. To
were not selected and peaks with rather clear physical meaning                 eliminate this fluorescence background the second derivative of
were selected. This is a great advantage of the OPS method                     the spectra was calculated.
compared to the other methods.                                                    Figure 7A shows the trend in RMSECV with increasing the
                                                                               number of components to generate the informative vectors
                                                                               (regression vector and NAS, SqRes and VIP). The leave-twelve-out
                                                                               cross-validation method was used to calculate the RMSECV. The
4.2.   Raman data set
                                                                               value hOPS ¼ 9 was used to obtain the informative vectors and
Raman is a rapidly developing spectroscopic technique that,                    hMod ¼ 3 for model building. The vectors Reg, NAS and the
unlike infrared spectroscopy (IR), does not require special                    combination RS, RN and SN, which are shown in Figure 7B yielded
sampling techniques. It gives a measure of the weak inelastic                  the best results (RMSECV ffi 0.47). Hence, these vectors could be
scattering created by interaction between the incoming light and               used to select variables. The vertical lines in Figure 7B indicate the
                                                                                                                                                            39
J. Chemometrics 2009; 23: 32–48                Copyright ß 2008 John Wiley & Sons, Ltd.                   www.interscience.wiley.com/journal/cem
                                                                                                          R. F. Teófilo, J. P. A. Martins, M. M. C. Ferreira
Table I. Statistical results obtained from NIR data set for all physical properties
Full model OPS model Full model OPS model Full model OPS model
BP50 CN D4052
     Figure 6. Selected variables from the original spectra for each physical property. (A) BP50, (B) CN, (C) D4052, (D) Freeze, (E) Total and (F) Visc. All
     calculations were performed on the first derivative spectra. This figure is available in color online at www.interscience.wiley.com/journal/cem
40
     www.interscience.wiley.com/journal/cem                      Copyright ß 2008 John Wiley & Sons, Ltd.                  J. Chemometrics 2009; 23: 32–48
Sorting variables feature selection in multivariate regression
Figure 7. (A) Typical decrease of RMSECV with an increasing number of components to build the regression vector used as informative vector;
(B) minimum RMSECV obtained for various informative vectors used in the OPS algorithm and for the full data set (other good values are indicated inside
the bars for comparison); (C) OPS plot indicating the optimum region for selection/elimination of variables (dotted ellipses). The vertical arrow indicates
the set of variables (left of the arrow) with better prediction ability. The horizontal dashed lines indicate the statistical parameters using the full data
set. The vertical and horizontal error bars are, respectively, the standard deviations of three replicates in the graphs A and B.
maximum RMSECV (0.685, right vertical line) and the minimum                        After variable selection, the data set was split into 85 samples
RMSECV (0.463, left vertical line). The combination Reg.-SqRes                  for modeling and 35 samples for external validation.
showed the least RMSECV value and so it was selected as the                        The statistical parameters for the final model are shown in
informative vector. Note that in Figure 7B the error of the SqRes               Table II. The results for the full model (no variables excluded) are
vector is very close to that of the full data set. However, when                included for comparison. Approximately 18% of total variables
SqRes was combined with Reg., (RS), a significant decrease in                   were selected and a meaningful improvement in the prediction
RMSECV was observed. This result justifies the study of the                     capacity was obtained. The selected variables are marked in the
informative vector combinations.                                                spectra of Figure 8. Note that selected regions are coherent with
   Figure 7C shows the OPS plot indicating the optimum region                   peaks obtained from the active substance. Several variables were
for selection/elimination of variables (vertical arrow and ellipses).           selected in the region around of 2233 nm originating from the
From 3401 original variables, 595 were selected with significant
improvement of the prediction ability of the model. The first
window and increment values in the OPS algorithm were 300 and
5, respectively.
 Table II. Statistical results for active substance for data set
 Raman
Activity
 Vector                           —                         Reg.-SqRes
 hOPS                             —                              9
 hMod.                                            3
 nVars                          3401                             595
 RMSECV                          0.69                           0.44
 Rcv                            0.800                           0.924           Figure 8. Selected variables for the active substance containing the
 RMSEP                           0.66                           0.49            cyanide group (C –– N). Variable selection was performed on the second
 Rp                             0.792                           0.890           derivative spectra. This figure is available in color online at www.
                                                                                interscience.wiley.com/journal/cem
                                                                                                                                                               41
J. Chemometrics 2009; 23: 32–48                 Copyright ß 2008 John Wiley & Sons, Ltd.                    www.interscience.wiley.com/journal/cem
                                                                                                           R. F. Teófilo, J. P. A. Martins, M. M. C. Ferreira
     Figure 9. (A and B) Pure excitation and emission spectra for the six fluorophores. The regions inside the vertical dotted lines in plots A and B were used
     for data analysis. (C) Representative landscape spectrum from one mixture of fluorophores. This figure is available in color online at www.interscience.
     wiley.com/journal/cem
     cyanide group. The spread of selected variables along the spectra              the plots with a wide gap between values for the full data set
     is justified due to the presence of several other peaks in the tablet          (horizontal dashed lines) and optimum region for selection/
     spectrum derived from the active substance [40] and also to the                elimination of variables (indicated by dotted circles).
     complexity of samples. Dyrby et al. [40] have tried to select                     Table III shows significant improvement in the prediction
     variables by using the iPLS method, but no improvement was                     ability of models with a few variables, except for HYDR. An
     obtained. The use of a genetic algorithm for variable selection in             interesting fact observed was that the vector Corr has potential as
     this data set is also not appropriate because it is time consuming             an informative vector for variable selection, especially when
     due to the high number of variables. On the other hand, the OPS                combined with the Reg vector. However, the vector Reg is once
     method was computationally efficient, and its high potential to                more the best informative vector for variable selection.
     select interpretative variables has been shown again in this                      Figure 12 shows the variables selected by the OPS method.
     example.                                                                       Note that specific peaks were selected, where each peak
                                                                                    corresponded to the emission spectra of a given excitation
                                                                                    wavelength. Within each peak selected specific wavelengths
                                                                                    were chosen. In all cases an interpretation of selected variables is
     4.3.   Fluorescence data set                                                   possible when comparing with the respective excitation and
     Fluorescence spectroscopy has been used in many scientific                     emission wavelengths from pure spectra (see Figure 9).
     fields such as chemistry, medicine, environmental and food                        The variables selected for CATE are approximately located in
     science. However, the fluorescence signal can be rather complex                the center of excitation band where most of the peaks are
     and, therefore, the analysis may become complicated owing to                   superimposed and at short wavelengths from emission spectra
     interferences, scattering, overlapping signals, etc. [46].                     (Figures 12A and 12Az). For RESO (Figures 12B and 12Bz) and
        When the fluorescence spectrum of a sample is measured at                   TYRO (Figures 12C and 12Cz), similar spectral regions are
     several emission and excitation wavelengths, the data analysis                 expected. It is very clear that the selected variables are situated
     can be carried out by using multi-way methods, or applying                     mainly in the central part of excitation spectra and at short
     two-way methods on the unfolded matrices. In this work the data                wavelengths of the emission band. Finally, for TRYP, though the
     was unfolded and the PLS regression method employed. Such                      position in excitation and emission spectra are dislocated to
     types of data contain several peaks with different intensities and             longer wavelengths (see pure spectra in Figures 9A and B), the
     with a relatively symmetric behavior. Besides, the excitation and              selected variables follow exactly the same tendency, where the
     emission bands are highly overlapped as shown in Figure 9. These               center to right part of excitation band and final part of
     two factors make this data set very complicated for the selection              emission spectra were indicated by the OPS method
     of interpretable variables from multiple peaks.                                (Figures 12F and 12Fz). For other phenols (not shown), the
        Plots for determination of the number of components, hOPS, to               behavior follows the expected trends.
     build the informative regression vectors that will be used for                    This conclusion reinforces the high performance of the OPS
     variable selection are shown in Figure 10. As expected, hOPS                   method in selecting meaningful variables.
     values [12 for CATE (Figure 10A), 9 for INDO (Figure 10B), 9 for
     TRYP (Figure 10C) and 10 for TYRO (Figure 10D)], were higher
     than hMod, which corresponds to the first point in each plot
                                                                                    4.4.   GC data set
     (hMod ¼ 6, except for HYDR and INDO where hMod ¼ 5).
        The OPS plots in Figure 11 show distinct trends of RMSECV for               A constant issue in industry is the analysis of finished products for
     different analytes. However, for all analytes, a few variables were            quality control and chromatography has shown to be an
     selected and the improvements were significant, as observed in                 extremely versatile technique in this context. The data set used
42
     www.interscience.wiley.com/journal/cem                       Copyright ß 2008 John Wiley & Sons, Ltd.                   J. Chemometrics 2009; 23: 32–48
Sorting variables feature selection in multivariate regression
Figure 10. Decrease of RMSECV with increasing number of components to build the regression vector used as an informative vector. The error bars are
the standard deviations of three replicates. This behavior is illustrated for (A) CATE, (B) INDO, (C) TRYP and (D) TYRO.
here was extracted from PirouetteTM package and contains peak                   Table IV shows the improvement in the models obtained by the
areas from gas chromatography applied to fuel. This example is a              selected peak areas using the OPS method. The variables selected
challenge to improve the prediction power by variable selection.              by importance in decreasing order were: flash point [12, 16 and
In this case, the OPS algorithm was applied to the autoscaled data            32]; specific gravity [8, 31 32, 27, 26, 2, 28 and 25]; freezing point
using leave-one-out cross-validation.                                         [18, 29, 24 and 16].
Figure 11. OPS plots for (A) CATE, (B) INDO, (C) TRYP and (D) TYRO. The horizontal dashed lines show statistical parameters for the full data set. The
arrows indicate the set of variables (left from the arrow) with better prediction ability. The dotted circles indicate the optimum region for selection/
elimination of variables. For comparison, each result was obtained with a fixed number hMod equal to 6 for the majority of analytes, except HYDR and
INDO where hMod ¼ 5. The cross validation method used was leave-twenty five-out.
                                                                                                                                                           43
J. Chemometrics 2009; 23: 32–48                Copyright ß 2008 John Wiley & Sons, Ltd.                  www.interscience.wiley.com/journal/cem
                                                                                                          R. F. Teófilo, J. P. A. Martins, M. M. C. Ferreira
Table III. Statistical results for all analytes from the fluorescence data set
Full model OPS model Full model OPS model Full model OPS model
     4.5.   QSAR data set                                                         of view. Thus, variable selection is essential to obtain a well
                                                                                  interpretable model.
     Variable selection is essential in QSAR studies. A good and useful              The OPS algorithm was applied to the autoscaled data using
     QSAR model for prediction and interpretation has low errors and              leave-one-out cross-validation and the results can be seen in
     must be understandable from the chemical and biological points               Table V.
     Figure 12. Selected variables for the four phenolic compounds. Each peak corresponds to one of the 11 emission spectra recorded. (A) CATE, (Az) zoom
     of the selected CATE emission peaks, (B) RESO, (Bz) zoom of the selected RESO spectral region, (C) TYRO, (Cz) zoom of the selectd TYRO emission peaks
     and (D) TRYP, (Dz) zoom of the selected TRYP emission peaks.
44
     www.interscience.wiley.com/journal/cem                     Copyright ß 2008 John Wiley & Sons, Ltd.                   J. Chemometrics 2009; 23: 32–48
Sorting variables feature selection in multivariate regression
Table IV. Statistical results from different models* of physical properties obtained from GC peak areas
Full model OPS model Full model OPS model Full model OPS model
Figure 13. Plots for model validation. Chance correlation plot (A) and leave-N-out crossvalidation (N varied from 1 to 5) plots with N versus RMSECV
(B) and Q2 (C).
                                                                                                                                                       45
J. Chemometrics 2009; 23: 32–48              Copyright ß 2008 John Wiley & Sons, Ltd.                  www.interscience.wiley.com/journal/cem
                                                                                                         R. F. Teófilo, J. P. A. Martins, M. M. C. Ferreira
     Figure 14. Selected variables for the four-component simulated data. (A) analyte 1, (B) analyte 2, (C) analyte 3 and (D) analyte 4. The black solid line
     indicates the analyte being studied.
46
     www.interscience.wiley.com/journal/cem                      Copyright ß 2008 John Wiley & Sons, Ltd.                  J. Chemometrics 2009; 23: 32–48
Sorting variables feature selection in multivariate regression
combination with other vectors was necessary for obtaining                   3. Forina M, Lanteri S, Oliveros M, Millan CP. Selection of useful pre-
better results. This makes clear the necessity of studying other                dictors in multivariate calibration. Anal. Bioanal. Chem. 2004; 380:
related vectors and the combinations among themselves.                          397–418.
                                                                             4. Xu L, Schechter I. Wavelength selection for simultaneous spectro-
Figure 14 shows the selected regions and the pure spectrum                      scopic analysis: experimental and theoretical study. Anal. Chem. 1996;
of the four components. The densest regions for analyte 1 are in                68: 2392–2400.
the middle and far right side of the pure spectrum and for                   5. Nadler B, Coifman RR. The prediction error in CLS and PLS: the
analyte 3, two informative regions one in the beginning and                     importance of feature selection prior to multivariate calibration.
another at the end were selected, indicating that in both cases                 J. Chemometr. 2005; 19: 107–118.
                                                                             6. Ferreira MMC, Multivariate QSAR. J. Braz. Chem. Soc. 2002; 13:
the selected variables are indeed informative. For these two                    742–753.
analytes, the informative vector was composed by a combination               7. Jouanrimbaud D, Walczak B, Massart DL, Last IR, Prebble KA. Com-
including the regression vector. Variable selection was also very               parison of multivariate methods based on latent vectors and methods
effective for analyte 2. In the case of analyte 4, the best results             based on wavelength selection for the analysis of near-infrared
                                                                                spectroscopic data. Anal. Chim. Acta 1995; 304: 285–295.
were given when using the regression vector per se and hOPS ¼ 7              8. Henrique CM, Teófilo RF, Sabino L, Ferreira MMC, Cereda MP. Classi-
and hMOD ¼ 4. For this data set, although there are several                     fication of cassava starch films by physicochemical properties and
variables selected in the far left side, the results can be considered          water vapor permeability quantification by FTIR and PLS. J. Food Sci.
reasonable for interpretation, since the relevant variables were                2007; 72: E184–E189.
also selected.                                                               9. Jiang JH, Berry RJ, Siesler HW, Ozaki Y. Wavelength interval selection in
                                                                                multicomponent spectral analysis by moving window partial
                                                                                least-squares regression with applications to mid-infrared and near-
                                                                                infrared spectroscopic data. Anal. Chem. 2002; 74: 3555–
                                                                                3565.
5.    CONCLUSIONS                                                           10. Centner V, Massart DL, deNoord OE, De Jong S, Vandeginste BM,
                                                                                Sterna C. Elimination of uninformative variables for multivariate
The application of the OPS method enables variable selection in                 calibration. Anal. Chem. 1996; 68: 3851–3858.
analysis of multivariate data sets with abilities to                        11. Herrero A, Ortiz MC. Qualitative and quantitative aspects of the
                                                                                application of genetic algorithm-based variable selection in polaro-
(1) improve the model’s prediction power;                                       graphy and stripping voltammetry. Anal. Chim. Acta 1999; 378:
(2) improve the interpretability of the selected variables;                     245–259.
(3) reduce significantly the number of variables in the final               12. Hoskuldsson A. Variable and subset selection in PLS regression.
                                                                                Chemometr. Intell. Lab. Syst. 2001; 55: 23–38.
    model;                                                                  13. Hoskuldsson A. Analysis of latent structures in linear models.
(4) be useful for different data set types and                                  J. Chemometr. 2003; 17: 630–645.
(5) be a feature selection method, simple and effective.                    14. Chong IG, Jun CH. Performance of some variable selection methods
                                                                                when multicollinearity is present. Chemometr. Intell. Lab. Syst. 2005;
                                                                                78: 103–112.
  Due to the criteria presented to select variables, this                   15. Helland IS. On the structure of partial least squares regression.
methodology has shown to be robust and avoid overfitting                        Commun. Stat.: Simul. Comput. 1988; 17: 581–607.
and chance correlation.                                                     16. Williams RP, Swinkels AJ, Maeder M. Identification and application of a
  Among all vectors investigated, the regression vector Reg.                    prognostic vector for use in multivariate calibration and prediction.
                                                                                Chemometr. Intell. Lab. Syst. 1992; 15: 185–193.
and the correlated vector NAS are the most promising to sort                17. Brown PJ. Wavelength selection in multicomponent near-infrared
the best variables. Moreover, a new criterion to choose the                     calibration. J. Chemometr. 1992; 6: 151–161.
number of components to define the Reg. and NAS vectors is                  18. Miller CE. The use of chemometric techniques in process analytical
presented. Applying this criterion, the number of components                    method development and operation. Chemometr. Intell. Lab. Syst.
selected to define Reg. and NAS is higher than that used to build               1995; 30: 11–22.
                                                                            19. Wold S, Johansson E, Cocchi M. 3D QSAR. In Drug Design: Theory,
the model.                                                                      Methods, and Applications, Escom T (ed.). Holland: Leiden, 1993;
  The other vectors, VIP, CovProc, SqRes, StN and their                         523–550.
combinations, although having lower performance with respect                20. Dodds SA, Heath WP. Construction of an online reduced-spectrum NIR
to Reg. and NAS vectors in the examples presented, could be                     calibration model from full-spectrum data. Chemometr. Intell. Lab.
more appropriate for some other specific data types. Besides,                   Syst. 2005; 76: 37–43.v
                                                                            21. Reinikainen SP, Hoskuldsson A. COVPROC method: strategy in mod-
other vectors can be introduced in the future.                                  eling dynamic systems. J. Chemometr. 2003; 17: 130–139.
  Although OPS was used in tandem with PLS regression, it can               22. Teófilo RF. Chemometric Methods in the Electrochemical Studies of
be combined with other regression methods.                                      Phenols on Boron-Doped Diamond Films. PhD Thesis, Universidade
                                                                                Estadual de Campinas, Campinas, 2007; 1–109.
                                                                            23. Ribeiro JS, Teófilo RF, Martins JPA, Ferreira MMC. Melhorias na Análise
                                                                                Discriminatória de Cafés Comerciais
                                                                                                              1
                                                                                                                        Brasileiros Aplicando o Algoritmo
                    Acknowledgements                                            de Seleção de Variáveis OPS Sobre Dados Cromatográficos. 148 ENQA,
                                                                                João Pessoa, PB, 2007, Abstract 104.
The authors acknowledge CNPq for financial support and Dr Carol             24. Manne R. Analysis of 2 partial-least-squares algorithms for multi-
H. Collins for English revision. They thank Dr. Rudolf Kiralj for his           variate calibration. Chemometr. Intell. Lab. Syst. 1987; 2: 187–197.
significant suggestions and contributions.                                  25. Golub GH, Kahan W. Calculating the singular values and pseudo-
                                                                                inverse of a matrix. SIAM J. Num. Anal. Ser. B. 1965; 2: 205–224.
                                                                            26. Golub GH, Van Loan CF. Matrix Computation, (2nd edn). Johns Hopkins
                                                                                University Press: Baltimore, 1989; 498–499.
REFERENCES                                                                  27. Lorber A, Kowalski BR. A note on the use of the partial least-squares
                                                                                method for multivariate calibration. Appl. Spectrosc. 1988; 42:
 1. Martens H, Naes T. Multivariate Calibration. Wiley: New York, 1929;         1572–1574.
    523–550.                                                                28. Phatak A. Evaluation of Some Multivariate Methods and Their Appli-
 2. Fairchild SZ, Kalivas JH. PCR eigenvector selection based on corre-         cations in Chemical Engineering, PhD Thesis, University of Waterloo,
    lation relative standard deviations. J. Chemometr. 2001; 15: 615–625.       Ontario, 1993; 1–58.
                                                                                                                                                            47
J. Chemometrics 2009; 23: 32–48               Copyright ß 2008 John Wiley & Sons, Ltd.                  www.interscience.wiley.com/journal/cem
                                                                                                          R. F. Teófilo, J. P. A. Martins, M. M. C. Ferreira
     29. Elden L. Partial least-squares vs. Lanczos bidiagonalization—I:           40. Dyrby M, Engelsen SB, Norgaard L, Bruhn M, Lundsberg-Nielsen L.
         analysis of a projection method for multiple regression. Comput. Stat.        Chemometric quantitation of the active substance (containing CN)
         Data Anal. 2004; 46: 11–31.                                                   in a pharmaceutical tablet using near-infrared (NIR) transmittance
     30. Pell RJ, Ramos LS, Manne R. The model space in partial least square           and NIR FT-Raman spectra. Appl. Spectrosc. 2002; 56: 579–
         regression. J. Chemometr. 2007; 21: 165–172.                                  585.
     31. Ergon R. PLS score-loading correspondence and a bi-orthogonal             41. Bro R, Rinnan A, Faber NM. Standard error of prediction for multilinear
         factorization. J. Chemometr. 2002; 16: 368–373.                               PLS—2: practical implementation in fluorescence spectroscopy. Che-
     32. Mosteller F, Tukey JW. Data Analysis and Regression: A Second Course in       mometr. Intell. Lab. Syst. 2005; 75: 69–76.
         Statistics. Addison Wesley: London, 1977; 299–331.                        42. Bahram M, Bro R, Stedmon C, Afkhami A. Handling of Rayleigh and
     33. Montgomery DC, Runger GC. Applied Statistics and Probability for              Raman scatter for PARAFAC modeling of fluorescence data using
         Engineers, (3nd edn). Wiley: New York, 2003; 506–555.                         interpolation. J. Chemometr. 2006; 20: 99–105.
     34. Wold S, Johansson E, Cocchi M. PLS-partial least squares projections      43. Pirouette 4.0, Infometrix, Inc., Bothwell, WA, 2007.
         to latent structures. In 3D QSAR in Drug Design, Kubinyi H (ed.). ESCOM   44. Kiralj R, Ferreira MMC. A priori molecular descriptors in QSAR: a case of
         Science Publishers: Leiden, 1993; 523–548.                                    HIV-1 protease inhibitors. I. The chemometric approach. J. Mol. Graph.
     35. Ferre J, Faber NM. Net analyte signal calculation for multivariate            2003; 21: 435–448.
         calibration. Chemometr. Intell. Lab. Syst. 2003; 69: 123–136.             45. Holloway M, Wai J, Halgren T, Fitzgerald P, Vacca J, Dorsey B, Levin R,
     36. Faber NM. Efficient computation of net analyte signal vector in inverse       Thompson W, Chen L, Desolms S, Gaffin N, Ghosh A, Giuliani E,
         multivariate calibration models. Anal. Chem. 1998; 70: 5108–5110.             Graham S, Guare J, Hungate R, Lyle T, Sanders W. A priori prediction
     37. Bro R, Andersen CM. Theory of net analyte signal vectors in inverse           of activity for HIV-1 protease inhibitors employing energy minimiz-
         regression. J. Chemometr. 2003; 17: 646–652.                                  ation in the active site. J. Med. Chem. 1995; 38: 305–317.
     38. Kennard RW, Stone LA. Computer aided design of experiments.               46. Andersen CM, Bro R. Practical aspects of PARAFAC modeling of
         Technometrics 1969; 11: 137–148.                                              fluorescence excitation-emission data. J. Chemometr. 2003; 17:
     39. Soyemi OO, Busch MA, Busch KW. Multivariate analysis of near-                 200–215.
         infrared spectra using the G-programming language. J. Chem Inf.           47. Wold S, Eriksson L. Statistical Validation of QSAR Results. VCH: Wein-
         Comput. Sci. 2000; 40: 1093–1100.                                             heim, 1995; 309–318.
48
www.interscience.wiley.com/journal/cem Copyright ß 2008 John Wiley & Sons, Ltd. J. Chemometrics 2009; 23: 32–48