Neural Processing Letters (2006) 23:89–101 © Springer 2006
DOI 10.1007/s11063-005-3500-3
Multi-Classification by Using Tri-Class SVM
CECILIO ANGULO1, , FRANCISCO J. RUIZ1 , LUIS GONZÁLEZ2 , and
JUAN ANTONIO ORTEGA3
1
Grup de Recerca en Enginyeria del Coneixement, Universitat Politècnica de Catalunya,
Av. Vı́ctor Balaguer s/n. 08800 – Vilanova i la Geltrú, Spain. e-mail: cecilio.angulo@upc.edu
2
Departamento de Economı́a Aplicada I, Universidad de Sevilla, Avenida Ramón y Cajal,
1. 41018 – Sevilla, Spain
3
Escuela Técnica Superior de Ingeniería Informática, Universidad de Sevilla, Avenida Reina
Mercedes, s/n. 41012 – Sevilla, Spain
Abstract. The standard form for dealing with multi-class classification problems when bi-
classifiers are used is to consider a two-phase (decomposition, reconstruction) training
scheme. The most popular decomposition procedures are pairwise coupling (one versus one,
1-v-1), which considers a learning machine for each Pair of classes, and the one-versus-all
scheme (one versus all, 1-v-r), which takes into consideration each class versus the remain-
ing classes. In this article a 1-v-1 tri-class Support Vector Machine (SVM) is presented. The
expansion of the architecture of this machine into three categories specifically addresses the
decomposition problem of how to prevent the loss of information which occurs in the usual
1-v-1 training procedure. The proposed machine, by means of a third class, allows all the
information to be incorporated into the remaining training patterns when a multi-class prob-
lem is considered in the form of a 1-v-1 decomposition. Three general structures are pre-
sented where each improves some features from the precedent structure. In order to deal
with multi-classification problems, it is demonstrated that the final machine proposed allows
ordinal regression as a form of decomposition procedure. Examples and experimental results
are presented which illustrate the performance of the new tri-class SV machine.
Key words. bi-classifier, multi-classification, ordinal regression, Support Vector Machine
Abbreviations. 1-v-1 – one versus one; all versus all; pairwise coupling; 1-v-r – one ver-
sus the rest; one versus all; s.t. – subject to; SV – Support Vector; SVM – Support Vector
Machine
1. Introduction
Support Vector Machines (SVMs) are learning machines which implement the
structural risk minimization inductive principle to obtain good generalization on
a limited number of learning patterns. This theory was originally developed by
Vapnik on the basis of a separable binary classification problem with signed out-
puts ±1 [21].
The SVM presents good theoretical properties and behaviour in problems of
binary classification [9]. There are several papers which generalize the original
Corresponding author.
90 CECILIO ANGULO ET AL.
bi-class approach to multi-classification problems [16, 17, 1] through different algo-
rithms, such as 1-v-r SVM or 1-v-1 SVM (see [15] for a comparison of SVM
multi-class methods). In this work it is assumed that problems with more than
2 classes will be considered, hence the original bi-class SVM is extended to a
more general tri-class SVM approach. The proposed final tri-class machine is
presented in a three-stage procedure: first the original idea of a third class is
introduced which was developed by Angulo and Català [3, 2]; secondly a more spe-
cific machine, as proposed by Angulo and González [5] is presented; finally, the
proposed novel tri-class SVM is explained, which implies a huge computational
cost reduction with respect to the former proposals, and a meeting point for both
classification and ordinal regression techniques.
The rest of the article is organized as follows: in Section 2, the standard SVM
classification learning paradigm is briefly presented in order to introduce some
notation. Section 3 is devoted to a short introduction about SVMs for multi-
classification. In Section 4, the 1-v-1 tri-class SV Machine is presented, and its
faster computational counterpart is derived in Section 5. Examples and experi-
mental results are displayed in Section 6 to illustrate its behaviour and strengths.
Finally, some conclusions are drawn and future research suggested.
2. Bi-Class SV Machine Learning
The SV Machine is an implementation of a more general regularization principle
known as the large margin principle. Let
Z = {(x1 , y1 ), . . . , (xn , yn )} = {z1 , . . . , zn } ∈ (X × Y)n (1)
be a training set, where X is the input space and
Y = {θ1 , θ2 } = {−1, +1} (2)
the output space. Let
φ : X → F ⊆ Rd (3)
be a feature mapping, with φ = (φ1 , . . . , φd ), for the usual ‘kernel trick’. F is
named feature space. Let
def
x = φ(x) ∈ F (4)
be the representation of x ∈ X . A binary linear classifier,
fw (x) = φ(x), w + b = x, w + b (5)
is sought in the space F, with fw : X → F → R, b ∈ R, and where outputs are
obtained by thresholding the classifier, hw (x) = sign(fw (x)). According to [12], the
MULTI-CLASSIFICATION BY USING TRI-CLASS SVM 91
classifier w with the largest geometrical margin on a given training sample Z can
be written as
def
wSVM = arg max w
1
· min yi xi , w. (6)
w∈F zi ∈Z
One practical method of dealing with the problem is to minimize the norm w in
(6) with the geometrical margin fixed to unity
min 1 w2
w∈F 2
(7)
s.t. yi xi , w 1 zi ∈ Z.
The solution can be expressed in the form
wSVM = αi yi xi ; fwSVM (x) = αi yi k(xi , x), (8)
i i
where k(x, x ) = φ(x), φ(x ) = x, x is the kernel function, and only a few αi are
not zero; those associated to the so-called support vectors.
3. SV Machine for Multi-Classification
Let Z be a training set. Now, a set of possible labels {θ1 , . . . , θ }, with > 2 will
be considered. Subsets Zk ∈ Z, defined as
Zk = {zi = (xi , yi ): yi = θk } (9)
generate a partition in Z, and nk = #Zk , hence n = n1 + n2 + · · · + n . If Ik is defined
as the set of indexes i where zi ∈ Zk , it follows that,
{(xi , yi )} = Zk . (10)
i∈Ik
A very common decomposition procedure for multi-classification when SVMs
are considered is 1-v-1 SVM: a first decomposition phase generates several learning
machines in parallel, whereby each machine takes only two classes into consid-
eration. A reconstruction scheme then allows the calculation of the overall out-
put by merging outputs from the decomposition phase. In this approach, ·(−1) 2
binary classifiers are trained to generate hyperplanes fkh , 1 k < h , by separat-
ing training vectors Zk with label θk from training vectors in class θh , Zh . If fkh
discriminates without error then sign(fkh (xi )) = 1, for zi ∈ Zk and sign(fkh (xi )) =
−1, for zi ∈ Zh . Remaining training vectors Z\ {Zk ∪ Zh } are not considered in the
optimization problem. Hence, for a new entry x, the numeric output from each
machine fkh (x) is interpreted as,
θ if sign(fkh (x)) = 1
(fkh (x)) = k (11)
θh if sign(fkh (x)) = −1.
92 CECILIO ANGULO ET AL.
In the reconstruction phase, the label distribution generated by the trained
machines in the parallel decomposition is considered through a merging scheme.
The 1-v-1 multi-classification approach is usually preferred to the 1-v-r scheme
[16] because it takes less training time, despite studies such as [19]. Moreover,
according to [15] it would be difficult to say which one gives better accuracy. The
main drawback for this approach is that only data from two classes is consid-
ered for the training of each machine, therefore output variance is high and any
information from the rest of the classes is ignored.
If a hyperplane fkh must classify an input xi with i ∈ / Ik Ih , only output
fkh (xi ) = 0 will be translated into a correct interpretation. The natural improve-
ment to be analysed is the enforcement of every training input in a different class
from θk and θh to be contained in the separating hyperplane fkh (x) = 0.
3.1. the k-svcr machine. a first approach
In [2], the first tri-class procedure, the K-SVCR machine, was presented where
remaining training vectors are forced to be encapsulated in a δ-tube, 0 δ < 1,
along the separation hyperplane. Parameter δ allows the creation of a slack zone
(a ‘tube’) around the hyperplane where remaining training vectors are covered. The
separating hyperplane must solve the optimization problem,
1
min w2 + C1 · ξi + C 2 · (ϕj + ϕj∗ )
w∈F 2
i j
⎧
⎪
⎪ yi w, xi 1 − ξi zi ∈ Z1,3 (12)
⎪
⎨ −δ − ϕ ∗ w, x δ + ϕ
j j j zj ∈ Z2
s.t.
⎪
⎪ ξ i 0 z i ∈ Z1,3
⎪
⎩ ∗
ϕj , ϕj 0 zj ∈ Z2 ,
where Z1,3 are the patterns belonging to the classes labelled as {−1, +1} and Z2
are those labelled with 0. The solution has a similar form to (8), where αi are the
multipliers associated to the problem, such that i αi = 0. For a new entry x, the
numeric output from the machine fw (x) is interpreted as
⎧
⎨ 1 if fw (x) > δ
(fw (x)) = −1 if fw (x) < −δ (13)
⎩
0 if |fw (x)| δ.
This approach has demonstrated good results on standard ‘benchmarks’ [2],
however, for the general case, it is necessary to select many parameters,1 such
as: (i) k, kernel function; (ii) C1 , associated weight for the sum of errors in the
two discriminated classes; (iii) C2 , associated weight for the sum of errors in the
remaining classes; (iv) δ, insensitivity parameter.
1 An extended study can be found in [10].
MULTI-CLASSIFICATION BY USING TRI-CLASS SVM 93
3.2. robust decomposition – reconstruction procedure
The K-SVCR machine improves standard algorithms treating 2-class classification
problems during the decomposing phase of a general multi-class scheme: by focus-
ing the learning on 2 classes, but using all the disposable information on the pat-
terns. Now, a second theoretical advantage of the ‘third-class approach’ will be
enunciated, the robustness of the reconstruction procedure [6]. To make evident
this assertion, a definition must be done.
DEFINITION 1. Let x ∈ X be an entry having a known output, θm . Let
#fmerr
εrob (x, F ) =
Lm
be the rate between the number of classifiers concerning class θm producing a
wrong output, #fmerr , and the total number of concerned classifiers with class
θm , Lm , being correct the final multi-class architecture output, F (x) = θm . The
robustness parameter
εrob (F ) = arg min εrob (x, F ) ∀x ∈ X
x
determines that a general decomposition and reconstruction multi-class architec-
ture A1 is more robust than A2 if
1 1 2 2
εrob = min εrob (F ) > min εrob (F ) = εrob , (14)
F ∈A1 F ∈A2
where superscripts refer to the global architecture being considered.
Basically, the robustness parameter specifies, for the worst case, how many clas-
sifiers concerned with the class of the entry could be wrong while the multi-class
architecture output is still correct.
Now, it can be enunciated the following Proposition [6].
PROPOSITION 2. If K is the number of classes in consideration, the multi-class
architecture based on a three-classes machine, like K-SVCR machine, with a voting
reconstruction scheme F has a robustness parameter
2 (K − 2)
εrob = .
K (K − 1)
In a similar way the following Proposition can be demonstrated [6].
PROPOSITION 3. A standard multi-class architecture based on 1-v-r 2-class classi-
fiers decomposition and a voting reconstruction scheme has a robustness parameter
εrob = 0.
94 CECILIO ANGULO ET AL.
A standard multi-class architecture based on 1-v-1 2-class classifiers decomposition
and voting reconstruction scheme has a robustness parameter
εrob = 0.
A ‘pairwise’ multi-class architecture [16] based on 1-v-1 2-class classifiers decom-
position and ‘pairwise’ voting reconstruction scheme has a robustness parameter
εrob = 0.
A DAGSVM architecture [18] has a robustness parameter
εrob = 0.
4. 1-v-1 Tri-Class SVM. A Second Approach
The number of tuning parameters can be reduced if the margin to be maximized
in (7) is that defined between the patterns assigned with output {−1, +1}, and the
entries labelled with 0, which are the remaining patterns. In this case, the width
of the ‘decision tube’ along the decision hyperplane where 0-labelled patterns are
allocated is not considered ‘a priori’ and the δ parameter is eliminated. A classifier
with this characteristic must accomplish
def 1
wSV 3 = arg max · min yi xi , w − max |xi , w| . (15)
w∈F w zi ∈Z1,3 zi ∈Z2
When w is minimized while the rest of the product is fixed to the unitary dis-
tance, (15) can be translated into the more manageable2
2
2 w
1
min
w∈F
(16)
s.t. yi xi , w 1 + xj , w zi ∈ Z1,3 ; zj ∈ Z2 .
This optimization problem is consistent with the standard formulation since if
all the 0-labelled training patterns are exactly on the decision hyperplane, (i.e. no
incorrect interpretation is possible), or these patterns are not considered in the
problem, then the novel machine would be similar to the 1-v-1 SVM machine.
Restrictions can be relaxed to allow some degree of noise on the ±1-labelled
training patterns by using ‘slack’ variables
ξi = 1 + max xj , w − yi xi , w 0 zi ∈ Z1,3 (17)
zj ∈Z2
2 Constraints are slightly stricter than (15).
MULTI-CLASSIFICATION BY USING TRI-CLASS SVM 95
and restrictions in (16) can be manipulated to obtain the optimization problem [5]
2
2 w + C
1
min ξi
w∈F
i
(18)
yi xi − xj , w − 1 + ξi 0 zi ∈ Z1,3 ; zj ∈ Z2
s.t.
ξi 0 zi ∈ Z1,3 .
When Lagrange multipliers are applied to the original optimization problem, it is
obtained
L = 21 w2 + C ξi + αij (1 − ξi − yi xi − xj , w) − µi ξi (19)
i ij i
with
0 αij C, zi ∈ Z1,3 ; w= yi αij (xi − xj ). (20)
j ij
The dual problem is therefore,
max αij − yi yk αij αkl xi − xj , xk − xl
α
i,j ij kl
⎧
⎨0 αij C (21)
s.t. j
⎩
αij , αkl 0, zi , zk ∈ Z1,3 ; zj , zl ∈ Z2
and the solution function can be written,
fw (x) = αij yi k(xi , x) − k(xj , x) . (22)
ij
For a new entry x, output is interpreted in accordance with (13), where
δ = max fw (xj ) = max w, xj . (23)
zj ∈Z2 zj ∈Z2
In Figure 1, the behaviour of the 1-v-1 tri-class machine is illustrated by using
a simple linearly separable problem with a Gaussian kernel. Support vectors (SVs)
are those patterns with associated null parameters, i.e. a null row or column in the
parameter matrix. As expected, the number of support vectors is limited and they
lie in the margin. Solid lines indicate the δ-tube for the ‘remaining vectors’ belong-
ing to the ‘third class’ and the dotted line represents the separating hyperplane. It
must be noted that values for 0 < δ < 1 are very low, in this example 0.1126, 0.1750
and 0.2159.
96 CECILIO ANGULO ET AL.
(a) Training da ta . (b) Class 1 vs 2. 9 SVs.
(c) Class 1 vs 3. 9 SVs. (d) Class 2 vs 3. 8 SVs.
Figure 1. Results of the 1-v-1 tri-class machine applied to a simple separable problem with 45 patterns.
5. 1-v-1 Tri-Class SVM Revised. A Third Approach
By means of a tri-class scheme, both the K-SVCR and the 1-v-1 tri-class SVM
allow the incorporation of all the information contained in the training patterns
when a multi-class problem is considered. For the 1-v-1 tri-class SVM, information
from ‘remaining patterns’ is captured in a δ-tube, where δ is an optimal param-
eter which is automatically obtained by maximizing the margin between classes.
However, this automatic tuning of the parameter leads to a computationally more
expensive optimization problem. This computational effort must be reduced.
By observing the nature of the constraints in the optimization problem (18), an
almost direct relation with respect to ordinal regression problems could be investi-
gated. In this sense, Shashua and Levin [20] have recently developed a fixed mar-
gin strategy to deal with ordinal regression problems by means of large margin
algorithms such as SVMs. This strategy considers all the classes at once, but with-
out squaring the size of the training data. Hence, the procedure seeks parallel
hyperplanes by separating consecutive classes through the optimization problem
j ∗j +1
2
2 w + C ξi + ξi
1
min
w∈F ;bj ∈R
⎧ i j
j
⎪
⎨ x i , w − b j −1 + ξi zi ∈ Zj (24)
∗j +1
s.t. xi , w − bj 1 − ξi zi ∈ Zj +1
⎪
⎩ j ∗j +1
ξi , ξ i 0
where j = 1, . . . , − 1.
MULTI-CLASSIFICATION BY USING TRI-CLASS SVM 97
When comparing the 1-v-1 tri-class approach (18) and the formulation in (24),
it follows that (18) can be obtained from (24) when the number of categories to
be considered is three, = 3, if the constraints which have similar bias bj are sub-
tracted , and a double value for the margin is considered. Hence,
min 1
2 w2 + C ξi1 + ξi2 + ξi∗2 + ξi∗3
w∈F ;b1 ,b2 ∈R
⎧ i
⎪
⎪ x , w − b 1 −1 + ξi1 zi ∈ Z1
⎪
⎪
i
∗2
⎪
⎨ xi , w − b1 1 − ξi zi ∈ Z2 (25)
2
s.t. xi , w − b2 −1 + ξi zi ∈ Z2
⎪
⎪
⎪ xi , w − b2 1 − ξi∗3
⎪ zi ∈ Z3
⎪
⎩ 1 2 ∗2 ∗3
ξi , ξ i , ξ i , ξ i 0
leads to
min 2
2 w + C
1
ξi1 + ξi2 + ξi∗2 + ξi∗3
w∈F ;b1 ,b2 ∈R
⎧ i
⎪ x
⎨ j − x i , w 2 − ξi1 − ξj∗2 zi ∈ Z1 ; zj ∈ Z2 (26)
s.t. xi − xj , w 2 − ξj2 − ξi∗3 zi ∈ Z3 ; zj ∈ Z2
⎪
⎩ ξi1 + ξi∗2 , ξi2 + ξi∗3 0
which is the same problem as (18) but with a double margin.
Indeed, it has been demonstrated that this ordinal regression approach can be
used in a similar way to the tri-class SVM in the decomposition – reconstruction
multi-classification procedure established in previous sections, by separating all the
patterns into three ensembles, labelled {−1, 0, 1}.
The size of the optimization problem associated to the 1-v-1 tri-class machine
has been drastically reduced. Hence, if a multi-classification problem of classes
is considered, where each class has the same number of patterns, (i.e. n patterns
for classes labelled ±1 and (−2)n
patterns for the 0-labelled class), the first opti-
mization problem has to fulfil a number of restrictions of O(n2 ), while the new
version has an order of O(n). When all the necessary 1-v-1 tri-class machines are
considered in the multi-classification schema, (−1)2 , then the global number of
constraints is O(2 n).
In Figure 2, the performance of the novel machine is shown when it is applied
with a Gaussian kernel on a non linearly separable multiclass problem. Classifiers
are combined by a majority voting scheme to produce the final multiclass classi-
fication. It can be observed that a little band between classes remains unclassified
since the outputs from the parallel decomposition phase assign this zone to differ-
ent classes.
6. Experimental Results
In this section, experimental results are presented for several problems from the
usual UCI Repository of machine learning databases [7]. A summary of the
98 CECILIO ANGULO ET AL.
Figure 2. Results of the 1-v-1 tri-class machine applied to a simple separable problem with 45 patterns.
Table 1. Characteristics of the selected datasets
from the UCI repository.
Dataset Patterns Classes Features
Iris 150 3 4
Wine 178 3 13
Glass 214 6 9
Vowel 528 11 10
Vehicle 846 4 18
DNA 2000 (1186) 3 180
characteristics of the selected datasets (Iris, Wine, Glass, Vowel, Vehicle and DNA)
is in Table 1. DNA dataset contains training and testing data.
The results have been obtained by following the experimental framework which
was proposed by [15] and was continued in [1], but with some modifications intro-
duced to incorporate the suggestions in [14] and [22]. Hence, training data have
not been scaled for their inclusion in [−1, +1], but have been normalized, (that is,
mean zero and standard deviation one), in order to avoid problems with outliers.
Test data are normalized accordingly.
The algorithms considered are the standard 1-v-1 and 1-v-r formulation and the
1-v-1 Tri-Class SVM in its final revised form for multi-classification. Their perfor-
mance, (in the form of accuracy rate), has been evaluated on models using the
Gaussian kernel,
−xi −xj
2
2σ 2
k(xi , xj ) = exp (27)
therefore two hyperparameters must be set: the regularization term C and the
width of the kernel σ . This space is explored
on a two-dimensional grid with the
following values: C = 2 , 2 , . . . , 2
4 3 −10 12 11 −2
and γ = 2 , 2 , . . . , 2 , where γ = 2σ1 2 .
MULTI-CLASSIFICATION BY USING TRI-CLASS SVM 99
Table 2. A comparison of the best accuracy rates using the RBF
kernel.
Dataset CV 1-v-1 (C,γ ) 1-v-r (C,γ ) Tri-class (C,γ )
Iris 30 96.73 (20 , 23 ) 96.00 (26 , 24 ) 95.49 (28 , 2−2 )
Wine 25 98.39 (211 , 23 ) 97.86 (22 , 23 ) 97.06 (27 , 23 )
Glass 10 70.91 (23 , 21 ) 71.11 (29 , 24 ) 71.81 (2−1 , 2−7 )
Vowel 10 98.95 (23 , 20 ) 98.48 (23 , 2−1 ) 99.36 (23 , 20 )
Vehicle 3 84.17 (28 , 24 ) 86.21 (28 , 24 ) 88.18 (26 , 22 )
DNA – 95.45 (23 , 2−5 ) 95.78 (21 , 2−6 ) 95.86 (22 , 2−7 )
The criteria used to estimate the generalized accuracy is a ten-fold cross-validation
on the whole training data, except for the DNA dataset. This procedure is repeated
between 3 and 30 times, according to the size of the dataset, in order to ensure
good statistical behaviour. The optimization algorithm used is the exact quadratic
program-solver provided by Matlab, except for the Vowel and DNA datasets, that an
iterative solver has been employed [8]. The best cross-validation mean rate among
the several pairs (C, γ ) is reported in Table 2.
It can be observed that similar performance results are obtained by all three
approaches, however slight differences can be appreciated.
7. Conclusions and Future Work
In this paper, a new kernel machine has been designed to solve multi-
classification problems. Initially, it has been proved that, by means of a tri-class
scheme, the machine allows the incorporation of all the information contained in
the training patterns when a multi-class problem is considered. Information from
‘remaining patterns’ is captured in a δ-tube, where δ is an optimal parameter which
can be automatically obtained by maximizing the margin between classes.
New formulation with automatic tuning of parameter δ is very time-consuming,
since comparisons between classes must be realized, similar to an ordinal regres-
sion procedure. An algorithm in [20] avoids making comparisons between classes
when a preference learning task is performed, which speeds up the computation
time considerably. However, all the hyperplanes considered must be parallel, hence
the explanation power of the machine is reduced, and the use of the machine is
restricted to ordinal regression. Our approach is an improvement on the machine
in [20], since it is possible that hyperplanes are not parallel, which improves their
explanation power, and our approach can therefore be used for multi-classification
tasks.
By observing the constraints in the optimization problem, a more direct exten-
sion to ordinal regression problems is under investigation. A first natural choice
would be to use a 1-v-1 tri-class SVM to solve preference learning problems, in
the same way as the K-SVCR machine was developed for this utilization in [4],
100 CECILIO ANGULO ET AL.
in accordance with the approach presented in [13]. However, it is still necessary to
use constraints on the differences between patterns of different classes.
When hyperplanes are merged to obtain the final multi-class solution, only
signed outputs are considered in the voting scheme, so ties between classes are con-
sidered as errors. An initiated research line is the probabilistic interpretation of the
outputs in accordance with their value [11].
Acknowledgements
This study was partially supported by Junta de Andalucı́a grant ACPAI-2003/014,
and Spanish MCyT grant TIC2002-04371-C02-01.
References
1. Anguita, D., Ridella, S. and Sterpi, D.: A New Method for Multiclass Support Vector
Machines. In: Proceedings of the IEEE IJCNN2004. Budapest (Hungary), 2004.
2. Angulo, C.: Learning with Kernel Machines into a Multi-Class Environment. Doctoral
thesis, Technical University of Catalonia. In Spanish, 2001.
3. Angulo, C. and Català, A.: A Multi-class Support Vector Machine. Lecture Notes in
Computer Science, 1810 (2000), 55–64.
4. Angulo, C. and Català, A.: Ordinal regression with K-SVCR machines. In: J. Mira and
A. Prieto (eds.), Proceedings of IWANN 2001, Part I, Vol. 2084 of Lecture Notes in
Computer Science. pp. 661–668, 2001.
5. Angulo, C. and González, L.: 1-v-1 tri-class SV machine. In: Proceedings of the 11th
European Symposium on Artificial Neural Networks. Bruges (Belgium), pp. 355–360, 2003.
6. Angulo, C., Parra, X. and Català, A.: K-SVCR. A support vector machine for multi-
class classification. Neurocomputing, 55(1–2), (2003) 57–77.
7. Blake, C. and Merz, C.: UCI Repository of Machine Learning Databases, 1998.
8. Canu, S., Grandvalet, Y. and Rakotomamonjy, A.: SVM and Kernel Methods Matlab
Toolbox. Perception Systèmes et Information. INSA de Rouen, Rouen, France, 2003.
9. Cristianini, N. and Shawe-Taylor, J.: An Introduction to Support Vector Machines and
other Kernel-based Learning Methods. Cambridge University press, 2000.
10. González, L.: Discriminative analysis using kernel vector machines support. The similar-
ity kernel function. Doctoral thesis, University of Seville. In Spanish, 2002.
11. González, L., Angulo, C., Velasco, F., and Vílchez, M.: Máquina -SVCR con salidas
probabilı́sticas (-SVCR machine with probabilistic outputs). Inteligencia Artificial. Revi-
sta Iberoamericana de IA (17) (2002) 72–82. In Spanish.
12. Hebrich, R.: Learning Kernel Classifiers. Theory and Algorithms. The MIT Press, 2002.
13. Herbrich, R., Graepel, T. and Obermayer, K.: Advances in Large Margin Classifiers,
Chapt. Large Margin Rank Boundaries for Ordinal Regression, pp. 115–132. Cambridge,
MA: MIT Press, 2000.
14. Hsu, C.-W., Chang, C.-C. and Lin, C.-J.: A practical guide to support vector classifica-
tion. Technical report, Department of Computer Science and Information Engineering,
National Taiwan University, 2003.
15. Hsu, C.-W. and Lin, C.-J.: A Comparison of methods for multiclass support vector
machine. IEEE Transactions on Neural Networks, 13(2), (2002) 415–425.
16. Kressel, U.: Pairwise classification and support vector machine. In B. Schölkopf, C.
Burgues and A. Smola (eds.) Advances in Kernel Methods: Support Vector Learning,
pp. 255–268, Cambridge, MA: MIT Press, 1999.
MULTI-CLASSIFICATION BY USING TRI-CLASS SVM 101
17. Mayoraz, E. and Alpaydin, E.: Support vector machines for multi-class classification. In:
J. Mira and J. V. Sánchez-Andrés (eds.), Proceedings of IWANN 1999, Part II, Vol. 1607
of Lecture Notes in Computer Science, 1999.
18. Platt, J., Cristianini, N. and Shawe-Taylor, J.: Large margin DAGs for multiclass classifi-
cation. Neural Information Processing Systems, 12 (2000).
19. Rifkin, R. and Klautau, A.: In defense of one-vs-all classification. Journal of Machine
Learning Research, 5 (2004), 101–141.
20. Shashua, A. and Levin, A.: Taxonomy of large margin principle algorithms for ordinal
regression problems. Neural Information Processing Systems, 16 (2002).
21. Vapnik, V.: Statistical Learning Theory. John Wiley & Sons, Inc., 1998.
22. Vert, J.-P., Tsuda, K. and Schölkopf, B.: Kernel Methods in Computational Biology,
Chapt. A Primer on Kernel Methods, pp. 35–70. The MIT Press, 2004.