Logistic classification with cross-entropy loss
Julián D. Arias Londoño
August 3, 2020
1 Definition
Logistic regression (LG) is one the basic models studied in Statistics and Ma-
chine Learning to solve bi-class classification problems. The intuition behind
this model is to find a polynomial function capable of splitting the feature space
into two parts, i.e. the polynomial function plays the role of decision bound-
ary between the two classes. The aim of the training algorithm is to find the
model’s parameters (polynomial’s weights) such that each part of the space
contains samples from only one of the classes as long as it is possible. Figure 1
shows a scatter plot of a two-class toy problem and the boundary function.
Figure 1: Scatter plot for a two-class problem and a linear decision function
N
Formally, given a dataset D = {(xj , yj )}j=1 , where xj is a feature vector rep-
resenting a sample j, and yj is its corresponding target output, which can take
one of two possible values {0, 1}, the aim is to build a function able to predict
whether a new sample belongs to class 0 or 1. The LG model is composed of a
polynomial function wrapped by a logistic function; it can be expressed as:
1
g(wT x) = (1)
1 + exp(−wT x)
The logistic function is chosen because it is a derivable approximation of the
1
sing function, and thus it can be used for gradient-based optimization methods.
Figure 2 shows a graphic representation of the logistic function.
Figure 2: Logistic function
In order to train the model, we need to define a loss function that can be used for
optimization purposes. Typically LG model uses the well-known cross-entropy
function as cost function. Taking into account that logistic function provide a
value in the interval [0, 1], it can be interpreted as the probability of belonging
to class 1. Therefore, we can use the Maximum Likelihood criterion applied to
a Bernoulli distribution to come out with the function we want to optimize.
By assuming the samples are i.i.d the log-likelihood function can be estimated
as:
N
y
Y
arg max L = log pj j (1 − pj )(1−yj )
w
j=1
N
Y
= log (g(wT xj ))yj (1 − g(wT xj ))(1−yj )
j=1
N
X
log (g(wT xj ))yj + log (1 − g(wT xj ))(1−yj )
=
j=1
N
X
yj log g(wT xj ) + (1 − yj ) log 1 − g(wT xj )
= (2)
j=1
For the sake of numerical stability and for the use of a minimization algorithm
instead of a maximization one, the final cross-entropy loss function is given by:
N
1 X
yj log g(wT xj ) + (1 − yj ) log 1 − g(wT xj )
J(w) = − (3)
N j=1
2
One of the most common optimization algorithms used for LG is the Gradient
Descent which is based on applying iteratively the following rule:
w(τ ) = w(τ − 1) − η∇J(w) (4)
In order to apply the former rule, the gradient of J(w) must be estimated. The
first step to get the gradient, is to estimate the derivative of the logistic function.
T 1
∇w g(w x) = ∇w
1 + exp(−wT x)
exp(−wT x)x
= 2
(1 + exp(−wT x))
exp(−wT x) x
=
1 + exp(−wT x) 1 + exp(−wT x)
= g(wT x)(1 − g(wT x))x (5)
Based on the former result it is quite easy to estimate the gradient of the cross-
entropy function as:
N
1 X
yj log g(wT xj ) − (1 − yj ) log 1 − g(wT xj )
∇w J(w) = ∇w −
N j=1
N
1 X ∇w g(wT xj ) −∇w g(wT xj )
= − yj + (1 − yi ) (6)
N j=1 g(wT xj ) 1 − g(wT xj )
By replacing Eq. (5) into Eq. (6) we obtain:
N
1 X
yj 1 − g(wT xj ) xj − (1 − yi )g(wT xj )xj
∇w J(w) = −
N j=1
N
1 X
g(wT xj ) − yj xj
= (7)
N j=1
The expression in Eq. (7) is similar to the one that can be obtained for multiple
regression using the Least Square Error cost function, that in turn can be derived
from the maximum likelihood criterion but applied to a normal distribution
instead [1].
References
[1] C. M. Bishop, Pattern recognition and machine learning. Springer, 2006.