SEBASTIAN RASCHKA
Introduction to
Artificial Neural Networks
and Deep Learning
with Applications in Python
Introduction to Artificial
Neural Networks
with Applications in Python
Sebastian Raschka
D RAFT
Last updated: May 25, 2018
This book will be available at http://leanpub.com/ann-and-deeplearning.
Please visit https://github.com/rasbt/deep-learning-book for more
information, supporting material, and code examples.
c 2016-2018 Sebastian Raschka
Contents
A Mathematical Notation Reference 4
A.1 Sets and Intervals . . . . . . . . . . . . . . . . . . . . . . . . . 5
A.2 Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
A.3 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
A.4 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
A.5 Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
A.6 Probability and Statistics . . . . . . . . . . . . . . . . . . . . . 11
A.7 Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
A.8 Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . 13
A.9 Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
i
Website
Please visit the GitHub repository to download the code examples accom-
panying this book and other supplementary material.
If you like the content, please consider supporting the work by buy-
ing a copy of the book on Leanpub. Also, I would appreciate hearing
your opinion and feedback about the book, and if you have any ques-
tions about the contents, please don’t hesitate to get in touch with me via
mail@sebastianraschka.com. Happy learning!
Sebastian Raschka
1
About the Author
Sebastian Raschka received his doctorate from Michigan State University
developing novel computational methods in the field of computational bi-
ology. In summer 2018, he joined the University of Wisconsin–Madison
as Assistant Professor of Statistics. Among others, his research activities
include the development of new deep learning architectures to solve prob-
lems in the field of biometrics. Among his other works is his book "Python
Machine Learning," a bestselling title at Packt and on Amazon.com, which
received the ACM Best of Computing award in 2016 and was translated
into many different languages, including German, Korean, Italian, tradi-
tional Chinese, simplified Chinese, Russian, Polish, and Japanese.
Sebastian is also an avid open-source contributor and likes to contribute
to the scientific Python ecosystem in his free-time. If you like to find more
about what Sebastian is currently up to or like to get in touch, you can find
his personal website at https://sebastianraschka.com.
2
Acknowledgements
I would like to give my special thanks to the readers, who provided feed-
back, caught various typos and errors, and offered suggestions for clarify-
ing my writing.
• Appendix A: Artem Sobolev, Ryan Sun
• Appendix B: Brett Miller, Ryan Sun
• Appendix D: Marcel Blattner, Ignacio Campabadal, Ryan Sun, Denis
Parra Santander
• Appendix F: Guillermo Monecchi, Ged Ridgway, Ryan Sun, Patric
Hindenberger
• Appendix H: Brett Miller, Ryan Sun, Nicolas Palopoli, Kevin Zakka
DRAFT 3
Appendix A
Mathematical Notation
Reference
This appendix provides a brief overview of the mathematical notation used
throughout this book. The following appendices describe most of the corre-
sponding concepts in more detail, and additional information is provided
in the context of the applications in the main chapters.
DRAFT 4
APPENDIX A. MATHEMATICAL NOTATION REFERENCE 5
A.1 Sets and Intervals
Z set of integers, {. . . , −2, −1, 0, 1, 2, ...}
N set of natural numbers, {0, 1, 2, 3, ...}
N+ set of natural numbers excluding zero, {1, 2, 3, ...}
R set of real numbers
∈ element of symbol; for example, x ∈ A translates to "x is an element of set A"
∈
/ not an element of symbol
∅ null set, empty set
A∪B union of two sets, A and B
A∩B intersection of two sets, A and B
A⊆B A is a subset of B or included in B
A∆B symmetric difference between two sets A and B
|A| cardinality of a set A (number of elements in a set A)
(a, b) open interval from a to b, excluding a and b
[a, b] closed interval from a to b, including a and b
[a, b) half-open interval from a to b, including a but not b
(a, b] half-open interval from a to b, including b but not a
DRAFT
APPENDIX A. MATHEMATICAL NOTATION REFERENCE 6
A.2 Sequences
n n
xi = x1 + x2 + · · · + xn
P P
xi summation of an indexed variable xi , defined as
i=1 i=1
n n
xi = x1 · x2 · . . . · xn
Q Q
xi product over an indexed variable xi , defined as
i=1 i=1
A.3 Functions
f :A→B function f with domain A and codomain B
(g ◦ f )(x) composition of two functions g and f alternative form: g[f (x)]
f −1 (x) inverse of a function f , such that f (y) = x if f −1 stands for y
|x| absolute value of x; for example, | − 2| = 2
logb base-b logarithm
log natural logarithm (base-e logarithm)
n! n-factorial, where 0! = 1 and n! = n(n − 1)(n − 2) · · · 2 · 1 for n > 0
n n n!
k binomial coefficient ("n choose k"); k = k!(n−k)! for 0 ≤ k ≤ n
arg max f (x) the x value that makes f (x) as large as possible
arg min f (x) the x value that makes f (x) as small as possible
DRAFT
APPENDIX A. MATHEMATICAL NOTATION REFERENCE 7
A.4 Linear Algebra
x scalar (lower-case italics notation)
x column vector (lower-case bold notation) or n × 1-matrix
a·b dot product of two vectors, a and b;
if a and b are n × 1-matrices, also written as aT b;
a · b = aT b = i ai bi = a1 b1 + a2 b2 + · · · + an bn
P
X m × n-matrix (upper-case bold notation)
X 3D-tensor (upper-case italics notation)
Rn real coordinate
space, written as a column vector with length n
x1
x2
x= ..
.
xn
xT transpose of a n × 1-matrix
T
x1
i x2
h
x T = x1 x2 . . . xn = .
..
xn
kxkp Lp norm, vector p-norm,
1/p
kxkp = |xp1 | + |xp2 | + · · · + |xpn |
kxk∞ L∞ norm, max norm; largest absolute value of a vector
kxk∞ = max |xi |
i
DRAFT
APPENDIX A. MATHEMATICAL NOTATION REFERENCE 8
kxk norm, L2 -norm, kxk = kxk2
vector q
kxk = x21 + x22 + · · · + x2n
Ai,: ith row of matrix A
A:,j jth column of matrix A
AT transpose of a matrix, matrix element Ai,j becomes ATj,i
T
1 2 " #
1 3 5
for example, 3 4 =
2 4 6
5 6
In n × nidentitymatrix
1 0 0
I3 = 0 1 0
0 0 1
A−1 inverse of a matrix A, such that AA−1 = A−1 A = I
tr A trace of a matrix A (sum of the diagonal elements)
n
P
tr A = Ai,i
i=1
det A determinant of a matrix A
diag(a1 , a2 , ..., an ) diagonal matrix, matrix whose
diagonal have the values a1 , a2 , ..., an and all other elements are zero
AB Hadamard product, element-wise matrix multiplication
DRAFT
APPENDIX A. MATHEMATICAL NOTATION REFERENCE 9
A.5 Calculus
lim f (x) limit of f (x) as x approaches a
x→a
lim f (x) limit of f (x) as x approaches a from the left
x→a−
lim f (x) limit of f (x) as x approaches a from the right
x→a+
df
dx derivative of f
dn f
dxn n-th derivative of f
∂f
∂x partial derivative of f (x, y, ...)
with respect to variable x, where x is a scalar
∇f gradient of a function f : Rn → R
∂f
∂x1
∂f
∂x2
∇f (x1 , x2 , ..., xn ) = .
..
∂f
∂xn
∆f Laplacian of a function f : Rn → R
n
P ∂2f
∆f = ∂x2i
i=1
DRAFT
APPENDIX A. MATHEMATICAL NOTATION REFERENCE 10
Hf Hessian of a function f : Rn → R
∂2f ∂2f ∂2f
...
∂x∂12∂x
f
1 ∂x1 ∂x2
∂2f
∂x1 ∂xn
∂2f
...
Hf = ∂x2.∂x1 ∂x2 ∂x2 ∂x2 ∂xn
.. .. .. ..
. . .
∂2f ∂2f ∂2f
∂xn ∂x1 ∂xn ∂x2 ... ∂xn ∂xn
∂fj
∂xi partial derivative of component function fj and the
variable xj , where f : Rn → Rm , such that
∂f1
f1 (x) ∂xi
∂f
f2 (x) ∂f
2
∂xi
f (x) =
.. ∂xi = ...
.
fm (x) ∂fm
∂xi
Df Jacobian matrix of f .
∂f1 ∂f1 ∂f1
...
∂x 1 ∂x2 ∂xn
∂f2 ∂f2
... ∂f2
∂x1 ∂x2 ∂xn
Df = . .. .. ..
.. . . .
∂fm ∂fm ∂fm
∂x1 ∂x2 ... ∂xn
R
f (x)dx indefinite integral of f (derivative of F ) with f : R → R
Rb
f (x)dx definite integral of f (derivative of F ) with f : R → R
a
DRAFT
APPENDIX A. MATHEMATICAL NOTATION REFERENCE 11
A.6 Probability and Statistics
P (A ∩ B) probability that event A and B occur
P (A ∪ B) probability that event A or B occurs
P (A | B) conditional probability of A given B
E(X), µX expected value (mean) of a random variable X
∞
P
E(X) = pi xi for a discrete random variable X
i=1
with values x1 , x2 , . . . and probabilities p1 , p2 , . . . .
R∞
E(X) = xf (x)dx for a continuos random variable and
−∞
probability density function f (x).
X̄ sample average of numerical data X1 , ..., Xn
n
1 P
X̄ = n Xi
i=1
var(X), σx2 variance of a random variable X
var(X) = E (X − µX ) = E(X 2 ) − E(X)2
2
s2X sample variance of numerical data X1 , ..., Xn
n
1
s2X = (Xi − X̄)2
P
n
i=1
DRAFT
APPENDIX A. MATHEMATICAL NOTATION REFERENCE 12
std(X), σx standard deviation of a random variable, square root of the variance
sX sample standard deviation, the square root of the sample variance s2X
cov(X, Y ) covariance of two random variables X and Y
cov(XY ) = E[(X − E(X))(Y − E(Y ))] = E(XY ) − E(X)E(Y )
sXY sample covariance of numerical data X1 , ..., Xn , and Y1 , ..., Yn
n
1
(Xi − X̄)(Yi − Ȳ )
P
sXY = n
i=1
corr(X, Y ) correlation coefficient of two random variables X and Y ,
corr(X, Y ) = cov(X,Y
σX σY
)
H(X) entropy of a random variable X
discrete: H(X) = − P (X = x) logb P (X = x)
P
x
R∞
continuous: H(X) = − f (x) logb f (x)dx
−∞
PMF probability mass function of a discrete random variable, f (x) = P (X = x)
CDF cumulative distribution function of a continuous random variable,
F (x) = P (X ≤ x)
PDF probability density function of a continuous random variable,
Rb
P (X ∈ [a, b]) = f (x)dx
a
X∼D random variable X has a distribution D
θ̂ estimator of a parameter θ
N (x, µ, σ 2 ) normal (Gaussian) distribution over x with mean µ and variance σ 2
DRAFT
APPENDIX A. MATHEMATICAL NOTATION REFERENCE 13
A.7 Numbers
e Euler’s number, mathematical constant approximated by 2.71828
π "pi", mathematical constant approximated by 3.14159
∞ infinity symbol
1.234 × 105 scientific notation for 123, 400
or 1.234E05
< less than sign, for example, x < 10 means that x is smaller than 10
much less than sign
> greater than sign, for example, x > 10 means that x is larger than 10
much greater than sign
much less than sign
A.8 Approximation
≈ approximate equality, for instance, e ≈ 2.71828 is the approximation
of Euler’s number
f (x) ∼ g(x) symbol to assert that the ratio of two functions approaches 1
lim fg(x)
(x)
= 1, if x is small
x→0
lim f (x) = 1, if x is large
x→∞ g(x)
f (x) ∝ g(x) the two functions f (x) and g(x) are proportional to each other
T (n) ∈ O(n2 ) big-O notation, an algorithm is asymptotically bounded by n2 ;
an algorithm has an order of n2 time complexity
DRAFT
APPENDIX A. MATHEMATICAL NOTATION REFERENCE 14
A.9 Logic
⇒ implication operator
for example, A ⇒ B translates to "if A implies B"
or "if A then B" (or "B only if A")
⇔ equality operator (if and only if (iff))
for example, A ⇔ B translates to "A if
and only if B" or "if A then B and if B then A"
∧ logical conjunction, and
for example, A ∧ B means "A and B"
∨ logical (inclusive) disjunction, or
for example, A ∨ B means "A or B"
¬ negation, not
for example, ¬A means "not A" or
"if A is true then ¬A is false" and vice versa
∀ universal quantifier, means for all
for example, "∀x ∈ R, x > 1"
translates to "for all real numbers x, x is greater than one"
∃ existential quantifier, means there exists
for example, "∃x ∈ A, f (x)"
translates to "there is an element in set A for which the predicate f (x) holds true"
DRAFT
Bibliography
DRAFT 15
Abbreviations and Terms
CNN [Convolutional Neural Network]
DRAFT 16
Index
DRAFT
17