Example Notation for Deep Learning
Ian Goodfellow
Yoshua Bengio
Aaron Courville
Contents
Notation ii
1 Commentary 1
1.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Bibliography 4
Index 5
i
Notation
This section provides a concise reference describing notation used throughout this
document. If you are unfamiliar with any of the corresponding mathematical
concepts, Goodfellow et al. (2016) describe most of these ideas in chapters 2–4.
Numbers and Arrays
a A scalar (integer or real)
a A vector
A A matrix
A A tensor
In Identity matrix with n rows and n columns
I Identity matrix with dimensionality implied by
context
e(i) Standard basis vector [0, . . . , 0, 1, 0, . . . , 0] with a
1 at position i
diag(a) A square, diagonal matrix with diagonal entries
given by a
a A scalar random variable
a A vector-valued random variable
A A matrix-valued random variable
ii
CONTENTS
Sets and Graphs
A A set
R The set of real numbers
{0, 1} The set containing 0 and 1
{0, 1, . . . , n} The set of all integers between 0 and n
[a, b] The real interval including a and b
(a, b] The real interval excluding a but including b
A\B Set subtraction, i.e., the set containing the ele-
ments of A that are not in B
G A graph
P aG (xi ) The parents of xi in G
Indexing
ai Element i of vector a, with indexing starting at 1
a−i All elements of vector a except for element i
Ai,j Element i, j of matrix A
Ai,: Row i of matrix A
A:,i Column i of matrix A
Ai,j,k Element (i, j, k) of a 3-D tensor A
A:,:,i 2-D slice of a 3-D tensor
ai Element i of the random vector a
Linear Algebra Operations
>
A Transpose of matrix A
A+ Moore-Penrose pseudoinverse of A
A B Element-wise (Hadamard) product of A and B
det(A) Determinant of A
iii
CONTENTS
Calculus
dy
Derivative of y with respect to x
dx
∂y
Partial derivative of y with respect to x
∂x
∇x y Gradient of y with respect to x
∇X y Matrix derivatives of y with respect to X
∇X y Tensor containing derivatives of y with respect to
X
∂f
Jacobian matrix J ∈ Rm×n of f : Rn → Rm
∂x
2
∇x f (x) or H(f )(x) The Hessian matrix of f at input point x
Z
f (x)dx Definite integral over the entire domain of x
Z
f (x)dx Definite integral with respect to x over the set S
S
Probability and Information Theory
a⊥b The random variables a and b are independent
a⊥b | c They are conditionally independent given c
P (a) A probability distribution over a discrete variable
p(a) A probability distribution over a continuous vari-
able, or over a variable whose type has not been
specified
a∼P Random variable a has distribution P
Ex∼P [f (x)] or Ef (x) Expectation of f (x) with respect to P (x)
Var(f (x)) Variance of f (x) under P (x)
Cov(f (x), g(x)) Covariance of f (x) and g(x) under P (x)
H(x) Shannon entropy of the random variable x
DKL (P kQ) Kullback-Leibler divergence of P and Q
N (x; µ, Σ) Gaussian distribution over x with mean µ and
covariance Σ
iv
CONTENTS
Functions
f :A→B The function f with domain A and range B
f ◦g Composition of the functions f and g
f (x; θ) A function of x parametrized by θ. (Sometimes
we write f (x) and omit the argument θ to lighten
notation)
log x Natural logarithm of x
1
σ(x) Logistic sigmoid,
1 + exp(−x)
ζ(x) Softplus, log(1 + exp(x))
||x||p Lp norm of x
||x|| L2 norm of x
x+ Positive part of x, i.e., max(0, x)
1condition is 1 if the condition is true, 0 otherwise
Sometimes we use a function f whose argument is a scalar but apply it to a
vector, matrix, or tensor: f (x), f (X), or f (X). This denotes the application of f
to the array element-wise. For example, if C = σ(X), then Ci,j,k = σ(Xi,j,k ) for all
valid values of i, j and k.
Datasets and Distributions
pdata The data generating distribution
p̂data The empirical distribution defined by the training
set
X A set of training examples
x(i) The i-th example (input) from a dataset
y (i) or y (i) The target associated with x(i) for supervised learn-
ing
X The m × n matrix with input example x(i) in row
Xi,:
v
Chapter 1
Commentary
This document is an example of how to use the accompanying files as well as some
commentary on them. The files are math_commands.tex and notation.tex. The
file math_commands.tex includes several useful LATEX macros and notation.tex
defines a notation page that could be used at the front of any publication.
We developed these files while writing Goodfellow et al. (2016). We release
these files for anyone to use freely, in order to help establish some standard notation
in the deep learning community.
1.1 Examples
We include this section as an example of some LATEX commands and the macros
we created for the book.
Citations that support a sentence without actually being used in the sentence
should appear at the end of the sentence using citep:
Inventors have long dreamed of creating machines that think. This
desire dates back to at least the time of ancient Greece. The mythical
figures Pygmalion, Daedalus, and Hephaestus may all be interpreted
as legendary inventors, and Galatea, Talos, and Pandora may all be
regarded as artificial life (Ovid and Martin, 2004; Sparkes, 1996; Tandy,
1997).
When the authors of a document or the document itself are a noun in the
sentence, use the citet command:
1
CHAPTER 1. COMMENTARY
Mitchell (1997) provides a succinct definition of machine learning: “A
computer program is said to learn from experience E with respect to
some class of tasks T and performance measure P , if its performance
at tasks in T , as measured by P , improves with experience E.”
When introducing a new term, using the newterm macro to highlight it. If
there is a corresponding acronym, put the acronym in parentheses afterward. If
your document includes an index, also use the index command.
Today, artificial intelligence (AI) is a thriving field with many prac-
tical applications and active research topics.
Sometimes you may want to make many entries in the index that all point to a
canonical index entry:
One of the simplest and most common kinds of parameter norm penalty
is the squared L2 parameter norm penalty commonly known as weight
decay. In other academic communities, L2 regularization is also known
as ridge regression or Tikhonov regularization.
To refer to a figure, use either figref or Figref depending on whether you
want to capitalize the resulting word in the sentence.
See figure 1.1 for an example of a how to include graphics in your
document. Figure 1.1 shows how to include graphics in your document.
Similarly, you can refer to different sections of the book using partref, Partref,
secref, Secref, etc.
You are currently reading section 1.1.
Acknowledgments
We thank Catherine Olsson and Úlfar Erlingsson for proofreading and review of
this manuscript.
2
CHAPTER 1. COMMENTARY
Deep learning Example:
Shallow
Example: Example:
Example: autoencoders
Logistic Knowledge
MLPs
regression bases
Representation learning
Machine learning
AI
Figure 1.1: An example of a figure. The figure is a PDF displayed without being rescaled
within LATEX. The PDF was created at the right size to fit on the page, with the fonts at
the size they should be displayed. The fonts in the figure are from the Computer Modern
family so they match the fonts used by LATEX.
3
Bibliography
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press.
Mitchell, T. M. (1997). Machine Learning. McGraw-Hill, New York.
Ovid and Martin, C. (2004). Metamorphoses. W.W. Norton.
Sparkes, B. (1996). The Red and the Black: Studies in Greek Pottery. Routledge.
Tandy, D. W. (1997). Works and Days: A Translation and Commentary for the Social
Sciences. University of California Press.
4
Index
Artificial intelligence, 2 Tikhonov regularization, see weight decay
Transpose, iii
Conditional independence, iv
Covariance, iv Variance, iv
Vector, ii, iii
Derivative, iv
Determinant, iii Weight decay, 2
Element-wise product, see Hadamard prod-
uct
Graph, iii
Hadamard product, iii
Hessian matrix, iv
Independence, iv
Integral, iv
Jacobian matrix, iv
Kullback-Leibler divergence, iv
Matrix, ii, iii
Norm, v
Ridge regression, see weight decay
Scalar, ii, iii
Set, iii
Shannon entropy, iv
Sigmoid, v
Softplus, v
Tensor, ii, iii