0% found this document useful (0 votes)

16 views58 pages

Lecture2 2015

Uploaded by

hu jack

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views58 pages

Lecture2 2015

Uploaded by

hu jack

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 58

STA

414/2104:
Machine Learning

Russ Salakhutdinov
Department of Computer Science!
Department of Statistics!
rsalakhu@cs.toronto.edu!
h0p://www.cs.toronto.edu/~rsalakhu/

Lecture 2
Linear Least Squares
From last class: Minimize the sum of the squares of the errors between
the predicAons for each data point xn and the corresponding
real-‐valued targets tn.

Loss funcAon: sum-‐of-‐squared error

funcAon:

Source: Wikipedia
Linear Least Squares
If is nonsingular, then the unique soluAon is given by:

opAmal vector of

weights target values

the design matrix has one

input vector per row

Source: Wikipedia

• At an arbitrary input , the predicAon is

• The enAre model is characterized by d+1 parameters w*.
Example: Polynomial Curve FiQng
Consider observing a training set consisAng of N 1-‐dimensional observaAons:
together with corresponding real-‐valued targets:

Goal: Fit the data using a polynomial funcAon of the form:

Note: the polynomial funcAon is a nonlinear funcAon of x, but it is a linear
funcAon of the coeﬃcients w ! Linear Models.
Example: Polynomial Curve FiQng
• As for the least squares example: we can minimize the sum of the
squares of the errors between the predicAons for each data
point xn and the corresponding target values tn.

Loss funcAon: sum-‐of-‐squared

error funcAon:

• Similar to the linear least squares: Minimizing sum-‐of-‐squared error

funcAon has a unique soluAon w*.
ProbabilisAc PerspecAve
• So far we saw that polynomial curve ﬁQng can be expressed in terms
of error minimizaAon. We now view it from probabilisAc perspecAve.
• Suppose that our model arose from a staAsAcal model:

where ² is a random error having Gaussian distribuAon with zero

mean, and is independent of x.
Thus we have:

where ¯ is a precision parameter,

corresponding to the inverse variance.

I will use probability distribution and

probability density interchangeably. It
should be obvious from the context.!
Maximum Likelihood
If the data are assumed to be independently and idenAcally
distributed (i.i.d assump*on), the likelihood funcAon takes form:

It is oXen convenient to maximize the log of the likelihood funcAon:

• Maximizing log-‐likelihood with respect to w (under the assumpAon of a

Gaussian noise) is equivalent to minimizing the sum-‐of-‐squared error
funcAon.
• Determine by maximizing log-‐likelihood. Then maximizing
w.r.t. ¯:
PredicAve DistribuAon
Once we determined the parameters w and ¯, we can make predicAon
for new values of x:

Later we will consider Bayesian linear regression.

Bernoulli DistribuAon
• Consider a single binary random variable For example, x
can describe the outcome of ﬂipping a coin:
Coin ﬂipping: heads = 1, tails = 0.

• The probability of x=1 will be denoted by the parameter µ, so that:

• The probability distribuAon, known as Bernoulli distribuAon, can be

wri0en as:
Parameter EsAmaAon
• Suppose we observed a dataset
• We can construct the likelihood funcAon, which is a funcAon of µ.

• Equivalently, we can maximize the log of the likelihood funcAon:

• Note that the likelihood funcAon depends on the N observaAons xn only
through the sum
Suﬃcient
StaAsAc
Parameter EsAmaAon
• Suppose we observed a dataset

• SeQng the derivaAve of the log-‐likelihood funcAon w.r.t µ to zero, we
obtain:

where m is the number of heads.

Binomial DistribuAon
• We can also work out the distribuAon of the number m of observaAons
of x=1 (e.g. the number of heads).

• The probability of observing m heads given N coin ﬂips and a

parameter µ is given by:

• The mean and variance can be easily derived as:

Example
• Histogram plot of the Binomial distribuAon as a funcAon of m for N=10
and µ = 0.25.
Beta DistribuAon
• We can deﬁne a distribuAon over (e.g. it can be used a prior
over the parameter µ of the Bernoulli distribuAon).

where the gamma funcAon is deﬁned as:

and ensures that the Beta distribuAon is normalized.

Beta DistribuAon
MulAnomial Variables
• Consider a random variable that can take on one of K possible mutually
exclusive states (e.g. roll of a dice).
• We will use so-‐called 1-‐of-‐K encoding scheme.

• If a random variable can take on K=6 states, and a parAcular
observaAon of the variable corresponds to the state x3=1, then x will be
resented as:

1-‐of-‐K coding scheme:

• If we denote the probability of xk=1 by the parameter µk, then the
distribuAon over x is deﬁned as:
MulAnomial Variables
• MulAnomial distribuAon can be viewed as a generalizaAon of Bernoulli
distribuAon to more than two outcomes.

• It is easy to see that the distribuAon is normalized:

and
Maximum Likelihood EsAmaAon
• Suppose we observed a dataset
• We can construct the likelihood funcAon, which is a funcAon of µ.

• Note that the likelihood funcAon depends on the N data points only
though the following K quanAAes:

which represents the number of observaAons of xk=1.

• These are called the suﬃcient staAsAcs for this distribuAon.

Maximum Likelihood EsAmaAon

• To ﬁnd a maximum likelihood soluAon for µ, we need to maximize the
log-‐likelihood taking into account the constraint that

• Forming the Lagrangian:

which is the fracAon of observaAons for which xk=1.

MulAnomial DistribuAon
• We can construct the joint distribuAon of the quanAAes {m1,m2,…,mk}
given the parameters µ and the total number N of observaAons:

• The normalizaAon coeﬃcient is the number of ways of parAAoning N
objects into K groups of size m1,m2,…,mK.

• Note that
Dirichlet DistribuAon
• Consider a distribuAon over µk, subject to constraints:

• The Dirichlet distribuAon is deﬁned as:

where ®1,…,®k are the parameters of the

distribuAon, and ¡(x) is the gamma funcAon.

• The Dirichlet distribuAon is conﬁned to a simplex as a consequence of
the constraints.
Dirichlet DistribuAon
• Plots of the Dirichlet distribuAon over three variables.
Gaussian Univariate DistribuAon
• In the case of a single variable x, the Gaussian distribuAon takes form:

which is governed by two parameters:

- µ (mean)
- ¾2 (variance)

• The Gaussian distribuAon saAsﬁes:

MulAvariate Gaussian DistribuAon
• For a D-‐dimensional vector x, the Gaussian distribuAon takes form:

which is governed by two parameters:

- µ is a D-‐dimensional mean vector.

- § is a D by D covariance matrix.

and |§| denotes the determinant of §.

• Note that the covariance matrix is a symmetric posiAve deﬁnite

matrix.
Central Limit Theorem
• The distribuAon of the sum of N i.i.d. random variables becomes
increasingly Gaussian as N grows.
• Consider N variables, each of which has a uniform distribuAon over the
interval [0,1].
• Let us look at the distribuAon over the mean:

• As N increases, the distribuAon tends towards a Gaussian distribuAon.

Geometry of the Gaussian DistribuAon
• For a D-‐dimensional vector x, the Gaussian distribuAon takes form:

• Let us analyze the funcAonal dependence of the Gaussian on x through
the quadraAc form:

• Here ¢ is known as Mahalanobis distance.

• The Gaussian distribuAon will be constant on

surfaces in x-‐space for which ¢ is constant.
Geometry of the Gaussian DistribuAon
• For a D-‐dimensional vector x, the Gaussian distribuAon takes form:

• Consider the eigenvalue equaAon for the covariance matrix:

• The covariance can be expressed in terms of its eigenvectors:

• The inverse of the covariance:

Geometry of the Gaussian DistribuAon
• For a D-‐dimensional vector x, the Gaussian distribuAon takes form:

• Remember:

• Hence:

• We can interpret {yi} as a new coordinate system deﬁned by the
orthonormal vectors ui that are shiXed and rotated .
Geometry of the Gaussian DistribuAon

• Red curve: surface of

constant probability density

• The axis are deﬁned by the

eigenvectors ui of the
covariance matrix with
corresponding eigenvalues.
Moments of the Gaussian DistribuAon
• The expectaAon of x under the Gaussian distribuAon:

The term in z in the factor (z+µ)

will vanish by symmetry.
Moments of the Gaussian DistribuAon
• The second order moments of the Gaussian distribuAon:

• The covariance is given by:

• Because the parameter matrix § governs the covariance of x under the
Gaussian distribuAon, it is called the covariance matrix.
Moments of the Gaussian DistribuAon
• Contours of constant probability density:

Covariance Diagonal, axis-‐ Spherical

matrix is of aligned covariance (proporAonal to
general form. matrix. idenAty) covariance
matrix.
ParAAoned Gaussian DistribuAon
• Consider a D-‐dimensional Gaussian distribuAon:
• Let us parAAon x into two disjoint subsets xa and xb:

• In many situaAons, it will be more convenient to work with the
precision matrix (inverse of the covariance matrix):

• Note that ¤aa is not given by the inverse of §aa.
CondiAonal DistribuAon
• It turns out that the condiAonal distribuAon is also a Gaussian
distribuAon:

Covariance does not

depend on xb.

Linear funcAon
of xb.
Marginal DistribuAon
• It turns out that the marginal distribuAon is also a Gaussian distribuAon:

• For a marginal distribuAon, the mean and covariance are most simply
expressed in terms of parAAoned covariance matrix.
CondiAonal and Marginal DistribuAons
Maximum Likelihood EsAmaAon
• Suppose we observed i.i.d data
• We can construct the log-‐likelihood funcAon, which is a funcAon of
µ and §:

• Note that the likelihood funcAon depends on the N data points only
though the following sums:

Suﬃcient StaBsBcs
Maximum Likelihood EsAmaAon
• To ﬁnd a maximum likelihood esAmate of the mean, we set the
derivaAve of the log-‐likelihood funcAon to zero:

and solve to obtain:

• Similarly, we can ﬁnd the ML esAmate of §:

Maximum Likelihood EsAmaAon
• EvaluaAng the expectaAon of the ML esAmates under the true
distribuAon, we obtain: Unbiased esAmate

Biased esAmate

• Note that the maximum likelihood esAmate of § is biased.

• We can correct the bias by deﬁning a diﬀerent esAmator:

SequenAal EsAmaAon
• SequenAal esAmaAon allows data points to be processed one at a Ame
and then discarded. Important for on-‐line applicaAons.
• Let us consider the contribuAon of the Nth data point xn:

correcAon given xN

correcAon weight
old esAmate
Student’s t-‐DistribuAon
• Consider Student’s t-‐DistribuAon

Inﬁnite mixture
where of Gaussians

SomeAmes called Degrees of freedom

the precision
parameter.
Student’s t-‐DistribuAon
• SeQng º = 1 recovers Cauchy distribuAon
• The limit º ! 1 corresponds to a Gaussian distribuAon.
Student’s t-‐DistribuAon
• Robustness to outliners: Gaussian vs. t-‐DistribuAon.
Student’s t-‐DistribuAon
• The mulAvariate extension of the t-‐DistribuAon:

where

• ProperAes:
Mixture of Gaussians
• When modeling real-‐world data, Gaussian assumpAon may not be
appropriate.
• Consider the following example: Old Faithful Dataset

Single Gaussian Mixture of two

Gaussians
Mixture of Gaussians
• We can combine simple models into a complex model by deﬁning a
superposiAon of K Gaussian densiAes of the form:

Component
Mixing coeﬃcient

K=3

• Note that each Gaussian component has its own mean µk and
covariance §k. The parameters ¼k are called mixing coeﬃcients.
• Mote generally, mixture models can comprise linear combinaAons of
other distribuAons.
Mixture of Gaussians
• IllustraAon of a mixture of 3 Gaussians in a 2-‐dimensional space:

(a) Contours of constant density of each of the mixture components,
along with the mixing coeﬃcients
(b) Contours of marginal probability density

(c) A surface plot of the distribuAon p(x).

Maximum Likelihood EsAmaAon
• Given a dataset D, we can determine model parameters µk. §k, ¼k by
maximizing the log-‐likelihood funcAon:

Log of a sum: no closed form soluAon

• SoluBon: use standard, iteraAve, numeric opAmizaAon methods or the

ExpectaAon MaximizaAon algorithm.
The ExponenAal Family
• The exponenAal family of distribuAons over x is deﬁned to be a set of
destrucAons for the form:

where
- ´ is the vector of natural parameters
- u(x) is the vector of suﬃcient staAsAcs

• The funcAon g(´) can be interpreted the coeﬃcient that ensures
that the distribuAon p(x|´) is normalized:
Bernoulli DistribuAon
• The Bernoulli distribuAon is a member of the exponenAal family:

• Comparing with the general form of the exponenAal family:

we see that

and so

LogisAc sigmoid
Bernoulli DistribuAon
• The Bernoulli distribuAon is a member of the exponenAal family:

• The Bernoulli distribuAon can therefore be wri0en as:

where
MulAnomial DistribuAon
• The MulAnomial distribuAon is a member of the exponenAal family:

where

and
NOTE: The parameters ´k
are not independent since
the corresponding µk must
saAsfy

• In some cases it will be convenient to remove the constraint by
expressing the distribuAon over the M-‐1 parameters.
MulAnomial DistribuAon
• The MulAnomial distribuAon is a member of the exponenAal family:

• Let

• This leads to:

and

• Here the parameters ´k are independent. SoXmax funcAon

• Note that:
and
MulAnomial DistribuAon
• The MulAnomial distribuAon is a member of the exponenAal family:

• The MulAnomial distribuAon can therefore be wri0en as:

where
Gaussian DistribuAon
• The Gaussian distribuAon can be wri0en as:

where
ML for the ExponenAal Family
• Remember the ExponenAal Family:

• From the deﬁniAon of the normalizer g(´):

• We can take a derivaAve w.r.t ´:

• Thus
ML for the ExponenAal Family
• Remember the ExponenAal Family:

• We can take a derivaAve w.r.t ´:

• Thus

• Note that the covariance of u(x) can be expressed in terms of the
second derivaAve of g(´), and similarly for the higher moments.
ML for the ExponenAal Family
• Suppose we observed i.i.d data
• We can construct the log-‐likelihood funcAon, which is a funcAon of
the natural parameter ´.

• Therefore we have

Suﬃcient StaAsAc

Unit 1 DMV
No ratings yet
Unit 1 DMV
22 pages
Probability and Statistics Cheat Sheet
100% (3)
Probability and Statistics Cheat Sheet
28 pages
Probability and Statistics: Cookbook
No ratings yet
Probability and Statistics: Cookbook
28 pages
Formulario Ep Probability and Statistics
No ratings yet
Formulario Ep Probability and Statistics
28 pages
Probability and Statistics: Cookbook
No ratings yet
Probability and Statistics: Cookbook
28 pages
Stats Cheat Sheet
No ratings yet
Stats Cheat Sheet
28 pages
Random Variables and Distribution Functions
No ratings yet
Random Variables and Distribution Functions
33 pages
A Probability and Statistics Cheatsheet
No ratings yet
A Probability and Statistics Cheatsheet
28 pages
Probability and Statistics - Cookbook
No ratings yet
Probability and Statistics - Cookbook
28 pages
Probability and Statistics Cookbook
No ratings yet
Probability and Statistics Cookbook
28 pages
Principles of Statistics
No ratings yet
Principles of Statistics
113 pages
Categorical Notes Ch1
No ratings yet
Categorical Notes Ch1
18 pages
Lecture BDS 2 23 24 Print
No ratings yet
Lecture BDS 2 23 24 Print
10 pages
CompleteLectureNotes STAT 261
No ratings yet
CompleteLectureNotes STAT 261
158 pages
Pattern Recognition Machine Learning: Chapter 3: Linear Models For Regression
100% (1)
Pattern Recognition Machine Learning: Chapter 3: Linear Models For Regression
48 pages
Lecture Note
No ratings yet
Lecture Note
101 pages
Statistics Course Review Notes
No ratings yet
Statistics Course Review Notes
20 pages
Probability and Statistics: Cookbook
No ratings yet
Probability and Statistics: Cookbook
31 pages
Probability and Statistics: Cookbook
No ratings yet
Probability and Statistics: Cookbook
31 pages
Cheat Sheet For The Final Exam
No ratings yet
Cheat Sheet For The Final Exam
6 pages
Lec3 IntroToProbabilityAndStatistics
No ratings yet
Lec3 IntroToProbabilityAndStatistics
45 pages
Book
No ratings yet
Book
113 pages
RigNotes15 PDF
No ratings yet
RigNotes15 PDF
130 pages
SP2009F - Lecture03 - Maximum Likelihood Estimation (Parametric Methods)
No ratings yet
SP2009F - Lecture03 - Maximum Likelihood Estimation (Parametric Methods)
23 pages
Statistical Machine Learning 1665832214
No ratings yet
Statistical Machine Learning 1665832214
55 pages
Book
No ratings yet
Book
113 pages
Lec 2 - Random Variables and Their Functions
No ratings yet
Lec 2 - Random Variables and Their Functions
4 pages
An Introduction To Objective Bayesian Statistics PDF
No ratings yet
An Introduction To Objective Bayesian Statistics PDF
69 pages
Book
No ratings yet
Book
106 pages
Statistics BI: Models of Random Outcomes. What Is A Model?
No ratings yet
Statistics BI: Models of Random Outcomes. What Is A Model?
22 pages
Week 6 Mle
No ratings yet
Week 6 Mle
41 pages
Stoch Book 19
No ratings yet
Stoch Book 19
134 pages
Applied Statistics - Lecture 1: Mario Beraha
No ratings yet
Applied Statistics - Lecture 1: Mario Beraha
52 pages
Dis Tri Buci Ones
No ratings yet
Dis Tri Buci Ones
16 pages
Topic 4: Some Special Distributions: Rohini Somanathan Course 003, 2014-2015
No ratings yet
Topic 4: Some Special Distributions: Rohini Somanathan Course 003, 2014-2015
31 pages
Stat Cookbook
No ratings yet
Stat Cookbook
31 pages
Probabilistic Machine Learning For Civil Engineers James-A Goulet Instant Download
100% (4)
Probabilistic Machine Learning For Civil Engineers James-A Goulet Instant Download
143 pages
Stat Cookbook
No ratings yet
Stat Cookbook
31 pages
Slide MathStat DS 2025
100% (1)
Slide MathStat DS 2025
273 pages
Sta 2110 Lectures Notes
No ratings yet
Sta 2110 Lectures Notes
21 pages
To The Mathematical and Statistical Foundations of Econometrics
No ratings yet
To The Mathematical and Statistical Foundations of Econometrics
20 pages
Lecture01 Uppsala EQG 12
No ratings yet
Lecture01 Uppsala EQG 12
39 pages
College Statistics
No ratings yet
College Statistics
244 pages
Doc-Cours MathsV
No ratings yet
Doc-Cours MathsV
69 pages
Statistics Formula Sheet
No ratings yet
Statistics Formula Sheet
11 pages
Probability and Statistics: Cookbook
No ratings yet
Probability and Statistics: Cookbook
31 pages
Probability & Random Variables Guide
No ratings yet
Probability & Random Variables Guide
3 pages
Stat Cookbook
No ratings yet
Stat Cookbook
31 pages
Poly Macs201macs203 250726 161340
No ratings yet
Poly Macs201macs203 250726 161340
197 pages
Slide Mathematical Statistics 220802
No ratings yet
Slide Mathematical Statistics 220802
254 pages
Variational Gaussian Processes
No ratings yet
Variational Gaussian Processes
62 pages
Durrande 2020
No ratings yet
Durrande 2020
90 pages
Seminar em
No ratings yet
Seminar em
51 pages
Classical and Quantum Information Basics
No ratings yet
Classical and Quantum Information Basics
49 pages
Gonzalez 2020
No ratings yet
Gonzalez 2020
79 pages
Ek 2020
No ratings yet
Ek 2020
203 pages
Lecture 7 - Introduction To Quantum Noise Bonus
No ratings yet
Lecture 7 - Introduction To Quantum Noise Bonus
13 pages
Lecture 8.2 - Variational Quantum Eigensolver
No ratings yet
Lecture 8.2 - Variational Quantum Eigensolver
27 pages
Lec31 32 CaterpillarRegressionExample
No ratings yet
Lec31 32 CaterpillarRegressionExample
108 pages
Lecture 4.1 - Quantum Query Algorithms
No ratings yet
Lecture 4.1 - Quantum Query Algorithms
38 pages
Lecture 3 - Entanglement in Action
No ratings yet
Lecture 3 - Entanglement in Action
36 pages
Iterative Quantum Phase Estimation
No ratings yet
Iterative Quantum Phase Estimation
31 pages
Lec35 SequentialImportanceSampling
No ratings yet
Lec35 SequentialImportanceSampling
46 pages
Lec20 RidgeRegression
No ratings yet
Lec20 RidgeRegression
21 pages
Lec33 MetropolisHastings
No ratings yet
Lec33 MetropolisHastings
66 pages
Lec9 MultivariateGaussian
No ratings yet
Lec9 MultivariateGaussian
60 pages
Bayesian Linear Regression Guide
No ratings yet
Bayesian Linear Regression Guide
29 pages
Lec17 PriorModeling
No ratings yet
Lec17 PriorModeling
37 pages
Lec18 HierarchicalBayesianModels
No ratings yet
Lec18 HierarchicalBayesianModels
20 pages
Lec29 ImportanceSampling
No ratings yet
Lec29 ImportanceSampling
84 pages
State Space Models & Bayesian Inference
No ratings yet
State Space Models & Bayesian Inference
58 pages
Advanced Rejection Sampling Guide
No ratings yet
Advanced Rejection Sampling Guide
30 pages
Lec23 Evidence4Regression
No ratings yet
Lec23 Evidence4Regression
38 pages
Lec14 15 GenerativeModelsForDiscreteData
No ratings yet
Lec14 15 GenerativeModelsForDiscreteData
74 pages
Lec21 BiasVarianceDecomposition
No ratings yet
Lec21 BiasVarianceDecomposition
15 pages
Lec25 MonteCarloMethods
No ratings yet
Lec25 MonteCarloMethods
57 pages
Lec7 InformationTheory
No ratings yet
Lec7 InformationTheory
41 pages
Lec16 SummarizingPosteriors BayesianModelSelection
No ratings yet
Lec16 SummarizingPosteriors BayesianModelSelection
59 pages
Lec22 Introduction2BayesianRegression
No ratings yet
Lec22 Introduction2BayesianRegression
42 pages
Lec12 13 BayesianInferenceForTheGaussian
No ratings yet
Lec12 13 BayesianInferenceForTheGaussian
57 pages