0% found this document useful (0 votes)
32 views29 pages

Lecture17 Mle Map

Uploaded by

MSR MSR
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views29 pages

Lecture17 Mle Map

Uploaded by

MSR MSR
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

10-601 Introduction to Machine Learning

Machine Learning Department


School of Computer Science
Carnegie Mellon University

MLE/MAP
+
Naïve Bayes
Matt Gormley
Lecture 17
Mar. 20, 2019

1
Reminders
• Homework 5: Neural Networks
– Out: Fri, Mar 1
– Due: Fri, Mar 22 at 11:59pm
• Homework 6: Learning Theory / Generative
Models
– Out: Fri, Mar 22
– Due: Fri, Mar 29 at 11:59pm (1 week)
TIP: Do the readings!

• Today’s In-Class Poll


– http://p17.mlcourse.org

2
MLE AND MAP

14
Likelihood Function One R.V.

• Suppose we have N samples D = {x(1), x(2), …, x(N)} from a


random variable X
In both cases
• The likelihood function: (discrete /
continuous), the
– Case 1: X is discrete with pmf p(x|θ) likelihood tells us
L(θ) = p(x(1)|θ) p(x(2)|θ) … p(x(N)|θ) how likely one
– Case 2: X is continuous with pdf f(x|θ) sample is relative
L(θ) = f(x(1)|θ) f(x(2)|θ) … f(x(N)|θ) to another

• The log-likelihood function:


– Case 1: X is discrete with pmf p(x|θ)
l(θ) = log p(x(1)|θ) + … + log p(x(N)|θ)
– Case 2: X is continuous with pdf f(x|θ)
l(θ) = log f(x(1)|θ) +… + log f(x(N)|θ)

17
Likelihood Function Two R.V.s

• Suppose we have N samples D = {(x(1), y(1)), …, (x(N), y(N))} from a


pair of random variables X, Y

• The conditional likelihood function:


– Case 1: Y is discrete with pmf p(y | x, θ)
L(θ) = p(y(1) | x(1), θ) …p(y(N) | x(N), θ)
– Case 2: Y is continuous with pdf f(y | x, θ)
L(θ) = f(y(1) | x(1), θ) …f(y(N) | x(N), θ)

• The joint likelihood function:


– Case 1: X and Y are discrete with pmf p(x,y|θ)
L(θ) = p(x(1), y(1)|θ) … p(x(N), y(N)|θ)
– Case 2: X and Y are continuous with pdf f(x,y|θ)
L(θ) = f(x(1), y(1)|θ) … f(x(N), y(N)|θ)

18
Likelihood Function Two R.V.s

• Suppose we have N samples D = {(x(1), y(1)), …, (x(N), y(N))} from a


pair of random variables X, Y

• The joint likelihood function:


– Case 1: X and Y are discrete with pmf p(x,y|θ)
L(θ) = p(x(1), y(1)|θ) … p(x(N), y(N)|θ)
Mixed
– Case 2: X and Y are continuous with pdf f(x,y|θ) discrete/
L(θ) = f(x(1), y(1)|θ) … f(x(N), y(N)|θ)
continuous!
– Case 3: Y is discrete with pmf p(y|β) and
X is continuous with pdf f(x|y,α)
L(α, β) = f(x(1)| y(1), α) p(y(1)|β) … f(x(N)| y(N), α) p(y(N)|β)

– Case 4: Y is continuous with pdf f(y|β) and


X is discrete with pmf p(x|y,α)
L(α, β) = p(x(1)| y(1), α) f(y(1)|β) … p(x(N)| y(N), α) f(y(N)|β)

19
MLE
Suppose we have data D = {x(i) }N
i=1
Principle of Maximum Likelihood
N Estimation:
Choose theMLE = `;Ktthat p(t
parameters maximize
|N ) the likelihood
(i)
of the data. MLE
= `;Kt
i=1 (i)
N
p(t | )
MAP
= `;Kt p(t(i)i=1
| )p( )
Maximum Likelihood Estimate (MLE)
i=1
θ2 θMLE

L(θ)
L(θ1, θ2)

θMLE θ1
20
MLE
What does maximizing likelihood accomplish?
• There is only a finite amount of probability
mass (i.e. sum-to-one constraint)
• MLE tries to allocate as much probability
mass as possible to the things we have
observed…

…at the expense of the things we have not


observed

21
Recipe for Closed-form MLE
1. Assume data was generated i.i.d. from some model
(i.e. write the generative story)
x(i) ~ p(x|θ)
2. Write log-likelihood
l(θ) = log p(x(1)|θ) + … + log p(x(N)|θ)
3. Compute partial derivatives (i.e. gradient)
!l(θ)/!θ1 = …
!l(θ)/!θ2 = …

!l(θ)/!θM = …
4. Set derivatives to zero and solve for θ
!l(θ)/!θm = 0 for all m ∈ {1, …, M}
θMLE = solution to system of M equations and M variables
5. Compute the second derivative and check that l(θ) is concave down
at θMLE

22
MLE
Example: MLE of Exponential Distribution
Goal:
• pdf of Exponential( ): f (x) = e x
• Suppose Xi Exponential( ) for 1 i N .
• pdf of Exponential( ): f
• Find MLE for data D = {x }i=1(x)
(i) =N e x

• Suppose
• First writeXdown
i Exponential(
log-likelihood) for 1 i N.
of sample.
Steps:
• Find MLE for data
• Compute first derivative, set}to
D = {x (i) N
zero, solve for .
i=1
• First
• writesecond
Compute down log-likelihood
derivative and ofcheck
sample.
that it is
• Compute first derivative,
concave down at MLE . set to zero, solve for .
• Compute second derivative and check that it is
concave down at MLE .

23
• pdf of Exponential( ): f (x) = e x

MLE •

Suppose Xi Exponential( ) for 1 i N .
Find MLE for data D = {x(i) }Ni=1
• First write down log-likelihood of sample.
• Compute first derivative, set to zero, solve for .
Example: MLE of Exponential Distribution • Compute second derivative and check that it is
concave down at MLE .

• First write down log-likelihood of sample.


N
( )= HQ; f (x(i) ) (1)
i=1
N
= HQ;( 2tT( x(i) )) (2)
i=1
N
= HQ;( ) + x(i) (3)
i=1
N
= N HQ;( ) x(i) (4)
i=1

24
• pdf of Exponential( ): f (x) = e x

MLE •

Suppose Xi Exponential( ) for 1 i N .
Find MLE for data D = {x(i) }Ni=1
• First write down log-likelihood of sample.
• Compute first derivative, set to zero, solve for .
Example: MLE of Exponential Distribution • Compute second derivative and check that it is
concave down at MLE .

• Compute first derivative, set to zero, solve for .


N
d ( ) d
= N HQ;( ) x(i) (1)
d d i=1
N
N
= x(i) = 0 (2)
i=1

MLE N
= N
(3)
i=1 x(i)

25
MLE
In-Class Exercise Steps to answer:
Show that the MLE of 1. Write log-likelihood
parameter ɸ for N of sample
samples drawn from 2. Compute derivative
Bernoulli(ɸ) is: w.r.t. ɸ
3. Set derivative to
zero and solve for ɸ

26
MLE
Question: Answer:
Assume we have N samples x(1), A. l(ɸ) = N1 log(ɸ) + N0 (1 - log(ɸ))
x(2), …, x(N) drawn from a B. l(ɸ) = N1 log(ɸ) + N0 log(1-ɸ)
Bernoulli(ɸ). C. l(ɸ) = log(ɸ)N1 + (1 - log(ɸ))N0
D. l(ɸ) = log(ɸ)N1 + log(1-ɸ)N0
What is the log-likelihood of E. l(ɸ) = N0 log(ɸ) + N1 (1 - log(ɸ))
the data l(ɸ)?
F. l(ɸ) = N0 log(ɸ) + N1 log(1-ɸ)
Assume N1 = # of (x(i) = 1) G. l(ɸ) = log(ɸ)N0 + (1 - log(ɸ))N1
N0 = # of (x(i) = 0) H. l(ɸ) = log(ɸ)N0 + log(1-ɸ)N1
I. l(ɸ) = the most likely answer

27
MLE
Question: Answer:
Assume we have N samples x(1), A. !l(θ)/!θ = ɸN1 + (1 - ɸ)N0
x(2), …, x(N) drawn from a B. !l(θ)/!θ = ɸ / N1 + (1 - ɸ) / N0
Bernoulli(ɸ). C. !l(θ)/!θ = N1 / ɸ + N0 / (1 - ɸ)
D. !l(θ)/!θ = log(ɸ) / N1 + log(1 - ɸ) / N0
E. !l(θ)/!θ = N1 / log(ɸ) + N0 / log(1 - ɸ)
What is the derivative of the
log-likelihood !l(θ)/!θ?

Assume N1 = # of (x(i) = 1)
N0 = # of (x(i) = 0)

28
Learning from Data (Frequentist)
Whiteboard
– Optimization for MLE
– Examples: 1D and 2D optimization
– Example: MLE of Bernoulli
– Example: MLE of Categorical
– Aside: Method of Langrange Multipliers

29
MLE vs. MAP
Suppose we have data D = {x(i) }N
i=1
Principle of Maximum Likelihood
N Estimation:
Choose theMLE = `;Ktthat p(t
parameters maximize
|N ) the likelihood
(i)
of the data. MLE
= `;Kt
i=1 (i)
N
p(t | )
MAP
= `;Kt p(t(i)i=1
| )p( )
Maximum Likelihood Estimate (MLE)
i=1
Principle of Maximum a posteriori (MAP) Estimation:
Choose the parameters that maximize the posterior
of the parameters given the data.

Maximum a posteriori (MAP) estimate 30


MLE vs. MAP
Suppose we have data D = {x(i) }N
i=1
Principle of Maximum Likelihood
N Estimation:
Choose theMLE = `;Ktthat p(t
parameters maximize
|N ) the likelihood
(i)
of the data. MLE
= `;Kt
i=1 (i)
N
p(t | )
MAP
= `;Kt p(t(i)i=1
| )p( )
Maximum Likelihood Estimate (MLE)
i=1
Principle of Maximum a posteriori (MAP) Estimation:
Choose the parameters that maximize the posterior
of the parameters given the data. Prior
N
MAP
= `;Kt p(t(i) | )p( )
i=1
Maximum a posteriori (MAP) estimate 31
MLE vs. MAP
Suppose we have data D = {x(i) }N
i=1
Principle of Maximum Likelihood
N Estimation:
Important!
Choose theMLE = `;Ktthat p(t
parameters maximize
(i)
Usually theparameters
|N ) the likelihood are
of the data. i=1 continuous,(i) so the prior is a
MLE
= `;Kt p(t | )
N probability density function
MAP
= `;Kt p(t(i)i=1
| )p( )
Maximum Likelihood Estimate (MLE)
i=1
Principle of Maximum a posteriori (MAP) Estimation:
Choose the parameters that maximize the posterior
of the parameters given the data. Prior
N
MAP
= `;Kt p(t(i) | )p( )
i=1
Maximum a posteriori (MAP) estimate 32
Learning from Data (Bayesian)
Whiteboard
– maximum a posteriori (MAP) estimation
– Optimization for MAP
– Example: MAP of Bernoulli—Beta

33
Takeaways
• One view of what ML is trying to accomplish is
function approximation
• The principle of maximum likelihood
estimation provides an alternate view of
learning

• Synthetic data can help debug ML algorithms


• Probability distributions can be used to model
real data that occurs in the world
(don’t worry we’ll make our distributions more
interesting soon!)
34
Learning Objectives
MLE / MAP
You should be able to…
1. Recall probability basics, including but not limited to: discrete
and continuous random variables, probability mass functions,
probability density functions, events vs. random variables,
expectation and variance, joint probability distributions,
marginal probabilities, conditional probabilities, independence,
conditional independence
2. Describe common probability distributions such as the Beta,
Dirichlet, Multinomial, Categorical, Gaussian, Exponential, etc.
3. State the principle of maximum likelihood estimation and
explain what it tries to accomplish
4. State the principle of maximum a posteriori estimation and
explain why we use it
5. Derive the MLE or MAP parameters of a simple model in closed
form

35
NAÏVE BAYES

36
Naïve Bayes Outline
• Real-world Dataset
– Economist vs. Onion articles
– Document à bag-of-words à binary
feature vector
• Naive Bayes: Model
– Generating synthetic "labeled documents"
– Definition of model
– Naive Bayes assumption
– Counting # of parameters with / without
NB assumption
• Naïve Bayes: Learning from Data
– Data likelihood
– MLE for Naive Bayes
– MAP for Naive Bayes
• Visualizing Gaussian Naive Bayes

37
Naïve Bayes
• Why are we talking about Naïve Bayes?
– It’s just another decision function that fits into
our “big picture” recipe from last time
– But it’s our first example of a Bayesian Network
and provides a clearer picture of probabilistic
learning
– Just like the other Bayes Nets we’ll see, it admits
a closed form solution for MLE and MAP
– So learning is extremely efficient (just counting)

38
Fake News Detector
Today’s Goal: To define a generative model of emails
of two different classes (e.g. real vs. fake news)

CNN The Onion

40
Fake News Detector
CNN The Onion

We can pretend the natural process generating these vectors is stochastic…


41
Naive Bayes: Model
Whiteboard
– Document à bag-of-words à binary feature
vector
– Generating synthetic "labeled documents"
– Definition of model
– Naive Bayes assumption
– Counting # of parameters with / without NB
assumption

42
Model 1: Bernoulli Naïve Bayes
Flip weighted coin

If HEADS, flip If TAILS, flip


each red coin each blue coin
y x1 x2 x3 … xM
… …
0 1 0 1 … 1

1 0 1 0 … 1 We can generate data in


this fashion. Though in
1 1 1 1 … 1 practice we never would
since our data is given.
0 0 0 1 … 1
Instead, this provides an
0 1 0 1 … 0 explanation of how the
Each red coin
corresponds to data was generated
1 1 0 1 … 0
an xm (albeit a terrible one).
43

You might also like