10-601 Introduction to Machine Learning
Machine Learning Department
School of Computer Science
Carnegie Mellon University
MLE/MAP
+
Naïve Bayes
Matt Gormley
Lecture 17
Mar. 20, 2019
1
Reminders
• Homework 5: Neural Networks
– Out: Fri, Mar 1
– Due: Fri, Mar 22 at 11:59pm
• Homework 6: Learning Theory / Generative
Models
– Out: Fri, Mar 22
– Due: Fri, Mar 29 at 11:59pm (1 week)
TIP: Do the readings!
• Today’s In-Class Poll
– http://p17.mlcourse.org
2
MLE AND MAP
14
Likelihood Function One R.V.
• Suppose we have N samples D = {x(1), x(2), …, x(N)} from a
random variable X
In both cases
• The likelihood function: (discrete /
continuous), the
– Case 1: X is discrete with pmf p(x|θ) likelihood tells us
L(θ) = p(x(1)|θ) p(x(2)|θ) … p(x(N)|θ) how likely one
– Case 2: X is continuous with pdf f(x|θ) sample is relative
L(θ) = f(x(1)|θ) f(x(2)|θ) … f(x(N)|θ) to another
• The log-likelihood function:
– Case 1: X is discrete with pmf p(x|θ)
l(θ) = log p(x(1)|θ) + … + log p(x(N)|θ)
– Case 2: X is continuous with pdf f(x|θ)
l(θ) = log f(x(1)|θ) +… + log f(x(N)|θ)
17
Likelihood Function Two R.V.s
• Suppose we have N samples D = {(x(1), y(1)), …, (x(N), y(N))} from a
pair of random variables X, Y
• The conditional likelihood function:
– Case 1: Y is discrete with pmf p(y | x, θ)
L(θ) = p(y(1) | x(1), θ) …p(y(N) | x(N), θ)
– Case 2: Y is continuous with pdf f(y | x, θ)
L(θ) = f(y(1) | x(1), θ) …f(y(N) | x(N), θ)
• The joint likelihood function:
– Case 1: X and Y are discrete with pmf p(x,y|θ)
L(θ) = p(x(1), y(1)|θ) … p(x(N), y(N)|θ)
– Case 2: X and Y are continuous with pdf f(x,y|θ)
L(θ) = f(x(1), y(1)|θ) … f(x(N), y(N)|θ)
18
Likelihood Function Two R.V.s
• Suppose we have N samples D = {(x(1), y(1)), …, (x(N), y(N))} from a
pair of random variables X, Y
• The joint likelihood function:
– Case 1: X and Y are discrete with pmf p(x,y|θ)
L(θ) = p(x(1), y(1)|θ) … p(x(N), y(N)|θ)
Mixed
– Case 2: X and Y are continuous with pdf f(x,y|θ) discrete/
L(θ) = f(x(1), y(1)|θ) … f(x(N), y(N)|θ)
continuous!
– Case 3: Y is discrete with pmf p(y|β) and
X is continuous with pdf f(x|y,α)
L(α, β) = f(x(1)| y(1), α) p(y(1)|β) … f(x(N)| y(N), α) p(y(N)|β)
– Case 4: Y is continuous with pdf f(y|β) and
X is discrete with pmf p(x|y,α)
L(α, β) = p(x(1)| y(1), α) f(y(1)|β) … p(x(N)| y(N), α) f(y(N)|β)
19
MLE
Suppose we have data D = {x(i) }N
i=1
Principle of Maximum Likelihood
N Estimation:
Choose theMLE = `;Ktthat p(t
parameters maximize
|N ) the likelihood
(i)
of the data. MLE
= `;Kt
i=1 (i)
N
p(t | )
MAP
= `;Kt p(t(i)i=1
| )p( )
Maximum Likelihood Estimate (MLE)
i=1
θ2 θMLE
L(θ)
L(θ1, θ2)
θMLE θ1
20
MLE
What does maximizing likelihood accomplish?
• There is only a finite amount of probability
mass (i.e. sum-to-one constraint)
• MLE tries to allocate as much probability
mass as possible to the things we have
observed…
…at the expense of the things we have not
observed
21
Recipe for Closed-form MLE
1. Assume data was generated i.i.d. from some model
(i.e. write the generative story)
x(i) ~ p(x|θ)
2. Write log-likelihood
l(θ) = log p(x(1)|θ) + … + log p(x(N)|θ)
3. Compute partial derivatives (i.e. gradient)
!l(θ)/!θ1 = …
!l(θ)/!θ2 = …
…
!l(θ)/!θM = …
4. Set derivatives to zero and solve for θ
!l(θ)/!θm = 0 for all m ∈ {1, …, M}
θMLE = solution to system of M equations and M variables
5. Compute the second derivative and check that l(θ) is concave down
at θMLE
22
MLE
Example: MLE of Exponential Distribution
Goal:
• pdf of Exponential( ): f (x) = e x
• Suppose Xi Exponential( ) for 1 i N .
• pdf of Exponential( ): f
• Find MLE for data D = {x }i=1(x)
(i) =N e x
• Suppose
• First writeXdown
i Exponential(
log-likelihood) for 1 i N.
of sample.
Steps:
• Find MLE for data
• Compute first derivative, set}to
D = {x (i) N
zero, solve for .
i=1
• First
• writesecond
Compute down log-likelihood
derivative and ofcheck
sample.
that it is
• Compute first derivative,
concave down at MLE . set to zero, solve for .
• Compute second derivative and check that it is
concave down at MLE .
23
• pdf of Exponential( ): f (x) = e x
MLE •
•
Suppose Xi Exponential( ) for 1 i N .
Find MLE for data D = {x(i) }Ni=1
• First write down log-likelihood of sample.
• Compute first derivative, set to zero, solve for .
Example: MLE of Exponential Distribution • Compute second derivative and check that it is
concave down at MLE .
• First write down log-likelihood of sample.
N
( )= HQ; f (x(i) ) (1)
i=1
N
= HQ;( 2tT( x(i) )) (2)
i=1
N
= HQ;( ) + x(i) (3)
i=1
N
= N HQ;( ) x(i) (4)
i=1
24
• pdf of Exponential( ): f (x) = e x
MLE •
•
Suppose Xi Exponential( ) for 1 i N .
Find MLE for data D = {x(i) }Ni=1
• First write down log-likelihood of sample.
• Compute first derivative, set to zero, solve for .
Example: MLE of Exponential Distribution • Compute second derivative and check that it is
concave down at MLE .
• Compute first derivative, set to zero, solve for .
N
d ( ) d
= N HQ;( ) x(i) (1)
d d i=1
N
N
= x(i) = 0 (2)
i=1
MLE N
= N
(3)
i=1 x(i)
25
MLE
In-Class Exercise Steps to answer:
Show that the MLE of 1. Write log-likelihood
parameter ɸ for N of sample
samples drawn from 2. Compute derivative
Bernoulli(ɸ) is: w.r.t. ɸ
3. Set derivative to
zero and solve for ɸ
26
MLE
Question: Answer:
Assume we have N samples x(1), A. l(ɸ) = N1 log(ɸ) + N0 (1 - log(ɸ))
x(2), …, x(N) drawn from a B. l(ɸ) = N1 log(ɸ) + N0 log(1-ɸ)
Bernoulli(ɸ). C. l(ɸ) = log(ɸ)N1 + (1 - log(ɸ))N0
D. l(ɸ) = log(ɸ)N1 + log(1-ɸ)N0
What is the log-likelihood of E. l(ɸ) = N0 log(ɸ) + N1 (1 - log(ɸ))
the data l(ɸ)?
F. l(ɸ) = N0 log(ɸ) + N1 log(1-ɸ)
Assume N1 = # of (x(i) = 1) G. l(ɸ) = log(ɸ)N0 + (1 - log(ɸ))N1
N0 = # of (x(i) = 0) H. l(ɸ) = log(ɸ)N0 + log(1-ɸ)N1
I. l(ɸ) = the most likely answer
27
MLE
Question: Answer:
Assume we have N samples x(1), A. !l(θ)/!θ = ɸN1 + (1 - ɸ)N0
x(2), …, x(N) drawn from a B. !l(θ)/!θ = ɸ / N1 + (1 - ɸ) / N0
Bernoulli(ɸ). C. !l(θ)/!θ = N1 / ɸ + N0 / (1 - ɸ)
D. !l(θ)/!θ = log(ɸ) / N1 + log(1 - ɸ) / N0
E. !l(θ)/!θ = N1 / log(ɸ) + N0 / log(1 - ɸ)
What is the derivative of the
log-likelihood !l(θ)/!θ?
Assume N1 = # of (x(i) = 1)
N0 = # of (x(i) = 0)
28
Learning from Data (Frequentist)
Whiteboard
– Optimization for MLE
– Examples: 1D and 2D optimization
– Example: MLE of Bernoulli
– Example: MLE of Categorical
– Aside: Method of Langrange Multipliers
29
MLE vs. MAP
Suppose we have data D = {x(i) }N
i=1
Principle of Maximum Likelihood
N Estimation:
Choose theMLE = `;Ktthat p(t
parameters maximize
|N ) the likelihood
(i)
of the data. MLE
= `;Kt
i=1 (i)
N
p(t | )
MAP
= `;Kt p(t(i)i=1
| )p( )
Maximum Likelihood Estimate (MLE)
i=1
Principle of Maximum a posteriori (MAP) Estimation:
Choose the parameters that maximize the posterior
of the parameters given the data.
Maximum a posteriori (MAP) estimate 30
MLE vs. MAP
Suppose we have data D = {x(i) }N
i=1
Principle of Maximum Likelihood
N Estimation:
Choose theMLE = `;Ktthat p(t
parameters maximize
|N ) the likelihood
(i)
of the data. MLE
= `;Kt
i=1 (i)
N
p(t | )
MAP
= `;Kt p(t(i)i=1
| )p( )
Maximum Likelihood Estimate (MLE)
i=1
Principle of Maximum a posteriori (MAP) Estimation:
Choose the parameters that maximize the posterior
of the parameters given the data. Prior
N
MAP
= `;Kt p(t(i) | )p( )
i=1
Maximum a posteriori (MAP) estimate 31
MLE vs. MAP
Suppose we have data D = {x(i) }N
i=1
Principle of Maximum Likelihood
N Estimation:
Important!
Choose theMLE = `;Ktthat p(t
parameters maximize
(i)
Usually theparameters
|N ) the likelihood are
of the data. i=1 continuous,(i) so the prior is a
MLE
= `;Kt p(t | )
N probability density function
MAP
= `;Kt p(t(i)i=1
| )p( )
Maximum Likelihood Estimate (MLE)
i=1
Principle of Maximum a posteriori (MAP) Estimation:
Choose the parameters that maximize the posterior
of the parameters given the data. Prior
N
MAP
= `;Kt p(t(i) | )p( )
i=1
Maximum a posteriori (MAP) estimate 32
Learning from Data (Bayesian)
Whiteboard
– maximum a posteriori (MAP) estimation
– Optimization for MAP
– Example: MAP of Bernoulli—Beta
33
Takeaways
• One view of what ML is trying to accomplish is
function approximation
• The principle of maximum likelihood
estimation provides an alternate view of
learning
• Synthetic data can help debug ML algorithms
• Probability distributions can be used to model
real data that occurs in the world
(don’t worry we’ll make our distributions more
interesting soon!)
34
Learning Objectives
MLE / MAP
You should be able to…
1. Recall probability basics, including but not limited to: discrete
and continuous random variables, probability mass functions,
probability density functions, events vs. random variables,
expectation and variance, joint probability distributions,
marginal probabilities, conditional probabilities, independence,
conditional independence
2. Describe common probability distributions such as the Beta,
Dirichlet, Multinomial, Categorical, Gaussian, Exponential, etc.
3. State the principle of maximum likelihood estimation and
explain what it tries to accomplish
4. State the principle of maximum a posteriori estimation and
explain why we use it
5. Derive the MLE or MAP parameters of a simple model in closed
form
35
NAÏVE BAYES
36
Naïve Bayes Outline
• Real-world Dataset
– Economist vs. Onion articles
– Document à bag-of-words à binary
feature vector
• Naive Bayes: Model
– Generating synthetic "labeled documents"
– Definition of model
– Naive Bayes assumption
– Counting # of parameters with / without
NB assumption
• Naïve Bayes: Learning from Data
– Data likelihood
– MLE for Naive Bayes
– MAP for Naive Bayes
• Visualizing Gaussian Naive Bayes
37
Naïve Bayes
• Why are we talking about Naïve Bayes?
– It’s just another decision function that fits into
our “big picture” recipe from last time
– But it’s our first example of a Bayesian Network
and provides a clearer picture of probabilistic
learning
– Just like the other Bayes Nets we’ll see, it admits
a closed form solution for MLE and MAP
– So learning is extremely efficient (just counting)
38
Fake News Detector
Today’s Goal: To define a generative model of emails
of two different classes (e.g. real vs. fake news)
CNN The Onion
40
Fake News Detector
CNN The Onion
We can pretend the natural process generating these vectors is stochastic…
41
Naive Bayes: Model
Whiteboard
– Document à bag-of-words à binary feature
vector
– Generating synthetic "labeled documents"
– Definition of model
– Naive Bayes assumption
– Counting # of parameters with / without NB
assumption
42
Model 1: Bernoulli Naïve Bayes
Flip weighted coin
If HEADS, flip If TAILS, flip
each red coin each blue coin
y x1 x2 x3 … xM
… …
0 1 0 1 … 1
1 0 1 0 … 1 We can generate data in
this fashion. Though in
1 1 1 1 … 1 practice we never would
since our data is given.
0 0 0 1 … 1
Instead, this provides an
0 1 0 1 … 0 explanation of how the
Each red coin
corresponds to data was generated
1 1 0 1 … 0
an xm (albeit a terrible one).
43