Machine Learning: fundamental algorithms
The maximum likelihood, maximum a posteriori and expectation-maximisation estimation methods

The Likelihood

Imagine you have a statistical model, that is, a mathematical description of your data which depends on some parameters
. The likelihood function, usually indicated as
L\mathcal L
, is a function of these parameters and represents the probability of observing evidence (observed data)
given said parameters:
L=P(E  θ)\mathcal{L} = P(E \ | \ \theta)
Because it is a function of the parameters given the outcome, you write
L(θ  E)=P(E  θ)\mathcal{L}(\theta \ | \ E) = P(E \ | \ \theta)
The difference between probability and likelihood is quite subtle in that in common language they are be casually swapped, but they represent different things. The probability measures the outcomes observed as a function of the parameters
of the underlying model. But in reality
are unknown and in fact, we go through the reverse process: estimating the parameters given the evidence we observe. For this, we use the likelihood, which is defined as above because we maximise it in such a way to respond to the equality above. This is exactly what the ML estimation does, as per below.
Bear in mind that the likelihood is a function of

The MLE method

The Maximum Likelihood Estimation (MLE) is a procedure to find the parameters of a statistical model via the maximisation of the likelihood so as to maximise the agreement between the model and the observed data.
The maximisation of the likelihood is usually performed via the maximisation of its logarithm as it is much more convenient; the logarithm is a monotonic function so the procedure is legit.

Example: a Bernoulli distribution

Refer to the page about distributions
The likelihood function for a Bernoulli distribution (
xi0,1x_i \in {0, 1}
) is, for parameter
L(x1,x2,,xnp)=P(X1=x1,X2=x2,,Xn=xnp)=px1(1p)1x1pxn(1p)1xn=pixi(1p)i(1xi)=pixi(1p)nixi\begin{align} \mathcal{L}(x_1, x_2, \ldots, x_n | p) &= P(X_1 = x_1, X_2 = x_2, \ldots, X_n = x_n | p) \\ &= p^{x_1}(1-p)^{1-x_1} \cdot \ldots \cdot p^{x_n}(1-p)^{1-x_n} \\ &= p^{\sum_i x_i}(1-p)^{\sum_i(1-x_i)} \\ &= p^{\sum_i x_i}(1-p)^{n -\sum_i x_i} \end{align}
so that if we take the logarithm, we get
logL=ixilogp+(nixi)log(1p) .\log \mathcal{L} = \sum_i x_i \log p + \Big(n - \sum_i x_i\Big) \log (1-p) \ .
To maximise it, we compute and nullify the first derivative
dlogLdp=ixipnixi1p=0\frac{d \log \mathcal{L}}{d p} = \frac{\sum_i x_i}{p} - \frac{n - \sum_i x_i}{1-p} = 0
which leads to
ixipixi=nppixi\sum_i x_i - p \sum_i x_i = np - p \sum_i x_i
and finally to
p=ixinp = \frac{\sum_i x_i}{n}

Example: estimating the best mean of some data

This example is reported from here. Let us assume we know the weights of women are normally distributed with a mean
and standard deviation
. A random sample of 10 women gives measurements (in pounds):
115,122,130,124,149,160,152,138,149,180115, 122, 130, 124, 149, 160, 152, 138, 149, 180
We want to estimate
. We know
P(xi;μ,σ)=12πσ2e(xiμ)22σ2P(x_i ; \mu, \sigma) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{- \frac{(x_i - \mu)^2}{2 \sigma^2}}
The likelihood is (note that the
are independent)
L(xiμ,σ)=P(X1=x1,,Xn=xn)=Πip(xi;μσ)=σn(2π)n/2e12σ2i(xiμ)2\begin{align} \mathcal{L}(x_i | \mu, \sigma) &= P(X_1=x_1, \ldots, X_n=x_n) \\ &= \Pi_i p(x_i; \mu \sigma) \\ &= \sigma^n (2 \pi)^{-n/2} e^{- \frac{1}{2 \sigma^2} \sum_i (x_i - \mu)^2} \end{align}
Now, again it is easier to work with the logarithm:
logL=nlogσn2log2π12σ2i(xiμ)2\log \mathcal{L} = -n \log \sigma \frac{n}{2} \log 2 \pi - \frac{1}{2 \sigma^2} \sum_i (x_i - \mu)^2
so that
dlogLdμ=12σ22i(xiμ)=0\frac{d \log \mathcal{L}}{d \mu} = -\frac{1}{2 \sigma^2} 2 \sum_i (x_i - \mu) = 0
ixinμ=0\sum_i x_i - n \mu = 0
μ=ixin\mu = \frac{\sum_i x_i}{n}
and so the maximum likelihood estimate for a given sample is 142.2 and we can could do the same to estimate
, obtaining (can be proven through second derivative that it is a maximum)
σ2=i(xiμ)2n\sigma^2 = \frac{\sum_i (x_i - \mu)^2}{n}

The MAP method

This Maximum a Posteriori (MAP) estimation method uses the mode of the posterior to estimate the unknown population.
From Bayes' theorem, the posterior is expressed as
P(θ  x)=P(x  θ)P(θ)dθP(x  θ)P(θ) ,P(\theta \ | \ x) = \frac{P(x \ | \ \theta) P(\theta)}{\int d \theta' P(x \ | \ \theta') P(\theta')} \ ,
being the parameters of the statistical model and
the observed data. The MAP method estimates
as the one which maximises the posterior; note that the denominator is just a normalisation factor:
θ^MAP(x)=argmaxθP(θ  x)=argmaxθP(x  θ)P(θ) .\hat{\theta}_{MAP}(x) = arg \max_\theta P(\theta \ | \ x) = arg \max_\theta P(x \ | \ \theta) P(\theta) \ .
This means exactly taking the mode of the posterior distribution.
In the case of a uniform prior, the MAP estimation is equal to the ML estimation as we get to maximise the likelihood because the prior becomes just a factor. For the computation, conjugate priors are particularly handy.
As in the case of the MLE, what we really do is maximising the logarithm of the posterior rather than the posterior itself, so we do
θ^MAP(x)=argmaxθlogP(θ  x)=argmaxθ[logP(x  θ)+logP(θ)] .\hat \theta_{MAP}(x) = arg \max_{\theta} \log P(\theta \ | \ x) = arg \max_{\theta} [\log P(x \ | \ \theta) + \log P(\theta)] \ .

MAP and ML

In the last equation, if we only had the first term to maximise, we would be doing a ML estimation. The second term is the one accounting for the presence of a prior: this is why the MAP method is considered as a regularised ML as prior knowledge is factored in the computation.
While the ML method can be seen as responding to a frequentist approach, the MAP method responds to a Bayesian approach.

The Expectation-Maximisation algorithm

The EM algorithm can be used to find the solution of MLE or MAP when some data is missing, meaning there are some latent variables not observed.
Let's say that for the random variable
we have the
x1,,xn ,x_1, \ldots, x_n \ ,
which depend on parameters
, and that the goal is to find the parameter
that maximises the likelihood which is of the form
L=zPθ(x,z) ,L = \sum_z P_\theta(x, z) \ ,
meaning it is a sum over the latent variables
; this makes the problem difficult to solve analytically.
The EM algorithm updates the parameters in steps, which means it risks obtaining a local rather than a global maximum.

The E step

In the E phase (time
), the expected value of
is computed with respect to the conditional distribution of
under the current estimate of parameters
Lˉ(θθt)=Ezx,θt[logL(θ,x)]\bar L(\theta | \theta^t) = \mathbb E_{z | x, \theta^t} [\log L (\theta, x)]
This means that the log-likelihood is evaluated using the current state of the parameters.

The M step

In the M phase (time
), we find the parameters which maximises the log-likelihood found in the E step:
θt+1=argmaxθLˉ(θθt)\theta^{t+1} = arg \max_{\theta} \bar L(\theta | \theta^t)


  1. 1.
    Some examples about MLE in this course from PennState
  2. 2.
    Cross Validated on the difference between Probability and Likelihood
  3. 3.
    An assignment on the method from Carnegie Mellon
Last modified 7mo ago