Machine Learning: fundamental algorithms
Bayes' theorem and conjugate distributions

What is

The Bayes' theorem, theorised by minister Thomas Bayes in the second half of the eighteenth century, relates the probability of an event to prior knowledge around it.
We'll call
EE
and
HH
respectively the evidence (what we see happening, the measurement) and the hypothesis, whose probability will be affected by gathering the evidence. Bayes' theorem states that the probability of
HH
after observing
EE
is computed as the ratio of the probability of
EE
given
HH
to the probability of
EE
:
P(HE)=P(EH)P(H)P(E) .P(H|E) = \frac{P(E|H)P(H)}{P(E)} \ .
Specifically,
  • P(HE)P(H|E)
    is the posterior: the probability of observing the hypothesis given the evidence, that is, after data is collected;
  • P(EH)P(E|H)
    is the probability of observing the evidence given the hypothesis; it is, as a function of
    EE
    with fixed
    HH
    , the likelihood;
  • P(H)P(H)
    is the prior, that is, the probability of the hypothesis before gathering the evidence; it is one's estimate that
    HH
    is true before observing the data
The bit
P(EH)P(E)\frac{P(E|H)}{P(E)}
quantifies the impact of
EE
on the probability of
HH
and is called the marginal likelihood or model evidence.
The denominator, which is the probability of observing what we observe, can be written as the integration over all the possible hypotheses
HH'
of the terms
P(EH)P(H)P(E|H')P(H')
, so that a rewriting of the theorem reads as
P(HE)=P(EH)P(H)HP(EH)P(H) .P(H|E) = \frac{P(E|H)P(H)}{\sum_{H'}P(E|H')P(H')} \ .
Note that the hypothesis is encoded as a statistical model, that is, through its parameters.
For the definition of conditional probability, refer to page:

Proof

A sets intersection
Can be derived from the definition of conditional probability (see page linked above):
P(AB)=P(AB)P(B)   if   P(B)0P(A | B) = \frac{P(A \cap B)}{P(B)} \ \ \ \text{if} \ \ \ P(B)\neq 0
and
P(BA)=P(AB)P(A)   if   P(A)0P(B | A) = \frac{P(A \cap B)}{P(A)} \ \ \ \text{if} \ \ \ P(A)\neq 0
Now, the intersection is commutative, so that
P(AB)=P(BA)P(A \cap B) = P(B \cap A)
, so
P(AB)=P(AB)P(B)=P(BA)P(A) ,P(A \cap B) = P(A | B)P(B) = P(B | A)P(A) \ ,
which leads to
P(AB)=P(BA)P(A)P(B) .P(A | B) = \frac{P(B | A) P(A)}{P(B)} \ .

Examples

Example: flipping a (fair) coin

The prior is 50% to get H or T as our coin is fair. Flipping the coin several times and recording the observed outcomes will change the degree of belief (so that if the coin was unfair the posterior would be different from the prior).

Example: Testing for a disease

Let's assume to have a testing method developed to test for a disease and to know that it is correct 99% of the times in determining that you have or not have the disease.
This means that if you have the disease it will output YES with probability 99% and if you don't have the disease it will output NO with probability 99%. This is the likelihood. Let's say we also know that such disease occurs in the general population in one over
10410^4
people. This is the prior.
Question now is, if you take the test, and it comes up positive, what is the probability that you actually have the disease?
P(E)P(E)
at the denominator can be written as a sum of the joint probabilities of observing the evidence and the event over all the possible events (all the possible hypotheses):
  • Event A: you have the disease;
  • Event B: you don't have the disease
P(E)=P(EA)P(A)+P(EB)P(B)=HˉP(EHˉ)P(Hˉ)P(E) = P(E|A)P(A) + P(E|B)P(B) = \sum_{\bar H} P(E|\bar H) P(\bar H)
and by plugging in the numbers, the desired posterior is equal to 1%.

Conjugate distributions

The Bayes' theorem links the prior probability to the posterior probability, through the likelihood. For certain priors, posterior and prior belong to the same probability family of distributions, hence they are called conjugate distributions. A conjugate prior is a very handy mathematical convenience.

Example: the gaussian

The gaussian distribution is conjugate to itself with respect to a gaussian likelihood: if the likelihood is gaussian, choosing a gaussian prior over the mean will ensure that the posterior is also gaussian.

Example: Bernoulli trials

A binomial likelihood, with a Beta prior leads to a Beta distribution for the posterior as well.

References

  1. 1.
    Some more examples on Wikipedia
Last modified 7mo ago