The naive assumption of the algorithm is that features are independent of each other, that is, the likelihood at the second member can be factorised into the product of the likelihoods of single features:
This way, we can simplify and write
We apply the MAP estimation (see page) to find the
that maximises the posterior. The denominator only gives a constant of normalisation, so the maximising value is found via
What this means is that the classifier assigns the class label
as the one which maximises the posterior probability.
If in the training data there is no value for which feature
is determined by the class
, meaning there is no occurrences where feature and class are together, the likelihood would equal zero: this is a problem as there would be a zero in the multiplication.
A correction to remedy this problem (regularised Naive Bayes) is obtained via adding an addend in the calculation of the likelihood as a frequency so as to have a small but non-zero probability. While in general we would calculate it as
is the number of times feature
appears in the sample for class
in the training set and
is the total count of occurrences of class
, we smoothen as
is a chosen factor and
the number of possible values for feature
. This procedure is called Lidstone smoothing, with
The problem is classifying if a person is a male (M) or a female (F) based on height (h, in feet), weight (w, in pounds) and foot size (f, in inches). This is the training data we assume to have collected:
We use a gaussian assumption, so we assume the likelihood for each feature to be
and we estimate the parameters of said gaussians via MLE (see page), obtaining:
The two classes are equiprobable because we got the same number of training points for each, so
, and these are the priors for each class. Note that we could also give the priors from the population, assuming that each gender is equiprobable.
Now, given a new sample point whose height is 6 feet, weight 130 lbs and foot size 8 inches, we want to classify its gender, so we determine which class maximises the posterior:
where, under the Naive Bayes assumption,
which is just a normalising constant so can be ignored. Now,
In the same way we compute
, so that in the end we obtain
. Similarly we get
, which is larger so we predict that the sample is a female.
Other examples, on classifying text
This is a common application of a Naive Bayes classifier and it is a case of text classification.
Some words are more frequent than others in spam e-mails (for example "Viagra" is definitely a recurring word in spam e-mails). The user manually and continuously trains the filter of their e-mail provider by indicating whether a mail is spam or not. For all words in each training mail, the filter then adjusts the probability that it will appear in a spam or legitimate e-mail.
Let S be the event that an e-mail is spam and
a word, then we compute the probability that an e-mail is spam given that it contains
is the event that mail is not spam, or "ham"):
, the prior, is the probability that a message is spam in general, and
is the probability that
appears in spam messages.
A non biased filter will assume
, biased filters will assume higher probability for mail being spam.
is approximated by the frequency of mails containing word
and being identified as spam in the learning phase, and similarly for
Now, this is valid for a single word, but a functional spam classifier uses several words and a Naive Bayes hypothesis, assuming that the presence of each word is an independent event. Note that this is a crude assumption as in reality in natural language words co-occurrence is key. Nevertheless, it is a useful idealisation, useful for the calculation in the Naive Bayes fashion.
So, with more words considered and asusming the priors are the same (see references),