For the Bayes' theorem, refer to page
The Naive Bayes is a probabilistic classifier based on (surprise surprise!) the Bayes' theorem and it uses a Maximum A Priori estimate to classify the labels.
In a nutshell, its properties are:
- It assumes independence among features used for the classification;
- It is usually used for text classification;
- Is fast;
- Requires small training data;
- The probability of the outcome classification is unreliable.
Given a target variable
(the class) and features
, by Bayes' theorem we can write
being the frequency of the class
The naive assumption of the algorithm is that features are independent of each other, that is, the likelihood at the second member can be factorised into the product of the likelihoods of single features:
This way, we can simplify and write
We apply the MAP estimation (see page) to find the
that maximises the posterior. The denominator only gives a constant of normalisation, so the maximising value is found via
What this means is that the classifier assigns the class label
as the one which maximises the posterior probability.
The different Naive Bayes classifiers differ in the assumptions for the likelihood distribution
In a Gaussian Naive Bayes, it is assumed to be a gaussian:
being the mean and the standard deviation of feature
, estimated using the Maximum Likelihood estimation.
In a Bernoulli Naive Bayes, used when features are binary, it is assumed that
In a Multinomial Naive Bayes, with feature vector
being the number of successes for variable
, the likelihood is given as
Note that the multinomial Naive Bayes classifier becomes a linear classifier when expressed in logarithmic scale:
(a constant) and
If in the training data there is no value for which feature
is determined by the class
, meaning there is no occurrences where feature and class are together, the likelihood would equal zero: this is a problem as there would be a zero in the multiplication.
A correction to remedy this problem (regularised Naive Bayes) is obtained via adding an addend in the calculation of the likelihood as a frequency so as to have a small but non-zero probability. While in general we would calculate it as
is the number of times feature
appears in the sample for class
in the training set and
is the total count of occurrences of class
, we smoothen as
is a chosen factor and
the number of possible values for feature
. This procedure is called Lidstone smoothing, with
it's the Laplace smoothing.
The problem is classifying if a person is a male (M) or a female (F) based on height (h, in feet), weight (w, in pounds) and foot size (f, in inches). This is the training data we assume to have collected:
We use a gaussian assumption, so we assume the likelihood for each feature to be
and we estimate the parameters of said gaussians via MLE (see page), obtaining:
The two classes are equiprobable because we got the same number of training points for each, so
, and these are the priors for each class. Note that we could also give the priors from the population, assuming that each gender is equiprobable.
Now, given a new sample point whose height is 6 feet, weight 130 lbs and foot size 8 inches, we want to classify its gender, so we determine which class maximises the posterior:
where, under the Naive Bayes assumption,
which is just a normalising constant so can be ignored. Now,
In the same way we compute
, so that in the end we obtain
. Similarly we get
, which is larger so we predict that the sample is a female.
This is a common application of a Naive Bayes classifier and it is a case of text classification.
Some words are more frequent than others in spam e-mails (for example "Viagra" is definitely a recurring word in spam e-mails). The user manually and continuously trains the filter of their e-mail provider by indicating whether a mail is spam or not. For all words in each training mail, the filter then adjusts the probability that it will appear in a spam or legitimate e-mail.
Let S be the event that an e-mail is spam and
a word, then we compute the probability that an e-mail is spam given that it contains
is the event that mail is not spam, or "ham"):
, the prior, is the probability that a message is spam in general, and
is the probability that
appears in spam messages.
A non biased filter will assume
, biased filters will assume higher probability for mail being spam.
is approximated by the frequency of mails containing word
and being identified as spam in the learning phase, and similarly for
Now, this is valid for a single word, but a functional spam classifier uses several words and a Naive Bayes hypothesis, assuming that the presence of each word is an independent event. Note that this is a crude assumption as in reality in natural language words co-occurrence is key. Nevertheless, it is a useful idealisation, useful for the calculation in the Naive Bayes fashion.
Given texts which can fall into categories (for example literary genres), we use
being the genre,
the words and the denominator is an irrelevant factor.
With a bag of words approach, if we have a training set
and a vocabulary
containing all the words in the documents, considering
the subset of texts in category
(fraction of samples in category
). Now, we concatenate all documents in
the number of occurrences of
(we use smoothing). The predicted category is then