AdaBoost

The gist

AdaBoost, shortening for adaptive boosting is a boosting ensemble method used both in classification and in regression problems. The authors won the Goedel prize for the work in 2003, whose original paper is in the first reference but a nice reading over the general idea, by the same authors, is the paper in the second reference.
The idea is to fit a sequence of weak learners (a weak learner is one that is just slightly better than a random guesser) on repeatedly modified versions of the training set, then combine their predictions through a weighted majority voting system.
In the first iteration, you give the same weight to all
nn
training samples,
wi=1nw_i = \frac{1}{n}
; in successive iterations the weights are modified in such a way that the training samples which were incorrectly predicted in the previous iteration will see their weights increased while those which were correctly predicted will see their weights decreased. This way, the weak learner is forced to focus on the points more difficult to predict: this is the adaptive part of the idea. The resulting combination of these weak learners will be a strong learner.
As per current literature, AdaBoost with decision trees is considered a pretty strong classifier.

The algorithm

This part follows the Wikipedia page. Let's see we have a binary classification problem, class variables being
yi1,1y_i \in {1, -1}
and sample points
(x1,y1),,(xn,yn)(x_1, y_1), \ldots, (x_n, y_n)
, where
xiXx_i \in X
(feature matrix).
  • Call a weak learner
    hh
  • At iteration
    tt
    , you'll have built a combination of weak learners into a strong learner
Ht(xi)=i=1tαihi(xi) ,H_t(x_i) = \sum_{i=1}^t \alpha_i h_i(x_i) \ ,
so that
Ht(xi)=Ht1(xi)+αtht(xi)H_t(x_i) = H_{t-1}(x_i) + \alpha_t h_t(x_i)
This way we have built a linear combination of weak learners over several iterations. The weights
αi\alpha_i
of each learner remain to be attributed in the most effective way. If we consider the error
EE
on
HtH_t
as the sum of the exponential losses on each data point,
E=i=1neyiHt(xi)E = \sum_{i=1}^n e^{-y_i H_t(x_i)}
(note that the argument of the exponential will be a 1 if the point is well classified and a -1 if not) which by posing
wi1=1w_i^1 = 1
and
wit=eyiHt1(xi)w_i^t = e^{-y_i H{t-1}(x_i)}
(
wiw_i
represents a weight to the error) we can rewrite as
E=i=1nwiteyiαtht(xi) .E = \sum_{i=1}^n w_i^t e^{-y_i \alpha_t h_t(x_i)} \ .
We can now split the sum between the points which are well classified and those misclassified:
E=i:yi=ht(xi)witeαt+i:yiht(xi)witeαt=i=1nwiteαt+i:yiht(xi)wit[etαeαt]\begin{align} E &= \sum_{i: y_i = h_t(x_i)} w_i^t e^{-\alpha_t} + \sum_{i: y_i \neq h_t(x_i)} w_i^t e^{\alpha^t} \\ &= \sum_{i=1}^n w_i^t e^{-\alpha_t} + \sum_{i: y_i \neq h_t(x_i)} w_i^t [e^\alpha_t - e^{-\alpha_t}] \end{align}
In the last expression above, the only part depending on the weak classifiers is the second sum, so the weak classifier that minimises
EE
is the one minimising this sum, which means the one that minimises
i:yiht(xi)wit\sum_{i: y_i \neq h_t(x_i)} w_i^t
, so the one with the lowest weighted error.
If we derive
EE
with respect to
αt\alpha_t
, we obtain
αt=12logi:yi=ht(xi)witi:yiht(xi)wit\alpha_t = \frac{1}{2} \log{\frac{\sum_{i: y_i = h_t(x_i)} w_i^t}{\sum_{i: y_i \neq h_t(x_i)} w_i^t}}
Now, the weighted error rate of the weak classifiers is
ϵt=i:yiht(xi)witi=1nwit ,\epsilon_t = \frac{\sum_{i: y_i \neq h_t(x_i)} w_i^t}{\sum_{i=1}^n w_i^t} \ ,
so it follows that we can write the
αt\alpha_t
which minimises
EE
as
αt=12log1ϵtϵt\alpha_t = \frac{1}{2} \log{\frac{1-\epsilon_t}{\epsilon_t}}
which is
12\frac{1}{2}
times the logit negative function.
To summarise then, the AdaBoost algorithm consists of
  1. 1.
    Choose the weak classifier which minimised the error
    EE
  2. 2.
    Use it to compute the classifiers weighted error
    ϵt\epsilon_t
  3. 3.
    Use this to compute the weights
    αt\alpha_t
  4. 4.
    Use this to compute the boosted (strong) classifier
    HtH_t
Note that there exist several variants of the original AdaBoost.

References

  1. 1.
    Y Freund, R E Schapire, A decision-theoretic generalization of on-line learning and an application to boosting, J of computer and system sciences 55.1, 1997
  2. 2.
    Y Freund, R E Schapire, A short introduction to boosting, J of Japanese Society for Artificial Intelligence, 14(5), 1999
Last modified 7mo ago