The gradient boosting method, sometimes called the Gradient Boosting Machine (GBM), like other boosting methods, combines many weak learners into one strong learner in a iterative fashion and can be used for regression or classification. The weak learner is typically chosen as a decision tree, where the name Gradient Tree Boosting comes from in that case. The method works by optimising a differentiable loss function by means of gradient descent.
The goal is learn a model
by minimising a loss function
(can be for instance the mean squared error averaged over the training set):
are the true values.
Let's say we have a total of
stages. At each stage
, we use a weak learner
: an example of a (very) weak learner could be one that predicts the mean of all the sample points as the target.
At the following stage, the GBM improves the learner by adding an estimator
The point is how to choose the estimator
represents the residual. As a matter of fact, the GBM fits
to the residual.
This way, like in other boosting methods, each learner learns to correct its predecessor.
In general, in a function space, given a training set of
a vector of
Goal: finding a
as the approximation of a
that minimises the expected value of some chosen loss function
over the distribution of the training set values, so that
The problem will have to be solved by numerical optimisation. We have the functional
which we can write as
This functional of
has to be minimised over
, which is considered as a parameter. In numerical optimization, we will assume the form of the solution to be
is an initial guess and the subsequent
are the incremental boosts we perform at each step.
The optimisation method utilised will be the gradient descent, whereby we compute the gradient at each step as
Assuming enough regularity that derivative and integration can be exchanged, we compute
By line-search we will have the parameter update
In practice though, we have finite data, our training set. This means that
cannot be estimated accurately by its value at each
, because we'd only be using the training set to estimate it. We'd need to use other points.
This problem gets typically addressed by using a parametric form for
and performing parameter optimisation to minimise the training data-based approximation of the loss function, so that
When this is unfeasible, a greedy approach is utilised, so that, for a given
is the weak learned employed.
Also, we use the data-based computation of the gradient
and the data-based line-search update
- 1.Start by initialising a constant function:
- Compute the negative gradient for each data pointas
- Compute the parameters from the training set as per above
- Update the learneras above
In the simple case of an OLS regression, the loss function is
so its negative gradient with respect to
, that is, the residual.
The update rule computes
so it clearly appears that the update rule is that of the Gradient Descent.
When used with a decision tree as the weak learner (usual case), the algorithm is called Gradient Tree Boosting.
Note that with respect with normal GB, GTB does not use the same
for the whole tree but rather a
for each leaf j.