Gradient boosting

What is and the idea

The gradient boosting method, sometimes called the Gradient Boosting Machine (GBM), like other boosting methods, combines many weak learners into one strong learner in a iterative fashion and can be used for regression or classification. The weak learner is typically chosen as a decision tree, where the name Gradient Tree Boosting comes from in that case. The method works by optimising a differentiable loss function by means of gradient descent.
The goal is learn a model
FF
that predicts
y^=F(x)\hat y = F(x)
by minimising a loss function
LL
(can be for instance the mean squared error averaged over the training set):
L=y^y2L = ||\hat y - y||^2
where
yy
are the true values.
Let's say we have a total of
MM
stages. At each stage
1mM1 \leq m \leq M
, we use a weak learner
FmF_m
: an example of a (very) weak learner could be one that predicts the mean of all the sample points as the target.
At the following stage, the GBM improves the learner by adding an estimator
hh
so that
Fm+1(x)=Fm(x)+h(x)F_{m+1}(x) = F_m(x) + h(x)
The point is how to choose the estimator
hh
.
A perfect
hh
would give
Fm+1(h)=Fm(x)+h(x)=yh(x)=yFm(x)F_{m+1}(h) = F_m(x) + h(x) = y \Leftrightarrow h(x) = y - F_m(x)
so
hh
represents the residual. As a matter of fact, the GBM fits
hh
to the residual.
This way, like in other boosting methods, each learner learns to correct its predecessor.

The Mathematics

In general, in a function space, given a training set of
NN
points
{(x1,y1),(x2,y2),,(xN,yN)} ,\{(\mathbf x_1, y_1), (\mathbf x_2, y_2), \ldots, (\mathbf x_N, y_N)\} \ ,
with
xi=(xi1,xi2,,xin)\mathbf x_i = (x_i^1, x_i^2, \ldots, x_i^n)
a vector of
nn
features.
Goal: finding a
F^(x)\hat F(x)
as the approximation of a
F(x)F^*(x)
that minimises the expected value of some chosen loss function
L(y,F(x))L(y, F(\mathbf x))
over the distribution of the training set values, so that
F(x)=argminFEx,y[L(y,F(x))]F^*(x) = arg \min_F \mathbb{E}_{\mathbf x, y} [L(y, F(\mathbf x))]
The problem will have to be solved by numerical optimisation. We have the functional
Φ(F)=Ex,y[L(y,F(x))] ,\Phi(F) = \mathbb{E}_{\mathbf x, y} [L(y, F(\mathbf x))] \ ,
which we can write as
Φ(F)=Ex[ϕ(F(x))]\Phi(F) = \mathbb{E}_{\mathbf x} [\phi(F(\mathbf x))]
where
ϕ(F(x))=Ey[L(y,F(x))x] .\phi(F(\mathbf x)) = \mathbb{E}_y[L(y, F(\mathbf x)) | \mathbf x] \ .
This functional of
FF
has to be minimised over
FF
, which is considered as a parameter. In numerical optimization, we will assume the form of the solution to be
F(x)=m=0Mfm(x)F(\mathbf x) = \sum_{m=0}^{M} f_m(\mathbf x)
h0h_0
is an initial guess and the subsequent
fmf_m
are the incremental boosts we perform at each step.
The optimisation method utilised will be the gradient descent, whereby we compute the gradient at each step as
gm(x)=[ϕ(F(x))F(x)]F(x)=Fm1(x)\mathbf g_m (\mathbf x) = \Big [ \frac{\partial \phi(F(\mathbf x))}{\partial F(\mathbf x)} \Big ]_{F(\mathbf x) = F_{m-1}(\mathbf x)}
and
Fm1(x)=i=0m1fi(x)F{m-1}(\mathbf x) = \sum{i=0}^{m-1} f_i(\mathbf x)
.
Assuming enough regularity that derivative and integration can be exchanged, we compute
gm(x)=Ey[L(y,F(x))F(x)x]F(x)=Fm1(x)\mathbf g_m(\mathbf x) = \mathbb{E}_y \Big[\frac{\partial L(y, F(\mathbf x))}{\partial F(\mathbf x)} \Big| x \Big]_{F(\mathbf x) = F_{m-1}(x)}
By line-search we will have the parameter update
Fm(x)=Fm1(x)ρmgm(x)F_m(\mathbf x) = F_{m-1}(\mathbf x) - \rho_m \mathbf g_m(\mathbf x)
with
ρm=argminρEy,x[L(y,Fm1(x))ρgm(x)]\rho_m = arg \min_\rho \mathbb{E}_{y, \mathbf x} [L(y, F_{m-1}(\mathbf x)) - \rho \mathbf g_m(\mathbf x)]
In practice though, we have finite data, our training set. This means that
ϕ(F(x))\phi(F(\mathbf x))
cannot be estimated accurately by its value at each
xi\mathbf x_i
, because we'd only be using the training set to estimate it. We'd need to use other points.
This problem gets typically addressed by using a parametric form for
FF
:
F(x,{βm,am}1M)=m=1Mβmhm(x;am)F(\mathbf x, \{\beta_m, \mathbf a_m\}_1^M) = \sum_{m=1}^M \beta_m h_m(\mathbf x; \mathbf a_m)
and performing parameter optimisation to minimise the training data-based approximation of the loss function, so that
{βm,am}1M=argmin(βm,am}1Mi=1NL(yi,m=1Mβmh(xi;am))\{\beta_m, \mathbf a_m\}_1^M = arg \min_{ (\beta'_m, \mathbf a'_m \}_1^M } \sum_{i=1}^N L\Big(y_i, \sum_{m=1}^M \beta'_m h(\mathbf x_i; \mathbf a'_m) \Big)
When this is unfeasible, a greedy approach is utilised, so that, for a given
mm
,
(βm,am)=argminβ,ai=1NL(yi,Fm1(xi)+βh(xi,a))(\beta_m, \mathbf a_m) = arg \min_{\beta, \mathbf a} \sum_{i=1}^N L\Big(y_i, F_{m-1}(\mathbf x_i) + \beta h(\mathbf x_i, \mathbf a) \Big)
and then
Fm(x)=Fm1(x)+βmh(x;am)F_m(\mathbf x) = F_{m-1}(\mathbf x) + \beta_m h(\mathbf x; \mathbf a_m)
h(x,am)h(\mathbf x, \mathbf a_m)
is the weak learned employed.
Also, we use the data-based computation of the gradient
gm(xi)=[L(yi,F(xi))F(xi)]F(x)=Fm1(x)g_m(\mathbf x_i) = \Big[\frac{\partial L(y_i, F(\mathbf x_i))}{\partial F(\mathbf x_i)} \Big]_{F(\mathbf x) = F_{m-1}(\mathbf x)}
and the data-based line-search update
ρm=argminρi=1NL(yi,Fm1(xi)+ρh(xi;am))\rho_m = arg \min_\rho \sum_{i=1}^N L(y_i, F_{m-1}(\mathbf x_i) + \rho h(\mathbf x_i; \mathbf a_m))
Fm(x)=Fm1(x)+ρmh(x;am)F_m(\mathbf x) = F_{m-1} (x) + \rho_m h(\mathbf x; \mathbf a_m)

The Algorithmic Implementation

  1. 1.
    Start by initialising a constant function
    F0(x)F_0(x)
    :
    F0(x)=argminρi=1NL(yi,ρ)F_0(\mathbf x) = arg \min_\rho \sum_{i=1}^{N} L(y_i, \rho)
  2. 2.
    For
    m=1m=1
    to
    MM
    :
    • Compute the negative gradient for each data point
      i=1,,Ni=1, \ldots, N
      as
      gmi=[L(yi,F(xi))F(xi)]F=Fm1g_{mi} = - \Big[\frac{\partial L(y_i, F(\mathbf x_i))}{\partial F(\mathbf x_i)} \Big]_{F=F_{m-1}}
    • Compute the parameters from the training set as per above
    • Update the learner
      FmF_m
      as above

Choosing the loss function

Ordinary Least Squares Regression

In the simple case of an OLS regression, the loss function is
L(y,F)=12(yF)2 ,L(y, F) = \frac{1}{2}(y-F)^2 \ ,
so its negative gradient with respect to
FF
is
yFy-F
, that is, the residual.
The update rule computes
Fm=Fm1+h=Fm1+yFm=Fm1LFm\begin{aligned} F_m &= F_{m-1} + h \\ &= F_{m-1} + y - F_m \\ &= F_{m-1} - \frac{\partial L}{\partial F_m} \end{aligned}
so it clearly appears that the update rule is that of the Gradient Descent.

Gradient Tree Boosting

When used with a decision tree as the weak learner (usual case), the algorithm is called Gradient Tree Boosting.
Note that with respect with normal GB, GTB does not use the same
ρm\rho_m
for the whole tree but rather a
ρmj\rho_{mj}
for each leaf j.

References

  1. 1.
    J H Friedman, Greedy function approximation: a gradient boosting machine, Annals of Statistics, 2001
Last modified 7mo ago