Gradient boosting

What is and the idea

The gradient boosting method, sometimes called the Gradient Boosting Machine (GBM), like other boosting methods, combines many weak learners into one strong learner in a iterative fashion and can be used for regression or classification. The weak learner is typically chosen as a decision tree, where the name Gradient Tree Boosting comes from in that case. The method works by optimising a differentiable loss function by means of gradient descent.
The goal is learn a model
that predicts
y^=F(x)\hat y = F(x)
by minimising a loss function
(can be for instance the mean squared error averaged over the training set):
L=y^y2L = ||\hat y - y||^2
are the true values.
Let's say we have a total of
stages. At each stage
1mM1 \leq m \leq M
, we use a weak learner
: an example of a (very) weak learner could be one that predicts the mean of all the sample points as the target.
At the following stage, the GBM improves the learner by adding an estimator
so that
Fm+1(x)=Fm(x)+h(x)F_{m+1}(x) = F_m(x) + h(x)
The point is how to choose the estimator
A perfect
would give
Fm+1(h)=Fm(x)+h(x)=yh(x)=yFm(x)F_{m+1}(h) = F_m(x) + h(x) = y \Leftrightarrow h(x) = y - F_m(x)
represents the residual. As a matter of fact, the GBM fits
to the residual.
This way, like in other boosting methods, each learner learns to correct its predecessor.

The Mathematics

In general, in a function space, given a training set of
{(x1,y1),(x2,y2),,(xN,yN)} ,\{(\mathbf x_1, y_1), (\mathbf x_2, y_2), \ldots, (\mathbf x_N, y_N)\} \ ,
xi=(xi1,xi2,,xin)\mathbf x_i = (x_i^1, x_i^2, \ldots, x_i^n)
a vector of
Goal: finding a
F^(x)\hat F(x)
as the approximation of a
that minimises the expected value of some chosen loss function
L(y,F(x))L(y, F(\mathbf x))
over the distribution of the training set values, so that
F(x)=argminFEx,y[L(y,F(x))]F^*(x) = arg \min_F \mathbb{E}_{\mathbf x, y} [L(y, F(\mathbf x))]
The problem will have to be solved by numerical optimisation. We have the functional
Φ(F)=Ex,y[L(y,F(x))] ,\Phi(F) = \mathbb{E}_{\mathbf x, y} [L(y, F(\mathbf x))] \ ,
which we can write as
Φ(F)=Ex[ϕ(F(x))]\Phi(F) = \mathbb{E}_{\mathbf x} [\phi(F(\mathbf x))]
ϕ(F(x))=Ey[L(y,F(x))x] .\phi(F(\mathbf x)) = \mathbb{E}_y[L(y, F(\mathbf x)) | \mathbf x] \ .
This functional of
has to be minimised over
, which is considered as a parameter. In numerical optimization, we will assume the form of the solution to be
F(x)=m=0Mfm(x)F(\mathbf x) = \sum_{m=0}^{M} f_m(\mathbf x)
is an initial guess and the subsequent
are the incremental boosts we perform at each step.
The optimisation method utilised will be the gradient descent, whereby we compute the gradient at each step as
gm(x)=[ϕ(F(x))F(x)]F(x)=Fm1(x)\mathbf g_m (\mathbf x) = \Big [ \frac{\partial \phi(F(\mathbf x))}{\partial F(\mathbf x)} \Big ]_{F(\mathbf x) = F_{m-1}(\mathbf x)}
Fm1(x)=i=0m1fi(x)F{m-1}(\mathbf x) = \sum{i=0}^{m-1} f_i(\mathbf x)
Assuming enough regularity that derivative and integration can be exchanged, we compute
gm(x)=Ey[L(y,F(x))F(x)x]F(x)=Fm1(x)\mathbf g_m(\mathbf x) = \mathbb{E}_y \Big[\frac{\partial L(y, F(\mathbf x))}{\partial F(\mathbf x)} \Big| x \Big]_{F(\mathbf x) = F_{m-1}(x)}
By line-search we will have the parameter update
Fm(x)=Fm1(x)ρmgm(x)F_m(\mathbf x) = F_{m-1}(\mathbf x) - \rho_m \mathbf g_m(\mathbf x)
ρm=argminρEy,x[L(y,Fm1(x))ρgm(x)]\rho_m = arg \min_\rho \mathbb{E}_{y, \mathbf x} [L(y, F_{m-1}(\mathbf x)) - \rho \mathbf g_m(\mathbf x)]
In practice though, we have finite data, our training set. This means that
ϕ(F(x))\phi(F(\mathbf x))
cannot be estimated accurately by its value at each
xi\mathbf x_i
, because we'd only be using the training set to estimate it. We'd need to use other points.
This problem gets typically addressed by using a parametric form for
F(x,{βm,am}1M)=m=1Mβmhm(x;am)F(\mathbf x, \{\beta_m, \mathbf a_m\}_1^M) = \sum_{m=1}^M \beta_m h_m(\mathbf x; \mathbf a_m)
and performing parameter optimisation to minimise the training data-based approximation of the loss function, so that
{βm,am}1M=argmin(βm,am}1Mi=1NL(yi,m=1Mβmh(xi;am))\{\beta_m, \mathbf a_m\}_1^M = arg \min_{ (\beta'_m, \mathbf a'_m \}_1^M } \sum_{i=1}^N L\Big(y_i, \sum_{m=1}^M \beta'_m h(\mathbf x_i; \mathbf a'_m) \Big)
When this is unfeasible, a greedy approach is utilised, so that, for a given
(βm,am)=argminβ,ai=1NL(yi,Fm1(xi)+βh(xi,a))(\beta_m, \mathbf a_m) = arg \min_{\beta, \mathbf a} \sum_{i=1}^N L\Big(y_i, F_{m-1}(\mathbf x_i) + \beta h(\mathbf x_i, \mathbf a) \Big)
and then
Fm(x)=Fm1(x)+βmh(x;am)F_m(\mathbf x) = F_{m-1}(\mathbf x) + \beta_m h(\mathbf x; \mathbf a_m)
h(x,am)h(\mathbf x, \mathbf a_m)
is the weak learned employed.
Also, we use the data-based computation of the gradient
gm(xi)=[L(yi,F(xi))F(xi)]F(x)=Fm1(x)g_m(\mathbf x_i) = \Big[\frac{\partial L(y_i, F(\mathbf x_i))}{\partial F(\mathbf x_i)} \Big]_{F(\mathbf x) = F_{m-1}(\mathbf x)}
and the data-based line-search update
ρm=argminρi=1NL(yi,Fm1(xi)+ρh(xi;am))\rho_m = arg \min_\rho \sum_{i=1}^N L(y_i, F_{m-1}(\mathbf x_i) + \rho h(\mathbf x_i; \mathbf a_m))
Fm(x)=Fm1(x)+ρmh(x;am)F_m(\mathbf x) = F_{m-1} (x) + \rho_m h(\mathbf x; \mathbf a_m)

The Algorithmic Implementation

  1. 1.
    Start by initialising a constant function
    F0(x)=argminρi=1NL(yi,ρ)F_0(\mathbf x) = arg \min_\rho \sum_{i=1}^{N} L(y_i, \rho)
  2. 2.
    • Compute the negative gradient for each data point
      i=1,,Ni=1, \ldots, N
      gmi=[L(yi,F(xi))F(xi)]F=Fm1g_{mi} = - \Big[\frac{\partial L(y_i, F(\mathbf x_i))}{\partial F(\mathbf x_i)} \Big]_{F=F_{m-1}}
    • Compute the parameters from the training set as per above
    • Update the learner
      as above

Choosing the loss function

Ordinary Least Squares Regression

In the simple case of an OLS regression, the loss function is
L(y,F)=12(yF)2 ,L(y, F) = \frac{1}{2}(y-F)^2 \ ,
so its negative gradient with respect to
, that is, the residual.
The update rule computes
Fm=Fm1+h=Fm1+yFm=Fm1LFm\begin{aligned} F_m &= F_{m-1} + h \\ &= F_{m-1} + y - F_m \\ &= F_{m-1} - \frac{\partial L}{\partial F_m} \end{aligned}
so it clearly appears that the update rule is that of the Gradient Descent.

Gradient Tree Boosting

When used with a decision tree as the weak learner (usual case), the algorithm is called Gradient Tree Boosting.
Note that with respect with normal GB, GTB does not use the same
for the whole tree but rather a
for each leaf j.


  1. 1.
    J H Friedman, Greedy function approximation: a gradient boosting machine, Annals of Statistics, 2001
Last modified 7mo ago