Regularisation techniques
Regularisation is the procedure whereby you add terms to the cost function your algorithm is trying to minimise in order to control for overfitting and in cases when the problem is ill-posed, like when there are multiple or no solutions to the problem.

Ridge:
L2L_2

Ridge regression, or Tikhonov-Philips regularisation, adds a quadratic term to the cost function, hence the
L2L_2
type.

In general terms

Refer to the page about Linear Regression for a presentation of the Ordinary Least Squares method:
Discussion borrowed from Wikipedia. Suppose we want to solve the problem
Ax=bA x = b
While OLS minimises the function
Axb2||Ax - b||^2
, Ridge regression minimises
Axb2+Γx2||Ax - b||^2 + ||\Gamma x||^2
, where
Γ\Gamma
is called the Tikhonov matrix, usually chosen as a multiple of the identity
1\mathbb{1}
, and serves the purpose of giving preference to solutions with small norms. The explicit solution is
x=(AtA+ΓtΓ)1Atbx = (A^t A + \Gamma^t \Gamma)^{-1} A^t b
and the effect of the regularisation may be varied via the scale of the matrix
Γ\Gamma
. For
Γ=0\Gamma = 0
, we fall back on the OLS solution, provided
AtAA^t A
exists.

In linear regression

In the context of linear regression, where
ww
are the parameters, the Ridge regularisation solves the problem
minwwxy2+βw2  ,  β0\min_w ||wx - y||^2 + \beta ||w||^2 \ \ , \ \ \beta \geq 0
where the regularisation term is a way to penalise large coefficients. This regularisation is usually applied when there are too many predictors (features), for instance when the number of input variables exceeds the number of observations; in such cases fitting the model without penalisation will result in large sizes for the coefficients so the added term controls for this behaviour.
In fact, the Ridge problem is equivalent (thinking in terms of Lagrange multipliers) to minimising
wxy2||wx - y||^2
under constraint
w2<c ,||w||^2 < c \ ,
for some
c>0c > 0
, which means constraining the sizes of the coefficients. Therefore, Ridge regression is equivalent to putting a threshold on the size of the parameters. The trade-off is that a large
ww
would give a better RSS but the penalty term would be higher. A small
ww
instead, would given a lower RSS and is preferable.
The larger the
β\beta
used, the more we want the coefficients close to 0.
The figure shows the relation between an OLS and a Ridge solution, in the case of 2 dimensions. The ellipses are the contours of the residual sum of squares: the inner ellipse has the smallest RSS, which is minimised at the estimate point. The constraint in ridge regression corresponds to a circle. We are trying to minimise the ellipse size and the circle simultaneously so the estimate is the point of contact between the ellipse and the circle.
Ridge shrinks coefficients but does not nullify any. The solutions of a Ridge regression have to be found by solving (
EE
is the error function)
Ewj=2i(yiwjXij)(Xij)+2βwj=0 ,\frac{\partial E}{\partial w_j} = 2 \sum_i (y_i - w_j X_{ij})(-X_{ij}) + 2 \beta w_j = 0 \ ,
getting to
wj=iXijyiiXij2+β=wj0iXij2+β ,w_j = \frac{-\sum_i X_{ij} y_i}{\sum_i X_{ij}^2 + \beta} = \frac{w_j^0}{\sum_i X_{ij}^2 + \beta} \ ,
w0w^0
being the solution of a normal regression and the denominator being a function of
β\beta
. So it's apparent that Ridge scales the coefficients by a constant factor.

LASSO:
L1L_1

The figure shows the different approaches to regularisation.
LASSO (Least Absolute Shrinkage (and) Selection Operator ) is a
L1L_1
regularisation where the cost function to be minimised is
wxy2+αw1 .||w x - y||^2 + \alpha ||w||_1 \ .
This is equivalent (thinking in terms of Lagrange multipliers) to minimising
wxy2||w x - y||^2
subject to constraint
w<c||w|| < c
with
cc
being a tuning parameter.
If
cc
is large enough, the regularisation has no effect and the solution is the OLS one; for a sufficiently small
cc
instead, the solutions are the shrunken versions of the OLS ones. LASSO shrinks some coefficients and sets some to 0.
LASSO solves
Ewj=2i(yiwjXij)(Xij)+2β=0 ,\frac{\partial E}{\partial w_j} = 2 \sum_i (y_i - w_j X_{ij})(-X_{ij}) + 2 \beta = 0 \ ,
giving
wj=iXijyiβiXij=wj0βiXijw_j = \frac{-\sum_i X_{ij} y_i - \beta}{\sum_i X_{ij}} = \frac{w_j^0 - \beta}{\sum_i X_{ij}}
So it's clear that LASSO shifts the coefficients by a constant factor, allowing for the possibility to set some to 0.

Elastic net

Elastic net is a linear combination of
L1L_1
and
L2L_2
regularisations so that the function to be minimised is
wxy2+α1w1+α2w22||wx - y||^2 + \alpha_1 ||w_1|| + \alpha_2 ||w_2||^2

References

  1. 1.
    Wikipedia on the Tikhonov regularisation
  2. 2.
    R Tibshirani, Robert, Regression Shrinkage and Selection via the lasso, Journal of the Royal Statistical Society B, 58:1, 1996
  3. 3.
    H Zou, T Hastie, Regularization and Variable Selection via the Elastic Net, Journal of the Royal Statistical Society B, 67:2, 2005
Last modified 7mo ago