Regularisation is the procedure whereby you add terms to the cost function your algorithm is trying to minimise in order to control for overfitting and in cases when the problem is ill-posed, like when there are multiple or no solutions to the problem.
Ridge regression, or Tikhonov-Philips regularisation, adds a quadratic term to the cost function, hence the
Refer to the page about Linear Regression for a presentation of the Ordinary Least Squares method:
While OLS minimises the function
, Ridge regression minimises
is called the Tikhonov matrix, usually chosen as a multiple of the identity
, and serves the purpose of giving preference to solutions with small norms. The explicit solution is
and the effect of the regularisation may be varied via the scale of the matrix
, we fall back on the OLS solution, provided
In the context of linear regression, where
are the parameters, the Ridge regularisation solves the problem
where the regularisation term is a way to penalise large coefficients. This regularisation is usually applied when there are too many predictors (features), for instance when the number of input variables exceeds the number of observations; in such cases fitting the model without penalisation will result in large sizes for the coefficients so the added term controls for this behaviour.
In fact, the Ridge problem is equivalent (thinking in terms of Lagrange multipliers) to minimising
, which means constraining the sizes of the coefficients. Therefore, Ridge regression is equivalent to putting a threshold on the size of the parameters. The trade-off is that a large
would give a better RSS but the penalty term would be higher. A small
instead, would given a lower RSS and is preferable.
The larger the
used, the more we want the coefficients close to 0.
The figure shows the relation between an OLS and a Ridge solution, in the case of 2 dimensions. The ellipses are the contours of the residual sum of squares: the inner ellipse has the smallest RSS, which is minimised at the estimate point. The constraint in ridge regression corresponds to a circle. We are trying to minimise the ellipse size and the circle simultaneously so the estimate is the point of contact between the ellipse and the circle.
Ridge shrinks coefficients but does not nullify any. The solutions of a Ridge regression have to be found by solving (
is the error function)
being the solution of a normal regression and the denominator being a function of
. So it's apparent that Ridge scales the coefficients by a constant factor.
The figure shows the different approaches to regularisation.
This is equivalent (thinking in terms of Lagrange multipliers) to minimising
subject to constraint
being a tuning parameter.
is large enough, the regularisation has no effect and the solution is the OLS one; for a sufficiently small
instead, the solutions are the shrunken versions of the OLS ones. LASSO shrinks some coefficients and sets some to 0.
So it's clear that LASSO shifts the coefficients by a constant factor, allowing for the possibility to set some to 0.
- 1.Wikipedia on the Tikhonov regularisation
- 2.R Tibshirani, Robert, Regression Shrinkage and Selection via the lasso, Journal of the Royal Statistical Society B, 58:1, 1996
- 3.H Zou, T Hastie, Regularization and Variable Selection via the Elastic Net, Journal of the Royal Statistical Society B, 67:2, 2005