Support vector machine
A Support Vector Machine (SVM) is a non-probabilistic method used both for classification (naturally) and for regression (see later) which builds a hyperplane (or a set of hyperplanes) in a high dimensional space to separate the data.
Given some labelled data, an SVM spits out the optimal hyperplane separating them: in
dimensions, this will have
dimensions. Referring to the figure, where we are in
so the hyperplanes are lines, and we have a binary classification problem (colours indicate the class), multiple possible lines can separate the data, but the optimal separation is achieved with the one line which is the farthest away from all points. The problem then is about finding the hyperplane which gives the largest minimum distance to all training samples, that is, which maximises the margin.
is given by
is called the bias and
the weight vector. Properties are
- ,is the vector normal to the surface of
- The signed distance of anytois given bywhere that is the distance ofto the hyperplane, henceis proportional to the signed distance fromto the hyperplane defined by.
The optimal hyperplane is represented in infinite number of ways by scaling
. The convention is to pick the representation which gives
the training samples closest to the hyperplane. These are called the support vectors.
For the canonical hyperplane, the distance to the support vectors is
. Because the margin
is defined by twice the distance between the hyperplane and the closest samples, then
the problem of maximising
becomes that of minimising
with some constraints:
being the number of training samples. This is called the SVM problem and the constraints ensure that all points are at least a signed distance
from the decision boundary defined by
The SVM problem can be solved using Lagrange multipliers. This was a brief round-up of the linear case, but SVMs can be used in several problems, including non-linearly separable data.
The hyperplane defining the margin are
so the constraints that points are not falling inside them are
which can be rewritten as
So the classifier is
The idea is mapping the training points to some higher-dimensional euclidean space, with possibly infinite dimensions, where it is linearly separable so that a linear SVM can then be applied. The "kernel trick" consists in applying a kernel function to operate in a high-dimensional space without computing the coordinates of the data but just the inner products between the images of all pairs of data, which is computationally cheaper.
Linear boundaries in the higher-dimensional space achieve a better separation and translate to non-linear boundaries in the original space. The classifier is fitted with features
are the change of base functions and the hyperplane function is now non-linear:
but the classifier
is the same.
Because in the linear SVM the Lagrange multipliers method furnishes
so that the hyperplane becomes
, which in the case of a non-linear SVM translates to
is only involved via inner products. There is no need to specify the transformation
at all, but only require knowledge of the kernel function
The kernel function computes inner products in the transformed space, it should be symmetric and semi-positive. Popular choices for
in literature are
- the d degree polynomial
- the radial basis
- the neural network
The effectiveness of a SVM depends on the selection of kernel and its parameters, which are usually optimised via grid search.
Given a tolerance level
, the idea is finding the optimal hyperplane with some fault tolerance. The SVR is trained solving
The inner product plus intercept
is the prediction for sample
is a free parameter used as a threshold: all predictions have to be within a range
of true predictions.
- It's effective in high-dimensional spaces
- Only use a subset of training data in the decision function (the support vectors), which makes them memory-efficient
- Versatility: choice of kernel and parameters
- If the number of features is much bigger than the number of samples, poor performance is likely
- They do not furnish a probability estimate