Logistic regression
For the code here, you need a few imports:
1
import numpy as np
2
from matplotlib import pyplot as plt
Copied!

What is

Logistic Regression is, despite the name, a classifier. Its procedure fits (hence the reference to regression) the logistic function, or sigmoid,
y=11+ex ,y = \frac{1}{1 + e^{-x}} \ ,
see the plot down here.
1
x = np.arange(-10, 10, 1)
2
y = 1. / (1 + np.exp(-x))
3
4
plt.plot(x, y, color='g', label='$1/(1 + e^{-x})#x27;)
5
plt.title('A sigmoid function')
6
plt.xlabel('$x#x27;)
7
plt.ylabel('$y#x27;)
8
plt.legend()
9
plt.show();
Copied!
Sigmoid, from the code above^

How does it work

This exposition is inspired and re-worked from these notes from a Stanford course.
The idea is: given a categorical variable
y0,1y \in {0, 1}
and some independent variable
xx
which we want to use in order to classify
yy
, we could think of running a linear regression
y=mx+by = mx + b
followed by a classification (say if
mx0.5mx \geq 0.5
we classify
yy
as 1 and if
mx<0.5mx < 0.5
we classify it as 0).
yep, hand-drawn :D
This would work fine if we were in the case displayed in the left side of the pictures here. But if, as in the right side, the training set is such that a point is far away from the rest, this point would be, with this procedure, classified as a 0!
The logistic regression uses a sigmoid function model:
hθ(x)=11+eθxh_{\theta}(x) = \frac{1}{1 + e^{- \theta x}}
where
0h10 \leq h \leq 1
can be interpreted as the probability that
y=1y=1
given
xx
as the input and
θ\theta
the parameters:
hθ(x)=P(y=1  x;θ)h_\theta(x) = P(y=1 \ | \ x; \theta)
A logistic regression is a linear classifier: it predicts
y=1y=1
when
h0.5h \geq 0.5
and
y=0y=0
otherwise, so the decision boundary is
θx=0\theta x = 0
.

An example

Suppose we find parameters (we are in multiple dimensions)
θˉ=[311] ,\bar{\theta} = \begin{bmatrix} -3 \\ 1 \\ 1 \end{bmatrix} \ ,
we would predict
y=1y=1
if
3+x1+x20-3+x_1+x_2 \geq 0
, that is, if
x1+x23x_1 + x_2 \geq 3
, and
y=0y=0
otherwise. See figure for a visual display of this.
Note though, that if the boundary given by the training data is not linear, we should use a logistic regression with higher order polynomials, as in
hθ(x)=g(θ0+θ1x1+θ2x2+θ3x12+θ4x22)h_\theta(x) = g(\theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_1^2 + \theta_4 x_2^2)
, with
gg
being the logistic function (the sigmoid). As in a polynomial regression, higher order features can be treated as first order ones with a substitution.

Finding the parameters

Now, how do we find the coefficients
θˉ\bar{\theta}
? We use a cost function as in the regular linear regression.
Given a training set with
mm
samples, in a multi-dimensional space,
(xˉ1,y1),,((xˉm,ym){(\bar{x}^1, y^1), \cdots, ((\bar{x}^m, y^m)}
, where we have
nn
features for each sample, so
xˉ\bar x
is a vector in
Rn\mathbb{R}^n
and
y0,1y \in {0, 1}
, the model is
hθ(xˉ)=11+eθˉtxˉ .h_\theta(\bar{x}) = \frac{1}{1 + e^{- \bar{\theta}^t \bar{x}}} \ .
The cost function minimised in the case of linear regression would be a non-convex function with many local minima in the case of the logistic model so a gradient descent would not find the global minimum. Instead, the cost function we use is (illustrated in figure):
cost(hθ(xˉ),y)={loghθ(xˉ)  if  y=1log1hθ(xˉ)  if  y=0cost(h_\theta(\bar{x}), y) = \begin{cases} - \log{h_\theta(\bar{x})} \ \ \text{if} \ \ y =1 \\ - \log{1 - h_\theta(\bar{x})} \ \ \text{if} \ \ y =0 \end{cases}
This cost function captures the intuition that if
h=0=P(y=1x;θ)h= 0 = P(y=1 | x; \theta)
, so the algorithm will be penalised with a large cost. All this can be written in a comprehensive form:
cost(hθ(xˉ),y)=1mi=1m[yiloghθ(xˉ)+(1yi)log1hθ(xˉ)] ,cost(h_\theta(\bar{x}), y) = \frac{1}{m} \sum_{i=1}^{m} [y^i \log{h_\theta(\bar{x})} + (1 - y^i) \log{1 - h_\theta(\bar{x})}] \ ,
At this point a gradient descent (see page) is used to compute the minimum over parameters
θˉ\bar \theta
of the cost function.
Logistic Regression is a case of a Generalised Linear Model: the predictor function is indeed linear in the input variable:
h=11+eθx1+eθx=1hθx=logh1hh = \frac{1}{1 + e^{- \theta x}} \Leftrightarrow 1 + e^{- \theta x} = \frac{1}{h} \Leftrightarrow \theta x = \log{\frac{h}{1-h}}
with
hh
being interpreted as a probability as explained above.

References

  1. 1.
    D R Cox, The regression analysis of binary sequences, Journal of the Royal Statistical Society B, 20:2, 1958
  2. 2.
    Notes on linear and logistic regression from the Stanford ML course by A Ng
Last modified 7mo ago