# What's in the residuals of a regression?

Refer to the page on linear regression for OLS:
and to the page about the performance metrics in regression:
For the code, you will need some imports
import numpy as np
import statsmodels.api as sm
from matplotlib import pyplot as plt
import pandas as pd

## What are the residuals

For each point
$(x_i, y_i)$
in the data set, where
$x_i$
is the independent variable and
$y_i$
the dependent variable, its residual
$r_i$
is the difference between the observed value of the dependent variable
$y_i$
and the predicted value
$\hat y_i$
(the point on the fitting curve):
$r = y_i - \hat y_i \ .$
Residuals represent the variation not captured by the fit. The concept of residual and that of error are similar and easy to confuse; the error has in fact the same definition except for the fact that the difference is computed between the observed value of the dependent variable and the real one, that is, the unknown population value of which the observed one is an observation. Residuals have the role of an estimate of the errors.
Note that we framed this in one dimension here, but the definition is general to any number of dimensions.
The general assumption of a linear regression is that residuals are normally distributed around a mean of 0, because the model is expected to predict higher than the actual value or lower than the actual value with the same probability, or, to phrase it better, because we expect residuals to be pure noise and not display any pattern.

## Let's play with them

### On some fake data

Let us fabricate some data and fit them with OLS in order to have a look at the residuals. We use 500 independent variables in the range 0 to 100:
x = np.linspace(0, 100, num=500)
and a dependent variable which lies on a line with slope 2 but contains some noise. Noise is given in such a way that its amplitude is extracted from a gaussian trend, this gaussian:
x = np.linspace(0, 100, num=500)
g = np.exp(-(x-50)**2/30**2)
plt.plot(x, g)
plt.show();
A beautiful gaussian
This way points in the middle of the scale are given higher possibility to vary from the line. So these are the
$y$
points:
y = [2 * i + np.random.randint(-np.exp(-(i-50)**2/30**2)*30, np.exp(-(i-50)**2/30**2)*30) for i in x]
Let's fit an OLS on these data (using statsmodels). We see that the slope we obtain is
$\approx 1.99$
and the
$R^2$
comes as 0.99.
# fit an OLS, just the slope (intercept is 0)
fit_result = sm.OLS(y, x).fit()
fit_result.summary()
Regression results
Now let's plot both the data points and the fitting line: the fact that residuals are larger in the middle is quite visible.
plt.plot(x, y, 'o', label='points')
plt.plot(x, fit_result.params * x, label='fit')
plt.xlabel('$x$')
plt.ylabel('$y$')
plt.legend()
plt.show();
Points and fitting line
Residuals are, as expected, normally centered around 0:
plt.hist(fit_result.resid, bins=20)
plt.title('Histogram of the residuals')
plt.show();
In this silly dataset, residuals are normally distributed but they display a pattern with the independent variable (well, we have imposed it!). In fact, if we plot a scatter of them against
$x$
, this appears very clearly: points in the middle have larger errors.
This because we have imposed this feature to our data so it is an obvious "discovery", but in a general case the analysis of the trend of residuals can lead to interesting considerations which the sole histogram would hide. See the treatment in the first reference for a vivid example.
plt.scatter(x, fit_result.resid)
plt.vlines(x, 0, fit_result.resid, lw=1)
plt.xlabel('$x$')
plt.ylabel('$r$')
plt.title('Residuals vs. x')
plt.show();

### On some real data

We'll now use one of the datasets readily available in statsmodels to perform an OLS. In particular, let's choose the dataset named Scotland, which contains data about the devolution referendum held in Scotland in 1997: the question to the voters was about the creation of a Scottish Parliament with its own powers, specifically over taxation.
There are 32 rows in the data, one for each county and 8 attributes, the dependent variable is the proportion of YES votes obtained. The attributes are:
• COUTAX: £ of council tax
• UNEMPF: percentage of total unemployment benefits claims from females
• MOR: standardised mortality rate (UK one set to 100)
• ACT: percentage of labor force participation
• GDP: GDP
• AGE: percentage of children aged 5 to 15
• COUTAX_FEMALEUNEMP: COUTAX * UNEMPF
df = dataset.data
This is a sample of the dataset:
Sample of the Scotland devolution dataset
And these are the basic statistics and correlations:
df.describe()
df.corr()
Statistics on the Scotland devolution dataset
Using all the attributes, an OLS fit yields a very high
$R^2$
:
fit_result = sm.OLS(dataset.endog, dataset.exog).fit()
fit_result.summary()
Fitting the Scotland devolution dataset
This is the histogram of residuals (note that the number of points is quite small):
plt.hist(fit_result.resid, bins=10)
plt.title('Histogram of residuals')
plt.show();
Clearly we cannot reproduce the residual plot we did above because we have a matrix of independent variables in this case and not a 1-dimensional array.

## Which points are "influential"?

### The projection matrix

A data point is influent in a regression if any changes to it would sensibly modify the regression. The influence is quantified via the projection matrix (Wikipedia gives a nice overview), also called influence matrix or hat matrix, which is matrix
$\mathbf P$
such that
$\mathbf{\hat y} = \mathbf{P y} \ ,$
where
$\mathbf{\hat y}$
is the fitted dependent variable and
$\mathbf{y}$
is the observed dependent variable. You see why the name "hat matrix".
Because in a linear model
$\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \mathbf{\epsilon}$
we have
$\boldsymbol{\hat \beta} = (\mathbf{X}^t \mathbf{X})^{-1} \mathbf{X}^t \mathbf{y} \ ,$
where
$\boldsymbol{\hat \beta}$
are the estimated parameters such that
$\mathbf{\hat y} = \mathbf{X} \boldsymbol{\hat \beta}$
, it follows that
$\mathbf{P} = \mathbf{X} (\mathbf{X}^t \mathbf{X})^{-1} \mathbf{X}^t \ .$
The diagonal elements in
$\mathbf{P}$
are called the leverages:
$l_i = p_{ii} = \frac{\partial \hat y_i}{\partial y_i}$
Leverages quantify the influence of each point in the dataset.
The projection matrix is idempotent:
$\mathbf{P}^2 = \mathbf{X} (\mathbf{X}^t \mathbf{X})^{-1} \mathbf{X}^t \mathbf{X} (\mathbf{X}^t \mathbf{X})^{-1} \mathbf{X}^t = \mathbf{X} (\mathbf{X}^t \mathbf{X})^{-1} (\mathbf{X}^t \mathbf{X}) (\mathbf{X}^t \mathbf{X})^{-1} \mathbf{X}^t = \mathbf{X} (\mathbf{X}^t \mathbf{X})^{-1} \mathbf{X}^t = \mathbf{P} \ ,$
a property which will come in handy in the following. More properties and detailed proofs around the projection matrix can be found in the third reference.

### Studentising the residuals

"Studentising" (from W S Gosset, aka "Student") the residuals is a way to normalise them to a measure of their error in such a way that they are rendered comparable to one another. The idea is that residuals are estimates of the error, so simply dividing them by their sample standard deviation would not be the best of ideas. A clean way has been provided by Mr. Student and it relies on the projection matrix.
Now the residual is expressed by
$\mathbf{r} = \mathbf{y} - \mathbf{\hat y} = (\mathbb{1} - \mathbf{P}) \mathbf{y} \ ,$
and we want to compute an estimate of its variance.
Because
$Var[\mathbf{X a}] = \mathbf{X} Var[\mathbf{a}] \mathbf{X}^t$
(see the page on the moments of a distribution for the proof) it follows that the variance of the residuals is given by (we are using the idempotence of the projection matrix):
$Var[\mathbf{r}] = (\mathbb{1} - \mathbf{P}) Var[\mathbf{y}] (\mathbb{1} - \mathbf{P})^t = \sigma^2 (\mathbb{1} - \mathbf{P})^2 = \sigma^2 (\mathbb{1} - \mathbf{P}) \ ,$
hence
$Var[r_i] = \sigma^2 (1 - l_i)$
where
$\sigma^2$
is the variance of the independent observed variables.
Studentised residuals are obtained dividing the residual by their standard deviation so calculated:
$s_i = \frac{r_i}{\sigma \sqrt{(1-l_i)}}$

### The influence plot

The influence plot is obtained by displaying the studentised residuals in a regression against their leverage. This way, in a diagnosis effort, it can be visually monitored which points have the highest influence on the regression itself.
statsmodels makes everything super-easy as it has an API for plotting this data directly, which in the case of our data from above leads to (the size of the bubbles is given by Cook's distance):
Influence plot on a fit of the Scotland devolution dataset
What we see here is that there are points whose (studentised) residuals are not that high but which exercise a certain influence on the regression, so that removing/changing them would lead to a different outcome.

## References

1. 1.
2. 2.
Wikipedia on the projection matrix
3. 3.
C E Ginestet, Hat Matrix: Properties and Interpretation, Boston University lecture class