# What's in the residuals of a regression?

Refer to the page on linear regression for OLS:

and to the page about the performance metrics in regression:

For the code, you will need some imports

import numpy as np

import statsmodels.api as sm

from matplotlib import pyplot as plt

import pandas as pd

For each point

$(x_i, y_i)$

in the data set, where$x_i$

is the independent variable and$y_i$

the dependent variable, its **residual**$r_i$

is the difference between the *observed*value of the dependent variable$y_i$

and the*predicted*value$\hat y_i$

(the point on the fitting curve):$r = y_i - \hat y_i \ .$

Residuals represent the variation not captured by the fit. The concept of

*residual*and that of*error*are similar and easy to confuse; the error has in fact the same definition except for the fact that the difference is computed between the*observed*value of the dependent variable and the*real*one, that is, the unknown population value of which the*observed*one is an observation. Residuals have the role of an*estimate*of the errors.Note that we framed this in one dimension here, but the definition is general to any number of dimensions.

The general assumption of a linear regression is that residuals are

*normally distributed*around a mean of 0, because the model is expected to predict higher than the actual value or lower than the actual value with the same probability, or, to phrase it better, because we expect residuals to be pure noise and not display any pattern.Let us fabricate some data and fit them with OLS in order to have a look at the residuals. We use 500 independent variables in the range 0 to 100:

x = np.linspace(0, 100, num=500)

and a dependent variable which lies on a line with slope 2 but contains some noise. Noise is given in such a way that its amplitude is extracted from a gaussian trend, this gaussian:

x = np.linspace(0, 100, num=500)

g = np.exp(-(x-50)**2/30**2)

plt.plot(x, g)

plt.show();

A beautiful gaussian

This way points in the middle of the scale are given higher possibility to vary from the line. So these are the

$y$

points:y = [2 * i + np.random.randint(-np.exp(-(i-50)**2/30**2)*30, np.exp(-(i-50)**2/30**2)*30) for i in x]

Let's fit an OLS on these data (using

`statsmodels`

). We see that the slope we obtain is $\approx 1.99$

and the $R^2$

comes as 0.99.# fit an OLS, just the slope (intercept is 0)

fit_result = sm.OLS(y, x).fit()

fit_result.summary()

Regression results

Now let's plot both the data points and the fitting line: the fact that residuals are larger in the middle is quite visible.

plt.plot(x, y, 'o', label='points')

plt.plot(x, fit_result.params * x, label='fit')

plt.xlabel('$x$')

plt.ylabel('$y$')

plt.legend()

plt.show();

Points and fitting line

Residuals are, as expected, normally centered around 0:

plt.hist(fit_result.resid, bins=20)

plt.title('Histogram of the residuals')

plt.show();

In this silly dataset, residuals are normally distributed but they display a pattern with the independent variable (well,

*we*have imposed it!). In fact, if we plot a scatter of them against$x$

, this appears very clearly: points in the middle have larger errors.This because we have imposed this feature to our data so it is an obvious "discovery", but in a general case the analysis of the trend of residuals can lead to interesting considerations which the sole histogram would hide. See the treatment in the first reference for a vivid example.

plt.scatter(x, fit_result.resid)

plt.vlines(x, 0, fit_result.resid, lw=1)

plt.xlabel('$x$')

plt.ylabel('$r$')

plt.title('Residuals vs. x')

plt.show();

We'll now use one of the datasets readily available in

`statsmodels`

to perform an OLS. In particular, let's choose the dataset named *Scotland*, which contains data about the devolution referendum held in Scotland in 1997: the question to the voters was about the creation of a Scottish Parliament with its own powers, specifically over taxation.There are 32 rows in the data, one for each county and 8 attributes, the dependent variable is the proportion of YES votes obtained. The attributes are:

*COUTAX*: £ of council tax*UNEMPF*: percentage of total unemployment benefits claims from females*MOR*: standardised mortality rate (UK one set to 100)*ACT*: percentage of labor force participation*GDP*: GDP*AGE*: percentage of children aged 5 to 15*COUTAX_FEMALEUNEMP*:*COUTAX * UNEMPF*

dataset = sm.datasets.scotland.load_pandas()

df = dataset.data

This is a sample of the dataset:

df.head()

Sample of the Scotland devolution dataset

And these are the basic statistics and correlations:

df.describe()

df.corr()

Statistics on the Scotland devolution dataset

Using all the attributes, an OLS fit yields a very high

$R^2$

:fit_result = sm.OLS(dataset.endog, dataset.exog).fit()

fit_result.summary()

Fitting the Scotland devolution dataset

This is the histogram of residuals (note that the number of points is quite small):

plt.hist(fit_result.resid, bins=10)

plt.title('Histogram of residuals')

plt.show();

Clearly we cannot reproduce the residual plot we did above because we have a matrix of independent variables in this case and not a 1-dimensional array.

A data point is

*influent*in a regression if any changes to it would sensibly modify the regression. The influence is quantified via the*projection matrix*(Wikipedia gives a nice overview), also called*influence matrix*or*hat matrix*, which is matrix$\mathbf P$

such that$\mathbf{\hat y} = \mathbf{P y} \ ,$

where

$\mathbf{\hat y}$

is the fitted dependent variable and$\mathbf{y}$

is the observed dependent variable. You see why the name "hat matrix".Because in a linear model

$\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \mathbf{\epsilon}$

we have$\boldsymbol{\hat \beta} = (\mathbf{X}^t \mathbf{X})^{-1} \mathbf{X}^t \mathbf{y} \ ,$

where

$\boldsymbol{\hat \beta}$

are the estimated parameters such that$\mathbf{\hat y} = \mathbf{X} \boldsymbol{\hat \beta}$

, it follows that$\mathbf{P} = \mathbf{X} (\mathbf{X}^t \mathbf{X})^{-1} \mathbf{X}^t \ .$

The diagonal elements in

$\mathbf{P}$

are called the *leverages*:$l_i = p_{ii} = \frac{\partial \hat y_i}{\partial y_i}$

Leverages quantify the influence of each point in the dataset.

The projection matrix is idempotent:

$\mathbf{P}^2 = \mathbf{X} (\mathbf{X}^t \mathbf{X})^{-1} \mathbf{X}^t \mathbf{X} (\mathbf{X}^t \mathbf{X})^{-1} \mathbf{X}^t = \mathbf{X} (\mathbf{X}^t \mathbf{X})^{-1} (\mathbf{X}^t \mathbf{X}) (\mathbf{X}^t \mathbf{X})^{-1} \mathbf{X}^t = \mathbf{X} (\mathbf{X}^t \mathbf{X})^{-1} \mathbf{X}^t = \mathbf{P} \ ,$

a property which will come in handy in the following. More properties and detailed proofs around the projection matrix can be found in the third reference.

"Studentising" (from W S Gosset, aka "Student") the residuals is a way to normalise them to a measure of their error in such a way that they are rendered comparable to one another. The idea is that residuals are

*estimates*of the error, so simply dividing them by their sample standard deviation would not be the best of ideas. A clean way has been provided by Mr. Student and it relies on the projection matrix.Now the residual is expressed by

$\mathbf{r} = \mathbf{y} - \mathbf{\hat y} = (\mathbb{1} - \mathbf{P}) \mathbf{y} \ ,$

and we want to compute an estimate of its variance.

Because

$Var[\mathbf{X a}] = \mathbf{X} Var[\mathbf{a}] \mathbf{X}^t$

(see the page on the moments of a distribution for the proof) it follows that the variance of the residuals is given by (we are using the idempotence of the projection matrix):$Var[\mathbf{r}] = (\mathbb{1} - \mathbf{P}) Var[\mathbf{y}] (\mathbb{1} - \mathbf{P})^t = \sigma^2 (\mathbb{1} - \mathbf{P})^2 = \sigma^2 (\mathbb{1} - \mathbf{P}) \ ,$

hence

$Var[r_i] = \sigma^2 (1 - l_i)$

where

$\sigma^2$

is the variance of the independent observed variables.Studentised residuals are obtained dividing the residual by their standard deviation so calculated:

$s_i = \frac{r_i}{\sigma \sqrt{(1-l_i)}}$

The influence plot is obtained by displaying the studentised residuals in a regression against their leverage. This way, in a diagnosis effort, it can be visually monitored which points have the highest influence on the regression itself.

`statsmodels`

makes everything super-easy as it has an API for plotting this data directly, which in the case of our data from above leads to (the size of the bubbles is given by Cook's distance):Influence plot on a fit of the Scotland devolution dataset

What we see here is that there are points whose (studentised) residuals are not that high but which exercise a certain influence on the regression, so that removing/changing them would lead to a different outcome.

- 1.
- 2.
- 3.

Last modified 1yr ago