Machine Learning: fundamental algorithms
Independence, joint/marginal/conditional probability, covariance and correlation

Statistical independence

Two random variables
XX
and
YY
are said to be independent when their joint probability (see below) is equal to the product of the probabilities of each:
P(X,Y)=P(X)P(Y) .P(X, Y) = P(X) P(Y) \ .
This means, in terms of conditional probabilities,
P(XY)=P(X,Y)P(Y)=P(X)P(Y)P(Y)=P(X) ,P(X | Y) = \frac{P(X, Y)}{P(Y)} = \frac{P(X)P(Y)}{P(Y)} = P(X) \ ,
that is, the probability of
XX
occurring is not affected by the occurrence of
YY
. This is typically how independence is defined, in verbal terms: the occurrence of one event does not influence the occurrence of the other.

IID variables

I.I.D. stands for independent and identically distributed, it's a shortening used all over in statistics. IID variables are independent but also distributed in the same way.
The concept is the basic assumptions of many foundational results in statistics.

The joint probability

The joint probability of one or more events is the probability that they happen together. If
XX
,
YY
,
ZZ
, ... are the random variables, their joint probability is written as
P(X,Y,Z,)P(X, Y, Z, \ldots)
or as
P(XYZ)P(X \cap Y \cap Z \ldots)

The case of independent variables

If the variables are independent, their joint probability reduces to the product of their probabilities:
P(X1,X2,,Xn)=Πi=1nP(Xi)P(X1, X_2, \ldots, X_n) = \Pi{i=1}^n P(X_i)
.

The marginal probability

Picture: Wikipedia, Bscan, CC0, via Wikimedia Commons
If we have the joint probability of two or more random variables, the marginal probability of each is the probability related to that variable and to its own space of events; it expresses the probability of the variable when the value of the other one is not known. It calculated by summing the joint probability over the space of events of the other variable. More specifically, given
P(X,Y)=P(X=x,Y=y)P(X, Y) = P(X=x, Y=y)
,
P(X=x)=yP(X=x,Y=y) .P(X=x) = \sum_y P(X=x, Y=y) \ .
The illustration here (Image by IkamusumeFan (own work, released under CC BY-SA 3.0), via Wikimedia Commons) shows points extracted from a joint probability (the black dots) and the marginal probabilities as well.

Covariance and correlation

Covariance

Given the random variables
XX
and
YY
with respective means
μx\mu_x
and
μy\mu_y
, their covariance is defined as
cov(X,Y)=E[(Xμx)((Yμy)]\text{cov}(X, Y) = \mathbb{E}[(X - \mu_x)((Y - \mu_y)]
It is a measure of how jointly the two variables vary: a positive covariance means that when
XX
grows,
YY
grows as well and a negative covariance means that when
XX
grows,
YY
decreases.

Correlation

The word correlation is measured by a correlation coefficient which exists in several definitions depending on what is exactly measured; it is always a sort of normalised covariance. The correspondent of the covariance itself is Pearson's definition, which defines the correlation coefficient as the covariance normalised by the product of the standard deviations of the two variables:
ρxy=cov(x,y)σxσy=E[(xμx)(yμy)]σxσy ,\rho_{xy} = \frac{\text{cov}(x, y)}{\sigma_x \sigma_y} = \frac{\mathbb{E}[(x - \mu_x)(y - \mu_y)]}{\sigma_x \sigma_y} \ ,
and it can also be written as
ρxy=E[(xyxμyμxy+μxμy)]σxσy=E[xy]μxμyμyμx+μxμyσxσy=E[xy]μxμyσxσy .\begin{align} \rho_{xy} &= \frac{\mathbb{E}[(xy - x \mu_y - \mu_x y + \mu_x \mu_y)]}{\sigma_x \sigma_y} \\ &= \frac{\mathbb{E}[xy] - \mu_x\mu_y - \mu_y\mu_x + \mu_x\mu_y}{\sigma_x \sigma_y} \\ &= \frac{\mathbb{E}[xy] - \mu_x\mu_y}{\sigma_x \sigma_y} \ . \end{align}
The correlation coefficient has these properties:
  • 1ρxy1-1 \leq \rho_{xy} \leq 1
  • It is symmetric:
    ρxy=ρyx\rho_{xy} = \rho_{yx}
  • If the variables are independent, then
    ρxy=0\rho_{xy} = 0
    (but the reverse is not true)

Independence and correlation

Let's expand on the last point there really. We said that if two random variables are independent, then the correlation coefficient is zero. This is easy to prove as it follows directly from the definition above (also bear in mind Fubini's theorem:
E[XY]=ΩXΩYdxdy xyP(x,y)=ΩXΩYdxdy xyP(x)P(y)=μxμy .\mathbb{E}[XY] = \int_{\Omega_X } \int_{\Omega_Y} \text{d} x \text{d} y \ xy P(x,y) = \int_{\Omega_X } \int_{\Omega_Y} \text{d} x \text{d} y \ xy P(x) P(y) = \mu_x \mu_y \ .
The reverse is not true. Look at this amazing Q&A on Cross Validated for a well explained counter-example.

Correlation and the relation between variables

Correlation says "how much" it happens that when
xx
grows,
yy
grows as well. It is not a measure of the slope of the linear relation between
xx
and
yy
. This is greatly illustrated in the figure above (from Wikipedia's page, under CC0 ), which reports sets of data points with
xx
and
yy
and their correlation coefficient.
In the central figure, because the variance of
yy
is 0, then the correlation is undefined. In the bottom row, the relation between variables is not linear, the correlation does not capture that.
Last modified 9mo ago