Meta & resources
Machine Learning: concepts & procedures
Machine Learning: fundamental algorithms
Machine Learning: model assessment
Artificial neural networks
Natural language processing
The computer science appendix
k-means
The k-means is a clustering algorithm. Not to be confused with kNN!
For the code reported here, you can head to the notebook present in the repo. You will need some imports:
1
import math
2
import random
3
4
import pandas as pd
5
import numpy as np
6
from matplotlib import pyplot as plt
7
from sklearn.cluster import KMeans
8
from sklearn.preprocessing import scale
9
from scipy.spatial.distance import euclidean
Copied!

# How it works

The k-means clustering algorithm needs you, the user, to choose
$K$
, which is the number of clusters you want to end up with. It is the fastest of the clustering methods and it always converges, though it may do to a local minimum. Given a set of
$n$
observations,
$(\mathbf{x_1}, \mathbf{x_2}, \ldots, \mathbf{x_n})$
, each living in a
$d$
-dimensional space (
$\mathbf{x} \in \mathbb{R}^d$
), where
$d$
are the features, the algorithms wants to partition them into
$K \leq n$
clusters
${\mathcal{C}_1, \mathcal{C}_2, \ldots, \mathcal{C}_n}$
so to minimise the within-cluster-sum-of-squares (WCSS,
$W$
):
$\boxed{W = \sum_{k=1}^{K} \sum_{\mathbf{x} \in \mathcal{C}_k} || \mathbf{x} - \mathbb{\mu}_k ||^2} \ ,$
where
$\mathbb{\mu}_k$
is the mean of the points falling in cluster
$C_k$
(the so-called centroid of the cluster), itself a vector in
$\mathbb{R}^d$
, each of its coordinates being the mean for that feature in the cluster.
In this horrible figure I did above,
$K$
is two and the red crosses indicate the centroids.

# The standard algorithm

The standard k-means algorithm is that of Lloyds (dated 1957, was only published outside of Bell Labs in 1982).
In short, the algorithms starts by initialising the
$K$
centroids with points picked randomly from the sample. It then proceeds iteratively, until convergence, through steps called
1. 1.
assignment
2. 2.
update
The assignment step consists in assigning each sample point to the cluster whose centroid is the nearest; this procedure is equivalent to partitioning the samples according to the Voronoi diagram generated by the centroids. The Voronoi diagram is a way to partition a plane in regions based on the distance to points. The update step calculates the new centroids with the new cluster configuration.
These steps are executed one after the other until the difference between new and old centroids does not exceed a chosen threshold.

# Further metrics

On top of the WCSS, we can use other metrics to assess and monitor the quality and the nature for our clustering.

## Compactness

We can assess how "compact" our clustering is via summing up all the intra-cluster distances for each cluster, divided by twice the number of points in each of them:
$\boxed{c = \sum_{k=1}^K \frac{1}{2n_k} d_k} \ .$
To expand on the above,
$n_k$
is the number of points in cluster
$\mathcal{C}_k$
identified by index
$k$
, with
$k \in {1, \ldots, K}$
, and
$d_k$
is the sum of distances within the cluster:
$d_k = \sum_{\mathbf{x} \in \mathcal{C}_k} \sum_{\mathbf{y} \in \mathcal{C}_k} || \mathbf{x} - \mathbf{y} ||^2 \ ,$
and it can be written as (
$\mathbf{\mu}_k$
being the centroid of the cluster)
$2 n_k \sum_{\mathbf{x} \in C_k} || \mathbf{x} - \mathbf{\mu}_k ||^2 \ .$

### Proof

By definition, we have (
$\mu_k$
is the centroid of
$\mathcal{C}_k$
)
$\mathbf{\mu}_k = \frac{\sum_{\mathbf{x} \in \mathcal{C}_k} \mathbf{x}}{n_k}$
because each component of
$\mathbf{\mu}_k$
is the mean of the corresponding components of the points in the cluster.
The first member is (we use the fact, in the second line, that
$\sum_{\mathbf{x}} 1 = n_k$
):
\begin{align} \sum_{\mathbf{x}} \sum_{\mathbf{y}} (||\mathbf{x}||^2 + ||\mathbf{y}||^2 - 2 \mathbf{x} \cdot \mathbf{y}) &= \sum_{\mathbf{x}} \sum_{\mathbf{y}} ||\mathbf{x}||^2 + \sum_{\mathbf{x}} \sum_{\mathbf{y}} ||\mathbf{y}||^2 - 2 \sum_{\mathbf{x}} \sum_{\mathbf{y}} \mathbf{x} \cdot \mathbf{y} \\ &= n_k \sum_{\mathbf{x}} ||\mathbf{x}||^2 + n_k \sum_{\mathbf{y}} ||\mathbf{y}||^2 -2 n_k \sum_{\mathbf{x}} \mathbf{x} \cdot \mathbf{\mu}_k \\ &= 2 n_k \sum_{\mathbf{x}} ||\mathbf{x}||^2 - 2n_k^2 ||\mu_k||^2 \end{align}
The second member is
\begin{align} 2 n_k \big[ \sum_{\mathbf{x}} ||\mathbf{x}||^2 + \sum_{\mathbf{x}} ||\mu_k||^2 -2 \sum_{\mathbf{x}} \mathbf{x} \cdot \mathbf{\mu}_k \big] &= 2 n_k \big[ \sum_{\mathbf{x}} ||\mathbf{x}||^2 + n_k ||\mu_k||^2 - 2 n_k ||\mu_k||^2 \big] \\ &= 2 n_k \big[ \sum_{\mathbf{x}} ||\mathbf{x}||^2 - n_k ||\mu_k||^2 \big] \\ &= 2 n_k \sum_{\mathbf{x}} ||\mathbf{x}||^2 - 2 n_k^2 ||\mu_k||^2 \ , \end{align}
so the two things are the same.

# How to choose the best$K$?

Clearly, if I choose a too large
$K$
, I end up with a too-fine clustering which fails in showing me similarities in the data. On the flip side, If I choose a too small
$K$
, I also fail in separating similar points for the reverse reason, as I'm putting too much stuff together. How should I go with choosing the
$K$
that best separates my points?
There can be (at least) three ways: one rule of thumb, one heuristic and one mathematical.

## The rule of thumb

The rule of thumb method suggests choosing
$K \sim \sqrt{n/2}$
. This is a (very) hand-waving method, and gives an "approximate" clustering (it typically gives too many clusters). The justification behind it comes from the fact that it runs in linear time so it really should only be taken as an indication when wanting to reduce dataset size.

## The elbow method

The elbow method is a heuristic method consisting in looking at when adding a new cluster does not model the data any better.
By plotting
$W$
as a function of
$K$
, it should be clear where the contribution of another cluster gives minimal further gain (the "elbow" point of the curve). It is clear that the choice of this point is kind of subjective anyway, as it depends on the situation and on the what we, in that situation, consider a minimal further gain not worth pursuing.

## The Gap Statistic method

It is a mathematical presented in the original paper, you can read an explanation plus implementation in the references. The authors define the gap statistic as
$G(K) = \mathbb{E}[\log W^*_K] - \log W_K \ ,$
where the asterisk refers to a null reference dataset. It is demonstrated that the best
$K$
clustering the data is the smallest
$K$
such that
$G(K) \geq G(K+1) - s_{K+1} \ ,$
where
$s_K$
is the simulation error calculated from the standard deviation
$\sigma_K$
of B Monte Carlo replicates
$\log(W^*_K)$
, as
$s_K = \sqrt{1 + \frac{1}{B}} \sigma_K \ ,$
and
$\mathbb{E}[\log W^*_K] = \frac{1}{B} \sum_{b=1}^{B} \log W^*_{Kb} \ ,$
$W^*_{Kb}$
given by the clustering of
$B$
reference datasets.
The reference distribution is obtained by generating each reference feature randomly (uniformly) within the range values for that feature, that is, within the bounding box for that feature. Then the expected value is estimated by an average of
$B$
copies
$\log W^*_K$
, each computed from a Monte Carlo sample
$x^*_1, \ldots, x^*_n$
, drawn from a reference distribution (x is the dataset matrix).
Because
$\mathbb{E}[\log W^*_K] = \frac{1}{B} \log \Pi_b W^*_{Kb} \ ,$
then
$G(K) = \log \left[ \frac{(\Pi_b W^*_{Kb})^{\frac{1}{B}}}{W_K} \right]$
In the logarithm, at the denominator, we got the geometric mean of
$W^*_K$
. This last equation is saying that the gap statistics is the logarithm of the ratio of the geometric mean of
$W^*_K$
to
$W_K$
.
The standard deviation is
$\sigma_K = \sqrt{\frac{1}{B} \sum_{b=1}^B (\log W^*_{Kb} - \mathbb{E}[\log W^*_{Kb}])^2 }$

# Trying it!

We will use the Old Faithful dataset for this, which reports some eruption times and waiting times between eruptions of the Old Faithful geyser in Yellowstone, a classic dataset. A copy of the data is available in the repo for this book for convenience.
1
# Let's read the Old Faithful dataset into a Pandas Dataframe, choosing the columns
2
3
data = pd.read_csv(dataset, delimiter=' ')[['eruptions', 'waiting']]
4
Copied!
Let's scale the data so that each column lives on the same scale:
1
scaled_data = scale(data)
Copied!
And let's see how the data looks like:
1
data.plot.scatter('eruptions', 'waiting')
2
plt.title('Scatter of the data in the "old faithful" dataset')
3
plt.show();
Copied! The Old Faithful dataset - you can eye two sets

## Apply the rule of thumb

1
thumb_k = np.sqrt(data.shape/2)
2
print(' * Rule of thumb asserts optimal k =', round(thumb_k, 2))
Copied!
The rule of thumb suggests an optimal k of 12!

## Apply the elbow method

You will see the elbow will appear quite soon. Let's define some functions first, for convenience:
1
def retrieve_cluster_points(cluster_index, labels, samples_matrix):
2
"""
3
In a clustering, retrieve points (their coordinates)
4
belonging to given cluster. Return array of such points.
5
samples_matrix is a Numpy array.
6
"""
7
8
cluster_points = []
9
for sample_index in np.where(labels == cluster_index):
10
cluster_points.append(samples_matrix[sample_index, :])
11
cluster_points = np.array(cluster_points)
12
13
return np.array(cluster_points)
14
15
16
def compute_wcss(centroids, labels, samples_matrix):
17
"""
18
Compute the WCSS for a k-means clustering, given the
19
centroids and the dataset matrix of samples.
20
samples_matrix is a Numpy array.
21
"""
22
23
k = len(centroids)
24
wcss = 0
25
for cluster_index in range(k):
26
cluster_points = retrieve_cluster_points(cluster_index,
27
labels,
28
samples_matrix)
29
wcss += sum([euclidean(point, centroids[cluster_index])
30
for point in cluster_points])
31
32
return wcss
Copied!
and then run the method, plotting the result k by k:
1
k_range = range(1, 21) # range of K (decide arbitrarily how many Ks to test)
2
inertia_k = dict() # inertia for each k
3
wcss_k = dict() # WCSS for each k
4
p_k = dict() # percentage of variance explained for each k
5
6
# Loop over the value of K
7
for k in k_range:
8
print('k = ', k)
9
10
# Fit the model
11
fit = KMeans(n_clusters=k).fit(scaled_data)
12
13
# Retrieve centroids for fitted model
14
# A centroid is the cluster center, given as the vector of coordinates over the n_features dimensions
15
# So there will be K number of n_features-dimensional centroids
16
centroids = fit.cluster_centers_
17
18
# Retrieve cluster labels for fitted model
19
labels = fit.labels_
20
21
# Retrieve the inertia of the fitted model
22
inertia_k[k] = fit.inertia_
23
24
# Compute the WCSS of the fitted model
25
wcss_k[k] = compute_wcss(centroids, labels, scaled_data)
26
27
plt.title('WCSS vs. K')
28
plt.plot(k_range, list(wcss_k.values()), marker='o')
29
plt.xlabel('K')
30
plt.xticks(k_range)
31
plt.ylabel('WCSS')
32
plt.show();
Copied! The elbow method on the Old Faithful dataset
It is clear that 2 is giving the sought elbow.

## Apply the gap statistics method

Let's define some preliminary convenience functions (note that they make use of the ones defined above):
1
def compute_gap_statistic(samples_matrix,
2
k,
3
b=10,
4
scale_data=True):
5
"""
6
Compute the Gap Statistic for a k-means clustering.
7
b is the number of replicates to use (defaults to 10);
8
k is the chosen number of clusters;
9
scale_data (defaults to True): if to scale replicates samples matrix.
10
Return the gap statistic, the stdev of the Monte Carlo replicas and
11
the simulation error.
12
samples_matrix is a Numpy array.
13
"""
14
15
if scale_data:
16
samples_matrix = scale(samples_matrix)
17
fit = KMeans(n_clusters=k).fit(samples_matrix)
18
centroids = fit.cluster_centers_
19
labels = fit.labels_
20
Wk = compute_wcss(centroids, labels, samples_matrix)
21
Wkb = []
22
for i in range(b):
23
replicate = []
24
for sample in range(samples_matrix.shape):
25
replicate_sample = \
26
[random.uniform(min(samples_matrix[:, feature]),
27
max(samples_matrix[:, feature]))
28
for feature in range(samples_matrix.shape)]
29
replicate.append(replicate_sample)
30
replicate = np.array(replicate)
31
if scale_data:
32
replicate = scale(replicate)
33
fit = KMeans(n_clusters=k).fit(replicate)
34
replicate_centroids = fit.cluster_centers_
35
replicate_labels = fit.labels_
36
Wkb.append(compute_wcss(replicate_centroids,
37
replicate_labels,
38
replicate))
39
gap = np.mean(np.log(Wkb)) - np.log(Wk)
40
stdk = np.std(np.log(Wkb))
41
sk = math.sqrt(1 + 1. / b) * stdk
42
43
return gap, stdk, sk
Copied!
then we will run the method:
1
k_gap = {}
2
for k in k_range:
3
gap = compute_gap_statistic(scaled_data, k, scale_data=False)
4
k_gap[k] = gap
5
6
k_gapdiff = {}
7
for k in range(min(k_range), max(k_range)-1):
8
k_gapdiff[k] = k_gap[k] - k_gap[k+1] - k_gap[k + 1]
9
10
'The best k according to the gap statistic is: ', evaluate_gap_statistic_best_k(scaled_data)
Copied!
which yields an optimal k of 2. Let's see it with a few plots:
1
plt.plot(list(k_gap.keys()), [value for value in k_gap.values()])
2
plt.title('Gap statistic')
3
plt.ylabel('$G#x27;) 4 plt.xlabel('$K#x27;)
5
#plt.grid()
6
plt.show();
Copied! The gap statistics on the Old Failthful dataset Gap diff
And finally, let's plot the clustered points - you can choose the k desired
1
# Choose feature indices from the header
2
f1 = 1
3
f2 = 2
4
5
k = 4
6
7
# List of colours: if len is < k, add some
8
colours = {
9
'red': 'r',
10
'black': 'k',
11
'blue': 'b',
12
'green': 'g',
13
'grey': '#929090',
14
'pink': '#FFB6C1',
15
'light_blue': '#00BFFF',
16
'light_green': '#29E191'
17
}
18
19
plt.title('Scatter plot of two features, k clusters', fontweight='bold', fontsize=16)
20
plt.xlabel('eruptions')
21
plt.ylabel('waiting')
22
plt.grid()
23
24
# Note: plot displays the scaled data
25
for cluster_index in range(k):
26
cluster_color = colours[list(colours)[cluster_index]]
27
cluster_points = retrieve_cluster_points(cluster_index, labels, scaled_data)
28
plt.scatter([point[f1-1] for point in cluster_points],
29
[point[f2-1] for point in cluster_points], color=cluster_color)
30
plt.scatter(centroids[cluster_index,f1-1], centroids[cluster_index, f2-1], marker='x', s=200, color=cluster_color)
31
32
plt.show();
Copied! Clustered points in the Old Faithful dataset, k=4
1. 1.
S Lloyds, Least squares quantization in PCM, IEEE transactions on information theory, 28.2, 1982
2. 2.
Stack Overflow on the elbow method
3. 3.
This great blog on the gap statistic, with a Python implementation and examples
4. 4.
Tibshirani, Walther, Hastie, Estimating the number of clusters in a data set via the gap statistic, J R Statistic Society, 63:2, 2001
5. 5.
K-means is no free lunch, Variance Explained