k-means
The k-means is a clustering algorithm. Not to be confused with kNN!
For the code reported here, you can head to the notebook present in the repo. You will need some imports:
1
import math
2
import random
3
4
import pandas as pd
5
import numpy as np
6
from matplotlib import pyplot as plt
7
from sklearn.cluster import KMeans
8
from sklearn.preprocessing import scale
9
from scipy.spatial.distance import euclidean
Copied!

How it works

The k-means clustering algorithm needs you, the user, to choose
KK
, which is the number of clusters you want to end up with. It is the fastest of the clustering methods and it always converges, though it may do to a local minimum.
Given a set of
nn
observations,
(x1,x2,,xn)(\mathbf{x_1}, \mathbf{x_2}, \ldots, \mathbf{x_n})
, each living in a
dd
-dimensional space (
xRd\mathbf{x} \in \mathbb{R}^d
), where
dd
are the features, the algorithms wants to partition them into
KnK \leq n
clusters
C1,C2,,Cn{\mathcal{C}_1, \mathcal{C}_2, \ldots, \mathcal{C}_n}
so to minimise the within-cluster-sum-of-squares (WCSS,
WW
):
W=k=1KxCkxμk2 ,\boxed{W = \sum_{k=1}^{K} \sum_{\mathbf{x} \in \mathcal{C}_k} || \mathbf{x} - \mathbb{\mu}_k ||^2} \ ,
where
μk\mathbb{\mu}_k
is the mean of the points falling in cluster
CkC_k
(the so-called centroid of the cluster), itself a vector in
Rd\mathbb{R}^d
, each of its coordinates being the mean for that feature in the cluster.
In this horrible figure I did above,
KK
is two and the red crosses indicate the centroids.

The standard algorithm

The standard k-means algorithm is that of Lloyds (dated 1957, was only published outside of Bell Labs in 1982).
In short, the algorithms starts by initialising the
KK
centroids with points picked randomly from the sample. It then proceeds iteratively, until convergence, through steps called
  1. 1.
    assignment
  2. 2.
    update
The assignment step consists in assigning each sample point to the cluster whose centroid is the nearest; this procedure is equivalent to partitioning the samples according to the Voronoi diagram generated by the centroids. The Voronoi diagram is a way to partition a plane in regions based on the distance to points. The update step calculates the new centroids with the new cluster configuration.
These steps are executed one after the other until the difference between new and old centroids does not exceed a chosen threshold.

Further metrics

On top of the WCSS, we can use other metrics to assess and monitor the quality and the nature for our clustering.

Compactness

We can assess how "compact" our clustering is via summing up all the intra-cluster distances for each cluster, divided by twice the number of points in each of them:
c=k=1K12nkdk .\boxed{c = \sum_{k=1}^K \frac{1}{2n_k} d_k} \ .
To expand on the above,
nkn_k
is the number of points in cluster
Ck\mathcal{C}_k
identified by index
kk
, with
k1,,Kk \in {1, \ldots, K}
, and
dkd_k
is the sum of distances within the cluster:
dk=xCkyCkxy2 ,d_k = \sum_{\mathbf{x} \in \mathcal{C}_k} \sum_{\mathbf{y} \in \mathcal{C}_k} || \mathbf{x} - \mathbf{y} ||^2 \ ,
and it can be written as (
μk\mathbf{\mu}_k
being the centroid of the cluster)
2nkxCkxμk2 .2 n_k \sum_{\mathbf{x} \in C_k} || \mathbf{x} - \mathbf{\mu}_k ||^2 \ .

Proof

By definition, we have (
μk\mu_k
is the centroid of
Ck\mathcal{C}_k
)
μk=xCkxnk\mathbf{\mu}_k = \frac{\sum_{\mathbf{x} \in \mathcal{C}_k} \mathbf{x}}{n_k}
because each component of
μk\mathbf{\mu}_k
is the mean of the corresponding components of the points in the cluster.
The first member is (we use the fact, in the second line, that
x1=nk\sum_{\mathbf{x}} 1 = n_k
):
xy(x2+y22xy)=xyx2+xyy22xyxy=nkxx2+nkyy22nkxxμk=2nkxx22nk2μk2\begin{align} \sum_{\mathbf{x}} \sum_{\mathbf{y}} (||\mathbf{x}||^2 + ||\mathbf{y}||^2 - 2 \mathbf{x} \cdot \mathbf{y}) &= \sum_{\mathbf{x}} \sum_{\mathbf{y}} ||\mathbf{x}||^2 + \sum_{\mathbf{x}} \sum_{\mathbf{y}} ||\mathbf{y}||^2 - 2 \sum_{\mathbf{x}} \sum_{\mathbf{y}} \mathbf{x} \cdot \mathbf{y} \\ &= n_k \sum_{\mathbf{x}} ||\mathbf{x}||^2 + n_k \sum_{\mathbf{y}} ||\mathbf{y}||^2 -2 n_k \sum_{\mathbf{x}} \mathbf{x} \cdot \mathbf{\mu}_k \\ &= 2 n_k \sum_{\mathbf{x}} ||\mathbf{x}||^2 - 2n_k^2 ||\mu_k||^2 \end{align}
The second member is
2nk[xx2+xμk22xxμk]=2nk[xx2+nkμk22nkμk2]=2nk[xx2nkμk2]=2nkxx22nk2μk2 ,\begin{align} 2 n_k \big[ \sum_{\mathbf{x}} ||\mathbf{x}||^2 + \sum_{\mathbf{x}} ||\mu_k||^2 -2 \sum_{\mathbf{x}} \mathbf{x} \cdot \mathbf{\mu}_k \big] &= 2 n_k \big[ \sum_{\mathbf{x}} ||\mathbf{x}||^2 + n_k ||\mu_k||^2 - 2 n_k ||\mu_k||^2 \big] \\ &= 2 n_k \big[ \sum_{\mathbf{x}} ||\mathbf{x}||^2 - n_k ||\mu_k||^2 \big] \\ &= 2 n_k \sum_{\mathbf{x}} ||\mathbf{x}||^2 - 2 n_k^2 ||\mu_k||^2 \ , \end{align}
so the two things are the same.

How to choose the best
KK
?

Clearly, if I choose a too large
KK
, I end up with a too-fine clustering which fails in showing me similarities in the data. On the flip side, If I choose a too small
KK
, I also fail in separating similar points for the reverse reason, as I'm putting too much stuff together. How should I go with choosing the
KK
that best separates my points?
There can be (at least) three ways: one rule of thumb, one heuristic and one mathematical.

The rule of thumb

The rule of thumb method suggests choosing
Kn/2K \sim \sqrt{n/2}
. This is a (very) hand-waving method, and gives an "approximate" clustering (it typically gives too many clusters). The justification behind it comes from the fact that it runs in linear time so it really should only be taken as an indication when wanting to reduce dataset size.

The elbow method

The elbow method is a heuristic method consisting in looking at when adding a new cluster does not model the data any better.
By plotting
WW
as a function of
KK
, it should be clear where the contribution of another cluster gives minimal further gain (the "elbow" point of the curve). It is clear that the choice of this point is kind of subjective anyway, as it depends on the situation and on the what we, in that situation, consider a minimal further gain not worth pursuing.
See the reference about Stack Overflow for a computational answer.

The Gap Statistic method

It is a mathematical presented in the original paper, you can read an explanation plus implementation in the references. The authors define the gap statistic as
G(K)=E[logWK]logWK ,G(K) = \mathbb{E}[\log W^*_K] - \log W_K \ ,
where the asterisk refers to a null reference dataset. It is demonstrated that the best
KK
clustering the data is the smallest
KK
such that
G(K)G(K+1)sK+1 ,G(K) \geq G(K+1) - s_{K+1} \ ,
where
sKs_K
is the simulation error calculated from the standard deviation
σK\sigma_K
of B Monte Carlo replicates
log(WK)\log(W^*_K)
, as
sK=1+1BσK ,s_K = \sqrt{1 + \frac{1}{B}} \sigma_K \ ,
and
E[logWK]=1Bb=1BlogWKb ,\mathbb{E}[\log W^*_K] = \frac{1}{B} \sum_{b=1}^{B} \log W^*_{Kb} \ ,
WKbW^*_{Kb}
given by the clustering of
BB
reference datasets.
The reference distribution is obtained by generating each reference feature randomly (uniformly) within the range values for that feature, that is, within the bounding box for that feature. Then the expected value is estimated by an average of
BB
copies
logWK\log W^*_K
, each computed from a Monte Carlo sample
x1,,xnx^*_1, \ldots, x^*_n
, drawn from a reference distribution (x is the dataset matrix).
Because
E[logWK]=1BlogΠbWKb ,\mathbb{E}[\log W^*_K] = \frac{1}{B} \log \Pi_b W^*_{Kb} \ ,
then
G(K)=log[(ΠbWKb)1BWK]G(K) = \log \left[ \frac{(\Pi_b W^*_{Kb})^{\frac{1}{B}}}{W_K} \right]
In the logarithm, at the denominator, we got the geometric mean of
WKW^*_K
. This last equation is saying that the gap statistics is the logarithm of the ratio of the geometric mean of
WKW^*_K
to
WKW_K
.
The standard deviation is
σK=1Bb=1B(logWKbE[logWKb])2\sigma_K = \sqrt{\frac{1}{B} \sum_{b=1}^B (\log W^*_{Kb} - \mathbb{E}[\log W^*_{Kb}])^2 }

Trying it!

We will use the Old Faithful dataset for this, which reports some eruption times and waiting times between eruptions of the Old Faithful geyser in Yellowstone, a classic dataset. A copy of the data is available in the repo for this book for convenience.
1
# Let's read the Old Faithful dataset into a Pandas Dataframe, choosing the columns
2
3
data = pd.read_csv(dataset, delimiter=' ')[['eruptions', 'waiting']]
4
Copied!
Let's scale the data so that each column lives on the same scale:
1
scaled_data = scale(data)
Copied!
And let's see how the data looks like:
1
data.plot.scatter('eruptions', 'waiting')
2
plt.title('Scatter of the data in the "old faithful" dataset')
3
plt.show();
Copied!
The Old Faithful dataset - you can eye two sets

Apply the rule of thumb

1
thumb_k = np.sqrt(data.shape[0]/2)
2
print(' * Rule of thumb asserts optimal k =', round(thumb_k, 2))
Copied!
The rule of thumb suggests an optimal k of 12!

Apply the elbow method

You will see the elbow will appear quite soon. Let's define some functions first, for convenience:
1
def retrieve_cluster_points(cluster_index, labels, samples_matrix):
2
"""
3
In a clustering, retrieve points (their coordinates)
4
belonging to given cluster. Return array of such points.
5
samples_matrix is a Numpy array.
6
"""
7
8
cluster_points = []
9
for sample_index in np.where(labels == cluster_index)[0]:
10
cluster_points.append(samples_matrix[sample_index, :])
11
cluster_points = np.array(cluster_points)
12
13
return np.array(cluster_points)
14
15
16
def compute_wcss(centroids, labels, samples_matrix):
17
"""
18
Compute the WCSS for a k-means clustering, given the
19
centroids and the dataset matrix of samples.
20
samples_matrix is a Numpy array.
21
"""
22
23
k = len(centroids)
24
wcss = 0
25
for cluster_index in range(k):
26
cluster_points = retrieve_cluster_points(cluster_index,
27
labels,
28
samples_matrix)
29
wcss += sum([euclidean(point, centroids[cluster_index])
30
for point in cluster_points])
31
32
return wcss
Copied!
and then run the method, plotting the result k by k:
1
k_range = range(1, 21) # range of K (decide arbitrarily how many Ks to test)
2
inertia_k = dict() # inertia for each k
3
wcss_k = dict() # WCSS for each k
4
p_k = dict() # percentage of variance explained for each k
5
6
# Loop over the value of K
7
for k in k_range:
8
print('k = ', k)
9
10
# Fit the model
11
fit = KMeans(n_clusters=k).fit(scaled_data)
12
13
# Retrieve centroids for fitted model
14
# A centroid is the cluster center, given as the vector of coordinates over the n_features dimensions
15
# So there will be K number of n_features-dimensional centroids
16
centroids = fit.cluster_centers_
17
18
# Retrieve cluster labels for fitted model
19
labels = fit.labels_
20
21
# Retrieve the inertia of the fitted model
22
inertia_k[k] = fit.inertia_
23
24
# Compute the WCSS of the fitted model
25
wcss_k[k] = compute_wcss(centroids, labels, scaled_data)
26
27
plt.title('WCSS vs. K')
28
plt.plot(k_range, list(wcss_k.values()), marker='o')
29
plt.xlabel('K')
30
plt.xticks(k_range)
31
plt.ylabel('WCSS')
32
plt.show();
Copied!
The elbow method on the Old Faithful dataset
It is clear that 2 is giving the sought elbow.

Apply the gap statistics method

Let's define some preliminary convenience functions (note that they make use of the ones defined above):
1
def compute_gap_statistic(samples_matrix,
2
k,
3
b=10,
4
scale_data=True):
5
"""
6
Compute the Gap Statistic for a k-means clustering.
7
b is the number of replicates to use (defaults to 10);
8
k is the chosen number of clusters;
9
scale_data (defaults to True): if to scale replicates samples matrix.
10
Return the gap statistic, the stdev of the Monte Carlo replicas and
11
the simulation error.
12
samples_matrix is a Numpy array.
13
"""
14
15
if scale_data:
16
samples_matrix = scale(samples_matrix)
17
fit = KMeans(n_clusters=k).fit(samples_matrix)
18
centroids = fit.cluster_centers_
19
labels = fit.labels_
20
Wk = compute_wcss(centroids, labels, samples_matrix)
21
Wkb = []
22
for i in range(b):
23
replicate = []
24
for sample in range(samples_matrix.shape[0]):
25
replicate_sample = \
26
[random.uniform(min(samples_matrix[:, feature]),
27
max(samples_matrix[:, feature]))
28
for feature in range(samples_matrix.shape[1])]
29
replicate.append(replicate_sample)
30
replicate = np.array(replicate)
31
if scale_data:
32
replicate = scale(replicate)
33
fit = KMeans(n_clusters=k).fit(replicate)
34
replicate_centroids = fit.cluster_centers_
35
replicate_labels = fit.labels_
36
Wkb.append(compute_wcss(replicate_centroids,
37
replicate_labels,
38
replicate))
39
gap = np.mean(np.log(Wkb)) - np.log(Wk)
40
stdk = np.std(np.log(Wkb))
41
sk = math.sqrt(1 + 1. / b) * stdk
42
43
return gap, stdk, sk
Copied!
then we will run the method:
1
k_gap = {}
2
for k in k_range:
3
gap = compute_gap_statistic(scaled_data, k, scale_data=False)
4
k_gap[k] = gap
5
6
k_gapdiff = {}
7
for k in range(min(k_range), max(k_range)-1):
8
k_gapdiff[k] = k_gap[k][0] - k_gap[k+1][0] - k_gap[k + 1][2]
9
10
'The best k according to the gap statistic is: ', evaluate_gap_statistic_best_k(scaled_data)
Copied!
which yields an optimal k of 2. Let's see it with a few plots:
1
plt.plot(list(k_gap.keys()), [value[0] for value in k_gap.values()])
2
plt.title('Gap statistic')
3
plt.ylabel('$G#x27;)
4
plt.xlabel('$K#x27;)
5
#plt.grid()
6
plt.show();
Copied!
The gap statistics on the Old Failthful dataset
Gap diff
And finally, let's plot the clustered points - you can choose the k desired
1
# Choose feature indices from the header
2
f1 = 1
3
f2 = 2
4
5
k = 4
6
7
# List of colours: if len is < k, add some
8
colours = {
9
'red': 'r',
10
'black': 'k',
11
'blue': 'b',
12
'green': 'g',
13
'grey': '#929090',
14
'pink': '#FFB6C1',
15
'light_blue': '#00BFFF',
16
'light_green': '#29E191'
17
}
18
19
plt.title('Scatter plot of two features, k clusters', fontweight='bold', fontsize=16)
20
plt.xlabel('eruptions')
21
plt.ylabel('waiting')
22
plt.grid()
23
24
# Note: plot displays the scaled data
25
for cluster_index in range(k):
26
cluster_color = colours[list(colours)[cluster_index]]
27
cluster_points = retrieve_cluster_points(cluster_index, labels, scaled_data)
28
plt.scatter([point[f1-1] for point in cluster_points],
29
[point[f2-1] for point in cluster_points], color=cluster_color)
30
plt.scatter(centroids[cluster_index,f1-1], centroids[cluster_index, f2-1], marker='x', s=200, color=cluster_color)
31
32
plt.show();
Copied!
Clustered points in the Old Faithful dataset, k=4

References

  1. 1.
    S Lloyds, Least squares quantization in PCM, IEEE transactions on information theory, 28.2, 1982
  2. 2.
    Stack Overflow on the elbow method
  3. 3.
    This great blog on the gap statistic, with a Python implementation and examples
  4. 4.
    Tibshirani, Walther, Hastie, Estimating the number of clusters in a data set via the gap statistic, J R Statistic Society, 63:2, 2001
  5. 5.
    K-means is no free lunch, Variance Explained
Last modified 7mo ago