The k-means is a clustering algorithm. Not to be confused with kNN!
For the code reported here, you can head to the notebook present in the repo. You will need some imports:
import math
import random
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import scale
from scipy.spatial.distance import euclidean

How it works

The k-means clustering algorithm needs you, the user, to choose
, which is the number of clusters you want to end up with. It is the fastest of the clustering methods and it always converges, though it may do to a local minimum.
Given a set of
(x1,x2,,xn)(\mathbf{x_1}, \mathbf{x_2}, \ldots, \mathbf{x_n})
, each living in a
-dimensional space (
xRd\mathbf{x} \in \mathbb{R}^d
), where
are the features, the algorithms wants to partition them into
KnK \leq n
C1,C2,,Cn{\mathcal{C}_1, \mathcal{C}_2, \ldots, \mathcal{C}_n}
so to minimise the within-cluster-sum-of-squares (WCSS,
W=k=1KxCkxμk2 ,\boxed{W = \sum_{k=1}^{K} \sum_{\mathbf{x} \in \mathcal{C}_k} || \mathbf{x} - \mathbb{\mu}_k ||^2} \ ,
is the mean of the points falling in cluster
(the so-called centroid of the cluster), itself a vector in
, each of its coordinates being the mean for that feature in the cluster.
In this horrible figure I did above,
is two and the red crosses indicate the centroids.

The standard algorithm

The standard k-means algorithm is that of Lloyds (dated 1957, was only published outside of Bell Labs in 1982).
In short, the algorithms starts by initialising the
centroids with points picked randomly from the sample. It then proceeds iteratively, until convergence, through steps called
  1. 1.
  2. 2.
The assignment step consists in assigning each sample point to the cluster whose centroid is the nearest; this procedure is equivalent to partitioning the samples according to the Voronoi diagram generated by the centroids. The Voronoi diagram is a way to partition a plane in regions based on the distance to points. The update step calculates the new centroids with the new cluster configuration.
These steps are executed one after the other until the difference between new and old centroids does not exceed a chosen threshold.

Further metrics

On top of the WCSS, we can use other metrics to assess and monitor the quality and the nature for our clustering.


We can assess how "compact" our clustering is via summing up all the intra-cluster distances for each cluster, divided by twice the number of points in each of them:
c=k=1K12nkdk .\boxed{c = \sum_{k=1}^K \frac{1}{2n_k} d_k} \ .
To expand on the above,
is the number of points in cluster
identified by index
, with
k1,,Kk \in {1, \ldots, K}
, and
is the sum of distances within the cluster:
dk=xCkyCkxy2 ,d_k = \sum_{\mathbf{x} \in \mathcal{C}_k} \sum_{\mathbf{y} \in \mathcal{C}_k} || \mathbf{x} - \mathbf{y} ||^2 \ ,
and it can be written as (
being the centroid of the cluster)
2nkxCkxμk2 .2 n_k \sum_{\mathbf{x} \in C_k} || \mathbf{x} - \mathbf{\mu}_k ||^2 \ .


By definition, we have (
is the centroid of
μk=xCkxnk\mathbf{\mu}_k = \frac{\sum_{\mathbf{x} \in \mathcal{C}_k} \mathbf{x}}{n_k}
because each component of
is the mean of the corresponding components of the points in the cluster.
The first member is (we use the fact, in the second line, that
x1=nk\sum_{\mathbf{x}} 1 = n_k
xy(x2+y22xy)=xyx2+xyy22xyxy=nkxx2+nkyy22nkxxμk=2nkxx22nk2μk2\begin{align} \sum_{\mathbf{x}} \sum_{\mathbf{y}} (||\mathbf{x}||^2 + ||\mathbf{y}||^2 - 2 \mathbf{x} \cdot \mathbf{y}) &= \sum_{\mathbf{x}} \sum_{\mathbf{y}} ||\mathbf{x}||^2 + \sum_{\mathbf{x}} \sum_{\mathbf{y}} ||\mathbf{y}||^2 - 2 \sum_{\mathbf{x}} \sum_{\mathbf{y}} \mathbf{x} \cdot \mathbf{y} \\ &= n_k \sum_{\mathbf{x}} ||\mathbf{x}||^2 + n_k \sum_{\mathbf{y}} ||\mathbf{y}||^2 -2 n_k \sum_{\mathbf{x}} \mathbf{x} \cdot \mathbf{\mu}_k \\ &= 2 n_k \sum_{\mathbf{x}} ||\mathbf{x}||^2 - 2n_k^2 ||\mu_k||^2 \end{align}
The second member is
2nk[xx2+xμk22xxμk]=2nk[xx2+nkμk22nkμk2]=2nk[xx2nkμk2]=2nkxx22nk2μk2 ,\begin{align} 2 n_k \big[ \sum_{\mathbf{x}} ||\mathbf{x}||^2 + \sum_{\mathbf{x}} ||\mu_k||^2 -2 \sum_{\mathbf{x}} \mathbf{x} \cdot \mathbf{\mu}_k \big] &= 2 n_k \big[ \sum_{\mathbf{x}} ||\mathbf{x}||^2 + n_k ||\mu_k||^2 - 2 n_k ||\mu_k||^2 \big] \\ &= 2 n_k \big[ \sum_{\mathbf{x}} ||\mathbf{x}||^2 - n_k ||\mu_k||^2 \big] \\ &= 2 n_k \sum_{\mathbf{x}} ||\mathbf{x}||^2 - 2 n_k^2 ||\mu_k||^2 \ , \end{align}
so the two things are the same.

How to choose the best

Clearly, if I choose a too large
, I end up with a too-fine clustering which fails in showing me similarities in the data. On the flip side, If I choose a too small
, I also fail in separating similar points for the reverse reason, as I'm putting too much stuff together. How should I go with choosing the
that best separates my points?
There can be (at least) three ways: one rule of thumb, one heuristic and one mathematical.

The rule of thumb

The rule of thumb method suggests choosing
Kn/2K \sim \sqrt{n/2}
. This is a (very) hand-waving method, and gives an "approximate" clustering (it typically gives too many clusters). The justification behind it comes from the fact that it runs in linear time so it really should only be taken as an indication when wanting to reduce dataset size.

The elbow method

The elbow method is a heuristic method consisting in looking at when adding a new cluster does not model the data any better.
By plotting
as a function of
, it should be clear where the contribution of another cluster gives minimal further gain (the "elbow" point of the curve). It is clear that the choice of this point is kind of subjective anyway, as it depends on the situation and on the what we, in that situation, consider a minimal further gain not worth pursuing.
See the reference about Stack Overflow for a computational answer.

The Gap Statistic method

It is a mathematical presented in the original paper, you can read an explanation plus implementation in the references. The authors define the gap statistic as
G(K)=E[logWK]logWK ,G(K) = \mathbb{E}[\log W^*_K] - \log W_K \ ,
where the asterisk refers to a null reference dataset. It is demonstrated that the best
clustering the data is the smallest
such that
G(K)G(K+1)sK+1 ,G(K) \geq G(K+1) - s_{K+1} \ ,
is the simulation error calculated from the standard deviation
of B Monte Carlo replicates
, as
sK=1+1BσK ,s_K = \sqrt{1 + \frac{1}{B}} \sigma_K \ ,
E[logWK]=1Bb=1BlogWKb ,\mathbb{E}[\log W^*_K] = \frac{1}{B} \sum_{b=1}^{B} \log W^*_{Kb} \ ,
given by the clustering of
reference datasets.
The reference distribution is obtained by generating each reference feature randomly (uniformly) within the range values for that feature, that is, within the bounding box for that feature. Then the expected value is estimated by an average of
logWK\log W^*_K
, each computed from a Monte Carlo sample
x1,,xnx^*_1, \ldots, x^*_n
, drawn from a reference distribution (x is the dataset matrix).
E[logWK]=1BlogΠbWKb ,\mathbb{E}[\log W^*_K] = \frac{1}{B} \log \Pi_b W^*_{Kb} \ ,
G(K)=log[(ΠbWKb)1BWK]G(K) = \log \left[ \frac{(\Pi_b W^*_{Kb})^{\frac{1}{B}}}{W_K} \right]
In the logarithm, at the denominator, we got the geometric mean of
. This last equation is saying that the gap statistics is the logarithm of the ratio of the geometric mean of
The standard deviation is
σK=1Bb=1B(logWKbE[logWKb])2\sigma_K = \sqrt{\frac{1}{B} \sum_{b=1}^B (\log W^*_{Kb} - \mathbb{E}[\log W^*_{Kb}])^2 }

Trying it!

We will use the Old Faithful dataset for this, which reports some eruption times and waiting times between eruptions of the Old Faithful geyser in Yellowstone, a classic dataset. A copy of the data is available in the repo for this book for convenience.
# Let's read the Old Faithful dataset into a Pandas Dataframe, choosing the columns
data = pd.read_csv(dataset, delimiter=' ')[['eruptions', 'waiting']]
Let's scale the data so that each column lives on the same scale:
scaled_data = scale(data)
And let's see how the data looks like:
data.plot.scatter('eruptions', 'waiting')
plt.title('Scatter of the data in the "old faithful" dataset');
The Old Faithful dataset - you can eye two sets

Apply the rule of thumb

thumb_k = np.sqrt(data.shape[0]/2)
print(' * Rule of thumb asserts optimal k =', round(thumb_k, 2))
The rule of thumb suggests an optimal k of 12!

Apply the elbow method

You will see the elbow will appear quite soon. Let's define some functions first, for convenience:
def retrieve_cluster_points(cluster_index, labels, samples_matrix):
In a clustering, retrieve points (their coordinates)
belonging to given cluster. Return array of such points.
samples_matrix is a Numpy array.
cluster_points = []
for sample_index in np.where(labels == cluster_index)[0]:
cluster_points.append(samples_matrix[sample_index, :])
cluster_points = np.array(cluster_points)
return np.array(cluster_points)
def compute_wcss(centroids, labels, samples_matrix):
Compute the WCSS for a k-means clustering, given the
centroids and the dataset matrix of samples.
samples_matrix is a Numpy array.
k = len(centroids)
wcss = 0
for cluster_index in range(k):
cluster_points = retrieve_cluster_points(cluster_index,
wcss += sum([euclidean(point, centroids[cluster_index])
for point in cluster_points])
return wcss
and then run the method, plotting the result k by k:
k_range = range(1, 21) # range of K (decide arbitrarily how many Ks to test)
inertia_k = dict() # inertia for each k
wcss_k = dict() # WCSS for each k
p_k = dict() # percentage of variance explained for each k
# Loop over the value of K
for k in k_range:
print('k = ', k)
# Fit the model
fit = KMeans(n_clusters=k).fit(scaled_data)
# Retrieve centroids for fitted model
# A centroid is the cluster center, given as the vector of coordinates over the n_features dimensions
# So there will be K number of n_features-dimensional centroids
centroids = fit.cluster_centers_
# Retrieve cluster labels for fitted model
labels = fit.labels_
# Retrieve the inertia of the fitted model
inertia_k[k] = fit.inertia_
# Compute the WCSS of the fitted model
wcss_k[k] = compute_wcss(centroids, labels, scaled_data)
plt.title('WCSS vs. K')
plt.plot(k_range, list(wcss_k.values()), marker='o')
The elbow method on the Old Faithful dataset
It is clear that 2 is giving the sought elbow.

Apply the gap statistics method

Let's define some preliminary convenience functions (note that they make use of the ones defined above):
def compute_gap_statistic(samples_matrix,
Compute the Gap Statistic for a k-means clustering.
b is the number of replicates to use (defaults to 10);
k is the chosen number of clusters;
scale_data (defaults to True): if to scale replicates samples matrix.
Return the gap statistic, the stdev of the Monte Carlo replicas and
the simulation error.
samples_matrix is a Numpy array.
if scale_data:
samples_matrix = scale(samples_matrix)
fit = KMeans(n_clusters=k).fit(samples_matrix)
centroids = fit.cluster_centers_
labels = fit.labels_
Wk = compute_wcss(centroids, labels, samples_matrix)
Wkb = []
for i in range(b):
replicate = []
for sample in range(samples_matrix.shape[0]):
replicate_sample = \
[random.uniform(min(samples_matrix[:, feature]),
max(samples_matrix[:, feature]))
for feature in range(samples_matrix.shape[1])]
replicate = np.array(replicate)
if scale_data:
replicate = scale(replicate)
fit = KMeans(n_clusters=k).fit(replicate)
replicate_centroids = fit.cluster_centers_
replicate_labels = fit.labels_
gap = np.mean(np.log(Wkb)) - np.log(Wk)
stdk = np.std(np.log(Wkb))
sk = math.sqrt(1 + 1. / b) * stdk
return gap, stdk, sk
then we will run the method:
k_gap = {}
for k in k_range:
gap = compute_gap_statistic(scaled_data, k, scale_data=False)
k_gap[k] = gap
k_gapdiff = {}
for k in range(min(k_range), max(k_range)-1):
k_gapdiff[k] = k_gap[k][0] - k_gap[k+1][0] - k_gap[k + 1][2]
'The best k according to the gap statistic is: ', evaluate_gap_statistic_best_k(scaled_data)
which yields an optimal k of 2. Let's see it with a few plots:
plt.plot(list(k_gap.keys()), [value[0] for value in k_gap.values()])
plt.title('Gap statistic')
The gap statistics on the Old Failthful dataset
Gap diff
And finally, let's plot the clustered points - you can choose the k desired
# Choose feature indices from the header
f1 = 1
f2 = 2
k = 4
# List of colours: if len is < k, add some
colours = {
'red': 'r',
'black': 'k',
'blue': 'b',
'green': 'g',
'grey': '#929090',
'pink': '#FFB6C1',
'light_blue': '#00BFFF',
'light_green': '#29E191'
plt.title('Scatter plot of two features, k clusters', fontweight='bold', fontsize=16)
# Note: plot displays the scaled data
for cluster_index in range(k):
cluster_color = colours[list(colours)[cluster_index]]
cluster_points = retrieve_cluster_points(cluster_index, labels, scaled_data)
plt.scatter([point[f1-1] for point in cluster_points],
[point[f2-1] for point in cluster_points], color=cluster_color)
plt.scatter(centroids[cluster_index,f1-1], centroids[cluster_index, f2-1], marker='x', s=200, color=cluster_color);
Clustered points in the Old Faithful dataset, k=4


  1. 1.
    S Lloyds, Least squares quantization in PCM, IEEE transactions on information theory, 28.2, 1982
  2. 2.
    Stack Overflow on the elbow method
  3. 3.
    This great blog on the gap statistic, with a Python implementation and examples
  4. 4.
    Tibshirani, Walther, Hastie, Estimating the number of clusters in a data set via the gap statistic, J R Statistic Society, 63:2, 2001
  5. 5.
    K-means is no free lunch, Variance Explained