K-Means Clustering#

The idea behind k-means clustering is to choose a number of clusters, $k$, and centroids for the clusters, and then iteratively refine the cluster assignments and centroids until a number of iterations is reached or the cluster assignments don't change. Detailed mathematical developments can be found in [1].

Specifically given a dataset of points $\mathcal{D}= \{x_1, \dots, x_n\}$ where $x_i\in\mathbb{R}^n$ and an integer parameter $k\geq2$ we pick $k$ random vectors $c_1, \dots, c_k\in \mathbb{R}^d$ which serve as candidate centroids and iteratively perform the following steps:

E Step This step updates the cluster assignments.
M Step This step updates the centroids.

Adopting the convention that $d(x_j, c_k) = ||x_j-c_k||_2^2$ and letting $C(j)\in\{1,\dots, k\}$ denote the cluster assignment of data point $x_j$, adding up over all of the clusters we penalize a given assignment $C$ via the following objective function:

\[ J = \sum_{k}\sum_{j:C(k)=k} = \sum_{j=1}^N\sum_{\ell=1}^k||x_j-c_\ell||A_{j\ell} \]

where $A_{j\ell}$ is an indicator (0/1 variable) indicating if observation $j$ assigned to cluster $l$. Concretely $A_{j\ell} = 1$ if $C(j)=\ell$ and is 0 otherwise.

Looking at the objective function we can see that it separates over clusters, thus for fixed clusters we can minimize $J$ by assigning each $x_j$ to its closest centroid. Using this insight the algorithm fixes clusters and penalizes them, and then updates the centroids by computing the means.

Thus the algorithm to minimize $J$ proceeds in two rounds:

1) E step This step is done first and assigns clusters according to which closest $$ C(j)=\underset{r=1,\dots, k}{\text{argmin }} d(x_j,c_r)^2 $$

2) M step We update the centroids by computing the average of points in each cluster: $$ c_\ell =\frac{\sum_{j=1}^n A_{jl}x_j}{N_\ell} $$ where $N_\ell = \sum_{j=1}^nA_{j\ell}$, that is, the number of points in the $\ell$'th cluster.

The $M$ step is derived by taken the gradient of $J$ with respect to each centroid, $\nabla_{c_k}J$, and setting it to zero. According to [1] the algorithm is guaranteed to converge but not to a global minimum.

$K$-means works as long as the metric $d$ is differentiable with respect to $c_k$. That is if we replace $||x_j-c_\ell||^2$ with any differentiable $d(x_j, c_\ell)$ we can still use $k$ means. Of particular interest is the case where the metric as an inner-product norm derived from a positive definite matrix $M$, that is: $$ d(x_j,c_\ell) = (x_j-c_\ell)^TM(x_j-c_\ell) $$ as in the case of metric learning (see [2] for inner product norms and [3] for metric learning).

Worked Example#

We follow the same set up as in the worked example for spectral clustering. Specifically we simulate expression data by generating two groups from negative binomial distributions and subsequently applying a log transform.

from sklearn.cluster import KMeans
from scipy.stats import nbinom
import numpy as np

seed = 1234 # seed to get reproducible behavior
n1, p1 = 10, 0.3
n2, p2 = 15, 0.5

g1 = nbinom.rvs(n1, p1, size = 50,random_state = seed).reshape(10,5) # simulated expression for distribution 1
g2 = nbinom.rvs(n2, p2, size = 50,random_state = seed).reshape(10,5) # simulated expression for distribution 2


X = np.log2(np.vstack((g1,g2))+0.001)
n_nbrs = int(np.ceil(np.log(X.shape[0])))+1
true_labels = 10*[0]+10*[1]

Running the algorithm and checking the predicted clusters is simple:

clustering_algo = KMeans(n_clusters = 2)
clustering_algo.fit(X)


print(true_labels)
print(clustering_algo.labels_)

References#

[1] Bishop, Christopher M. Pattern recognition and machine learning. Vol. 4. No. 4. New York: springer, 2006.
[2] Lax, Peter D. Linear Algebra and Its Applications, 2nd Edition
[3] Bellet, Aurélien, Amaury Habrard, and Marc Sebban. Metric learning. Morgan & Claypool Publishers, 2015.