K-Means Clustering#
The idea behind k-means clustering is to choose a number of clusters, \(k\), and centroids for the clusters, and then iteratively refine the cluster assignments and centroids until a number of iterations is reached or the cluster assignments don't change. Detailed mathematical developments can be found in [1].
Specifically given a dataset of points \(\mathcal{D}= \{x_1, \dots, x_n\}\) where \(x_i\in\mathbb{R}^n\) and an integer parameter \(k\geq2\) we pick \(k\) random vectors \(c_1, \dots, c_k\in \mathbb{R}^d\) which serve as candidate centroids and iteratively perform the following steps:
- E Step This step updates the cluster assignments.
- M Step This step updates the centroids.
Adopting the convention that \(d(x_j, c_k) = ||x_j-c_k||_2^2\) and letting \(C(j)\in\{1,\dots, k\}\) denote the cluster assignment of data point \(x_j\), adding up over all of the clusters we penalize a given assignment \(C\) via the following objective function:
where \(A_{j\ell}\) is an indicator (0/1 variable) indicating if observation \(j\) assigned to cluster \(l\). Concretely \(A_{j\ell} = 1\) if \(C(j)=\ell\) and is 0 otherwise.
Looking at the objective function we can see that it separates over clusters, thus for fixed clusters we can minimize \(J\) by assigning each \(x_j\) to its closest centroid. Using this insight the algorithm fixes clusters and penalizes them, and then updates the centroids by computing the means.
Thus the algorithm to minimize \(J\) proceeds in two rounds:
1) E step This step is done first and assigns clusters according to which closest $$ C(j)=\underset{r=1,\dots, k}{\text{argmin }} d(x_j,c_r)^2 $$
2) M step We update the centroids by computing the average of points in each cluster: $$ c_\ell =\frac{\sum_{j=1}^n A_{jl}x_j}{N_\ell} $$ where \(N_\ell = \sum_{j=1}^nA_{j\ell}\), that is, the number of points in the \(\ell\)'th cluster.
The \(M\) step is derived by taken the gradient of \(J\) with respect to each centroid, \(\nabla_{c_k}J\), and setting it to zero. According to [1] the algorithm is guaranteed to converge but not to a global minimum.
\(K\)-means works as long as the metric \(d\) is differentiable with respect to \(c_k\). That is if we replace \(||x_j-c_\ell||^2\) with any differentiable \(d(x_j, c_\ell)\) we can still use \(k\) means. Of particular interest is the case where the metric as an inner-product norm derived from a positive definite matrix \(M\), that is: $$ d(x_j,c_\ell) = (x_j-c_\ell)^TM(x_j-c_\ell) $$ as in the case of metric learning (see [2] for inner product norms and [3] for metric learning).
Worked Example#
We follow the same set up as in the worked example for spectral clustering. Specifically we simulate expression data by generating two groups from negative binomial distributions and subsequently applying a log transform.
from sklearn.cluster import KMeans
from scipy.stats import nbinom
import numpy as np
seed = 1234 # seed to get reproducible behavior
n1, p1 = 10, 0.3
n2, p2 = 15, 0.5
g1 = nbinom.rvs(n1, p1, size = 50,random_state = seed).reshape(10,5) # simulated expression for distribution 1
g2 = nbinom.rvs(n2, p2, size = 50,random_state = seed).reshape(10,5) # simulated expression for distribution 2
X = np.log2(np.vstack((g1,g2))+0.001)
n_nbrs = int(np.ceil(np.log(X.shape[0])))+1
true_labels = 10*[0]+10*[1]
Running the algorithm and checking the predicted clusters is simple:
clustering_algo = KMeans(n_clusters = 2)
clustering_algo.fit(X)
print(true_labels)
print(clustering_algo.labels_)