Skip to content Skip to sidebar Skip to footer

How Do I Automate The Number Of Clusters?

Edit: I accept that my question has been closed for being similar but I think the answers have provided valuable knowledge for others so this should be open. I've been playing wi

Solution 1:

This is a slippery field because it is very difficult to measure how "good" your clustering algorithm works without any ground truth labels. In order to make an automatic selection, you need to have a metrics that will compare how KMeans performs for different values of n_clusters.

A popular choice is the silhouette score. You can find more details about it here. Here is the scikit-learn documentation:

The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b - a) / max(a, b). To clarify, b is the distance between a sample and the nearest cluster that the sample is not a part of. Note that Silhouette Coefficient is only defined if number of labels is 2 <= n_labels <= n_samples - 1.

As a result, you can only compute the silhouette score for n_clusters >= 2, (which might be a limitation for you given your problem description unfortunately).

This is how you would use it on a dummy data set (you can adapt it to your code then, it is just to have a reproducible example):

from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

iris = load_iris()
X = iris.data

sil_score_max = -1 #this is the minimum possible score

for n_clusters in range(2,10):
  model = KMeans(n_clusters = n_clusters, init='k-means++', max_iter=100, n_init=1)
  labels = model.fit_predict(X)
  sil_score = silhouette_score(X, labels)
  print("The average silhouette score for %i clusters is %0.2f" %(n_clusters,sil_score))
  if sil_score > sil_score_max:
    sil_score_max = sil_score
    best_n_clusters = n_clusters

This will return:

The average silhouette score for 2 clusters is 0.68
The average silhouette score for 3 clusters is 0.55
The average silhouette score for 4 clusters is 0.50
The average silhouette score for 5 clusters is 0.49
The average silhouette score for 6 clusters is 0.36
The average silhouette score for 7 clusters is 0.46
The average silhouette score for 8 clusters is 0.34
The average silhouette score for 9 clusters is 0.31

And thus you will have best_n_clusters = 2 (NB: in reality, Iris has three classes...)


Post a Comment for "How Do I Automate The Number Of Clusters?"