How Do I Automate The Number Of Clusters?
Solution 1:
This is a slippery field because it is very difficult to measure how "good" your clustering algorithm works without any ground truth labels. In order to make an automatic selection, you need to have a metrics that will compare how KMeans
performs for different values of n_clusters
.
A popular choice is the silhouette score. You can find more details about it here. Here is the scikit-learn
documentation:
The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b - a) / max(a, b). To clarify, b is the distance between a sample and the nearest cluster that the sample is not a part of. Note that Silhouette Coefficient is only defined if number of labels is 2 <= n_labels <= n_samples - 1.
As a result, you can only compute the silhouette score for n_clusters >= 2
, (which might be a limitation for you given your problem description unfortunately).
This is how you would use it on a dummy data set (you can adapt it to your code then, it is just to have a reproducible example):
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
iris = load_iris()
X = iris.data
sil_score_max = -1 #this is the minimum possible score
for n_clusters in range(2,10):
model = KMeans(n_clusters = n_clusters, init='k-means++', max_iter=100, n_init=1)
labels = model.fit_predict(X)
sil_score = silhouette_score(X, labels)
print("The average silhouette score for %i clusters is %0.2f" %(n_clusters,sil_score))
if sil_score > sil_score_max:
sil_score_max = sil_score
best_n_clusters = n_clusters
This will return:
The average silhouette score for 2 clusters is 0.68
The average silhouette score for 3 clusters is 0.55
The average silhouette score for 4 clusters is 0.50
The average silhouette score for 5 clusters is 0.49
The average silhouette score for 6 clusters is 0.36
The average silhouette score for 7 clusters is 0.46
The average silhouette score for 8 clusters is 0.34
The average silhouette score for 9 clusters is 0.31
And thus you will have best_n_clusters = 2
(NB: in reality, Iris has three classes...)
Post a Comment for "How Do I Automate The Number Of Clusters?"