Some of our users have expressed surprise and concern that k-means clustering can produce different classifications from the same data if the starting conditions are
varied. They point to the fact that this is not evidently the case with other software, and hence query the added complexity involved in deciding which cluster solution to choose from a range of different solutions.
If you are currently using other k-means software, we invite you to check your specifications for a discussion of the following important issues:
The chances are that few if any of the above issues will be discussed. Most of our competitors present one classification for a given dataset and number of clusters, as if it is the definitive, or best, cluster solution that can be found by the k-means procedure. However, this is unlikely to be the case. k-Means Procedure
The cluster means are re-computed at step 5 when any change in cluster membership occurs. We call
this "migrating" or "drifting" means, and it will be evident that the final cluster solution will be dependent to some extent on the order in which the cases are considered for relocation (see
below).
With ClustanGraphics you can choose to randomize the case order, so that different cluster solutions will typically be obtained from different k-means runs. Our Different Cluster Solutions
The letters A and B (contoured in red and blue, respectively) denote the cluster to which each point has
been assigned, and the crosses denote the cluster means where a cluster has 2 or 3 points. It should be evident, from inspection, that in all six cluster solutions each case is closer to the mean of the cluster
to which it is assigned than to the mean of the other cluster. A little high school mathematics can readily prove the point.
However, the two solutions at the left have a smaller Euclidean Sum of Squares than the four solutions at the right - to be exact, if the square has a side of length 1, the Euclidean Sum of Squares is 1 for the two
clusters of 2 points, and 1.33 for a triangular cluster of 3 points. Hence the two solutions at the left represent global optimum stable cluster solutions, while the four solutions to the right are
sub-optimal stable solutions. Again, a little high school math can prove the point. The moral of this exercise is - don't trust a k-means procedure that gives you only one cluster solution.
You need to explore the possible alternatives in more depth. Criterion Values
For example, when FocalPoint is used with the Mammals case study, 7 different cluster solutions are obtained from 500 random trials with ESS values and cluster sizes as follows:
Solution 1 has the smallest ESS value and reassuringly occurs in two-thirds of the random trials. This is
the cluster solution that we would normally use to characterize mammals' milk. However, it is interesting to note that solutions 2 and 4 assign elephant to a singelton cluster, and solutions 6 and 7 assign
dolphin and seal to singleton clusters. This section therefore illustrates the importance of investigating the criterion function values for the
classifications, to establish which of several cluster solutions is most appropriate to use as a model, and also the influence of outliers on a k-means classification. If you are using one of our competitors'
k-means clustering tools, try running it with different starting configurations (if you can) and check the criterion function values (if they are reported) - then ask your vendor's representative to explain the differences. Outlier Deletion
By comparison, if every point is forced to join a cluster the classification looks like this:
The clusters are typically larger when outliers are not removed, and the resulting cluster solution is more confused. We generally advise that outliers should be deleted when clustering a large dataset with ClustanGraphics, thus placing all remote or atypical cases in a residue subset of unclassified cases. When the cluster model has been saved, the distance from each outlier to its nearest cluster centre can be obtained and this can be used to classify the outliers, if desired. Failure To Converge Tests show that the number of cases moved in each k-means iteration does not necessarily decline
sequentially, and therefore this ad hoc stopping rule may actually prevent the procedure from even getting close to the global optimum solution. With ClustanGraphics, we do not provide such ad hoc
convergence criteria, because our implementation of k-means is guaranteed to converge and they are therefore not needed. We compute an exact relocation test
for the Euclidean Sum of Squares, by which a case is moved from one cluster to another only if the move reduces the Euclidean Sum of Squares, exactly computed. Since the Euclidean Sum
of Squares is a sum of squared distances about k cluster means, it cannot be reduced indefinitely and is bounded by zero, hence our exact k-means procedure must converge. Furthermore, we assert that our
procedure achieves greater accuracy in its assignment decisions and is more likely to find the global optimum solution than those offered by our competitors. Details Case Order Dependence
Completely Randomized Trials Conclusion
If you wish to explore this further, please download a paper that we presented at the German Classification Society in March, 2001 - Thank you for your interest in the ClustanGraphics k-means procedure. |