Cluster Keys starts with the whole dataset and examines each variable in turn for the best split into two cluster subsets, for example that division into two clusters which results in the maximum reduction in the Euclidean Sum of Squares. It is therefore analogous of agglomerative clustering to minimize the Euclidean Sum of Squares, or Ward's Method, but working by division rather than agglomeration.If the best split of the dataset is on a binary variable, Cluster Keys divides on the presence or absence of the key binary variable. If the best split is an ordinal or continuous variable, Cluster Keys divides on a cut value between x and y, where x and y are values chosen from the data by Cluster Keys. One cluster subset will contain the values x and lower on the chosen variable, and the other cluster subset will comprise values y and higher. The following dialogue gives the abbreviated results for Cluster Keys with the Mammals Dataset.
The first division gives a simple divisional key for classifying the data, such that the best discriminating
variable is found that splits the dataset into two clusters. In the above example, the first division is on Lactose > 3.3%, reducing the Euclidean Sum of Squares by 69.9, the maximum possible split on any
variable, forming clusters 1 and 5. The next step is to examine each of the two resulting clusters and find the best further split of one of
these clusters into two subsets, thus forming 3 clusters. In the example, this second best split is for Water > 46.5% on cluster 5, forming clusters 5 and 7. Cluster Keys continues in this way until the
whole dataset has been subdivided into singleton clusters or cases, at each step finding the maximum reduction in the Euclidean Sum of Squares. The result is a hierarchical classification obtained by
division, the reverse of hierarchical clustering by agglomeration (e.g. Cluster/Proximities).
When small clusters are being subdivided, the same split will frequently be identified for two or more
variables, and of course this always occurs with clusters of size 2. In this instance, the "best" split is ambiguous – any one of the variables could be selected as the best divisional key. Cluster Keys
therefore reports all ambiguous divisions, and these keys should not be applied in practice. Upon completing the division tree, the results can be copied to a document or spreadsheet, for example
into Excel as shown below. They keys can thus be used for further analysis or classification.
At present, Cluster Keys cannot sensibly divide on a nominal variable, but it is planned to provide this in
a future version – any nominal variables will currently be excluded from the analysis. However, it is of course open to the user to re-code each nominal variable as a set of dummy binary variables.
The results for Cluster Keys are currently displayed as a tree for which Cluster Keys was introduced in ClustanGraphics 8.02, July 2005
|