In this example we clustered a matrix of correlation coefficients for test scores by a sample of 220 Scottish pupils on six school subjects, copied from Lawley and Maxwell* . The correlation matrix was read into ClustanGraphics as a square proximity matrix - see Reading Proximities. We also had to declare that the matrix contained proximities of type similarity (i.e. product-moment correlation coefficients).
First Cluster Second Cluster Fusion Value Arithmetic Geometry 0.464 Gaelic English 0.439 Gaelic History 0.351 Gaelic Arithmetic 0.164 We can see that at the two-cluster level, the subjects Arithmetic, Algebra and Geometry are all inter-correlated at a value of 0.464, or higher; and that Gaelic, English and History are inter-correlated at 0.351, or higher. The two-cluster level, illustrated below, neatly separates the "verbal" subjects from the "mathematical" subjects.
We also took the opportunity to optimally re-order the tree so that the sequence from top to bottom makes most presentational sense. ClustanGraphics can display a shaded version of the correlation matrix in which the two cluster of variables are highlighted in green:
These results are very similar to what was obtained by factor analysis (see source, below), though we would venture to suggest that interpretation is much easier when the subjects are clustered. Our cluster analysis on variables is a discrete form of factor analysis, where each variable belongs to only one cluster (factor). Lawley and Maxwell state that the analysis demonstrates that individuals who do well on verbal subjects tend to do less well on mathematical subjects, and vice versa. We agree. This application may appear trivial with only six variables, as the factoring of the correlation matrix can be done by inspection. But it would not be so obvious if there were 50 variables, or 5000 variables as can occur in, for example, gene expression studies. Clustering correlation matrices of this size is quite feasible using ClustanGraphics, and can lead to helpful insight into the structure of the data which is not so straightforward to analyze using factor analysis. Indeed, it is not really practicable to use factor analysis with more than about 100 variables, because of the cost of inverting the correlation or covariance matrix. ClustanGraphics also finds the cluster exemplars, or the most typical members of each cluster. In this context, the exemplar is that variable which has the highest average correlation with the other variables in the cluster. At the two-cluster level illustrated above, the cluster exemplars are Gaelic and Arithmetic. These are the two key subjects which, on the basis of the scores of the 220 Scottish pupils analyzed, most exemplify the verbal versus mathematical dichotomy in the data.A large matrix containing inter-correlated variables can thus be reduced to a more focussed subset of key variables which account for the principal dimensionality in the data, without resorting to the more abstract, and hence less meaningful, transformation to factor scores obtained by factor analysis. * Factor Analysis as a Statistical Method, by D N Lawley and A E Maxwell, Butterworths 1971, p. 66. |