Supposing that the data have been transformed to standard scores, or z-scores, then a typical selection of five cases taken from the Mammals Case Study is as follows: Standard scores (z-scores)
Standardization to z-scores has the effect that each column has a mean of zero and standard deviation of 1. See Data Transformations for further details. The next step is to calculate proximities between all pairs of cases. In ClustanGraphics it's simpe - just click Compute on the Proximities menu, and select a proximity coefficient from the drop-down list:
In this example, we have selected Squared Euclidean Distance. The result is a proximity matrix of order 252
. But since the diagonal elements are not important, and the matrix is symmetrical about the diagonal, there are effectively only 288 different distances. For ease of presentation, we just show the
proximities for five of the cases below: Squared Euclidean Distance Matrix
Just by examining the proximities we can determine interesting details, such as the fact that donkey and
zebra are quite similar, with a squared distance value of 0.186, whereas donkey and seal are the two most dissimilar cases, with a squared distance of 8.731.
Of course it's not very practicable to examine all 288 proximities by inspecting the proximity matrix for 25 cases; and it certainly would not be practicable with 10,000 cases. However, we can use Nearest
Neighbour analysis to find the nearest neighbours; and of course, clustering the proximity matrix is the main way we can group the cases into clusters and thus describe the structure and diversity of the data.
See the ClustanGraphics Preview, where the Mammals Case Study is taken further. Continuous Data
Euclidean Distance Euclidean Sum of Squares City Block Distance Pearson's Product-Moment Correlation [rho] Pearson's Correlation as Distance |rho-1| Binary Data
Simple Matching Coefficient (A+D)/M Jaccard Similarity Coefficient A/(A+B+C) These coefficients compare any two cases i and j across all M unmasked binary variables, as follows:
If a variable is "missing" for either case i or case j, then it is not considered for the computation of the coefficient. In this case M is the number of variables that are observed for both cases.
Binary Euclidean Distance (B+C)/M is a dissimilarity coefficient and the other two are similarity
coefficients. Use Binary Euclidean Distance if you intend to cluster by minimizing the Euclidean Sum of Squares (Ward's Method). Details Mixed Data
Euclidean Distance Euclidean Sum of Squares City Block Distance Jukes-Cantor Genetic Distance Gower's Similarity Coefficient There is no program limit to the size of proximity matrix which can be computed; the limit is determined
by the memory and disk resources available on the user's PC. As a rough guide only, a reasonable Pentium PC is capable of computing proximities for up to about 10,000 cases with ClustanGraphics. If
you have a larger data matrix, we recommend that you use Direct Data Clustering which can produce a hierarchical cluster analysis for 100,000 cases, or more.
Note that ClustanGraphics can compute proximities from incomplete data - see
For further definitions and other details, please refer to the ClustanGraphics Primer and the ClustanGraphics Help file. |