The General Distance Coefficients provided in ClustanGraphics are the complement of Gower's General Similarity Coefficient but with extensions that allow for distances
between clusters, as well as distances between cases, to be computed for mixed data types. For details of how to specify mixed data types click here.The General Squared Euclidean Distance Coefficient compares two cases i and j, and is defined as follows
: It should be noted that the effect of the denominator Ordinal and Continuous Variables = (xik - xjk)2/rk2 The components of distance dijk2 for ordinal and continuous variables therefore range between 0, for identical values xik = xjk
, and 1, for the two extreme values xmax - xmin. When an ordinal or continuous variable is standardised to z-scores, the transformed values
Standardising to z-scores is generally preferable to standardising by range because the resulting values are not determined by the two extreme values If you choose not to transform an ordinal or continuous variable, the component of distance for that variable will not be divided by anything, i.e.
This may be appropriate if all your variables are ordinal variables on the same scale, or if you have
standardized your variables externally by some other transformation. There is a further discussion of data transformations here. Binary and Nominal Variables For a binary variable, For a nominal variable, Differential Variable Weights
If the weight of any variable is zero, then the variable is effectively ignored for the calculation of proximities. Such variables are "masked" for clustering, but available for Euclidean Distance Increase in Sum of Squares City Block Distance City Block Distance is akin to the walking distance between two points in a city like New York's Manhatten district, where each component is the number of blocks in the directions North-South and East-West. Maximum Distance This measure is appropriate if you wish to locate all the cases in a cluster within a hypercube of side 2
d, where d could be the outlier deletion threshold in k-means analysis. Since every case is within a distance d of the mean, it must lie inside the cell of size 2d
. This criterion was added to ClustanGraphics so that it is comparable to CHAID analysis, which creates segments of hypercube shapes. Distances Between Clusters This is straightforward for ordinal and continuous variables. But with binary and nominal variables, mpk is a vector {jpks} of probabilities of occurrence for each state s of variable k for cluster p. The distance component d
ijk2 between two clusters (in hierarchical cluster analysis) or a case and a cluster (in
k-means analysis) is computed between these vectors, and standardized. The chosen criterion function, including Euclidean Sum of Squares, can then be optimized in terms of the sum of the distance
components across all variables k, thereby allowing for mixed data types most generally in cluster analysis.
Of course, these definitions also allow for missing values to be present, and for differential variable and case weighting. Where the data may be incomplete, the weights w
This uniquely flexible clustering capability sets Clustan software head and shoulders above the competition. With ClustanGraphics, you don't have to force categorical data to behave like continuous
variables, or categorize continuous variables to fit a
Our General Distance Coefficients have been available in Clustan since 1984, and in ClustanGraphics since release 5 in 2001. It's one of our best-kept secrets! To order ClustanGraphics on-line click |