With ClustanGraphics it is possible to cluster with complex data structures involving different types of variables. But for the purposes of this discussion, let's suppose that all the variables are measured on a continuous or semi-continuous scale.For example, in the Mammals Case Study, the 5 variables are the percentages of Water, Protein, Fat, Lactose and Trace Elements (Ash) in the milk of different mammals. Five lines of data are reproduced below:
Composition of mammals milk (percentages)
For the full data set, please refer to the file MammalsLabelled.txt distributed with ClustanGraphics.
Note that the variables' ranges differ quite markedly, from water with a range of 44 and fat with a range of 40, to ash with a range of less than 2. If we wish the variables to be treated equally, as is often the case,
it will be appropriate to standardize the values by dividing by each variable's range or standard deviation. Otherwise, clustering on the above table would be dominated by the diversity of water and fat, and ash
would have a negligible influence on any measure of cluster variance (ash, or trace elements, is arguably the most important variable).
We note that the percentages for rabbit and seal do not sum to 100 - perhaps due to measurement or transcription error. Furthermore, it is not clear whether the value of lactose for seal is zero, or missing;
for this analysis we shall take a value of zero, but with ClustanGraphics we can denote the value as missing. As the analyst, there's not a lot we can do about these data queries - the data were published
in 1956, so asking the authors for an explanation is not very practicable. Dividing by the standard deviation obtains z-scores, which have a mean of zero and standard deviation of
1, for each variable. Thus, each variable contributes equally to the variance in the analysis: Standard scores (z-scores)
Dividing by the range provides a different diversity of values, as shown below: Standardization to Unit Range
Here we can see that rabbit has the maximum value for both protein and ash, while seal has the
maximum for fat and the minimum for lactose. (These transfromations relate to the full dataset of 25 mammals, and not to the selection of five cases used here for illustration).
Having transformed the data, we are now in a position to |