Here's a Clustan program which runs Cluster with mixed variable types and missing values: read data, variables multi 1 cont 2-5 bin 6 missing 6 * -1.0, cases 9 2 .72 .54 .72 .41 0 3 .50 -1. .68 .46 0 -1 .47 .71 .84 .56 -1 1 -1. .49 .53 .42 0 1 .25 .38 .41 .33 0 3 .45 .68 -1. .45 0 1 .57 .70 .78 .55 1 2 .87 .82 .87 .51 0 read labels variables Blood Ht/Wt FRG val ECG ind AntibodyCyst cluster, method wards, print fusions, centres min 2 max 4, members min 2 max 4 stop The program calls three Clustan procedures - Read Data, Read Labels and Cluster.
Read Data defines 6 variables: the first is a multi-state or ordinal variable; variables 2-5 are continuous; and variable 6 is binary. We also specify missing value codes of -1 for all 6 variables. There are 9 cases
to be read and the data are read in free-format; that is, successive values are separated by blanks. Note that a missing value (code -1) occurs in all 6 variables. Read Labels specifies alphabetical labels for the variables.
The third command runs Cluster directly on the data matrix to complete a Ward's analysis. The options selected for printing are the fusion schedule, cluster centres and cluster membership for 2, 3 and 4 clusters.
A typical output for 3 clusters is as follows: Table of cluster centres for 3 clusters
We would like to draw attention to some special features in this table. Cluster weights refer to the number of cases in each of the 3 clusters; they are decimal values because the cases could have been
assigned differential decimal weights. The number of observations/weights for each variable in each cluster is the sum of the weights for valid
observations only; missing values have been ignored. In the example, each variable has a weight total of 8, i.e. 9 cases less 1 missing value on each variable.
Cluster means are computed for continuous variables, but percentages of occurrence are computed for the multi-state and binary variables.
The tree corresponding to this hierarchical cluster analysis by Ward's method can also be displayed using We could, of course, add alphabetical case labels so that the nodes of the tree are labelled accordingly. We could also use this tree to identify new cases with procedure Classify.
This tutorial has illustrated the following features which are unique to Cluster:
Finally, we would emphasize that since Cluster operates directly on a data matrix we can use it to obtain a hierarchical classification for thousands of cases. See very large surveys for details.
|