Cluster Tutorial 

Home
About Clustan
Cluster Analysis
Applications
ClustanGraphics
User Support
Clustan/PC
Orders
What's New
White Papers
Contact Us
There are two ways to run Cluster in Clustan - interactively, or by specifying commands in a Clustan program.   For this tutorial we'll focus on the command language approach, because it's easier to illustrate on a webpage.

Here's a Clustan program which runs Cluster with mixed variable types and missing values:

    read data, variables multi 1 cont 2-5 bin 6 missing 6 * -1.0, cases 9
    2 .57 .53 .69 .43  0
    2 .72 .54 .72 .41  0
    3 .50 -1. .68 .46  0
    -1 .47 .71 .84 .56 -1
    1 -1. .49 .53 .42  0
    1 .25 .38 .41 .33  0
    3 .45 .68 -1. .45  0
    1 .57 .70 .78 .55  1
    2 .87 .82 .87 .51  0

    read labels variables
    Blood Ht/Wt FRG val ECG ind AntibodyCyst

    cluster, method wards,
    print fusions, centres min 2 max 4, members min 2 max 4

    stop

The program calls three Clustan procedures - Read Data, Read Labels and Cluster.

Read Data defines 6 variables: the first is a multi-state or ordinal variable; variables 2-5 are continuous; and variable 6 is binary.  We also specify missing value codes of -1 for all 6 variables.  There are 9 cases to be read and the data are read in free-format; that is, successive values are separated by blanks.  Note that a missing value (code -1) occurs in all 6 variables.

Read Labels specifies alphabetical labels for the variables.

The third command runs Cluster directly on the data matrix to complete a Ward's analysis.   The options selected for printing are the fusion schedule, cluster centres and cluster membership for 2, 3 and 4 clusters.

A typical output for 3 clusters is as follows:

Table of cluster centres for 3 clusters
Cluster codes            1        2        3
Cluster weights          5.00     2.00     2.00

Variable Type Cluster:   1        2        3
Blood    Multi. 1 %      0      100      100
Observations/weights     5.       1.       2.
Blood    Multi. 2 %     60        0        0
Observations/weights     5.       1.       2.
Blood    Multi. 3 %     40        0        0
Observations/weights     5.       1.       2.
Ht/Wt    Cont. Mean      0.69     0.61     0.41
Observations/weights     5.       2.       1.
FRG val  Cont. Mean      0.47     0.50     0.36
Observations/weights     4.       2.       2.
ECG ind  Cont. Mean      0.72     0.87     0.13
Observations/weights     4.       2.       2.
Antibody Cont. Mean      1.27     1.49     1.10
Cyst     Binary %        0      100        0
Observations/weights     5.       1.       2.

We would like to draw attention to some special features in this table.  Cluster weights refer to the number of cases in each of the 3 clusters; they are decimal values because the cases could have been assigned differential decimal weights.

The number of observations/weights for each variable in each cluster is the sum of the weights for valid observations only; missing values have been ignored.  In the example, each variable has a weight total of 8, i.e. 9 cases less 1 missing value on each variable.

Cluster means are computed for continuous variables, but percentages of occurrence are computed for the multi-state and binary variables.

The tree corresponding to this hierarchical cluster analysis by Ward's method can also be displayed using ClustanGraphics:

    ClusterTutorialTree.gif (4175 bytes) 

We could, of course, add alphabetical case labels so that the nodes of the tree are labelled accordingly.  We could also use this tree to identify new cases with procedure Classify.

This tutorial has illustrated the following features which are unique to Cluster:

  • Mixed variable types can be combined in the analysis
  • Missing values can be taken into account, without bias
  • It wasn't necessary to calculate a proximity matrix
  • Cluster centres are computed and can be stored
  • The tree can be used to identify new cases with Classify

Finally, we would emphasize that since Cluster operates directly on a data matrix we can use it to obtain a hierarchical classification for thousands of cases.  See very large surveys for details.