|
Two-Stage k-Means Clustering Method
with Outlier Detection and Model Calibration D. Wishart1
, S. Simmons21 Department of Management, University of St. Andrews;
2McKinsey & Co, LondonKeywords : ClustanGraphics, cluster analysis, conjoint surveys, data mining, k-means,
knowledge management, market segmentation, outliersThe paper describes a two-stage k-means clustering method to search for an optimum
cluster model. The first stage provides six different starting strategies, randomized trials, differential variable and case weights, outlier deletion, and saving and profiling of the top solutions. In the
second stage, outliers are re-assigned to clusters and the model is calibrated. Six starting strategies are provided - tree partition, dense cliques, cluster exemplars, contingency table, initial
points and random assignment. Any combination of these can be used to specify the initial search partitions for k-means analysis. The Euclidean Sum of Squares is optimized, and it will be shown that the algorithm always
converges if allowed sufficient iterations. Outliers and intermediate cases can be excluded from the model, thereby tightening the core clusters and separating densely populated segments.
The variables used to specify the cluster model can be selected from the input data matrix and standardized or assigned differential weights. Cases can also be assigned differential weights, and random split samples can be
selected for cluster model validation. Weighting of cases is particularly important for aggregated units, such as geographical regions weighted by their population sizes, or customers weighted by their transaction values or
volumes. A specified number of the top convergent solutions can be saved, together with their criterion values and
frequencies of occurrence. Any of these top solutions can then be used as a core cluster model for second-stage calibration, evaluation, and case assignment.
Second-stage model calibration involves selecting the subset of the best discriminatory variables for the core cluster model; weighting variables by t-tests on cluster means or f-tests on cluster variances, to focus the
classificatory sub-space; summarizing the model by hierarchical cluster analysis; optimal ordering of the core clusters; and assigning outliers and intermediate cases. The model can then be used to classify new cases, such
as in data mining applications of any size. The paper will be illustrated by applications in market segmentation. New software for Windows 95/2000 and NT will be demonstrated. References
WISHART, D. (1998): ClustanGraphics3: Interactive Graphics for Cluster Analysis, in Gaul, Locarek-Junge (Eds.): Classification in the Information Age, Proceedings of the 22nd Annual Conference of the Gesellschaft für
Klassifikation, University of Dresden, Springer, Heidelberg, pp 268-275. |