Tools for Quantitative Archaeology |
KMEANS: Nonhierarchical Cluster Analysis
© 1987-2007 Keith Kintigh & Hans Peter Blankholm
Keith Kintigh
2014 East Alameda Drive
Tempe, Arizona 85282 USA
Hans Peter Blankholm
Department of Prehistoric Archaeology
University of Äarhus, Moesgård
DK-8270 Højbjerg, DENMARK
K-means executes a variance minimizing nonhierarchical cluster analysis. This Turbo Pascal program is translated (by Hans Peter Blankholm and Keith Kintigh) from a PL/I version that was in turned translated and expanded (by Keith Kintigh) from an IBM ALGOL program provided by F.R. Hodson of the Institute of Archaeology, London. Extensive revisions have been made to the input and output routines, however, the basic algorithm remains intact. Hodson's program was based on the work of Geoffrey Ball at SRI International in Menlo Park, California. See Doran and Hodson 1975, Kintigh and Ammerman 1982, MacQueen 1967, Blankholm 1990, and Kintigh 1991.
INTRODUCTION
K-means performs a non-hierarchical divisive cluster analysis on input data. K-means has several features that distinguish it from the more common hierarchical clustering techniques.
1. At times, there is an interpretive advantage to non-hierarchical clusters. For example, assume that if the data are divided into three clusters, units A and B will be in the same cluster. It may often make sense that if the data are divided into say two clusters, A and B will be in different clusters. This result is impossible with a hierarchical method.
2. With hierarchical methods, execution time and necessary storage generally increase linearly with the square of the number of objects being clustered. With k-means, execution time goes up linearly with the product of the number of units, the number of variables, the maximum number of clusters desired (usually much less than the number of units) and the (unpredictable) number of iterations. Storage increases linearly with the product of the number of objects and variables. Thus, it is possible to use k-means to cluster much larger numbers of objects.
3. K-means provides a method whereby the degree of clustering in the data can be evaluated. It is a property of most cluster analysis techniques (including k-means) that the desired number of clusters will be formed whether or not the data are clustered in any practical sense. K-means allows one to compare the degree of clustering observed in the actual data with clustering observed with comparable randomized data.
4. However, in distinction to the hierarchical methods, which are guaranteed to find the best solution, given the clustering criteria, k-means does not necessarily find the optimal solution for a given level of clustering. (It is like multidimensional scaling in this respect.)
PROGRAM OPERATION
If it is desired that the input units be divided into a maximum of MAXCLUST clusters, the program will allocate the units into NCLS = 1, 2, 3, ... MAXCLUST clusters. That is, at clustering stage NCLS, each of the units is assigned to one of the NCLS clusters. Each cluster at a given stage is described by the units assigned to it and the means and standard deviations of the variables over these units. The degree of clustering for the entire configuration of NCLS clusters is measured by the SSE, the sum of the squared distances (in a space of VARS dimensions) from each unit to its cluster centroid. If there is a high degree of clustering, the distance from a given unit to its centroid will be small.
The program starts with all units included in a single cluster. One at a time, the unit farthest (using a Euclidean distance function) from its cluster centroid is split off to form a new cluster. After this, units that are closer to the new cluster centroid than to the original centroid are moved to the new cluster and the cluster centroids are recomputed. Then, another pass through the units is made to reallocate each unit to the cluster with the closest centroid. At each reallocation the cluster centroids are recomputed. This process repeats for the new set of clusters and continues until MAXCLUST clusters are formed.
Then, one at a time, the two nearest clusters are lumped together and the centroids are recomputed. Again a pass is made through the units to reassign them to the cluster with the closest centroid (and again, at each reassignment the centroids are recomputed).
The program will again attempt to split off new clusters as described above, as long as an improvement (reduction) in SSE can be made over the previous configuration with that number of clusters. The loop of lumping and splitting will continue until all clusters are lumped back into a single cluster and no improvement is SSE can be achieved with further splitting. We are left with the best achieved configuration for each number of clusters from 1 to MAXCLUST.
NCLUST=1 | SPLIT form a new ┌───> cluster │ NCLUST=NCLUST+1 │ │ │ BESTMOVE │ reallocate units <────────────────────┐ │ to best cluster │ │ │ yes │ │ NCLUST=1? ───────> STOP │ │ │ │ │ remove clusters │ │ smaller than min │ │ cluster size │ │ (MINSIZE) │ │ │ no │ │ is the SSE less ───> ISODATA │ │ than previous restore │ │ configuration of previous │ │ NCLUST clusters configuration │ │ │ yes │ │ │ no │ │ │ └───< NCLUST=MAXCLUST? │ │ has the max number │ │ of clusters been │ │ formed │ │ │ yes │ │ │ │ │ LUMP <─────────┘ │ combine 2 clusters │ with closest >───────────────────┘ centroids
Figure 5. Flow Chart of k-means Operation.
The main computational procedures are described in more detail below and the sequence of control is illustrated in the flow chart (Figure 5)
Split Procedure
The unit farthest from the cluster centroid is selected to start a new cluster. The selected unit is considered to be the centroid of a new cluster and the centroid of the cluster from which it was moved is recalculated (to account for its absence). Next, all other units are examined to see if they are closer to the new cluster centroid than to their old cluster centroid. If so, that unit is moved to the new cluster and the centroids of the two affected clusters are recalculated.
Bestmove Procedure
Now, another pass through the units is made that checks each unit to see if it is closer to any other cluster centroid than to its own. If so, the unit is moved and the centroids are recalculated. (In this procedure and in the Split procedure, the re-calculation is at the time of the move, not after all moves have been made.)
Main Procedure Loop
The program returns to the main procedure after the splitting caused by the Split procedure and the adjustment by the Bestmove procedure. A check is made for any clusters smaller than MINSIZE, the minimum allowable cluster size have been formed. If so, these units and clusters are permanently removed from the analysis, and a message is printed.
If this is the first time the current number of clusters (NCLS) has been formed, or if the current cluster configuration is a SSE improvement over the previous configuration of NCLS clusters then, as long as the maximum number of clusters (MAXCLUST) has not been reached, the program returns to Split to split off another cluster. If the maximum number of clusters has been reached, the program proceeds to the Lump procedure below. If the current configuration is not a SSE improvement over the previous NCLS configuration, then the program executes Isodata and then continues to procedure Lump. If at the time the main procedure loop is invoked there is only a single cluster remaining, then the program computation ends.
Isodata Procedure
This procedure returns the program to the last (previous best) configuration of NCLS clusters. However, as the program only saves the cluster centroids (not the complete configuration), it restores the previous centroids and then assigns each unit to the closest centroid, thus fully reconstructing the previous configuration. After all re-assignments, the cluster centroids are recomputed.
Procedure Lump
Lump, in effect is the inverse of Split. It lumps together the two clusters that are closest together, in other words, it combines the two clusters that will cause the minimum SSE increase. After the Lump procedure, the program returns to the Bestmove procedure and then back to the main procedure loop where a new splitting or lumping sequence will begin.
RUNNING THE PROGRAM
The first operation of the program is to read a set of parameters from the user and to read data and labeling files. Formation of clusters will not begin unless all input is error-free. The program parameters are specified using a series of interactive prompts. For information on the conventions used in the prompts and data files, see the section: "Program Conventions." To start the program, at the DOS prompt type: KMEANS<Enter>
PROGRAM PROMPTS
Pathname of Input Data File {.ADF} ?
After the informational message, the program requests the file name of the file that contains the data. If the data that you want to cluster has UNITS observations (points) and VARS (variables - 2 for pure locational clustering) then this file must contain UNITS*VARS integer or real values preceded by integer values for UNITS and VARS. The program first reads these two values from the file and uses this information to interpret the remaining values.
Care should be taken with data values that have a large number of significant digits, or large absolute values (e.g., UTM coordinates in meters have 6 or 7 digits). Program accuracy will be diminished if there is little or no variation in the first few leading digits (e.g. the first two digits of all UTM north coordinates within a region will often be identical--they represent the number of thousands and hundreds of kilometers from the origin). If the absolute values of the values are large (e.g. greater than about 10000) all output columns may not be lined up properly making the program output hard to read. In either cases, transformation of the data to a smaller range is advisable. These transformations can be accomplished using the Arithmetic option in ADFUTIL. For example with UTMs, one might subtract off invariant leading digits (e.g. subtract 3900000 from all values) and change the scale from meters to more tractable kilometers by dividing the (subtracted) coordinates in meters by 1000. If you need to delete or remove columns from a fixed format file, you may find the MVC program helpful.
Reading ?? Rows and ?? Columns
After the file name is read, the program reads the data and reports the number of rows (observations) and columns (variables) read.
Number of Tabulation Variable (0 for none) {0} ?
Include Tabulation Variable in the Analysis {N} ?
A tabulation variable is an integer variable with values between 0 and 255. For each cluster, the program will report the number of observations for in that cluster that have each separate value of the tabulation variable. Thus, in doing pure spatial clustering one might have a data set with 3 variables, east north, and type (specified as an integer number less than 256). In this case, one would specify variable 3 as the tabulation variable and indicate that it should not be used in the analysis. In this case, the cluster analysis will report the number of artifacts in each type in each spatial cluster. Output option V (below) would be most appropriate if the particular assignments of the individual tool are not important.
For purposes of tabulation, non-integer values of the tabulation variable are truncated to integers, values greater than 255 are replaced by the remainder of the value when divided by 256.
Read Row (Point) Label File {N} ?
Row Label File Name {xxxxxxxx.ARL} ?
The program asks whether you want to read labels for the individual observations from a file. If you answer Y, then it prompts for the name of the file in which these labels are saved. A label for each observation is read from a single line of the label file. The labels may be of any length, but only the first 16 characters are retained by the program. If no label file is specified, then an observations is identified by a sequential number that indicates its position in the input data file.
Read Column (Variable) Label File {N} ?
Column Label File Name {xxxxxxxx.ACL} ?
K-means also permits labels for the columns (variables). Like the row labels, column labels are used in the printed program output and a maximum of 16 characters is used. If no labels are supplied, the variables are numbered sequentially. As with the row labels, each column label must be entered on a single line.
Standardize Data {N} ?
If one answers Y, then each data value is divided by that variable's standard deviation. The effect of this is that each variable used in the analysis has a standard deviation of 1, and thus in some sense, has equal weight in the analysis. In general, standardization is not desirable for pure locational clustering of artifact coordinates, but is desirable in cases where the variable measures are incommensurate or the variables differ greatly in standard deviation or range.
Name of Output Listing File {FIBULA.LST} ?
Enter the name of the file in which the program output is to be stored. If you give a file or path name make sure that you have sufficient room on the disk drive. Otherwise, you may specify PRN to send the output directly to the printer or CON to send the output to the screen (console). Note, however that if you send it to the screen it is not saved for later printing. If the program terminates with a run error, you may have run out of room on your disk and will have to start again.
Title ?
A title with up to 60 characters may be entered (don't worry about counting, extra characters will be ignored). This title will appear at the top of output pages.
List [N]one, [G]lobal, <+[C]luster, <+[M]eans, <+[U]nits, <+[L]og {U} ?
The answer to this prompt determines the detail of the program output. For most purposes, the default {U} response is appropriate.
If N is specified, only summary statistics from a run are printed. If G is specified then the SSE and related statistics for the configuration are reported. If C is specified then the global statistics are reported along with statistics on the individual clusters (e.g. the RMS and number of points). If V is specified then the cluster centroids (variable means and standard deviations of the items in the cluster) and the tabulation variable counts, if any, are reported in addition to the output of the C option. With the U option, the observations assigned to each cluster are listed along with the other information.
Finally, the L option request that each step of the analysis be listed. In contrast to the other options in which only the best configuration for each number of clusters is reported, with the L option each cluster is printed as it is formed with the detail associated with the U option. The L option produces at least twice as much output as the U option.
Print Run Statistics Summary: [O]riginal Data, [A]lways, [N]ever {O} ?
The program, on request prints a summary of the statistics for each run (see example below). In most cases, the plot provides the information needed from the random runs.
Print Cluster Size Plot: [O]riginal Data, [A]lways, [N]ever {O} ?
On request, the program produces a map of the cluster sizes (sample below). These will ordinarily not be useful fore random runs.
Name of PLOT File {xxxxxxxx.PLT} ?
The program produces a file with the essential information obtained from the clustering so that the results can be plotted using the KMPLT program. If this plotting is not required, reply NUL to this prompt and no file will be produced.
Number of Variable to Plot on X Axis {1} ?
Number of Variable to Plot on Y Axis {2} ?
Include Plot Variables in Analysis {Y} ?
If there are more than two variable in the analysis, the program now asks which 2 variables you want plotted by the KMPLT program. If you are not interested in viewing a plot, just press <Enter> in response to both of the first two questions. It then asks if you want those variable included in the analysis. For a cluster analysis of all variables reply Y. However, if you have included two variables only for purposes of plotting (e.g. in an unconstrained clustering spatial analysis), and you do not want clustering done using those variables you should reply N to this prompt.
Maximum Number of Clusters ?
Enter the number of clusters desired. All cluster configurations from 1 to this number of clusters will be reported.
Minimum Size of Clusters {1} ?
Minimum size a cluster can have and still be retained in the analysis. If this value is 1 even clusters with a single unit will be kept. If a value greater than 1 is specified then if a cluster with less than that number points is formed, the cluster will be eliminated, and the units in that cluster will be dropped from the analysis from that point on. This option allows the program to discard outlying units. However, it does create some statistical and interpretive problems and is not recommended.
Number of Random Runs {0} ?
The number of clustering runs to be done with randomized versions of the input data. If this value is 0, no random runs are made. If a number greater than 1 is given then that number of randomized runs is made after an initial run on the actual data. If a number less than 0 is specified then only the absolute value of that number of randomized runs are made. Where two-dimensional data are being analyzed, the configuration is rotated a random angle and randomized (see Kintigh 1990).
Randomization is done by randomizing the order of the values in each of the data columns. In this way, the set of values for a given variable does not change, but the association of different variable values for a single unit is randomized. Each variable column still has the same mean and standard deviation.
Specify Random Number Seed Value {N} ?
If randomization is requested, then the program asks if you want to set the seed of the random number generator. Generally the default (using the system clock) is fine and you can press <Enter>. If you want to replicate a random run, you must enter that run's seed.
Run 0: Original Data
Cluster Level
2 3 4 ...
Run 1: Random Data - Seeds (xxxxxxxxx)
Cluster Level
2 3 4 ...
Compute Time: x.x Minutes
So that you can trace the program's progress (and to reassure you that it hasn't died), it displays the cluster configutaion it is working on. Because of the iterative nature of the program these numbers will increase to the maximum number of clusters, then oscillate, and eventually descending to 2. Larger numbers of clusters will take longer to form.
SAMPLE INPUT DATA SET
16 #points# 3 #variables: x y type# 1 1 0 1 2 0 2 1 0 3 3 0 8 2 2 9 1 0 9 3 2 10 2 0 6 10 10 6 11 11 5 12 12 7 12 13 12 8 4 13 7 2 14 9 2 15 7 4
SAMPLE COLUMN LABEL FILE - KMEANS.ACL
East North Type
SAMPLE ROW LABEL FILE - KMEANS.ARL
01-blade 02-blade 03-blade 04-blade 05-point 07-point 08-blade 09-Uscraper 10-Escraper 11-Sscraper 12-Dscraper 13-burin 14-point 15-point 16-burin
PROGRAM OUTPUT
For each stage of clustering, global statistics are printed that describe characteristics of the current set of clusters. In addition, descriptive information is printed for the individual clusters. The statistics printed are described below and a sample cluster configuration is included as the below. Note that the values of the various statistics will depend on whether or not the data were standardized.
Global Statistics
A given clustering stage (or configuration) is described by CLUS, the number of clusters in that configuration, N, the total number of units included in the clusters, and by the SSE, a measure of the degree of clustering. The SSE, for sum squared error, is the sum over all variables for all units of the squared difference between the variable value and its cluster mean:
N VARS SSE= Σ Σ (Xij - Xcij)² i=1 j=1
Where Xij is the value of unit i on variable j and Xcij is the mean of variable j for the cluster including unit i.
In addition, the current configuration SSE is given as a percentage of the single cluster SSE (the maximum SSE); this value is labelled %TOT_SSE. The logarithm to the base ten of this percentage value is also given, labelled LOG(%SSE). The two different SSE values printed, SSE1 and SSE2, will usually be very nearly equal if MINSIZE=1 (the default), the difference being due to minor computational differences. If MINSIZE>1 and units have been dropped from the analysis, SSE1 and SSE2 will diverge. SSE2 is calculated as described above only for those units remaining in the analysis. SSE1 includes the contribution of the deleted units to the total SSE at the time they are dropped from the analysis. The %TOT_SSE and the LOG(%SSE) values are based on SSE2.
Kmeans Test Run 4 Clusters - Final Clusters CLUS N SSE1 SSE2 %SSE Log(%SSE) 4 16 22 22 3.84 0.58
Additional global statistics are printed at each clustering stage that are generally only relevant for pure locational clustering of artifact coordinates. The mean number of points per cluster (Nbar) is printed as is the standard deviation of N (Nstd). Similarly, the RMSbar and RMSstd are printed describing the distribution of the individual cluster RMS values. Finally, the program prints the mean and standard deviation of the r2values (r2wbar, r2std) calculated for the clusters with more than two members (n>2) and weighted by the number of members in each cluster.
CLUS nbar nstd RMSbar RMSstd n>2 r2wbar r2wstd 4 4.0 0.0 1.16 0.15 4 0.11 0.177
Individual Cluster Statistics
For a given configuration of NCLS clusters, information is printed to describe the individual clusters. The number of units, N, in each cluster is given, followed by the RMS, a measure of the cluster radius in the space of VARS dimensions. The RMS is the square root of the sum of the VARS variable variances:
n VARS RMS= (1/n) √[Σ Σ (Xij-XCj)²] i=1 j=1
Where unit i is a member of cluster C, n is the number of units in cluster C, Xij is the value of variable j for unit i, and Xcj is the mean of variable j for cluster C.
Note. The configuration in which all units are in a single cluster describes the complete sample. If the data are standardized it is easy to see that for a single cluster that SSE=UNITS*VARS, since the standard deviation for each variable is 1. Similarly, the RMS for the single cluster (with standardized data) is given by: RMS=√(VARS).
Additional individual cluster statistics are printed at each clustering stage that are generally only relevant for pure locational clustering of artifact coordinates. For each individual cluster, r2 and SLOPE values are printed. These numbers are the product-moment correlation coefficient squared, (r2), and b value of a least squares linear regression of the variable YV on XV for the points in the cluster. These values give an indication of a linear trend in the cluster for the plot variables XV and YV. Clearly, for many applications this statistic is not meaningful.
CLUS n RMS r2 Slope 1 4 1.00 0.00 0.00 2 4 1.17 0.40 0.64 3 4 1.09 0.00 0.00 4 4 1.39 0.02 -0.10
The means and standard deviations of each variable calculated over all the units in the cluster are printed. The list of variable means is referred to as the centroid that may be thought of as the center of the cluster.
CLUS VAR MEAN/STDDEV... 1 East : 9.000/ 0.707 North: 2.000/ 0.707 2 East : 1.750/ 0.829 North: 1.750/ 0.829 3 East : 6.000/ 0.707 North: 11.250/ 0.829 4 East : 13.500/ 1.118 North: 7.750/ 0.829
If a tabulation variable is specified, then observations with each value of that variable are counted and the percentage is listed for each cluster.
CLUS Type:Cnt/Pct... 1 0: 2/ 50% 2: 2/ 50% 4: 0/ 0% 10: 0/ 0% 11: 0/ 0% 12: 0/ 0% 13: 0/ 0% 2 0: 4/100% 2: 0/ 0% 4: 0/ 0% 10: 0/ 0% 11: 0/ 0% 12: 0/ 0% 13: 0/ 0% 3 0: 0/ 0% 2: 0/ 0% 4: 0/ 0% 10: 1/ 25% 11: 1/ 25% 12: 1/ 25% 13: 1/ 25% 4 0: 0/ 0% 2: 2/ 50% 4: 2/ 50% 10: 0/ 0% 11: 0/ 0% 12: 0/ 0% 13: 0/ 0%
Finally, the units included in each cluster are listed with either their sequence numbers or labels..
CLUS UNITS... 1 05-point 06-blade 07-point 08-blade 2 01-blade 02-blade 03-blade 04-blade 3 09-Uscraper 10-Escraper 11-Sscraper 12-Dscraper 4 13-burin 14-point 15-point 16-burin
The Run Statistics Summary
After each clustering run, including random runs, the global statistics for that run are summarized as shown above. For random runs only the global statistics for each stage of the clustering are given since the individual cluster composition is not relevant for randomized data.
Kmeans Test Run Run Statistics Summary CLUS N SSE1 SSE2 %SSE Log(%SSE) 1 16 573 573 100.00 2.00 2 16 264 264 46.09 1.66 3 16 127 127 22.19 1.35 4 16 22 22 3.84 0.58 5 16 18 18 3.11 0.49 6 16 14 14 2.37 0.37 7 16 11 11 1.93 0.29 8 16 9 9 1.54 0.19 CLUS nbar nstd RMSbar RMSstd n>2 r2wbar r2wstd 1 16.0 0.0 5.99 0.00 1 0.09 0.00 2 8.0 0.0 4.06 0.27 2 0.42 0.36 3 5.3 1.9 2.09 1.21 3 0.03 0.02 4 4.0 0.0 1.16 0.15 4 0.11 0.17 5 3.2 1.2 0.83 0.47 4 0.05 0.10 6 2.7 1.1 0.76 0.38 3 0.07 0.11 7 2.3 1.3 0.49 0.45 3 0.07 0.11 8 2.0 1.0 0.48 0.41 2 0.11 0.12
The Cluster Size Plot
After each clustering run, a plot of the cluster sizes can be printed. It shows the number of clusters of each cluster size at each clustering stage. In the following plot, when the data are divided into 7 clusters, 2 clusters have 4 points, 1 has 3 points, 1 has 2, and 3 have only 1 point.
Kmeans Test Run Cluster Size Distribution cluster | clustering stage size | 1 2 3 4 5 6 7 8 --------+---+---+---+---+---+---+---+---+ 16 | 1 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 2 1 7 | 6 | 5 | 4 | 2 4 3 2 2 1 3 | 1 1 1 1 2 | 2 1 3 1 | 1 1 3 3 --------+---+---+---+---+---+---+---+---+
TECHNICAL NOTES
With k-means, execution time goes up approximately linearly with the product of the number of units, the number of variables, the maximum number of clusters desired (usually much less than the number of units) and the (unpredictable) number of iterations.
The difference in SSE compared with an earlier configuration that is considered sufficient to start another splitting and lumping loop (as described in the Program Operation section above) is determined by a parameter (not now user specifiable) called DELTA. If DELTA>0 the difference is simply the value of DELTA. If DELTA<0 the significant difference is computed as the absolute value of DELTA percent of the total SSE. If the data are not standardized then the default is -0.1 (0.1% of the total SSE). If the data are standardized then the default is 0.1 (0.1 SSE units). Note: In general this parameter will not require modification. However, in very large problems, an increase in speed (but perhaps a decrease in the ability of the program to find an optimal solution) may be obtained by increasing DELTA.
Home | Top | Overview | Ordering | Documentation |
Page Last Updated - 19-Jul-2007